Audio Coding Techniques (I)
Introduction
Lossless audio coding
How is audio different from speech? Human auditory system Reversibility of closed-loop DPCM Inter-channel decorrelation
Perceptual audio coding
Psychoacoustics Perceptual entropy EE493Q: Digital Speech Processing
Introduction to Audio
What is audio?
“Of or relating to high-fidelity sound reproduction”
How is audio different from speech?
Higher sampling rate
Higher accuracy
12-16 bits per sample
Multi-channel
CD-quality music: 20KHz Wideband speech in video conferencing: 7KHz
Mono, stereo, 6-channel
Require much more bandwidth
raw data rate ~700Kbps per channel EE493Q: Digital Speech Processing
Stereo-Audio
L
R
EE493Q: Digital Speech Processing
Audio Compression
How is it different from speech compression?
Requirement Lossless compression is important in some applications (e.g., archiving and mixing of highquality recordings in professional environments) Perceptually lossless compression (e.g., MP3 music)
Principles No physical model exists for audio production Instead, more emphasis is put on human auditory system, in particular, psychoacoustic masking effect
EE493Q: Digital Speech Processing
Sound Quality Requirements
EE493Q: Digital Speech Processing
Review Question (I)
NO Q: Given an audio sampled at 16KHz, its 25th subband (12-12.5KHz) has SPL of 10dB, can human ear hear it? EE493Q: Digital Speech Processing
Review Question (II)
NO Q: Given a masker tone with 2Khz and 60dB, if the testing tone is played at the 15th CB with SPL of 50dB, is it masked?
EE493Q: Digital Speech Processing
Review Question (III)
AFTER Q: Consider the echo hiding scheme for audio watermarking, do we want to insert echoes before or after the masker?
EE493Q: Digital Speech Processing
Audio Coding Techniques (I)
Introduction
Lossless audio coding
How is audio different from speech? Human auditory system Reversibility of closed-loop DPCM Inter-channel decorrelation
Perceptual audio coding
Psychoacoustics Perceptual entropy EE493Q: Digital Speech Processing
Overview
Hans and Schafer, “Lossless compression of digital audio” IEEE Signal Processing Magazine, July 2001 Note: such approach does not take inter-channel correlation into account, which is unlikely to be optimal EE493Q: Digital Speech Processing
Intra-Channel Decorrelation
(rounding)
Notes: prediction residues e(n) are integers due to rounding A(z) is autoregressive (AR) model; B(z) is moving average (MA) model EE493Q: Digital Speech Processing
Justification of Reversibility Recall: quantization is not invertible, how can we achieve lossless compression regardless of the rounding operation?
e(n) = x(n) − Q[ xˆ (n)], K
xˆ (n) = ∑ ak x(n − k ) k =1
K
xˆ (n) = ∑ ak x(n − k ), k =1
x(n) = e(n) + Q[ xˆ (n)] Decoder
Encoder
Answer: closed-loop DPCM guarantees the reversibility EE493Q: Digital Speech Processing
Inter-Channel Decorrelation
L
s
R d
Average: s=(L+R)/2
Difference: d=R-L
EE493Q: Digital Speech Processing
Stereo Recording Techniques*
X-Y technique: two directional microphones are placed coincidentally, typically at a 90+ degree angle to each other
Mono-compatible
A-B technique: two omni-directional microphones are used at an especial distance to each other (20 centimeters up to some meters).
Add another microphone at the center, it becomes “Decca Tree” EE493Q: Digital Speech Processing
Audio Coding Techniques (II)
MP3 Audio Compression
Filter bank/Modified DCT Psychoacoustic Models Bit Allocation
Advanced Audio Coding (AAC) Techniques
MPEG-1,2,4 SONY ATRAC Lucent PAC Dolby AC-3 EE493Q: Digital Speech Processing
Introduction
What does ISO MPEG-1 Audio provide? A transparently lossy audio compression system based on the weaknesses of the human ear.
Can provide compression by a factor of 6 and retain sound quality. One part of a three part standard that includes audio, video, and audio/video synchronization
MPEG-2 and MPEG-4 have advanced audio coding (AAC) options ITU-T has its own standardized algorithm for wideband speech (audio) EE493Q: Digital Speech Processing
MPEG-I Audio Features
PCM sampling rate of 32, 44.1, or 48 kHz Four channel modes:
Monophonic and Dual-monophonic Stereo and Joint-stereo
Three modes (layers in MPEG-I speak):
Layer I: Computationally cheapest, bit rates > 128kbps Layer II: Bit rate ~ 128 kbps, used in VCD Layer III: Most complicated encoding/decoding, bit rates ~ 64kbps, originally intended for streaming audio
EE493Q: Digital Speech Processing
MPEG-I Encoder Architecture
EE493Q: Digital Speech Processing
MPEG-I Encoder Architecture
Polyphase Filter Bank: Transforms PCM samples to frequency domain signals in 32 subbands Psychoacoustic Model: Calculates acoustically irrelevant parts of signal Bit Allocation: Allots bits to subbands according to input from psychoacoustic calculation. Frame Creation: Generates an MPEG-I compliant bit stream.
EE493Q: Digital Speech Processing
What is Filter Bank?
Synthesis
Analysis EE493Q: Digital Speech Processing
Filter Bank Illustration
EE493Q: Digital Speech Processing
Modified Discrete Cosine Transform
Forward Transform
Inverse Transform EE493Q: Digital Speech Processing
Pre-Echo Distortion
EE493Q: Digital Speech Processing
MPEG-I Psychoacoustic Models MPEG-I standard defines two models: Psychoacoustic Model 1:
Less computationally expensive Makes some serious compromises in what it assumes a listener cannot hear
Psychoacoustic Model 2:
Provides more features suited for Layer III coding, assuming of course, increased processor bandwidth.
EE493Q: Digital Speech Processing
Step 1: Spectral Analysis and SPL Normalization
Convert samples to frequency domain
Use a Hann weighting and then a DFT
Simply gives an edge artifact (from finite window size) free frequency domain representation.
Model 1 uses 512 (Layer I) or 1024 (Layers II and III) sample window. Model 2 uses a 1024 sample window and two calculations per frame.
EE493Q: Digital Speech Processing
Step 2: Identification of Tonal and Noise Maskers
Need to separate sound into “tones” and “noise” components Model 1:
Local peaks are tones, lump remaining spectrum per critical band into noise at a representative frequency.
Example:
Model 2:
Calculate “tonality” index to determine likelihood of each spectral point being a tone
based on previous two analysis windows
EE493Q: Digital Speech Processing
Graphic Illustration
X: tonal O: noise
EE493Q: Digital Speech Processing
Three Types of Frequency Masking
Noise-Masking-Tone (NMT): SMR=4dB Tone-Masking-Noise (TMN): SMR=24dB Noise-Masking-Noise (NMN): SMR=26dB
NMT
Asymmetry EE493Q: Digital Speech Processing
TMN
Step 3: Decimation and Reorganization of Maskers
“Smear” each signal within its critical band
Use either a masking (Model 1) or a spreading function (Model 2).
Adjust calculated threshold by incorporating a “quiet” mask – masking threshold for each frequency when no other frequencies are present.
EE493Q: Digital Speech Processing
Step 4: Calculation of Individual Masking Thresholds
Calculate a masking threshold for each subband in the polyphase filter bank Model 1:
Selects minima of masking threshold values in range of each subband Inaccurate at higher frequencies – recall how subbands are linearly distributed, critical bands are NOT!
Model 2:
If subband wider than critical band:
Use minimal masking threshold in subband
If critical band wider than subband:
Use average masking threshold in subband
EE493Q: Digital Speech Processing
Graphic Illustration
Tonal components
Noise components
EE493Q: Digital Speech Processing
Step 5: Calculating Global Masking Thresholds
The hard work is done – now, we just calculate the signal-to-mask ratio (SMR) per subband
SMR = signal energy / masking threshold
The calculated SMR results can be used by audio codec to determine how many bits are needed to spend on each subband
This is where most compression occurs – if some coefficient is below the masking threshold, it does not need any bit! EE493Q: Digital Speech Processing
Graphic Illustration
EE493Q: Digital Speech Processing
Psychoacoustic Model Summary input audio frame Spectral Analysis and SPL Normalization Identification of Tonal and Noise Maskers Decimation and Reorganization of Maskers Calculation of Individual Masking Thresholds Calculating Global Masking Thresholds Signal-to-Masking Ratios (SMR) EE493Q: Digital Speech Processing
Example: Calculating Signal Energy
EE493Q: Digital Speech Processing
Calculating Masking Thresholds
EE493Q: Digital Speech Processing
SMR Results
EE493Q: Digital Speech Processing
How Perceptual Lossless Compression is Achieved? A
C
D
B
Coefficient A requires bits, but not coefficient B (masked) Question: how about coefficients C and D?
EE493Q: Digital Speech Processing
Summary of Perceptual Audio Coding
Psychoacoustics
Frequency dependency: Human ears are most sensitive to 2-4KHz Masking: A tone could be inaudible because of the presence of another one (close in frequency or time) Asymmetry: Noise-masking-tone is easier than tonemasking-noise
MP3
Time-to-frequency transformation by filter bank or modified Discrete Cosine Transform Psychoacoustic Model I or II produces Signal-to-Masking Ratio (SMR) that guides the bit allocation process for each subband Perceptually lossless at the bit rate of 64K-128Kbps EE493Q: Digital Speech Processing
Headphone Technology
http://www.technologyreview.com/read_article.aspx?id=17642&ch=infotech EE493Q: Digital Speech Processing
Audio Coding Techniques (II)
MP3 Audio Compression
Filter bank/Modified DCT Psychoacoustic Models Bit Allocation
Beyond technical issues
Legal, practical and ethic issues Open discussions
EE493Q: Digital Speech Processing
Legal Issues Surrounding MP3 It's a civil offense, punishable by fine, if you distribute music that you don't own the rights to. It's a criminal offense to copy music illegally and then redistribute it for financial gain. There is a great deal of uncertainty about how copyright laws should function in the digital world, but the laws themselves are clear
EE493Q: Digital Speech Processing
The Story of Home-Taping Nightmare
In 1970s, tapes become easy to be duplicated at home – nobody was caught as copyright violators, right? The economics of the entire system actually collapsed, and was only revived by the forced implementation of an entirely new audio format, the compact disc (CD). A tax on all blank tapes and taping mechanisms was created in accordance with the 1992 Home Recording Act to offset lost revenues
EE493Q: Digital Speech Processing
Now Comes the MP3 Nightmare
The ease of downloading and sharing MP3s due to internet Those mammoth companies are still going to sue every college student they find with MP3s on their site. On 9 October 1998, when the RIAA filed for a temporary restraining order to prevent San Josebased Diamond Multimedia from selling their new MP3 player. Called the "Rio," this player retails for $199 and is essentially a Walkman for MP3 files.
EE493Q: Digital Speech Processing
What is Legal?
Most MP3 files on the internet are illegal except
Recorded works to which you personally own the copyrights. Recorded works in the public domain.
As long as you keep your MP3s in the privacy of your own hard drive and not on the Web, you are very hard to catch and relatively harmless. EE493Q: Digital Speech Processing
MP3: The Transformation of Recording Industry
Why didn't the ISO/IEC address the copyright issue when developing MPEG1?
Its members weren't necessarily thinking about the legal ramifications but instead focused on creating an effective technology.
Unsuccessful fight-back strategy by RIAA: search and destroy
Only by systematic, consistent, massive legal action can the record company possibly hope to win this war. EE493Q: Digital Speech Processing
Watermarking: a Technical Savior?
RIAA's Secure Digital Music Initiative (SDMI) goes in vain around 2000
Your midterm project might have shown this All existing watermarking techniques are not good enough to win this war against piracy
Ultimately, it's all about a workable revenue model. Once that's been established, then perhaps the quality and convenience of the MP3 format can be seen as a boon to the industry instead of a threat.
EE493Q: Digital Speech Processing
Ethic Issues Copying music for your own private use is cool, but posting that music to a Web site or distributing it in any way is not. By doing this you are robbing people who worked very hard to create the music you like. Think about: is there any difference between downloading an album from internet for free and stealing an album from the local store?
EE493Q: Digital Speech Processing
Open Discussions Who should get paid? Is there a better business model? Is there any better technical solution than watermarking to fight against piracy? Which side will you take? A defender of RIAA or a hacker?
EE493Q: Digital Speech Processing