Perceptual Coding of General Purpose Audio
By
Syed Masroor Hassan Zaidi MSc. Communication Engineering The University of York, UK. Now at Warid Telecom, Pakistan.
Page 1
Perceptual Coding of General Purpose Audio 1.0 Introduction To transmit or store one second of CD quality audio data requires 1.4 megabits, which is about 1/8th the size of a typical floppy diskette. This amount of data can be reduced significantly using compression techniques without any apparent loss. Typical compression ratio of approximately 10 to 15 can be achieved using various signal processing and source coding approaches by removing redundancies and the components of the audio signal, which are inaudible for the human auditory system. Idea adapted from [8].The definition of the coding in the context of this report is, “the process of reducing the bit-rate required to store or transmit an audio signal”. Input of the coder is a digital audio signal and the coder output is the lower rate digital audio signal. Coder can be divided historically in four kinds i.e. Lossless Coder, Lossy Coder or Numerical Coders and Source Coders. Lossless coding refers to the coding, which is reversible, which means we can reconstruct the original signal bit by bit. On the other hand in the lossy coding techniques, the output signal is an approximation of the original input signal. The loss or the quality of the output signal entirely dependant on, the kind of coding used. 2.0 Acoustic signal Human auditory system responds to acoustic signals. The atmosphere is made up of tiny molecules. These molecules have energy and these produces air pressure. Now, when a person speaks or claps his/her hand that generates variations in that air pressure. By clapping the hands air molecules squashed together and then they move apart and in reaction to that in order to the lower pressure, they move back together again and then move far apart and move back together again. This process goes on and on. This causes a nock on effect, which propagates throughout the room. The molecules near the object not actually traveling through the air towards the ear but the effect of this nock on effect and the pressure variation, which generates this acoustic signal. 3.0 Human auditory systems
Page 2
In order to understand audio source coding techniques the human auditory system cannot be avoided. Human auditory system subdivided naturally into outer, middle and inner ear for analysis purposes. Acoustic Signal as described in the previous section traveling through the air captured by the pinna and then tunneled through the auditory canal. At the end of the auditory canal these sound waves hits the tympanic membrane and make it vibrate. This vibration then forwarded to the cochlea through ossicles (malleus, incus and stapes). Main function of these three small bones (ossicles) is to match the impedance of the air to the fluid in the inner ear. The outer ear works on the lower impedance, then the inner ear. Most of the acoustic energy would be reflected back without this impedance matching mechanism.
Fig. 1. Anatomy of the human ear . Adapted from www.open2.net
Cochlea is the structure filled with fluid; it houses the end of auditory nerves. When the ossicle vibrates the fluid in the cochlea moves and this movement stimulates the auditory nerve endings. Auditory nerves carry these signals to the brain for further processing. 3.1 The cochlea The cochlea is acts as a transducer; it converts the acoustic vibrations into electrical neural activity. Cochlea also performs spectral analysis on the sound waves that is very significant for human perception of sound. The cochlea is a bony walls spiral ( similar to the snail shell) cavity, filled with fluid. The widest end called the base (near the oval window) and the narrowest end called the apex. Fig. X shows that, the cochlea divided into two membranes Reissner’s and
Page 3
basilar membranes and creates three ducts filled with fluid: scala vestibuly, scala tympani and scala media. The vestibular and tympanic ducts connected to each other by a small aperture called helicotrema, at the apex. The scala media is entirely separate compartment with the different composition of the fluid. Just below the techtorial membrane and on top of the basilar membrane is the organ of corti with the rows of hair cells. “In each human cochlea there is one row of inner hair cells (closest to the inside of the cochlear spiral) and and upto five rows of the outer hair cells”[3]. The inner hairs transform the vibration of the basilar membrane into electrical spikes and the inner hairs are responsible to change the mechanical properties of the basilar membrane.
Fig.2 Cross-section of cochlea. Adapted from [2]
Width of the basilar membrane is not uniform it varies in thickness, the narrowest end called Helicotrema. The whole thing becomes a mechanical spectrum analyzer, which performs spectral analysis of sound. Basilar membrane part, which resonates with the sound, which applied to it, is changes with the change in frequency. If the frequency is higher, it causes the resonance near the oval window but low frequencies causes resonances far away. It is widely accepted that the distance from the apex is a log function of the frequencies. Therefore, tones separated apart in octave steps will stimulate evenly separated resonances in the basilar membrane. The prediction of a Page 4
particular location of resonance, in the basilar membrane known as place theory or space analysis of sound. The frequency of vibration determines where in basilar membrane the maximum vibration occurs. Following diagrams shows that, there is a very strong relationship between the frequency and the distance in the basilar membrane. Organ of Corti acts as a sensor, which senses the vibration of the basilar membrane and sits on top of the basilar membrane. Organ of Corti contains hair cells, which generate and sense the vibration. When the basilar membrane moves the organ of corti moves with it and these hairs interacts with the Techtorial mambrain. The sharing cuses these hairs bend. When these interconnected hairs by tiplinks bend one way, these tip-links opens the track doors.
Fig.3 Place frequency representation in cochlea.Adapted from [5]
Fig.4 Traveling wave in the basilar membrane. Adapted from [5]
Page 5
When these track doors are open the embilum which is the liquid from the scala media flows into the hairs. These hair cells have electrical potential of -45 mV. Hair cells depolarized by this process. As the hair cells depolarizes and causes spikes in the auditory nerve known as Nerve firing. This Nerve firing triggered by the deflection of these hairs and the signals generated by this nerve firing then sent to the brain by the auditory nerves. Nerve firings do not perfectly maps the vibration of the basilar membrane. At the higher frequencies, firing is discontinuous but with the same phase relationship. Firing rate per second determines the amplitude of the sound. this way ear converts the mechanical energy into electrical energy.
Fig. 5 Hair cells showing interconnections. Adapted from [5].
Frequencies below 50 Hz cannot make pattern of vibration to be changed and this vibration can no longer measured from the nerve firings rate[8]. 4.0 Levels and Loudness “The ear can detect a sound pressure variation of only 2x10-5 Pascals r.m.s. and so this figure is used as a reference against which sound pressure level (SPL) is measured” [8]. In audio measurement, the sensation of loudness is a logarithmic function of SPL and as a result, it expressed in deciBel.
Page 6
The actual rang of hearing exceeds 130 dB, but it is not very more often very painful to hear the extremes of this range. Therefore, it is not very necessary to produce the audio of this range. If the dynamic range is very compressed then it will also be very difficult to hear the audio. Frequency response of the ear changes with SPL. “The subjective response to level is called loudness and is measured in phones” [8]. The loudness scale which is measured in phones and sound pressure level scale which is measured in SPL only coincides at 1kHz, but for the other frequency ranges the phone scale deviates from the SPL scale. The reason to this is that it shows the actual sound pressure levels judged by the human ear to be nearly loud as a reference level of 1kHz. More details about this topic can be found in [8]. To measure loudness is nearly impossible because it is a subjective reaction. As an extension to the level dependent frequency response, problem people listen to the sound to draw some results about the source which actually producing this sound. For example, some people can describe sound of a chopper, which is far away as loud. Therefore, it is clear that at the source generating this sound, it is very loud, but as the listener is very far away, hence it can be compensated for the distance. 5.0 Beats If two sine waves of the same frequency are added together linearly, then the envelop of the signal varies as these two waves move in and assuming that the frequency transform has been calculated infinitely accurate and the amplitudes are constant and there is no envelop modulation. However, these measurements required long period of time, but where the time is short, the bands in which the energy is detected become wider when the frequency discrimination falls. “The rate at which the envelop amplitude changes is called beat frequency”[8]. This beat frequency is not present in the input signal. In audio compression the author also says that, ”Beats are an artifact of finite frequency resolution transforms”. By measuring the beats we can also measure the critical bandwidth. 6.0 Perceptual Coding Source coding can be lossless or lossy. Source coding removes redundancies by estimating the source generation model. This model can be mathematical for example
Page 7
“transform gain”. Transform gain occurs when a bank of filters digitalize the audio signal. The goal of source coders is also to increase the signal to noise ratio (SNR) or to decrease other matric such as mean square errors (MSE) of the audio signal by the help of accurate models. Typical source coding methods are linear predictive coding (LPC), Sub-band coding, transform coding, vector quantization etc. Well-known source coding examples are Delta Modulation, G728, DCPM, LDCELP, ADPCM, LPC-10E and G721. Details of all these algorithms are beyond the scope of this report, the information of these algorithms can be found in the source coding text books. Numerical coding is almost a lossless type of coding technique. Numerical coding means that a mathematical model is used to remove the redundancies form the coded data. Now a days new lossy numerical coders provide very fine scalability of the bit rate. Examples of common numerical coding techniques are Huffman coding, Arithmetic coding and Ziv-Lampel coding. Entropy coding is used in the common numerical coders for the actual bit rate reduction. Whereas, in the source coding, signal model is most often used to reduce the signal redundancies and it produces lossy coding system. Source behavior is considered in both the methods and both of the methods are used to reduce the amount of redundant information in the signals. Model of the destination is used in the perceptual coding that is the human, which is using the data rather, then a model of the source of the signal. Perceptual coding removes those parts of the signal, which can not be perceived by a human. Perceptual Filtered Audio Signal
Input Coding filter Bank
Quantized Filter bank values
Quantization and Rate Control
Noiseless coding and bitstream packing
Perceptual model Perceptual Threshold Fig.6 Block diagram of Perceptual coder. Adapted from [6]
Page 8
Coded bitstrea m
coding is a lossy coding method, this means the original input signal cannot be reconstructed bit by bit. The information which is unperceivable from a human is removed by the perceptual coder is often known as irrelevancy of the audio signal. In practice most of the coders attempts to eliminate both irrelevancy and redundancy of the input signal in order to reduce the required bit rate of the audio signal. SNR of the perceptual coder are comparatively lower then the source coders but perceptual coders have higher perceive quality then a source coder of the same bit rate. 6.1 Auditory Masking And Perceptual Model Phenomena of auditory masking play a very important part in the perceptual coding. When a very strong signal occurs near a weaker signal either, in frequency or time due to the limited detection ability of the human auditory system the weaker signal is imperceptible in many situations. Before describing the auditory filter bank, it is worthwhile to explain the critical bandwidth or critical bands. “Critical bandwidth or equivalent rectangular bandwidth is a range of frequencies over which the masking signal to noise ratio remains approximately constant”. 6.2 Auditory Filter Bank The human cochlea can be modeled as a bank of many filters. the shape of these filter at any particular position on the human cochlea is known as cochlear filter specifically for that point on the cochlea and a critical band is near to the pass-band bandwidth of the particular filter. ERB or equivalent rectangular which results in filter slightly get narrow at the lower frequencies and gets narrower at mid and high frequencies substantially. Cochlear filter bank consist of two parts, a high pass filter and a low pass filter and action of the hair cells sets one of these filters. It is also called the neural tuning of tonal stimulus. The time and frequency response of the cochlear filters varies by atleast 10:1 this means that the audio coding filter has to be bale to compensate the changes in resolution of 10:1 atleast. 6.3 Issues in Filter Bank designing There are two conflicting requirements in the filter-bank designing. The first requirement is the good frequency resolution and it is important due to following reasons: 1- Source coding gain
Page 9
2- The auditory filters are quite narrow at the low frequencies and requires and this results in the requirement of the good control of the noise by filter-bank. The problem with good frequency resolution is the bad time resolution. The second requirement of the filterbank design is the good time resolution which is important to control pre-echo and post-echo effects. The problem with good time resolution is that it does not allow enough redundancy removal because of not enough signal digitalization as well as not enough frequency resolution to do more efficient and effective coding at low frequencies. So, while designing the filter-bank care must be taken that it should have a good time as well as frequency resolution in order to do an efficient coding. An efficient coder must required to use a time varying filter-bank that respond to both perceptual requirements and the signal statistics. To implement the psychoacoustic threshold, quantization and rate control is used in the perceptual coders while maintaining the acceptable bit rate. Many approaches used to approach the quantization and control issue with the same common goal. The first approach is to enforce the required rate secondly implement the psychoacoustic threshold and finally adding noise in those parts, which are less offensive when there are not enough bits. 7.0 Examples of codecs The best audio coder in terms of performance is MPEG2-AAC. Where AAC stands for Advanced audio coding. It is an audio format which is at first utilized by the japnies Digital broadcast system [7] and it is known as ISDB ( Integrated Digital Broadcasting). MPEG2-AAC is also the main technology which is currently being used for XM radio. In the US there is one of the two satellite radio service currently been operated on the basis of this codec. MPEG4- AAC is the extended version of MPEG2-AAC, but mostly these advanced tools available in MPEG4-AAC are not required for the companies which are creating product. Detailed information about MPEG and various codecs can be found in [8]. 8.0 Conclusion Perceptual coding designed for the final delivery applications. It is not recommended for the recording of the signals, or in other conditions where the signals required to be processed after the perceptual coding is applied. Perceptual coding performs very well
Page 10
in the situations where the signal is not required to be reprocessed, equalize, or modified before after the coder and before the final delivery to the target listener. Audio perceptual coding is a very powerful technique for the final delivery of sound especially in the situations where the delivery bit rate is very limited or where the storage space is very small. References: [1] www.open2.net, last visited 25th of July 2006 [2] http://www.dmc.org/chm/img/cochlea-p.jpg, last visited 25th of July 2006 [3] The Sense of Hearing, C.Plack [4] http://www.vimm.it/cochlea/cochleapages/theory/bm/bm.htm, last visited 25th of July 2006 [5] http://www.sv.vt.edu/classes/ESM4714/Student_Proj/class96/cotton/Links.jpg,
last
visited
25th of July 2006 [6] http://www.ece.rochester.edu/~gsharma/SPS_Rochester/presentations/JohnstonPerceptualAud ioCoding.pdf, last visited 25th of July 2006 [7] http://www.vialicensing.com/products/mpeg2aac/standard.html, last visited 25th of July 2006 [8] Audio Compression Handout [9] http://www.crc.ca/en/html/aas/home/source_coding/source_coding, last visited 25th of July 2006
Page 11