EEM.ssr: Speaker & Speech Recognition
Speech Analysis by Dr Philip Jackson lecturer in speech & audio Centre for Vision, Speech & Signal Processing, Department of Electronic Engineering. http://www.ee.surrey.ac.uk/Teaching/Courses/eem.ssr
What’s the point of analysing speech? • Speech analysis, or speech processing, transforms a speech waveform into a representation that is suitable for extracting its features: • Human visual inspection – e.g., by a speech scientist, speech therapist, or forensic phonetician
• Computer analysis – e.g., for automatic speech recognition, speaker recognition, or paralinguistic processing
And what does that mean? • Suitable could be: – amenable to human visual inspection – using a small number of bits per second (for transmission or storage) – compatible with the models in a speech recognizer – in line with our understanding of human auditory processing
Cochlear section • Cochlea, or inner ear, has a spiral form: – vestibular canal – basilar membrane – tympanic canal – auditory nerve
Response of the cochlea
Basilar membrane
• • • •
sound enters at the stapes travels along the basilar membrane vibrates at matching position activates auditory nerves
Short-term spectrum • Represents the distribution of power with respect to frequency over a time interval centred at time, t, like a vertical slice through the spectrogram • From a source-filter perspective, it gives us some information about the shape of the vocal tract at time t • From a human speech perception view, it provides similar information to that sent from the cochlea to the auditory nerve
Computing the ST-spectrum • Analogue-to-Digital (A/D) Conversion – convert the analogue signal from the microphone into a digital signal
• Windowing – select a short section of speech, centred at time t, and smooth
• Frequency analysis – estimate the distribution of power with respect to frequency
Waterfall display
Speech spectrogram
Derived formant tracks
A/D conversion • Sampling measures the speech signal at regular intervals, n • Quantisation encodes the signal xn with a discrete value
xn
n
Sample rate • Nyquist’s theorem: for a signal bandlimited to B Hz, then a rate of 2B samples per second is needed to encode the signal faithfully • Human ear sensitive up 20 kHz (hence 44 kHz rate for CDs) • But for speech: – high-quality needs 10 kHz bandwidth, i.e., 20 kHz sample rate – bandwidth can be reduced to ~4 kHz (8 kHz rate), for telephone quality – e.g., 8-bit PCM at 8kHz = 64 kbps
CD-quality: fS = 44 kHz
High-quality speech: fS = 20 kHz
Telephone speech: fS = 8 kHz
Window functions
Frequency analysis • Discrete Fourier Transform (DFT) is applied to the windowed digital waveform {x(n):n=1,…,N}. • With an N-sample window, an N-point complex spectrum is obtained {X(k): k=1,…,N}. • The modulus squared gives the power spectrum, |X2(k)| • The logarithm gives the log-power spectrum, log|X2(k)|
Discrete Fourier transform • over a finite period of time • sampled at regular intervals Forward transform:
(
X ( k ) = ∑n =0 x ( n ) cos N −1
− j 2πkn N
+ j sin
− j 2πkn N
)
Inverse transform:
1 x (n ) = N
(
( ) X k cos k =0
∑
N −1
+ j 2πkn N
+ j sin
+ j 2πkn N
)
Frequency analysis • Alternative methods include: – filter-bank analysis (based on a set of band-pass filters) – approximations of the spectral envelope, e.g., Linear predictive coding (LPC)
Time-frequency resolution 1 • If the window is long then – – – – –
the time resolution is poor the number of points, N, is large there are N points in the spectrum so there is fine frequency resolution narrow-band frequency analysis, or narrow-band spectrum
Narrow-band spectrum
Time-frequency resolution 2 • If the window is short then – – – – –
the time resolution is good the number of points, N, is small there are N points in the spectrum so the frequency resolution is coarse broad-band frequency analysis, or broad-band spectrum
Wide-band spectrum
Time-frequency resolution 3 • In summary: – long window, narrow-band spectrum; – short window, broad-band spectrum.
• Indeed, the bandwidth-time product cannot exceed a half: 1 BT ≤ 2 where T = N f S and f S is the sample rate
Wide-band and narrow-band spectrograms
Mel-frequency filter bank • Allocation of DFT bins to filters, spaced according to the Mel scale:
The real cepstrum • Procedure for computing cepstral coefficients from the magnitude spectrum:
Mel-frequency cepstrum • Procedure for computing cepstral coefficients, based on the output from Mel-frequency binning:
Summary of Fourier analysis • Fourier leads to frequency representation – good for visualisation – is reversible – continuous and discrete time forms
• Wide- and narrow-band spectra obtained by adjusting frame size • Windowing – reduces spectral smearing – allows for adaptation