Slides03

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Slides03 as PDF for free.

More details

  • Words: 799
  • Pages: 30
EEM.ssr: Speaker & Speech Recognition

Speech Analysis by Dr Philip Jackson lecturer in speech & audio Centre for Vision, Speech & Signal Processing, Department of Electronic Engineering. http://www.ee.surrey.ac.uk/Teaching/Courses/eem.ssr

What’s the point of analysing speech? • Speech analysis, or speech processing, transforms a speech waveform into a representation that is suitable for extracting its features: • Human visual inspection – e.g., by a speech scientist, speech therapist, or forensic phonetician

• Computer analysis – e.g., for automatic speech recognition, speaker recognition, or paralinguistic processing

And what does that mean? • Suitable could be: – amenable to human visual inspection – using a small number of bits per second (for transmission or storage) – compatible with the models in a speech recognizer – in line with our understanding of human auditory processing

Cochlear section • Cochlea, or inner ear, has a spiral form: – vestibular canal – basilar membrane – tympanic canal – auditory nerve

Response of the cochlea

Basilar membrane

• • • •

sound enters at the stapes travels along the basilar membrane vibrates at matching position activates auditory nerves

Short-term spectrum • Represents the distribution of power with respect to frequency over a time interval centred at time, t, like a vertical slice through the spectrogram • From a source-filter perspective, it gives us some information about the shape of the vocal tract at time t • From a human speech perception view, it provides similar information to that sent from the cochlea to the auditory nerve

Computing the ST-spectrum • Analogue-to-Digital (A/D) Conversion – convert the analogue signal from the microphone into a digital signal

• Windowing – select a short section of speech, centred at time t, and smooth

• Frequency analysis – estimate the distribution of power with respect to frequency

Waterfall display

Speech spectrogram

Derived formant tracks

A/D conversion • Sampling measures the speech signal at regular intervals, n • Quantisation encodes the signal xn with a discrete value

xn

n

Sample rate • Nyquist’s theorem: for a signal bandlimited to B Hz, then a rate of 2B samples per second is needed to encode the signal faithfully • Human ear sensitive up 20 kHz (hence 44 kHz rate for CDs) • But for speech: – high-quality needs 10 kHz bandwidth, i.e., 20 kHz sample rate – bandwidth can be reduced to ~4 kHz (8 kHz rate), for telephone quality – e.g., 8-bit PCM at 8kHz = 64 kbps

CD-quality: fS = 44 kHz

High-quality speech: fS = 20 kHz

Telephone speech: fS = 8 kHz

Window functions

Frequency analysis • Discrete Fourier Transform (DFT) is applied to the windowed digital waveform {x(n):n=1,…,N}. • With an N-sample window, an N-point complex spectrum is obtained {X(k): k=1,…,N}. • The modulus squared gives the power spectrum, |X2(k)| • The logarithm gives the log-power spectrum, log|X2(k)|

Discrete Fourier transform • over a finite period of time • sampled at regular intervals Forward transform:

(

X ( k ) = ∑n =0 x ( n ) cos N −1

− j 2πkn N

+ j sin

− j 2πkn N

)

Inverse transform:

1 x (n ) = N

(

( ) X k cos k =0



N −1

+ j 2πkn N

+ j sin

+ j 2πkn N

)

Frequency analysis • Alternative methods include: – filter-bank analysis (based on a set of band-pass filters) – approximations of the spectral envelope, e.g., Linear predictive coding (LPC)

Time-frequency resolution 1 • If the window is long then – – – – –

the time resolution is poor the number of points, N, is large there are N points in the spectrum so there is fine frequency resolution narrow-band frequency analysis, or narrow-band spectrum

Narrow-band spectrum

Time-frequency resolution 2 • If the window is short then – – – – –

the time resolution is good the number of points, N, is small there are N points in the spectrum so the frequency resolution is coarse broad-band frequency analysis, or broad-band spectrum

Wide-band spectrum

Time-frequency resolution 3 • In summary: – long window, narrow-band spectrum; – short window, broad-band spectrum.

• Indeed, the bandwidth-time product cannot exceed a half: 1 BT ≤ 2 where T = N f S and f S is the sample rate

Wide-band and narrow-band spectrograms

Mel-frequency filter bank • Allocation of DFT bins to filters, spaced according to the Mel scale:

The real cepstrum • Procedure for computing cepstral coefficients from the magnitude spectrum:

Mel-frequency cepstrum • Procedure for computing cepstral coefficients, based on the output from Mel-frequency binning:

Summary of Fourier analysis • Fourier leads to frequency representation – good for visualisation – is reversible – continuous and discrete time forms

• Wide- and narrow-band spectra obtained by adjusting frame size • Windowing – reduces spectral smearing – allows for adaptation

Related Documents

Slides03
November 2019 2