EEM.ssr: Speaker & Speech Recognition
Speech Communication by Dr Philip Jackson lecturer in speech & audio Centre for Vision, Speech & Signal Processing, Department of Electronic Engineering. http://www.ee.surrey.ac.uk/Teaching/Courses/eem.ssr
World of speech technologies Automatic speech recognition
Spoken dialogue processing
Speaker recognition Speech perception
Spoken language understanding
Speech technology
Speech enhancement Speech coding
Spoken language generation Emotion in speech Phonetics Speech production
Speech modification
Speech analysis
Speech synthesis
Speech-related disciplines Psycholog y Maths & stats
Linguistics Phonetics
Speech science Acoustics
Computer science Electronics
Signal processin g
The speech chain SPEAKER
LISTENER
SENSORY NERVES EAR
FEEDBACK LINK EAR
MOTOR NERVES VOCAL MUSCLES
Linguistic
Physiological
SENSORY NERVES
SOUND WAVES
Acoustic
Physiological
Linguistic
It comes as naturally as breathing • Speech is man’s preferred modality • Can use natural language for interacting with complex systems • Hands-free • Eyes-free • Small footprint • Requires no training
Ideas and language • Ideas are concepts or abstract notions • Language has a grammar and syntax, and is made up of words • Develop our understanding of the world at the same time as we are learning to talk: – Many of our thoughts are framed in terms of words – Language (and culture) affect the way we think
Written vs. spoken language • Written language – discrete words separated by spaces – usually complete, correct spelling – opportunity to skip, skim or re-read
• Spoken language – continuous sequence of sounds, usually without spaces – often damaged, interrupted, parts mumbled
Speech is not acoustic text
Sounds and words • Phonetics – How speech sounds are produced – Acoustic result of speech articulation
• Phonology – How sounds are used to make words – The functions of the sounds within a particular language
Acoustic signal • Sound produced by vibration of vocal cords • Sound modified by resonances of the vocal tract • International Phonetic Alphabet (IPA) – smallest unit in speech where substitution could change meaning = phoneme
Speech sounds • Speech production – Articulators: how do they affect the speech waveform?
• Phonemes – What are they, why are they useful? – Phonemes are speech sounds in an ideal world.
• Phonetics – How are phonemes actually realized? – Phones are speech sounds in the real world. – Allophones are different types of realisation.
• The wider context – Language, accent, – Speaker differences, – Effect of external factors.
Vowels, consonants and syllables • Vowels – Vibrating vocal cords in larynx with clear vocal tract – Produced using slower extrinsic muscles
• Consonants – Usually some occlusion of the vocal tract – Sound source can be from larynx, click or hiss – Produced using faster intrinsic muscles
• Syllables – All languages have CV syllables – Basic unit of articulation – Consonant clusters
Phonetics vs. orthography • Letter-phoneme mapping is not 1-to-1: • Some sounds require several letters – e.g., “sh”, “ph”
• Some letters have several pronunciations – e.g., “g”, “c”
• Some sounds have several transcriptions – e.g., /f/: “f” and “ph”
• Some letters produce several sounds – e.g., “x” /ks/
• Some combinations have complex relations – e.g., “-ough-”
• Different accents use different phonemes – e.g., “bath”
Prosody • Pitch – Corresponds to the frequency of vibration of the vocal cords – (Has phonetic significance in tonal languages)
• Intensity – How loud a particular word or syllable is
• Timing – Durations depend on the phrasing (punctuation), context (cf. “league”, “leek”), etc. – Stress timed vs. syllable timed languages
Language, accent and dialect • Language – A system of communication with a vocabulary of words, grammar and syntax – Different languages have different phonetic contrasts (“right”, “light”)
• Accent – Pronunciation variations that do not affect meaning of spoken utterance (“good”, “food”) – Intelligible by native speakers
• Dialect – Variations in vocabulary, and possibly other aspects, for distinct population
Non-acoustic signals • Many other sources of information from other senses: face, body, gesture, touch,… – can make you “hear” things differently
• Lip reading – Information about articulation can be derived from (peripherally) observing lips – Major cue for hearing impaired – Significant effect for normal hearers (McGurk)
• Para-linguistic information – – – – –
Facial mood and emotion Culturally-grounded gestures Modifying gestures Body language Stress and emphasis
Complexity demands intelligence • Speech is very complex – requires fusion of many sources of knowledge
• Humans have developed large brains and supreme intelligence in the animal kingdom to deal with it: – very large number of neurons, in parallel
Summary of speech comm. • Speech is natural modality of man to interact with machines – Ideas and language – Written vs. spoken language (phonology) – Continuous acoustic signal (phonetics)
• Phonemes, phones and allophones – Vowels and consonants – Phonetic vs. orthographic transcriptions – Intelligible by native speakers
• Para-linguistic information – Prosody: intensity, pitch and timing – Language, accent and dialect – Visual, haptic and contextual information
Speech recognition
What is speech recognition? • Types of spoken language processing: – – – – – – – – –
Automatic speech recognition (ASR) Spoken language understanding Dialogue systems Paralinguistic speech processing Speech verification Speech coding, enhancement & modification Speech synthesis Spoken language generation Speaker recognition: identification and authentication/verification
Speech recognition problem • The dream and reality – Intelligent machines? – Size of vocabulary: 50, 1000, 20000 words – Speaker -dependent/-independent
• Discovering our ignorance – How does the ear work? – How does the brain process sounds to perceive concepts?
• Circumventing our ignorance – Ad-hoc rules vs. pattern matching techniques – Most successful based on stochastic modelling – Recent advances in neural network approaches
Dimensions of difficulty • • • •
Speaker dependency Vocabulary size Isolated words vs. continuous speech Language constraints and knowledge sources • Acoustic ambiguity • Noise robustness
Speech recognition summary • Dream and reality – Speech-to-text machines – Vocabulary size and speaker-dependency trade off against recognition accuracy
• Incomplete specification – Of language, of the human ear, the auditory nerves and of how the cortex processes speech to derive meaning
• An engineering solution – Use pattern matching techniques – Most successful based on Hidden Markov Models – Recent advances in HMM/ANN hybrids