Slides01

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Slides01 as PDF for free.

More details

  • Words: 1,030
  • Pages: 23
EEM.ssr: Speaker & Speech Recognition

Speech Communication by Dr Philip Jackson lecturer in speech & audio Centre for Vision, Speech & Signal Processing, Department of Electronic Engineering. http://www.ee.surrey.ac.uk/Teaching/Courses/eem.ssr

World of speech technologies Automatic speech recognition

Spoken dialogue processing

Speaker recognition Speech perception

Spoken language understanding

Speech technology

Speech enhancement Speech coding

Spoken language generation Emotion in speech Phonetics Speech production

Speech modification

Speech analysis

Speech synthesis

Speech-related disciplines Psycholog y Maths & stats

Linguistics Phonetics

Speech science Acoustics

Computer science Electronics

Signal processin g

The speech chain SPEAKER

LISTENER

SENSORY NERVES EAR

FEEDBACK LINK EAR

MOTOR NERVES VOCAL MUSCLES

Linguistic

Physiological

SENSORY NERVES

SOUND WAVES

Acoustic

Physiological

Linguistic

It comes as naturally as breathing • Speech is man’s preferred modality • Can use natural language for interacting with complex systems • Hands-free • Eyes-free • Small footprint • Requires no training

Ideas and language • Ideas are concepts or abstract notions • Language has a grammar and syntax, and is made up of words • Develop our understanding of the world at the same time as we are learning to talk: – Many of our thoughts are framed in terms of words – Language (and culture) affect the way we think

Written vs. spoken language • Written language – discrete words separated by spaces – usually complete, correct spelling – opportunity to skip, skim or re-read

• Spoken language – continuous sequence of sounds, usually without spaces – often damaged, interrupted, parts mumbled

Speech is not acoustic text

Sounds and words • Phonetics – How speech sounds are produced – Acoustic result of speech articulation

• Phonology – How sounds are used to make words – The functions of the sounds within a particular language

Acoustic signal • Sound produced by vibration of vocal cords • Sound modified by resonances of the vocal tract • International Phonetic Alphabet (IPA) – smallest unit in speech where substitution could change meaning = phoneme

Speech sounds • Speech production – Articulators: how do they affect the speech waveform?

• Phonemes – What are they, why are they useful? – Phonemes are speech sounds in an ideal world.

• Phonetics – How are phonemes actually realized? – Phones are speech sounds in the real world. – Allophones are different types of realisation.

• The wider context – Language, accent, – Speaker differences, – Effect of external factors.

Vowels, consonants and syllables • Vowels – Vibrating vocal cords in larynx with clear vocal tract – Produced using slower extrinsic muscles

• Consonants – Usually some occlusion of the vocal tract – Sound source can be from larynx, click or hiss – Produced using faster intrinsic muscles

• Syllables – All languages have CV syllables – Basic unit of articulation – Consonant clusters

Phonetics vs. orthography • Letter-phoneme mapping is not 1-to-1: • Some sounds require several letters – e.g., “sh”, “ph”

• Some letters have several pronunciations – e.g., “g”, “c”

• Some sounds have several transcriptions – e.g., /f/: “f” and “ph”

• Some letters produce several sounds – e.g., “x” /ks/

• Some combinations have complex relations – e.g., “-ough-”

• Different accents use different phonemes – e.g., “bath”

Prosody • Pitch – Corresponds to the frequency of vibration of the vocal cords – (Has phonetic significance in tonal languages)

• Intensity – How loud a particular word or syllable is

• Timing – Durations depend on the phrasing (punctuation), context (cf. “league”, “leek”), etc. – Stress timed vs. syllable timed languages

Language, accent and dialect • Language – A system of communication with a vocabulary of words, grammar and syntax – Different languages have different phonetic contrasts (“right”, “light”)

• Accent – Pronunciation variations that do not affect meaning of spoken utterance (“good”, “food”) – Intelligible by native speakers

• Dialect – Variations in vocabulary, and possibly other aspects, for distinct population

Non-acoustic signals • Many other sources of information from other senses: face, body, gesture, touch,… – can make you “hear” things differently

• Lip reading – Information about articulation can be derived from (peripherally) observing lips – Major cue for hearing impaired – Significant effect for normal hearers (McGurk)

• Para-linguistic information – – – – –

Facial mood and emotion Culturally-grounded gestures Modifying gestures Body language Stress and emphasis

Complexity demands intelligence • Speech is very complex – requires fusion of many sources of knowledge

• Humans have developed large brains and supreme intelligence in the animal kingdom to deal with it: – very large number of neurons, in parallel

Summary of speech comm. • Speech is natural modality of man to interact with machines – Ideas and language – Written vs. spoken language (phonology) – Continuous acoustic signal (phonetics)

• Phonemes, phones and allophones – Vowels and consonants – Phonetic vs. orthographic transcriptions – Intelligible by native speakers

• Para-linguistic information – Prosody: intensity, pitch and timing – Language, accent and dialect – Visual, haptic and contextual information

Speech recognition

What is speech recognition? • Types of spoken language processing: – – – – – – – – –

Automatic speech recognition (ASR) Spoken language understanding Dialogue systems Paralinguistic speech processing Speech verification Speech coding, enhancement & modification Speech synthesis Spoken language generation Speaker recognition: identification and authentication/verification

Speech recognition problem • The dream and reality – Intelligent machines? – Size of vocabulary: 50, 1000, 20000 words – Speaker -dependent/-independent

• Discovering our ignorance – How does the ear work? – How does the brain process sounds to perceive concepts?

• Circumventing our ignorance – Ad-hoc rules vs. pattern matching techniques – Most successful based on stochastic modelling – Recent advances in neural network approaches

Dimensions of difficulty • • • •

Speaker dependency Vocabulary size Isolated words vs. continuous speech Language constraints and knowledge sources • Acoustic ambiguity • Noise robustness

Speech recognition summary • Dream and reality – Speech-to-text machines – Vocabulary size and speaker-dependency trade off against recognition accuracy

• Incomplete specification – Of language, of the human ear, the auditory nerves and of how the cortex processes speech to derive meaning

• An engineering solution – Use pattern matching techniques – Most successful based on Hidden Markov Models – Recent advances in HMM/ANN hybrids

Related Documents

Slides01
November 2019 2
C++-slides01-fp2005-ver1.1
November 2019 4