Speech Recognition

  • Uploaded by: Justin Cook
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Speech Recognition as PDF for free.

More details

  • Words: 5,691
  • Pages: 16
Speech recognition From Wikipedia, the free encyclopedia Jump to: navigation, search Speech recognition (in many contexts also known as automatic speech recognition, computer speech recognition or erroneously as voice recognition) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program. Speech recognition applications that have emerged over the last few years include voice dialing (e.g., "Call home"), call routing (e.g., "I would like to make a collect call"), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), domotic appliances control and content-based spoken audio search (e.g. find a podcast where particular words were spoken). Voice recognition or speaker recognition is a related process that attempts to identify the person speaking, as opposed to what is being said.

Contents [hide]

• • • • • • •

1 Speech recognition technology 2 Performance of speech recognition systems o 2.1 Hidden Markov model (HMM)-based speech recognition o 2.2 Neural network-based speech recognition o 2.3 Dynamic time warping (DTW)-based speech recognition 3 Speech recognition patents and patent disputes 4 For further information 5 Applications of speech recognition 6 Microphone 7 See also 8 References 9 Books



10 External links

• •

[edit] Speech recognition technology In terms of technology, most of the technical text books nowadays emphasize the use of hidden Markov model as the underlying technology. The dynamic programming

approach, the neural network based approach and the knowledge-based learning approach have been studied intensively in the 1980s and 1990s.

[edit] Performance of speech recognition systems The performance of a speech recognition systems is usually specified in terms of accuracy and speed. Accuracy is measured with the word error rate, whereas speed is measured with the real time factor. Most speech recognition users would tend to agree that dictation machines can achieve very high performance in controlled conditions. Part of the confusion mainly comes from the mixed usage of the terms "speech recognition" and "dictation". Speaker-dependent dictation systems requiring a short period of training can capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy (getting one to two words out of one hundred wrong) if operated under optimal conditions. These optimal conditions usually means the test subjects have 1) matching speaker characteristics with the training data, 2) proper speaker adaptation, and 3) clean environment (e.g. office space). (This explains why some users, especially those whose speech is heavily accented, might actually perceive the recognition rate to be much lower than the expected 98% to 99%). Limited vocabulary systems, requiring no training, can recognize a small number of words (for instance, the ten digits) as spoken by most speakers. Such systems are popular for routing incoming phone calls to their destinations in large organizations. Both acoustic modeling and language modeling are important studies in modern statistical speech recognition. In this entry, we will the use of hidden Markov model (HMM) because notably it is very widely used in many systems. (Language modeling has many other applications such as smart keyboard and document classification; to the corresponding entries.) The Carnegie Mellon University has made some good steps in increasing the speed of speechchips by using ASICs (application-specific integrated circuits) and reconfigurable chips called FPGAs (field programmable gate arrays). [1]

[edit] Hidden Markov model (HMM)-based speech recognition Modern general-purpose speech recognition systems are generally based on (HMMs). This is a statistical model which outputs a sequence of symbols or quantities. One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piece-wise stationary signal or a short-time stationary signal. That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as a stationary process. Speech could thus be thought as a Markov model for many stochastic processes (known as states).

Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, to give the very simplest setup possible, the hidden Markov model would output a sequence of ndimensional real-valued vectors with n around, say, 13, outputting one of these every 10 milliseconds. The vectors, again in the very simplest case, would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short-time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients. The hidden Markov model will tend to have, in each state, a statistical distribution called a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes. Described above are the core elements of the most common, HMM-based approach to speech recognition. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system would need context dependency for the phones (so phones with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have socalled delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear transform, or MLLT). Many systems use socalled discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE). Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information, or combining it statically beforehand (the finite state transducer, or FST, approach).

[edit] Neural network-based speech recognition Another approach in acoustic modeling is the use of neural networks. They are capable of solving much more complicated recognition tasks, but do not scale as well as HMMs when it comes to large vocabularies. Rather than being used in general-purpose speech recognition applications they can handle low quality, noisy data and speaker

independence. Such systems can achieve greater accuracy than HMM based systems, as long as there is training data and the vocabulary is limited. A more general approach using neural networks is phoneme recognition. This is an active field of research, but generally the results are better than for HMMs. There are also NN-HMM hybrid systems that use the neural network part for phoneme recognition and the hidden markov model part for language modeling.

[edit] Dynamic time warping (DTW)-based speech recognition Main article: Dynamic time warping Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach. Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics -- indeed, any data which can be turned into a linear representation can be analyzed with DTW. A well known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.

[edit] Speech recognition patents and patent disputes Microsoft and Alcatel-Lucent hold patents in speech recognition, and are in dispute as of March 2, 2007.[2] This short section requires expansion.

[edit] For further information Popular speech recognition conferences held each year or two include ICASSP, Eurospeech/ICSLP (now named Interspeech) and the IEEE ASRU. Conferences in the field of Natural Language Processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing. Important journals include the IEEE Transactions on Speech and Audio Processing (now named IEEE Transactions on Audio, Speech and Language Processing), Computer Speech and Language, and Speech Communication. Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek which is a more up to date book (1998). A good insight into the techniques used

in the best modern systems can be gained by paying attention to government sponsored competitions such as those organised by DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components). In terms of freely available resources, the HTK book (and the accompanying HTK toolkit) is one place to start to both learn about speech recognition and to start experimenting. Another such resource is Carnegie Mellon University's SPHINX toolkit.

[edit] Applications of speech recognition • • • • • • • • • •

Automatic translation Automotive speech recognition Dictation Hands-free computing: voice command recognition computer user interface Home automation Interactive voice response Medical transcription Mobile telephony Pronunciation evaluation in computer-aided language learning applications[1] Robotics

[edit] Microphone The microphone type recommend for speech recognition is the array microphone. This short section requires expansion.

[edit] See also • • • • • • • • • • • • • •

Audio visual speech recognition Cockpit (aviation) (also termed Direct Voice Input) Keyword spotting List of speech recognition projects Microphone Speech Analytics Speaker identification Speech processing Speech synthesis Speech verification Text-to-speech (TTS) VoiceXML Acoustic Model Speech corpus

[edit] References •

"Survey of the State of the Art in Human Language Technology (1997) by Ron Cole et all"

1. ^ Dennis van der Heijden. "Computer Chips to Enhance Speech Recognition", Axistive.com, 2003-10-06. 2. ^ Roger Cheng and Carmen Fleetwood. "Judge dismisses Lucent patent suit against Microsoft", Wall Street Journal, 2007-03-02.

[edit] Books •

Multilingual Speech Processing, Edited by Tanja Schultz and Katrin Kirchhoff, April 2006--Researchers and developers in industry and academia with different backgrounds but a common interest in multilingual speech processing will find an excellent overview of research problems and solutions detailed from theoretical and practical perspectives.---CH 1: Introduction / CH 2: Language Characteristics / CH 3: Linguistic Data Resources / CH 4: Multilingual Acoustic Modeling / CH 5: Multilingual Dictionaries / CH 6: Multilingual Language Modeling / CH 7: Multilingual Speech Synthesis / CH 8: Automatic Language Identification / CH 9: Other Challenges /

[edit] External links • • • • • • • • • •

NIST Speech Group How to install and configure speech recognition in Windows. Entropic/Cambridge Hidden Markov Model Toolkit Open CV library, especially the multi-stream speech and vision combination programs LT-World: Portal to information and resources on the internet LDC – The Linguistic Data Consortium Evaluations and Language resources Distribution Agency OLAC – Open Language Archives Community BAS – Bavarian Archive for Speech Signals Think-A-Move – Speech and Tongue Control of Robots and Wheelchairs

Audio-visual speech recognition From Wikipedia, the free encyclopedia (Redirected from Audio visual speech recognition) Jump to: navigation, search

Audio visual speech recognition (AVSR) is a technique that uses image processing capabilities in lip reading to aid speech recognition systems in recognizing undeterministic phones or giving preponderance among near probability decisions. Each system lip reading and speech recognition works separately then their results are mixed at the stage of feature fusion.

[edit] External links IBM Research - Audio Visual Speech Technologies

[hide] v•d•e

Major fields of technology Artificial intelligence • Ceramic engineering • Computing technology • Electronics • Energy • Energy storage • Engineering physics • Applied science Environmental technology • Materials science & engineering • Microtechnology • Nanotechnology • Nuclear technology • Optical engineering Information and Communication • Graphics • Music technology • Speech recognition • communication Visual technology Construction • Financial engineering • Manufacturing • Machinery • Industry Mining Bombs • Guns and Ammunition • Military technology and equipment Military • Naval engineering Domestic appliances • Domestic technology • Educational technology Domestic • Food technology Aerospace • Agricultural • Architectural • Bioengineering • Biochemical • Biomedical • Ceramic • Chemical • Civil • Computer • Construction • Cryogenic • Electrical • Electronic • Environmental • Engineering Food • Industrial • Materials • Mechanical • Mechatronics • Metallurgical • Mining • Naval • Nuclear • Petroleum • Software • Structural • Systems • Textile • Tissue Biomedical engineering • Bioinformatics • Biotechnology • Health and safety Cheminformatics • Fire protection technology • Health technologies • Pharmaceuticals • Safety engineering • Sanitary engineering Aerospace • Aerospace engineering • Marine engineering • Motor Transport vehicles • Space technology • Transport

1.2: Speech Recognition Victor Zue, Ron Cole, & Wayne Ward MIT Laboratory for Computer Science, Cambridge, Massachusetts, USA Oregon Graduate Institute of Science & Technology, Portland, Oregon, USA Carnegie Mellon University, Pittsburgh, Pennsylvania, USA

Defining the Problem Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section . Speech recognition systems can be characterized by many parameters, some of the more important of which are shown in Figure . An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment---a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words. The simplest language model can be specified as a finite-state network, where the permissible words following each word are given explicitly. More general language models approximating natural language are specified in terms of a context-sensitive grammar. One popular measure of the difficulty of the task, combining the vocabulary size and the language model, is perplexity, loosely defined as the geometric mean of the number of words that can follow a word after the language model has been applied (see section for a discussion of language modeling in general and perplexity in particular). Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.

Table: Typical parameters used to characterize the capability of speech recognition systems Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal. First, the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variabilities are exemplified by the acoustic differences of the phoneme /t/ in two, true, and butter in American English. At word boundaries, contextual variations can be quite dramatic---making gas shortage sound like gash shortage in American English, and devo andare sound like devandare in Italian. Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabilities can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities. Figure shows the major components of a typical speech recognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, typically once every 10--20 msec (see sections and 11.3 for signal representation and digital signal processing, respectively). These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters.

Figure: Components of a typical speech recognition system.

Speech recognition systems attempt to model the sources of variability described above in several ways. At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent characteristics [Her90]. At the acoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of data. Speaker adaptation algorithms have also been developed that adapt speaker-independent acoustic models to those of the current speaker during system use, (see section ). Effects of linguistic context at the acoustic phonetic level are typically handled by training separate models for phonemes in different contexts; this is called context dependent acoustic modeling. Word level variability can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Common alternate pronunciations of words, as well as effects of dialect and accent are handled by allowing search algorithms to find alternate paths of phonemes through these networks. Statistical language models, based on estimates of the frequency of occurrence of word sequences, are often used to guide the search through the most probable sequence of words. The dominant recognition paradigm in the past fifteen years is known as hidden Markov models (HMM). An HMM is a doubly stochastic model, in which the generation of the underlying phoneme string and the frame-by-frame, surface acoustic realizations are both represented probabilistically as Markov processes, as discussed in sections , and 11.2. Neural networks have also been used to estimate the frame based scores; these scores are then integrated into HMM-based system architectures, in what has come to be known as hybrid systems, as described in section 11.5. An interesting feature of frame-based HMM systems is that speech segments are identified during the search process, rather than explicitly. An alternate approach is to first identify speech segments, then classify the segments and use the segment scores to recognize words. This approach has produced competitive recognition performance in several tasks [ZGPS90,FBC95].

1.2.2 State of the Art Comments about the state-of-the-art need to be made in the context of specific applications which reflect the constraints on the task. Moreover, different technologies are sometimes appropriate for different tasks. For example, when the vocabulary is small, the entire word can be modeled as a single unit. Such an approach is not practical for large vocabularies, where word models must be built up from subword units. Performance of speech recognition systems is typically described in terms of word error rate, E, defined as:

where N is the total number of words in the test set, and S, I, and D are the total number of substitutions, insertions, and deletions, respectively. The past decade has witnessed significant progress in speech recognition technology. Word error rates continue to drop by a factor of 2 every two years. Substantial progress has been made in the basic technology, leading to the lowering of barriers to speaker independence, continuous speech, and large vocabularies. There are several factors that have contributed to this rapid progress. First, there is the coming of age of the HMM. HMM is powerful in that, with the availability of training data, the parameters of the model can be trained automatically to give optimal performance. Second, much effort has gone into the development of large speech corpora for system development, training, and testing. Some of these corpora are designed for acoustic phonetic research, while others are highly task specific. Nowadays, it is not uncommon to have tens of thousands of sentences available for system training and testing. These corpora permit researchers to quantify the acoustic cues important for phonetic contrasts and to determine parameters of the recognizers in a statistically meaningful way. While many of these corpora (e.g., TIMIT, RM, ATIS, and WSJ; see section 12.3) were originally collected under the sponsorship of the U.S. Defense Advanced Research Projects Agency (ARPA) to spur human language technology development among its contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada, France, Germany, Japan, and the U.K.) as standards on which to evaluate speech recognition. Third, progress has been brought about by the establishment of standards for performance evaluation. Only a decade ago, researchers trained and tested their systems using locally collected data, and had not been very careful in delineating training and testing sets. As a result, it was very difficult to compare performance across systems, and a system's performance typically degraded when it was presented with previously unseen data. The recent availability of a large body of data in the public domain, coupled with the specification of evaluation standards, has resulted in uniform documentation of test results, thus contributing to greater reliability in monitoring progress (corpus development activities and evaluation methodologies are summarized in chapters 12 and 13 respectively). Finally, advances in computer technology have also indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has enabled researchers to run many large scale experiments in a short amount of time. This means that the elapsed time between an idea and its implementation and evaluation is greatly reduced. In fact, speech recognition systems with reasonable performance can now run in real time using high-end workstations without additional hardware---a feat unimaginable only a few years ago.

One of the most popular, and potentially most useful tasks with low perplexity (PP=11) is the recognition of digits. For American English, speaker-independent recognition of digit strings spoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3% when the string length is known. One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource Management (RM) task, in which inquiries can be made concerning various naval vessels in the Pacific ocean. The best speaker-independent performance on the RM task is less than 4%, using a word-pair language model that constrains the possible words following a given word (PP=60). More recently, researchers have begun to address the issue of recognizing spontaneously generated speech. For example, in the Air Travel Information Service (ATIS) domain, word error rates of less than 3% has been reported for a vocabulary of nearly 2,000 words and a bigram language model with a perplexity of around 15. High perplexity tasks with a vocabulary of thousands of words are intended primarily for the dictation application. After working on isolated-word, speaker-dependent systems for many years, the community has since 1992 moved towards very-large-vocabulary (20,000 words and more), high-perplexity ( ), speaker-independent, continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on read sentences drawn from North America business news [PFF 94]. With the steady improvements in speech recognition performance, systems are now being deployed within telephone and cellular networks in many countries. Within the next few years, speech recognition will be pervasive in telephone networks around the world. There are tremendous forces driving the development of the technology; in many countries, touch tone penetration is low, and voice is the only option for controlling automated services. In voice dialing, for example, users can dial 10--20 telephone numbers by voice (e.g., call home) after having enrolled their voices by saying the words associated with telephone numbers. AT&T, on the other hand, has installed a call routing system using speaker-independent word-spotting technology that can detect a few key phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to my calling card. At present, several very large vocabulary dictation systems are available for document generation. These systems generally require speakers to pause between words. Their performance can be further enhanced if one can apply constraints of the specific domain such as dictating medical reports. Even though much progress is being made, machines are a long way from recognizing conversational speech. Word recognition rates on telephone conversations in the Switchboard corpus are around 50% [CGF94]. It will be many years before unlimited vocabulary, speaker-independent continuous dictation capability is realized.

1.2.3 Future Directions

In 1992, the U.S. National Science Foundation sponsored a workshop to identify the key research challenges in the area of human language technology, and the infrastructure needed to support the work. The key research challenges are summarized in [CH 92]. Research in the following areas for speech recognition were identified:

Robustness: In a robust system, performance degrades gracefully (rather than catastrophically) as conditions become more different from those under which it was trained. Differences in channel characteristics and acoustic environment should receive particular attention.

Portability: Portability refers to the goal of rapidly designing, developing and deploying systems for new applications. At present, systems tend to suffer significant degradation when moved to a new task. In order to return to peak performance, they must be trained on examples specific to the new task, which is time consuming and expensive.

Adaptation: How can systems continuously adapt to changing conditions (new speakers, microphone, task, etc) and improve through use? Such adaptation can occur at many levels in systems, subword models, word pronunciations, language models, etc.

Adaptation: How can systems continuously adapt to changing conditions (new speakers, microphone, task, etc) and improve through use? Such adaptation can occur at many levels in systems, subword models, word pronunciations, language models, etc.

Language Modeling: Current systems use statistical language models to help reduce the search space and resolve acoustic ambiguity. As vocabulary size grows and other constraints are relaxed to create more habitable systems, it will be increasingly important to get as much constraint as possible from language models; perhaps incorporating syntactic and semantic constraints that cannot be captured by purely statistical models.

Confidence Measures: Most speech recognition systems assign scores to hypotheses for the purpose of rank ordering them. These scores do not provide a good indication of whether a hypothesis is correct or not, just that it is better than the other hypotheses. As we move to tasks that require actions, we need better methods to evaluate the absolute correctness of hypotheses.

Out-of-Vocabulary Words: Systems are designed for use with a particular set of words, but system users may not know exactly which words are in the system vocabulary. This leads to a certain percentage of out-of-vocabulary words in natural conditions. Systems must have some

method of detecting such out-of-vocabulary words, or they will end up mapping a word from the vocabulary onto the unknown word, causing an error.

Spontaneous Speech: Systems that are deployed for real use must deal with a variety of spontaneous speech phenomena, such as filled pauses, false starts, hesitations, ungrammatical constructions and other common behaviors not found in read speech. Development on the ATIS task has resulted in progress in this area, but much work remains to be done.

Prosody: Prosody refers to acoustic structure that extends over several segments or words. Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger). Current systems do not capture prosodic structure. How to integrate prosodic information into the recognition architecture is a critical question that has not yet been answered.

Modeling Dynamics: Systems assume a sequence of input frames which are treated as if they were independent. But it is known that perceptual cues for words and phonemes require the integration of features that reflect the movements of the articulators, which are dynamic in nature. How to model dynamics and incorporate this information into recognition systems is an unsolved problem.

voice recognition - Voice or speech recognition is the ability of a machine or program to receive and interpret dictation, or to understand and carry out spoken commands. For use with computers, analog audio must be converted into digital signals. This requires analog-to-digital conversion. For a computer to decipher the signal, it must have a digital database, or vocabulary, of words or syllables,and a speedy means of comparing this data with signals. The speech patterns are stored on the hard drive and loaded into memory when the program is run. A comparator checks these stored patterns against the output of the A/D converter. In practice, the size of a voice-recognition program's effective vocabulary is directly related to the random access memory capacity of the computer in which it is installed. A voice-recognition program runs many times faster if the entire vocabulary can be loaded into RAM, as compared with searching the hard drive for some of the matches. Processing speed is critical as well, because it affects how fast the computer can search the RAM for matches. All voice-recognition systems or programs make errors. Screaming children, barking

dogs, and loud external conversations can produce false input. Much of this can be avoided only by using the system in a quiet room. There is also a problem with words that sound alike but are spelled differently and have different meanings -- for example, "hear" and "here." This problem might someday be largely overcome using stored contextual information. However, this will require more RAM and faster processors than are currently available in personal computers. Though a number of voice recognition systems are available on the market, the industry leaders are IBM and Dragon Systems. LAST UPDATED: 05 Mar 2007 QUESTION POSED ON: 07 October 2002 There has been some consideration for using voice recognition with contact centers to deflect queuing calls. Recently there have been some nice implementations from both Nuance and Speechworks. How do you see this market segment developing and especially how would advise someone interested in this technology to ensure they leverage existing investments in either outbound scripts (Siebel Smartscripts) or knowledge bases (Primus and eGain)? > EXPERT RESPONSE I believe that the successful use of speech recognition in contact centers hinges on two critical factors: 1. Humanization 2. Application Let's take these two factors and explore them further. 1. Humanization People do not like talking with a computer. Most interactions involving speech recognition use either Text-To-Speech or cold, robotic sounding prompts to interact with the customer. Neither of these work toward building a relationship with the customer. I know it sounds odd to think about a computer building a relationship with a customer but that is at the heart of real communication. If the computer sounds like a person and responds as a person would, then your ability to engage a customer and keep them engaged for an automated session increases significantly. As an example compare the following: "Please state your full name" (stated in a monotone) "Would you please say your first and last name" (stated with full dynamics)

Clearly the 2nd interaction would be preferred. Achieving this is the first success factor. 2. Application Certain types of applications lend themselves well toward an automated interaction with a customer. A good example would be calling in a prescription refill to a pharmacy or checking to see when an order shipped and when it is expected to be delivered. These types of applications don't require the skills of a highly trained agent but can be very time consuming in terms of personnel cost. Imagine the value of reducing your headcount of less skilled agents while not wasting the time of your highly trained and well compensated agents. Summary It is these types of applications where the largest values can be gained. Don't try to replace your entire agent population. That is not going to happen. Be realistic. Focus on applications where the form of the transaction is fairly consistent. The field of computer science that deals with designing computer systems that can recognize spoken words. Note that voice recognition implies only that the computer can take dictation, not that it understands what is being said. Comprehending human languages falls under a different field of computer science called natural language processing. A number of voice recognition systems are available on the market. The most powerful can recognize thousands of words. However, they generally require an extended training session during which the computer system becomes accustomed to a particular voice and accent. Such systems are said to be speaker dependent. Many systems also require that the speaker speak slowly and distinctly and separate each word with a short pause. These systems are called discrete speech systems. Recently, great strides have been made in continuous speech systems -- voice recognition systems that allow you to speak naturally. There are now several continuous-speech systems available for personal computers. Because of their limitations and high cost, voice recognition systems have traditionally been used only in a few specialized situations. For example, such systems are useful in instances when the user is unable to use a keyboard to enter data because his or her hands are occupied or disabled. Instead of typing commands, the user can simply speak into a headset. Increasingly, however, as the cost decreases and performance improves, speech recognition systems are entering the mainstream and are being used as an alternative to keyboards.

Related Documents


More Documents from ""