Title: Seminar
1- Seminar
- Speech Recognition
- 2003
- E.M. Bakker
- LIACS Media Lab
- Leiden University
2Outline
- Introduction and State of the Art
- A Speech Recognition Architecture
- Acoustic modeling
- Language modeling
- Practical issues
- Applications
- NB Some of the slides are adapted from the
presentation Can Advances in Speech Recognition
make Spoken Language as Convenient and as
Accessible as Online Text?, an excellent
presentation by Dr. Patti Price, Speech
Technology Consulting Menlo Park, California
94025, and Dr. Joseph Picone Institute for Signal
and Information Processing Dept. of Elect. and
Comp. Eng. Mississippi State University
3Research Areas
- Speech Analysis (Production, Perception,
Parameter Estimation) - Speech Coding/Compression
- Speech Synthesis (TTS)
- Speaker Identification/Recognition/Verification
(Sprint, TI) - Language Identification (Transparent Dialogue)
- Speech Recognition (Dragon, IBM, ATT)
- Speech recognition sub-categories
- Discrete/Connected/Continuous Speech/Word
Spotting - Speaker Dependent/Independent
- Small/Medium/Large/Unlimited Vocabulary
- Speaker-Independent Large Vocabulary Continuous
Speech Recognition (or LVCSR for short )
4Introduction What is Speech Recognition?
Goal Automatically extract the string of words
spoken from the speech signal
- Other interesting areas
- Who is talker (speaker recognition,
identification) - Speech output (speech synthesis)
- What the words mean (speech understanding,
semantics)
5IntroductionApplications
- Command and control
- Manufacturing
- Consumer products
http//www.speech.philips.com
- Database query
- Resource management
- Air travel information
- Stock quote
Nuance, American Airlines 1-800-433-7300, touch 1
- Dictation
- http//www.lhsl.com/contacts/
- http//www-4.ibm.com/software/speech
- http//www.microsoft.com/speech/
6Introduction State of the Art
- Speech-recognition software
- IBM (Via Voice, Voice Server Applications,...)
- Speaker independent, continuous command
recognition - Large vocabulary recognition
- Text-to-speech confirmation
- Barge in (The ability to interrupt an audio
prompt as it is playing) - Dragon Systems, Lernout Hauspie (LH Voice
Xpress (( ) - Philips
- Dictation
- Telephone
- Voice Control (SpeechWave, VoCon SDK, chip-sets)
- Microsoft (Whisper, Dr Who)
7Introduction State of the Art
- Speech over the telephone.
- ATT Bell Labs pioneered the use of
speech-recognition systems for telephone
transactions - companies such as Nuance, Philips and SpeechWorks
are active in this field for some years now. - IBM Applications over telephone
- request news, internet pages, e-mail
- stock quotes, traveling info
- weather information
8Introduction State of the Art
- Speech over the telephone (Philips)
- SpeechPearl large vocabulary natural language
recognition (up to 200,000 words) - SpeechMania mixed initiative dialog gives the
caller the impression of a truly natural
dialogue full replacement of the human operator.
- SpeechWave relatively small vocabularies (up to
hundreds of words) available in nearly 40
languages - Voice ReQuest The system recognizes the request
and routes the call to the appropriate extension,
all without the intervention of an operator.
9Introduction State of the Art
- Speech over the telephone.
- Brokerage ETrade,...
- Banking
- Travel United Airlines, Continental Airlines
- BellSouth
- HP
- Federal Express
- Foodline
10Introduction State of the Art
- Speech over the telephone.
11Introduction State of the Art
- The following is a transcript of a telephone
conversation - between Victor Zue and the Jupiter system for
weather - information. Jupiter's toll-free number is
1-888-573-8255 - Phone ringing
- Jupiter Hi. Welcome to Jupiter, the M.I.T. Lab
for Computer Science Weather Information System.
This call is being recorded for system
development.You may hang up or ask for help at
any time. How can I help you? - V.Z. What is the forecast for Boston today?
- Jupiter In Boston this afternoon, partly sunny
and windy. High around 50. Tonight, clear, lows
around 35. What other information can I give you?
- V.Z. Will it rain tomorrow?
- Jupiter To my knowledge, the forecast calls for
no rain in Boston tomorrow. Is there anything
else?
12Introduction State of the Art
- V.Z. How about Seattle?
- Jupiter To my knowledge, the forecast calls for
no rain in Seattle tomorrow. Is there something
else you would like to know? - V.Z. Is it sunny anywhere in the Caribbean?
- Jupiter Sorry, I am not sure what you said.
- V.Z. Is it sunny anywhere in the Caribbean?
- Jupiter I am sorry, I may not know the city you
are asking about. Try asking for the state or
country. For example, what cities do you know
about in Massachusetts? - V.Z. What is the temperature in Montego Bay?
- Jupiter In Montego Bay today, high 86 and low
73. Is there something else? - V.Z. Good-bye.
13Factors that Affect Performance of Speech
Recognition Systems
14How Do You Measure the Performance?
- USC, October 15, 1999 the world's first machine
system that can recognize spoken words better
than humans can. - In benchmark testing using just a few spoken
words, USC's Berger-Liaw System not only bested
all existing computer speech recognition systems
but outperformed the keenest human ears. - What benchmarks?
- What was training?
- What was the test?
- Were they independent?
- How large was the vocabulary and the sample size?
- Did they really test all existing systems?Is that
different from chance? - Was the noise added or coincident with speech?
- What kind of noise? Was it independent of the
speech?
15Evaluation Metrics
Word Error Rate (WER)
Conversational Speech
40
- Spontaneous telephone speech is still a grand
challenge. - Telephone-quality speech is still central to the
problem. - Broadcast news is a very dynamic domain.
30
Broadcast News
20
Read Speech
10
Continuous Digits
Letters and Numbers
Digits
Command and Control
0
Level Of Difficulty
16Evaluation MetricsHuman Performance
Word Error Rate
- Human performance exceeds machine
- performance by a factor ranging from
- 4x to 10x depending on the task.
- On some tasks, such as credit card number
recognition, machine performance exceeds humans
due to human memory retrieval capacity. - The nature of the noise is as important as the
SNR (e.g., cellular phones). - A primary failure mode for humans is inattention.
- A second major failure mode is the lack of
familiarity with the domain (i.e., business
terms and corporation names).
20
Wall Street Journal (Additive Noise)
15
Machines
10
5
Human Listeners (Committee)
0
Quiet
10 dB
16 dB
22 dB
Speech-To-Noise Ratio
17Evaluation MetricsMachine Performance
100
(Foreign)
Read Speech
Conversational Speech
20k vocabularies
Broadcast Speech
Spontaneous Speech
Varied Microphones
(Foreign)
10 X
10
5k
Noisy
1k
1
1988 1989 1990 1991 1992 1993 1994
1995 1996 1997 1998 1999 2000 2001
2002 2003
18What does a speech signal look like?
19Spectrogram
20Speech Recognition
21Recognition ArchitecturesWhy Is Speech
Recognition So Difficult?
Feature No. 2
Ph_1
Ph_2
Ph_3
Feature No. 1
- Measurements of the
- signal are ambiguous.
- Region of overlap represents classification
errors. - Reduce overlap by introducing acoustic and
linguistic context (e.g., context-dependent
phones).
22Overlap in the ceptral space (alphadigits)
Female iy
Female aa
Male iy
Male aa
23Overlap in the cepstral space (alphadigits)
Male iy (blue) vs. Female iy (red)
Male aa (green) vs. Female aa (black)
- Combined Comparisons
- Male "aa" (green)
- Female "aa" (black)
- Male "iy" (blue)
- Female "iy" (red)
24OVERLAP IN THE CEPSTRAL SPACE (SWB-All)
The following plots demonstrate overlap of
recognition features in the cepstral space. These
plots consist of all vowels excised from tokens
in the SWITCHBOARD conversational speech corpus.
All Male Vowels
All Vowels
All Female Vowels
25Recognition ArchitecturesA Communication
Theoretic Approach
Message Source
Linguistic Channel
Articulatory Channel
Acoustic Channel
Features
Observable Message
Words
Sounds
- Bayesian formulation for speech recognition
- P(WA) P(AW) P(W) / P(A)
Objective minimize the word error
rate Approach maximize P(WA) during training
- Components
- P(AW) acoustic model (hidden Markov models,
mixtures) - P(W) language model (statistical, finite
state networks, etc.) - The language model typically predicts a small set
of next words based on - knowledge of a finite number of previous words
(N-grams).
26Recognition ArchitecturesIncorporating Multiple
Knowledge Sources
Input Speech
Language Model P(W)
27Acoustic ModelingFeature Extraction
Fourier Transform
Input Speech
Cepstral Analysis
Perceptual Weighting
Time Derivative
Time Derivative
Delta Energy Delta Cepstrum
Delta-Delta Energy Delta-Delta Cepstrum
Energy Mel-Spaced Cepstrum
28Acoustic ModelingHidden Markov Models
- Acoustic models encode the temporal evolution of
the features (spectrum). - Gaussian mixture distributions are used to
account for variations in speaker, accent, and
pronunciation. - Phonetic model topologies are simple
left-to-right structures. - Skip states (time-warping) and multiple paths
(alternate pronunciations) are also common
features of models. - Sharing model parameters is a common strategy to
reduce complexity.
29Acoustic ModelingParameter Estimation
- Word level transcription
- Supervises a closed-loop data-driven modeling
- Initial parameter estimation
- The expectation/maximization (EM) algorithm is
used to improve our parameter estimates. - Computationally efficient training algorithms
(Forward-Backward) are crucial. - Batch mode parameter updates are typically
preferred. - Decision trees and the use of additional
linguistic knowledge are used to optimize
parameter-sharing, and system complexity,.
30Language ModelingIs A Lot Like Wheel of Fortune
31Language ModelingN-Grams The Good, The Bad, and
The Ugly
32Language ModelingIntegration of Natural Language
33Implementation IssuesDynamic Programming-Based
Search
34Implementation IssuesCross-Word Decoding Is
Expensive
- Cross-word Decoding since word boundaries dont
occur in spontaneous speech, we must allow for
sequences of sounds that span word boundaries. -
- Cross-word decoding significantly increases
memory requirements.
35Implementation Issues Search Is Resource
Intensive
- Typical LVCSR systems have about 10M free
parameters, which makes training a challenge. - Large speech databases are required (several
hundred hours of speech). - Tying, smoothing, and interpolation are required.
36Applications Conversational Speech
- Conversational speech collected over the
telephone contains background - noise, music, fluctuations in the speech rate,
laughter, partial words, - hesitations, mouth noises, etc.
- WER (Word Error Rate) has decreased from 100 to
30 in six years.
- Laughter
- Singing
- Unintelligible
- Spoonerism
- Background Speech
- No pauses
- Restarts
- Vocalized Noise
- Coinage
37ApplicationsAudio Indexing of Broadcast News
- Broadcast news offers some unique
- challenges
- Lexicon important information in
- infrequently occurring words
- Acoustic Modeling variations in channel,
particularly within the same segment ( in the
studio vs. on location) - Language Model must adapt ( Bush,
Clinton, Bush, McCain, ???) - Language multilingual systems?
language-independent acoustic modeling?
38ApplicationsAutomatic Phone Centers
- Portals Bevocal, TellMe, HeyAniat
- VoiceXML 2.0
- Automatic Information Desk
- Reservation Desk
- Automatic Help-Desk
- With Speaker identification
- bank account services
- e-mail services
- corporate services
39Applications Real-Time Translation
- From President Clintons State of the Union
address (January 27, 2000) -
- These kinds of innovations are also propelling
our remarkable prosperity... - Soon researchers will bring us devices that can
translate foreign languages - as fast as you can talk... molecular computers
the size of a tear drop with the - power of todays fastest supercomputers.
- Human Language Engineering a sophisticated
integration of many speech and - language related technologies... a science
for the next millennium.
40Technology Future Directions
- The algorithmic issues for the next decade
- Better features by extracting articulatory
information? - Bayesian statistics? Bayesian networks?
- Decision Trees? Information-theoretic measures?
- Nonlinear dynamics? Chaos?