Title: Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends
1Evaluation MetricsEvolution
Word Error Rate
Conversational Speech
40
- Spontaneous telephone speech is still a grand
challenge. - Telephone-quality speech is still central to the
problem. - Vision for speech technology continues
- to evolve.
- Broadcast news is a very dynamic domain.
30
Broadcast News
20
Read Speech
10
Continuous Digits
Letters and Numbers
Digits
Command and Control
0
Level Of Difficulty
2Evaluation MetricsHuman Performance
Word Error Rate
- Human performance exceeds machine
- performance by a factor ranging from
- 4x to 10x depending on the task.
- On some tasks, such as credit card number
recognition, machine performance exceeds humans
due to human memory retrieval capacity. - The nature of the noise is as important as the
SNR (e.g., cellular phones). - A primary failure mode for humans is inattention.
- A second major failure mode is the lack of
familiarity with the domain (i.e., business
terms and corporation names).
20
Wall Street Journal (Additive Noise)
15
Machines
10
5
Human Listeners (Committee)
0
Quiet
10 dB
16 dB
22 dB
Speech-To-Noise Ratio
3Evaluation MetricsMachine Performance
100
(Foreign)
Read Speech
Conversational Speech
Broadcast Speech
20k
Spontaneous Speech
Varied Microphones
(Foreign)
10 X
10
5k
Noisy
1k
1
1988 1989 1990 1991 1992 1993 1994
1995 1996 1997 1998 1999 2000 2001
2002 2003
4Evaluation MetricsBeyond WER Named Entity
- Information extraction is the analysis of
- natural language to collect information
- about specified types of entities.
-
- As the focus shifts to providing enhanced
annotations, WER may not be the most
appropriate measure of performance (content-based
scoring).
5Recognition ArchitecturesWhy Is Speech
Recognition So Difficult?
Feature No. 2
Ph_1
Ph_2
Ph_3
Feature No. 1
- Our measurements of the
- signal are ambiguous.
- Region of overlap represents classification
errors. - Reduce overlap by introducing acoustic and
linguistic context (e.g., context-dependent
phones).
6Recognition ArchitecturesA Communication
Theoretic Approach
Message Source
Linguistic Channel
Articulatory Channel
Acoustic Channel
Features
Observable Message
Words
Sounds
- Bayesian formulation for speech recognition
- P(WA) P(AW) P(W) / P(A)
Objective minimize the word error
rate Approach maximize P(WA) during training
- Components
- P(AW) acoustic model (hidden Markov models,
mixtures) - P(W) language model (statistical, finite
state networks, etc.) - The language model typically predicts a small set
of next words based on - knowledge of a finite number of previous words
(N-grams).
7Recognition ArchitecturesIncorporating Multiple
Knowledge Sources
Input Speech
Language Model P(W)
8Acoustic ModelingFeature Extraction
Fourier Transform
Input Speech
Cepstral Analysis
Perceptual Weighting
Time Derivative
Time Derivative
Delta Energy Delta Cepstrum
Delta-Delta Energy Delta-Delta Cepstrum
Energy Mel-Spaced Cepstrum
9Acoustic ModelingHidden Markov Models
- Acoustic models encode the temporal evolution of
the features (spectrum). - Gaussian mixture distributions are used to
account for variations in speaker, accent, and
pronunciation. - Phonetic model topologies are simple
left-to-right structures. - Skip states (time-warping) and multiple paths
(alternate pronunciations) are also common
features of models. - Sharing model parameters is a common strategy to
reduce complexity.
10Acoustic ModelingParameter Estimation
- Closed-loop data-driven modeling supervised only
from a word-level transcription - The expectation/maximization (EM) algorithm is
used to improve our parameter estimates. - Computationally efficient training algorithms
(Forward-Backward) have been crucial. - Batch mode parameter updates are typically
preferred. - Decision trees are used to optimize
parameter-sharing, system complexity, and the
use of additional linguistic knowledge.
11Language ModelingIs A Lot Like Wheel of Fortune
12Language ModelingN-Grams The Good, The Bad, and
The Ugly
13Language ModelingIntegration of Natural Language
14Implementation Issues Search Is Resource
Intensive
- Typical LVCSR systems have about 10M free
parameters, which makes training a challenge. - Large speech databases are required (several
hundred hours of speech). - Tying, smoothing, and interpolation are required.
15Implementation IssuesDynamic Programming-Based
Search
16Implementation IssuesCross-Word Decoding Is
Expensive
- Cross-word Decoding since word boundaries dont
occur in spontaneous speech, we must allow for
sequences of sounds that span word boundaries. -
- Cross-word decoding significantly increases
memory requirements.
17Implementation IssuesDecoding Example
18Implementation IssuesInternet-Based Speech
Recognition
19Technology Conversational Speech
- Conversational speech collected over the
telephone contains background - noise, music, fluctuations in the speech rate,
laughter, partial words, - hesitations, mouth noises, etc.
- WER has decreased from 100 to 30 in six years.
- Laughter
- Singing
- Unintelligible
- Spoonerism
- Background Speech
- No pauses
- Restarts
- Vocalized Noise
- Coinage
20Technology Audio Indexing of Broadcast News
- Broadcast news offers some unique
- challenges
- Lexicon important information in
- infrequently occurring words
- Acoustic Modeling variations in channel,
particularly within the same segment ( in the
studio vs. on location) - Language Model must adapt ( Bush,
Clinton, Bush, McCain, ???) - Language multilingual systems?
language-independent acoustic modeling?
21Technology Real-Time Translation
- From President Clintons State of the Union
address (January 27, 2000) -
- These kinds of innovations are also propelling
our remarkable prosperity... - Soon researchers will bring us devices that can
translate foreign languages - as fast as you can talk... molecular computers
the size of a tear drop with the - power of todays fastest supercomputers.
- Human Language Engineering a sophisticated
integration of many speech and - language related technologies... a science
for the next millennium.
22Technology Future Directions
- What are the algorithmic issues for the next
decade - Better features by extracting articulatory
information? - Bayesian statistics? Bayesian networks?
- Decision Trees? Information-theoretic measures?
- Nonlinear dynamics? Chaos?
23To Probe Further References
Journals and Conferences 1 N. Deshmukh, et.
al., Hierarchical Search for LargeVocabulary
Conversational Speech Recognition, IEEE Signal
Processing Magazine, vol. 1, no. 5, pp. 84- 107,
September 1999. 2 N. Deshmukh, et. al.,
Benchmarking Human Performance for Continuous
Speech Recognition, Proceedings of the Fourth
International Conference on Spoken Language
Processing, pp. SuP1P1.10, Philadelphia,
Pennsylvania, USA, October 1996. 3 R.
Grishman, Information Extraction and Speech
Recognition, presented at the DARPA Broadcast
News Transcription and Understanding Workshop,
Lansdowne, Virginia, USA, February 1998. 4 R.
P. Lippmann, Speech Recognition By Machines and
Humans, Speech Communication, vol. 22, pp. 1-15,
July 1997. 5 M. Maybury (editor), News on
Demand, Communications of the ACM, vol. 43, no.
2, February 2000. 6 D. Miller, et. al., Named
Entity Extraction from Broadcast News, presented
at the DARPA Broadcast News Workshop, Herndon,
Virginia, USA, February 1999. 7 D. Pallett,
et. al., Broadcast News Benchmark Test Results,
presented at the DARPA Broadcast News Workshop,
Herndon, Virginia, USA, February 1999. 8 J.
Picone, Signal Modeling Techniques in Speech
Recognition, IEEE Proceedings, vol. 81, no. 9,
pp. 1215- 1247, September 1993.
9 P. Robinson, et. al., Overview Information
Extraction from Broadcast News, presented at the
DARPA Broadcast News Workshop, Herndon, Virginia,
USA, February 1999. 10 F. Jelinek,
Statistical Methods for Speech Recognition, MIT
Press, 1998. URLs and Resources 11 Speech
Corpora, The Linguistic Data Consortium,
http//www.ldc.upenn.edu. 12 Technology
Benchmarks, Spoken Natural Language Processing
Group, The National Institute for Standards,
http//www.itl.nist.gov/iaui/894.01/index.html.
13 Signal Processing Resources, Institute for
Signal and Information Technology, Mississippi
State University, http//www.isip.msstate.edu.
14 Internet- Accessible Speech Recognition
Technology, http//www.isip.msstate.edu/projects/
speech/index.html. 15 A Public Domain Speech
Recognition System, http//www.isip.msstate.edu/p
rojects/speech/software/index.html. 16
Remote Job Submission, http//www.isip.msstate.
edu/projects/speech/experiments/index.html. 17
The Switchboard Corpus, http//www.isip.msstate
.edu/projects/switchboard/index.html.