Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends

1 / 23

About This Presentation

Title:

Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends

Description:

Telephone-quality speech is still central to the problem. ... Least Common: 'Abraham', 'Alastair', 'Acura' BravoBrava. Mississippi State University ... –

Number of Views:99

Avg rating:3.0/5.0

Slides: 24

Provided by: DAR153

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends

1
Evaluation MetricsEvolution
Word Error Rate
Conversational Speech
40

Spontaneous telephone speech is still a grand
challenge.
Telephone-quality speech is still central to the
problem.
Vision for speech technology continues
to evolve.
Broadcast news is a very dynamic domain.

30
Broadcast News
20
Read Speech
10
Continuous Digits
Letters and Numbers
Digits
Command and Control
0
Level Of Difficulty
2
Evaluation MetricsHuman Performance
Word Error Rate

Human performance exceeds machine
performance by a factor ranging from
4x to 10x depending on the task.
On some tasks, such as credit card number
recognition, machine performance exceeds humans
due to human memory retrieval capacity.
The nature of the noise is as important as the
SNR (e.g., cellular phones).
A primary failure mode for humans is inattention.
A second major failure mode is the lack of
familiarity with the domain (i.e., business
terms and corporation names).

20
Wall Street Journal (Additive Noise)
15
Machines
10
5
Human Listeners (Committee)
0
Quiet
10 dB
16 dB
22 dB
Speech-To-Noise Ratio
3
Evaluation MetricsMachine Performance
100
(Foreign)
Read Speech
Conversational Speech
Broadcast Speech
20k
Spontaneous Speech
Varied Microphones
(Foreign)
10 X
10
5k
Noisy
1k
1
1988 1989 1990 1991 1992 1993 1994
1995 1996 1997 1998 1999 2000 2001
2002 2003
4
Evaluation MetricsBeyond WER Named Entity

Information extraction is the analysis of
natural language to collect information
about specified types of entities.

As the focus shifts to providing enhanced
annotations, WER may not be the most
appropriate measure of performance (content-based
scoring).

5
Recognition ArchitecturesWhy Is Speech
Recognition So Difficult?
Feature No. 2
Ph_1
Ph_2
Ph_3
Feature No. 1

Our measurements of the
signal are ambiguous.
Region of overlap represents classification
errors.
Reduce overlap by introducing acoustic and
linguistic context (e.g., context-dependent
phones).

6
Recognition ArchitecturesA Communication
Theoretic Approach
Message Source
Linguistic Channel
Articulatory Channel
Acoustic Channel
Features
Observable Message
Words
Sounds

Bayesian formulation for speech recognition
P(WA) P(AW) P(W) / P(A)

Objective minimize the word error
rate Approach maximize P(WA) during training

Components
P(AW) acoustic model (hidden Markov models,
mixtures)
P(W) language model (statistical, finite
state networks, etc.)
The language model typically predicts a small set
of next words based on
knowledge of a finite number of previous words
(N-grams).

7
Recognition ArchitecturesIncorporating Multiple
Knowledge Sources
Input Speech
Language Model P(W)
8
Acoustic ModelingFeature Extraction
Fourier Transform
Input Speech
Cepstral Analysis
Perceptual Weighting
Time Derivative
Time Derivative
Delta Energy Delta Cepstrum
Delta-Delta Energy Delta-Delta Cepstrum
Energy Mel-Spaced Cepstrum
9
Acoustic ModelingHidden Markov Models

Acoustic models encode the temporal evolution of
the features (spectrum).
Gaussian mixture distributions are used to
account for variations in speaker, accent, and
pronunciation.
Phonetic model topologies are simple
left-to-right structures.
Skip states (time-warping) and multiple paths
(alternate pronunciations) are also common
features of models.
Sharing model parameters is a common strategy to
reduce complexity.

10
Acoustic ModelingParameter Estimation

Closed-loop data-driven modeling supervised only
from a word-level transcription
The expectation/maximization (EM) algorithm is
used to improve our parameter estimates.
Computationally efficient training algorithms
(Forward-Backward) have been crucial.
Batch mode parameter updates are typically
preferred.
Decision trees are used to optimize
parameter-sharing, system complexity, and the
use of additional linguistic knowledge.

11
Language ModelingIs A Lot Like Wheel of Fortune
12
Language ModelingN-Grams The Good, The Bad, and
The Ugly
13
Language ModelingIntegration of Natural Language
14
Implementation Issues Search Is Resource
Intensive

Typical LVCSR systems have about 10M free
parameters, which makes training a challenge.
Large speech databases are required (several
hundred hours of speech).
Tying, smoothing, and interpolation are required.

15
Implementation IssuesDynamic Programming-Based
Search
16
Implementation IssuesCross-Word Decoding Is
Expensive

Cross-word Decoding since word boundaries dont
occur in spontaneous speech, we must allow for
sequences of sounds that span word boundaries.
Cross-word decoding significantly increases
memory requirements.

17
Implementation IssuesDecoding Example
18
Implementation IssuesInternet-Based Speech
Recognition
19
Technology Conversational Speech

Conversational speech collected over the
telephone contains background
noise, music, fluctuations in the speech rate,
laughter, partial words,
hesitations, mouth noises, etc.
WER has decreased from 100 to 30 in six years.

Laughter
Singing
Unintelligible
Spoonerism
Background Speech
No pauses
Restarts
Vocalized Noise
Coinage

20
Technology Audio Indexing of Broadcast News

Broadcast news offers some unique
challenges
Lexicon important information in
infrequently occurring words
Acoustic Modeling variations in channel,
particularly within the same segment ( in the
studio vs. on location)
Language Model must adapt ( Bush,
Clinton, Bush, McCain, ???)
Language multilingual systems?
language-independent acoustic modeling?

21
Technology Real-Time Translation

From President Clintons State of the Union
address (January 27, 2000)
These kinds of innovations are also propelling
our remarkable prosperity...
Soon researchers will bring us devices that can
translate foreign languages
as fast as you can talk... molecular computers
the size of a tear drop with the
power of todays fastest supercomputers.

Human Language Engineering a sophisticated
integration of many speech and
language related technologies... a science
for the next millennium.

22
Technology Future Directions

What are the algorithmic issues for the next
decade
Better features by extracting articulatory
information?
Bayesian statistics? Bayesian networks?
Decision Trees? Information-theoretic measures?
Nonlinear dynamics? Chaos?

23
To Probe Further References
Journals and Conferences 1 N. Deshmukh, et.
al., Hierarchical Search for LargeVocabulary
Conversational Speech Recognition, IEEE Signal
Processing Magazine, vol. 1, no. 5, pp. 84- 107,
September 1999. 2 N. Deshmukh, et. al.,
Benchmarking Human Performance for Continuous
Speech Recognition, Proceedings of the Fourth
International Conference on Spoken Language
Processing, pp. SuP1P1.10, Philadelphia,
Pennsylvania, USA, October 1996. 3 R.
Grishman, Information Extraction and Speech
Recognition, presented at the DARPA Broadcast
News Transcription and Understanding Workshop,
Lansdowne, Virginia, USA, February 1998. 4 R.
P. Lippmann, Speech Recognition By Machines and
Humans, Speech Communication, vol. 22, pp. 1-15,
July 1997. 5 M. Maybury (editor), News on
Demand, Communications of the ACM, vol. 43, no.
2, February 2000. 6 D. Miller, et. al., Named
Entity Extraction from Broadcast News, presented
at the DARPA Broadcast News Workshop, Herndon,
Virginia, USA, February 1999. 7 D. Pallett,
et. al., Broadcast News Benchmark Test Results,
presented at the DARPA Broadcast News Workshop,
Herndon, Virginia, USA, February 1999. 8 J.
Picone, Signal Modeling Techniques in Speech
Recognition, IEEE Proceedings, vol. 81, no. 9,
pp. 1215- 1247, September 1993.
9 P. Robinson, et. al., Overview Information
Extraction from Broadcast News, presented at the
DARPA Broadcast News Workshop, Herndon, Virginia,
USA, February 1999. 10 F. Jelinek,
Statistical Methods for Speech Recognition, MIT
Press, 1998. URLs and Resources 11 Speech
Corpora, The Linguistic Data Consortium,
http//www.ldc.upenn.edu. 12 Technology
Benchmarks, Spoken Natural Language Processing
Group, The National Institute for Standards,
http//www.itl.nist.gov/iaui/894.01/index.html.
13 Signal Processing Resources, Institute for
Signal and Information Technology, Mississippi
State University, http//www.isip.msstate.edu.
14 Internet- Accessible Speech Recognition
Technology, http//www.isip.msstate.edu/projects/
speech/index.html. 15 A Public Domain Speech
Recognition System, http//www.isip.msstate.edu/p
rojects/speech/software/index.html. 16
Remote Job Submission, http//www.isip.msstate.
edu/projects/speech/experiments/index.html. 17
The Switchboard Corpus, http//www.isip.msstate
.edu/projects/switchboard/index.html.

Write a Comment

User Comments (0)