Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends

1 / 23
About This Presentation
Title:

Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends

Description:

Telephone-quality speech is still central to the problem. ... Least Common: 'Abraham', 'Alastair', 'Acura' BravoBrava. Mississippi State University ... –

Number of Views:99
Avg rating:3.0/5.0
Slides: 24
Provided by: DAR153
Category:

less

Transcript and Presenter's Notes

Title: Automatic Speaker Recognition: Recent Progress, Current Applications, and Future Trends


1
Evaluation MetricsEvolution
Word Error Rate
Conversational Speech
40
  • Spontaneous telephone speech is still a grand
    challenge.
  • Telephone-quality speech is still central to the
    problem.
  • Vision for speech technology continues
  • to evolve.
  • Broadcast news is a very dynamic domain.

30
Broadcast News
20
Read Speech
10
Continuous Digits
Letters and Numbers
Digits
Command and Control
0
Level Of Difficulty
2
Evaluation MetricsHuman Performance
Word Error Rate
  • Human performance exceeds machine
  • performance by a factor ranging from
  • 4x to 10x depending on the task.
  • On some tasks, such as credit card number
    recognition, machine performance exceeds humans
    due to human memory retrieval capacity.
  • The nature of the noise is as important as the
    SNR (e.g., cellular phones).
  • A primary failure mode for humans is inattention.
  • A second major failure mode is the lack of
    familiarity with the domain (i.e., business
    terms and corporation names).

20
Wall Street Journal (Additive Noise)
15
Machines
10
5
Human Listeners (Committee)
0
Quiet
10 dB
16 dB
22 dB
Speech-To-Noise Ratio
3
Evaluation MetricsMachine Performance
100
(Foreign)
Read Speech
Conversational Speech
Broadcast Speech
20k
Spontaneous Speech
Varied Microphones
(Foreign)
10 X
10
5k
Noisy
1k
1
1988 1989 1990 1991 1992 1993 1994
1995 1996 1997 1998 1999 2000 2001
2002 2003
4
Evaluation MetricsBeyond WER Named Entity
  • Information extraction is the analysis of
  • natural language to collect information
  • about specified types of entities.
  • As the focus shifts to providing enhanced
    annotations, WER may not be the most
    appropriate measure of performance (content-based
    scoring).

5
Recognition ArchitecturesWhy Is Speech
Recognition So Difficult?
Feature No. 2
Ph_1
Ph_2
Ph_3
Feature No. 1
  • Our measurements of the
  • signal are ambiguous.
  • Region of overlap represents classification
    errors.
  • Reduce overlap by introducing acoustic and
    linguistic context (e.g., context-dependent
    phones).

6
Recognition ArchitecturesA Communication
Theoretic Approach
Message Source
Linguistic Channel
Articulatory Channel
Acoustic Channel
Features
Observable Message
Words
Sounds
  • Bayesian formulation for speech recognition
  • P(WA) P(AW) P(W) / P(A)

Objective minimize the word error
rate Approach maximize P(WA) during training
  • Components
  • P(AW) acoustic model (hidden Markov models,
    mixtures)
  • P(W) language model (statistical, finite
    state networks, etc.)
  • The language model typically predicts a small set
    of next words based on
  • knowledge of a finite number of previous words
    (N-grams).

7
Recognition ArchitecturesIncorporating Multiple
Knowledge Sources
Input Speech
Language Model P(W)
8
Acoustic ModelingFeature Extraction
Fourier Transform
Input Speech
Cepstral Analysis
Perceptual Weighting
Time Derivative
Time Derivative
Delta Energy Delta Cepstrum
Delta-Delta Energy Delta-Delta Cepstrum
Energy Mel-Spaced Cepstrum
9
Acoustic ModelingHidden Markov Models
  • Acoustic models encode the temporal evolution of
    the features (spectrum).
  • Gaussian mixture distributions are used to
    account for variations in speaker, accent, and
    pronunciation.
  • Phonetic model topologies are simple
    left-to-right structures.
  • Skip states (time-warping) and multiple paths
    (alternate pronunciations) are also common
    features of models.
  • Sharing model parameters is a common strategy to
    reduce complexity.

10
Acoustic ModelingParameter Estimation
  • Closed-loop data-driven modeling supervised only
    from a word-level transcription
  • The expectation/maximization (EM) algorithm is
    used to improve our parameter estimates.
  • Computationally efficient training algorithms
    (Forward-Backward) have been crucial.
  • Batch mode parameter updates are typically
    preferred.
  • Decision trees are used to optimize
    parameter-sharing, system complexity, and the
    use of additional linguistic knowledge.

11
Language ModelingIs A Lot Like Wheel of Fortune
12
Language ModelingN-Grams The Good, The Bad, and
The Ugly
13
Language ModelingIntegration of Natural Language
14
Implementation Issues Search Is Resource
Intensive
  • Typical LVCSR systems have about 10M free
    parameters, which makes training a challenge.
  • Large speech databases are required (several
    hundred hours of speech).
  • Tying, smoothing, and interpolation are required.

15
Implementation IssuesDynamic Programming-Based
Search
16
Implementation IssuesCross-Word Decoding Is
Expensive
  • Cross-word Decoding since word boundaries dont
    occur in spontaneous speech, we must allow for
    sequences of sounds that span word boundaries.
  • Cross-word decoding significantly increases
    memory requirements.

17
Implementation IssuesDecoding Example
18
Implementation IssuesInternet-Based Speech
Recognition
19
Technology Conversational Speech
  • Conversational speech collected over the
    telephone contains background
  • noise, music, fluctuations in the speech rate,
    laughter, partial words,
  • hesitations, mouth noises, etc.
  • WER has decreased from 100 to 30 in six years.
  • Laughter
  • Singing
  • Unintelligible
  • Spoonerism
  • Background Speech
  • No pauses
  • Restarts
  • Vocalized Noise
  • Coinage

20
Technology Audio Indexing of Broadcast News
  • Broadcast news offers some unique
  • challenges
  • Lexicon important information in
  • infrequently occurring words
  • Acoustic Modeling variations in channel,
    particularly within the same segment ( in the
    studio vs. on location)
  • Language Model must adapt ( Bush,
    Clinton, Bush, McCain, ???)
  • Language multilingual systems?
    language-independent acoustic modeling?

21
Technology Real-Time Translation
  • From President Clintons State of the Union
    address (January 27, 2000)
  • These kinds of innovations are also propelling
    our remarkable prosperity...
  • Soon researchers will bring us devices that can
    translate foreign languages
  • as fast as you can talk... molecular computers
    the size of a tear drop with the
  • power of todays fastest supercomputers.
  • Human Language Engineering a sophisticated
    integration of many speech and
  • language related technologies... a science
    for the next millennium.

22
Technology Future Directions
  • What are the algorithmic issues for the next
    decade
  • Better features by extracting articulatory
    information?
  • Bayesian statistics? Bayesian networks?
  • Decision Trees? Information-theoretic measures?
  • Nonlinear dynamics? Chaos?

23
To Probe Further References
Journals and Conferences 1 N. Deshmukh, et.
al., Hierarchical Search for LargeVocabulary
Conversational Speech Recognition, IEEE Signal
Processing Magazine, vol. 1, no. 5, pp. 84- 107,
September 1999. 2 N. Deshmukh, et. al.,
Benchmarking Human Performance for Continuous
Speech Recognition, Proceedings of the Fourth
International Conference on Spoken Language
Processing, pp. SuP1P1.10, Philadelphia,
Pennsylvania, USA, October 1996. 3 R.
Grishman, Information Extraction and Speech
Recognition, presented at the DARPA Broadcast
News Transcription and Understanding Workshop,
Lansdowne, Virginia, USA, February 1998. 4 R.
P. Lippmann, Speech Recognition By Machines and
Humans, Speech Communication, vol. 22, pp. 1-15,
July 1997. 5 M. Maybury (editor), News on
Demand, Communications of the ACM, vol. 43, no.
2, February 2000. 6 D. Miller, et. al., Named
Entity Extraction from Broadcast News, presented
at the DARPA Broadcast News Workshop, Herndon,
Virginia, USA, February 1999. 7 D. Pallett,
et. al., Broadcast News Benchmark Test Results,
presented at the DARPA Broadcast News Workshop,
Herndon, Virginia, USA, February 1999. 8 J.
Picone, Signal Modeling Techniques in Speech
Recognition, IEEE Proceedings, vol. 81, no. 9,
pp. 1215- 1247, September 1993.
9 P. Robinson, et. al., Overview Information
Extraction from Broadcast News, presented at the
DARPA Broadcast News Workshop, Herndon, Virginia,
USA, February 1999. 10 F. Jelinek,
Statistical Methods for Speech Recognition, MIT
Press, 1998. URLs and Resources 11 Speech
Corpora, The Linguistic Data Consortium,
http//www.ldc.upenn.edu. 12 Technology
Benchmarks, Spoken Natural Language Processing
Group, The National Institute for Standards,
http//www.itl.nist.gov/iaui/894.01/index.html.
13 Signal Processing Resources, Institute for
Signal and Information Technology, Mississippi
State University, http//www.isip.msstate.edu.
14 Internet- Accessible Speech Recognition
Technology, http//www.isip.msstate.edu/projects/
speech/index.html. 15 A Public Domain Speech
Recognition System, http//www.isip.msstate.edu/p
rojects/speech/software/index.html. 16
Remote Job Submission, http//www.isip.msstate.
edu/projects/speech/experiments/index.html. 17
The Switchboard Corpus, http//www.isip.msstate
.edu/projects/switchboard/index.html.
Write a Comment
User Comments (0)
About PowerShow.com