Title: Automatic Speech Recognition Introduction
1Automatic Speech RecognitionIntroduction
- Readings Jurafsky Martin 7.1-2
- HLT Survey Chapter 1
2The Human Dialogue System
3The Human Dialogue System
4Computer Dialogue Systems
Dialogue Management
Audition
Automatic Speech Recognition
Natural Language Understanding
Natural Language Generation
Text-to- speech
Planning
signal
words
words
signal
signal
logical form
5Computer Dialogue Systems
Dialogue Mgmt.
Audition
ASR
NLU
NLG
Text-to- speech
Planning
signal
words
words
signal
signal
logical form
6Parameters of ASR Capabilities
- Different types of tasks with different
difficulties - Speaking mode (isolated words/continuous speech)
- Speaking style (read/spontaneous)
- Enrollment (speaker-independent/dependent)
- Vocabulary (small lt 20 wd/large gt20kword)
- Language model (finite state/context sensitive)
- Perplexity (small lt 10/large gt100)
- Signal-to-noise ratio (high gt 30 dB/low lt 10dB)
- Transducer (high quality microphone/telephone)
7The Noisy Channel Model
message
message
noisy channel
Message
Channel
Signal
Decoding model find Message argmax
P(MessageSignal) But how do we represent each of
these things?
8ASR using HMMs
- Try to solve P(MessageSignal) by breaking the
problem up into separate components - Most common method Hidden Markov Models
- Assume that a message is composed of words
- Assume that words are composed of sub-word parts
(phones) - Assume that phones have some sort of acoustic
realization - Use probabilistic models for matching acoustics
to phones to words
9HMMs The Traditional View
go
home
Markov model backbone composed of phones (hidden
because we dont know correspondences)
g
o
h
o
m
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
Acoustic observations
Each line represents a probability estimate (more
later)
10HMMs The Traditional View
go
home
Markov model backbone composed of phones (hidden
because we dont know correspondences)
g
o
h
o
m
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
Acoustic observations
Even with same word hypothesis, can have
different alignments. Also, have to search over
all word hypotheses
11HMMs as Dynamic Bayesian Networks
Markov model backbone composed of phones
go
home
q0g
q1o
q2o
q3o
q4h
q5o
q6o
q7o
q8m
q9m
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
Acoustic observations
12HMMs as Dynamic Bayesian Networks
Markov model backbone composed of phones
go
home
q0g
q1o
q2o
q3o
q4h
q5o
q6o
q7o
q8m
q9m
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
ASR What is best assignment to q0q9 given x0x9?
13Hidden Markov Models DBNs
DBN representation
Markov Model representation
14Parts of an ASR System
Feature Calculation
Language Modeling
Acoustic Modeling
cat dog 0.00002 cat the 0.0000005 the cat
0.029 the dog 0.031 the mail 0.054
k
_at_
S E A R C H
The cat chased the dog
15Parts of an ASR System
Feature Calculation
Language Modeling
Acoustic Modeling
cat dog 0.00002 cat the 0.0000005 the cat
0.029 the dog 0.031 the mail 0.054
k
_at_
Maps acoustics to phones
Maps phones to words
Strings words together
Produces acoustics (xt)
16Feature calculation
17Feature calculation
Frequency
Time
Find energy at each time step in each frequency
channel
18Feature calculation
Frequency
Time
Take inverse Discrete Fourier Transform to
decorrelate frequencies
19Feature calculation
Input
-0.1 0.3 1.4 -1.2 2.3 2.6
0.2 0.1 1.2 -1.2 4.4 2.2
-6.1 -2.1 3.1 2.4 1.0 2.2
0.2 0.0 1.2 -1.2 4.4 2.2
Output
20Robust Speech Recognition
- Different schemes have been developed for dealing
with noise, reverberation - Additive noise reduce effects of particular
frequencies - Convolutional noise remove effects of linear
filters (cepstral mean subtraction)
21Now what?
-0.1 0.3 1.4 -1.2 2.3 2.6
0.2 0.1 1.2 -1.2 4.4 2.2
-6.1 -2.1 3.1 2.4 1.0 2.2
0.2 0.0 1.2 -1.2 4.4 2.2
???
That you
22Machine Learning!
-0.1 0.3 1.4 -1.2 2.3 2.6
0.2 0.1 1.2 -1.2 4.4 2.2
-6.1 -2.1 3.1 2.4 1.0 2.2
0.2 0.0 1.2 -1.2 4.4 2.2
Pattern recognition
That you
with HMMs
23Hidden Markov Models (again!)
P(statet1statet) Pronunciation/Language models
P(acousticststatet) Acoustic Model
24Acoustic Model
- Assume that you can label each vector with a
phonetic label - Collect all of the examples of a phone together
and build a Gaussian model (or some other
statistical model, e.g. neural networks)
Na(m,S) P(Xstatea)
25Building up the Markov Model
- Start with a model for each phone
- Typically, we use 3 states per phone to give a
minimum duration constraint, but ignore that here
transition probability
26Building up the Markov Model
- Pronunciation model gives connections between
phones and words - Multiple pronunciations
ow
ey
t
m
ah
ah
27Building up the Markov Model
- Language model gives connections between words
(e.g., bigram grammar)
p(hethat)
t
p(youthat)
28ASR as Bayesian Inference
q1w1
q2w1
q3w1
p(hethat)
t
p(youthat)
x1
x2
x3
iy
argmaxW P(WX) argmaxW P(XW)P(W)/P(X) argmaxW
P(XW)P(W) argmaxW SQ P(X,QW)P(W) argmaxW maxQ
P(X,QW)P(W) argmaxW maxQ P(XQ) P(QW) P(W)
d
29ASR Probability Models
- Three probability models
- P(XQ) acoustic model
- P(QW) duration/transition/pronunciation model
- P(W) language model
- language/pronunciation models inferred from prior
knowledge - Other models learned from data (how?)
30Parts of an ASR System
P(XQ)
P(QW)
P(W)
Feature Calculation
Language Modeling
Acoustic Modeling
cat dog 0.00002 cat the 0.0000005 the cat
0.029 the dog 0.031 the mail 0.054
k
_at_
S E A R C H
The cat chased the dog
31EM for ASR The Forward-Backward Algorithm
- Determine state occupancy probabilities
- I.e. assign each data vector to a state
- Calculate new transition probabilities, new means
standard deviations (emission probabilities)
using assignments
32ASR as Bayesian Inference
q1w1
q2w1
q3w1
p(hethat)
t
p(youthat)
x1
x2
x3
iy
argmaxW P(WX) argmaxW P(XW)P(W)/P(X) argmaxW
P(XW)P(W) argmaxW SQ P(X,QW)P(W) argmaxW maxQ
P(X,QW)P(W) argmaxW maxQ P(XQ) P(QW) P(W)
d
33Search
- When trying to find WargmaxW P(WX), need to
look at (in theory) - All possible word sequences W
- All possible segmentations/alignments of WX
- Generally, this is done by searching the space of
W - Viterbi search dynamic programming approach that
looks for the most likely path - A search alternative method that keeps a stack
of hypotheses around - If W is large, pruning becomes important
34How to train an ASR system
- Have a speech corpus at hand
- Should have word (and preferrably phone)
transcriptions - Divide into training, development, and test sets
- Develop models of prior knowledge
- Pronunciation dictionary
- Grammar
- Train acoustic models
- Possibly realigning corpus phonetically
35How to train an ASR system
- Test on your development data (baseline)
- Think real hard
- Figure out some neat new modification
- Retrain system component
- Test on your development data
- Lather, rinse, repeat
- Then, at the end of the project, test on the test
data.
36Judging the quality of a system
- Usually, ASR performance is judged by the word
error rate - ErrorRate 100(Subs Ins Dels) / Nwords
- REF I WANT TO GO HOME
- REC WANT TWO GO HOME NOW
- SC D C S C C I
- 100(1S1I1D)/5 60
37Judging the quality of a system
- Usually, ASR performance is judged by the word
error rate - This assumes that all errors are equal
- Also, a bit of a mismatch between optimization
criterion and error measurement - Other (task specific) measures sometimes used
- Task completion
- Concept error rate