ROBUST SPEECH RECOGNITION Hidden Markov Models in Speech Recognition - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

ROBUST SPEECH RECOGNITION Hidden Markov Models in Speech Recognition

Description:

ROBUST SPEECH RECOGNITION Hidden Markov Models in Speech Recognition – PowerPoint PPT presentation

Number of Views:579

Avg rating:3.0/5.0

Slides: 39

Provided by: whw

Category:

more less

Transcript and Presenter's Notes

Title: ROBUST SPEECH RECOGNITION Hidden Markov Models in Speech Recognition

1
ROBUST SPEECH RECOGNITIONHidden Markov Models in
Speech Recognition

Richard Stern
Robust Speech Recognition Group
Carnegie Mellon University
Telephone (412) 268-2535
Fax (412) 268-3890
rms_at_cs.cmu.edu
http//www.cs.cmu.edu/rms
Short Course at UNAM
August 14-17, 2007

2
Acknowledgements

Much of this talk is derived from
the paper "An Introduction to Hidden Markov
Models by Rabiner and Juang
the talk "Hidden Markov Models Continuous Speech
Recognition by Kai-Fu Lee
notes compiled by Wayne Ward and Roni Rosenfeld

3
Topics

Markov Models and Hidden Markov Models
HMMs applied to speech recognition
Training
Decoding
Note In this talk we describe discrete HMMs (the
simplest type). Will comment on more modern
generalizations.

4
Intro Hidden Markov Models (HMMs)

The Hidden Markov Model is a doubly-stochastic
process
A random sequence of states
Each state transition is causes a random
observation to be emitted
The three classic HMM problems
Computing the probabilities of the observations,
given a model
Finding the state sequence that maximizes the
probabilities of a model
Finding the model parameters that maximize the
probabiities of the observations

5
Speech Recognition
Front End
Match Search
Analog Speech
Discrete Observations
Word Sequence
6
ML Continuous Speech Recognition

Goal
Given acoustic data A a1, a2, ..., ak
Find word sequence W w1, w2, ... wn
Such that P(W A) is maximized

Bayes Rule
acoustic model (HMMs)
language model
P(A W) P(W)
P(W A)
P(A)
P(A) is a constant for a complete sentence
7
Markov Models
Elements States Transition
probabilities
Markov Assumption Transition probability
depends only on current state
8
Single Fair Coin
P(H) 1.0 P(T) 0.0
P(H) 0.0 P(T) 1.0

Outcome head corresponds to State 1, tail to
State 2
Observation sequence uniquely defines state
sequence

9
Hidden Markov Models

Elements
States
Transition probabilities
Output prob distributions (at state j for
symbol k)

Prob
Obs
10
Discrete Observation HMM

P(R) 0.31 P(B) 0.50 P(Y) 0.19
P(R) 0.50 P(B) 0.25 P(Y) 0.25
P(R) 0.38 P(B) 0.12 P(Y) 0.50

Observation sequence R B Y Y R
not unique to state sequence

11
HMMs In Speech Recognition

Represent speech as a sequence of observations
Use HMM to model some unit of speech (phone,
word)
Concatenate units into larger units

ih
Phone Model
Word Model
12
HMM Problems And Solutions

Evaluation
Problem - Compute Probability of observation
sequence given a model
Solution - Forward Algorithm and Viterbi
Algorithm
Decoding
Problem - Find state sequence which maximizes
probability of observation sequence
Solution - Viterbi Algorithm
Training
Problem - Adjust model parameters to maximize
probability of observed sequences
Solution - Forward-Backward Algorithm

13
Evaluation

Probability of observation sequence
given HMM model ??is

Q q0q1 qT is a state sequence
Not practical since the number of paths is
N number of states in model T number of
observations in sequence
14
The Forward Algorithm
Compute ? recursively
1 if j is start state 0 otherwise
15
Forward Trellis
0.6
1.0
Probabilities of arrival state
0.4
Initial
Final
A
A
B
t2
t1
t3
t0
0.6 0.2
0.6 0.8
0.6 0.8
state 1
0.23
0.48
0.03
1.0
0.4 0.7
0.4 0.3
0.4 0.3
1.0 0.7
1.0 0.3
1.0 0.3
state 2
0.09
0.12
0.13
0.0
16
The Backward Algorithm
Compute b recursively
1 if i is end state 0 otherwise
N
j0
17
Backward Trellis
Probabilities of arrival state
0.6
1.0
0.4
Initial
Final
A
A
B
t2
t1
t3
t0
0.6 0.2
0.6 0.8
0.6 0.8
state 1
0.28
0.22
0.0
0.13
0.4 0.7
0.4 0.3
0.4 0.3
1.0 0.7
1.0 0.3
1.0 0.3
state 2
0.7
0.21
1.0
0.06
18
The Viterbi Algorithm

For decoding
Find the state sequence Q which maximizes P(O,
Q ? )
Similar to Forward Algorithm except MAX instead
of SUM

Recursive Computation
Save each maximum for backtrace at end
19
Viterbi Trellis
0.6
1.0
Probabilities of arrival state
0.4
Initial
Final
A
A
B
t2
t1
t3
t0
0.6 0.2
0.6 0.8
0.6 0.8
state 1
0.23
0.48
0.03
1.0
0.4 0.7
0.4 0.3
0.4 0.3
1.0 0.7
1.0 0.3
1.0 0.3
state 2
0.06
0.12
0.06
0.0
20
Training HMM Parameters

Train parameters of HMM
Tune ??to maximize P(O ? )
No efficient algorithm for global optimum
Efficient iterative algorithm finds a local
optimum
Baum-Welch (Forward-Backward) re-estimation
Compute probabilities using current model ?
Refine ???????????based on computed values
Use ?? and ? from Forward-Backward

21
Forward-Backward Algorithm

Probability of transiting from to
at time t given O

22
Baum-Welch Reestimation
23
Convergence of FB Algorithm

1. Initialize ?? (A,B)
2. Compute ?, ?, and ?
???Estimate ? (A, B) from ?
???Replace ? with ?
5. If not converged go to 2
It can be shown that P(O l) gt P(O l) unless
l l

24
HMMs In Speech Recognition

Represent speech as a sequence of symbols
Use HMM to model some unit of speech (phone,
word)
Output Probabilities - Prob of observing symbol
in a state
Transition Prob - Prob of staying in or skipping
state

Phone Model
25
Training HMMs for Continuous Speech

Use only orthographic transcription of sentence
No need for segmented or labeled data
Concatenate phone models to give word model
Concatenate word models to give sentence model
Train entire sentence model on entire spoken
sentence

26
Forward-Backward Trainingfor Continuous Speech
ALL
SHOW
ALERTS
AX
L
SH
L
TS
AA
ER
OW
27
Recognition Search
28
Viterbi Search

Uses Viterbi decoding
Takes MAX, not SUM
Finds optimal state sequence P(O, Q ? ) not
optimal word sequence P(O ? )
Time synchronous
Extends all paths by 1 time step
All paths have same length (no need to normalize
to compare scores)

29
Viterbi Search Algorithm

0. Create state list with one cell for each state
in system
1. Initialize state list with initial states for
time t 0
2. Clear state list for time t1
3. Compute within-word transitions from time t to
t1
If new state reached, update score and BackPtr
If better score for state, update score and
BackPtr
4. Compute between word transitions at time t1
If new state reached, update score and BackPtr
If better score for state, update score and
BackPtr
5. If end of utterance, print backtrace and quit
6. Else increment t and go to Step 2

30
Viterbi Search Algorithm
Word 1
Word 2
OldProb(S1) OutProb Transprob
OldProb(S3) P(W2 W1)
S1
S1
S2
S2
Word 1
S3
S3
S1
S1
Score BackPtr ParmPtr
S2
S2
Word 2
S3
S3
time t1
time t
31
Continuous Density HMMs

Model so far has assumed discrete observations,
each observation in a sequence was one of a
set of M discrete symbols
Speech input must be Vector Quantized in order to
provide discrete input.
VQ leads to quantization error
The discrete probability density bj(k) can be
replaced with the continuous probability density
bj(x) where x is the observation vector
Typically Gaussian densities are used
A single Gaussian is not adequate, so a weighted
sum of Gaussians is used to approximate actual
PDF

32
Mixture Density Functions
is the probability density function
for state j

x Observation vector
M Number of mixtures (Gaussians)
Weight of mixture m in state j where
N Gaussian density function
Mean vector for mixture m, state j
Covariance matrix for mixture m, state j

33
Summary

We have (very briefly) reviewed the approaches to
the major HMM problems of modeling and decoding
Keep in mind
The doubly-stochastic nature of the model
The roles that the state transitions and the
output densities play

34
(No Transcript)
35
Viterbi Beam Search

Viterbi Search
All states enumerated
Not practical for large grammars
Most states inactive at any given time
Viterbi Beam Search - prune less likely paths
States worse than threshold range from best are
pruned
From and To structures created dynamically - list
of active states

36
Viterbi Beam Search
FROM BEAM
TO BEAM
States within threshold from best state
Dynamically constructed
?
Word 1
S1
S1
Within threshold ? Exist in TO beam? Better than
existing score in TO beam?
S2
Word 2
S3
time t
time t1
37
Discrete Hmm vs. Continuous HMM

Problems with Discrete
quantization errors
Codebook and HMMs modelled separately
Problems with Continuous Mixtures
Small number of mixtures performs poorly
Large number of mixtures increases computation
and parameters to be estimated
Continuous makes more assumptions than Discrete,
especially if diagonal covariance pdf
Discrete probability is a table lookup,
continuous mixtures require many multiplications

38
Model Topologies
Ergodic - Fully connected, each state has
transition to every other state