Title: CSE 552
1- CSE 552
- Hidden Markov Models for Speech Recognition
- Spring, 2004
- Oregon Health Science University
- OGI School of Science Engineering
- John-Paul Hosom
- Lecture Notes for May 5
- Gamma, Xi, and the Forward-Backward Algorithm
2Review ? and ?
- Define variable ? which has meaning of the
probability of observations o1 through ot and
being in state i at time t, given our HMM
Compute ? and P(O ?) with the following
procedure
Induction
Termination
3Review ? and ?
- In the same way that we defined ?, we can define
?
- Define variable ? which has meaning of the
probability of observations ot1 through oT,
given that were in state i at time t, and
given our HMM
Compute ? with the following procedure
Where a value of 1 is chosen arbitrarily (but
wont affect results) Induction
4Forward Procedure Algorithm Example
0.65
0.55
0.15
0.20
- observed features o1 0.8 o2 0.8 o3
0.2
?1(h)0.55 ?1(ay)0.0 ?2(h) 0.550.3
0.00.0 0.55 0.09075 ?2(ay) 0.550.7
0.00.4 0.15 0.05775
?3(h) 0.090750.3 0.057750.0 0.20
0.0054 ?3(ay) 0.090750.7 0.057750.4
0.65 0.0563
??3(i) 0.0617
5Backward Procedure Algorithm Example
?3(h)1.0 ?3(ay)1.0
?2(h) 0.30.201.0 0.70.651.0
0.515 ?2(ay) 0.00.201.0 0.40.651.0
0.260
?1(h) 0.30.550.515 0.70.150.260
0.1123 ?1(ay) 0.00.550.515 0.40.150.260
0.0156 ?0() 1.00.550.1123
0.00.150.0156 0.0618 ?0() ? ?3(i)
P(O?)
6Probability of Gamma
- Now we can define ?, the probability of being in
state i at time t given an observation
sequence and HMM.
also
, so
7Probability of Gamma Illustration
Illustration what is probability of being in
state 2 at time 2?
b1(o3)
b1(o1)
b1(o2)
State 1
a21
a12
b2(o3)
b2(o1)
b2(o2)
a22
a22
State 2
a32
a23
b3(o3)
b3(o1)
b3(o2)
State 3
8Gamma Example
- Given this 3-state HMM and set of 4
observations, what is probability of being in
state A at time 2?
0.2
0.3
1.0
0.8
0.7
1.0
A
C
B
1.0
0.0
0.0
1.0
O 0.2 0.3 0.4 0.5
9Gamma Example
1. Compute forward probabilities up to time 2
10Gamma Example
2. Compute backward probabilities for times 4, 3,
2
11Gamma Example
3. Compute ?
12Xi
- We can define one more variable ? is the
probability of being in state i at time t,
and in state j at time t1, given the
observations and HMM
- We can specify ? as follows
13Xi Diagram
- This diagram illustrates ?
b1(o4)
b1(o1)
b1(o3)
b1(o2)
a12
State 1
a21
a22
b2(o3)
b2(o2)
b2(o4)
b2(o1)
a22
a32
State 2
a23
b3(o4)
b3(o1)
b3(o3)
b3(o2)
State 3
a12b2(o3)
t
t1
t2
t-1
?2(1)
?3(2)
14Xi Example 1
- Given the same HMM and observations as before,
what is ?2(A,B)?
15Xi Example 2
- Given this 3-state HMM and set of 4
observations, what is the expected number of
transitions from B to C?
0.2
0.3
1.0
0.8
0.7
1.0
A
C
B
1.0
0.0
0.0
1.0
O 0.2 0.3 0.4 0.5
16Xi Example 2
17Xi
- We can also specify ? in terms of ?
18How Do We Improve Estimates of HMM Parameters?
- With the Expectation-Maximization algorithm,
also known as the Baum-Welch method - In this case, we can use the following
re-estimation formulae
19How Do We Improve Estimates of HMM Parameters?
- After computing new model parameters, we
maximize by substituting the new parameter
values in place of the old parameter values
and repeat.
20How Do We Improve Estimates of HMM Parameters?
jstate, kmixture component!!
p(being in state j from component k)
p(being in state j)
21How Do We Improve Estimates of HMM Parameters?
expected value of ot based on existing ?
expected value of diagonal of covariance
matrix based on existing ?
22How Do We Improve Estimates of HMM Parameters?
- EM called Baum-Welch, also called
forward-backward algorithm - This process is guaranteed to converge
monotonically to a maximum-likelihood
estimate. - There may be many local maxima cant guarantee
the process will reach globally best result.
23Multiple Training Files
So far, weve implicitly assumed a single set of
observations for training. Most systems are
trained on multiple sets of observations (files).
This makes it necessary to use
accumulators. Initialize for each file compute
initial state boundaries (e.g. flat start) add
information to accumulator compute average,
standard deviation Update for each
iteration reset accumulators for each
file add information to accumulators compute
average, standard deviation update estimates
24Viterbi Search Project Notes
- Assume that any state can follow any other
state this will greatly simplify the
implementation. - Also assume that this is a whole-word
recognizer, and that each word is recognized
with a separate execution of the program.
This will greatly simplify the implementation - Print out both the score for the utterance and
the most likely state sequence from t1 to T
25Viterbi Search Project Notes
- the Normal p.d. f. returns probabilities??
techniques from multivariate calculus must be
used to show that
(Devore, p. 138)
Examples ot 2.0, ? 4.0, ? 5.0 N0.07365 ot
3.9, ? 4.0, ? 0.2 N1.76032 Conclusion whe
n ? is small and ot is near ?, N(ot, ?, ?)
yields likelihoods instead of probabilities.