Title: Fast Inference and Learning in Large-State-Space HMMs
1Fast Inference and Learning in Large-State-Space
HMMs
- Sajid M. Siddiqi
- Andrew W. Moore
- The Auton Lab
- Carnegie Mellon University
2Fast Inference and Learning in Large-State-Space
HMMs
- Sajid M. Siddiqi
- Andrew W. Moore
- The Auton Lab
- Carnegie Mellon University
3Sajid Siddiqi Happy
Sajid Siddiqi Discontented
4Hidden Markov Models
1/3
1
5Hidden Markov Models
i P(qt1s1qtsi) P(qt1s2qtsi) P(qt1sjqtsi) P(qt1sNqtsi)
1 a11 a12 a1j a1N
2 a21 a22 a2j a2N
3 a31 a32 a3j a3N
i ai1 ai2 aij aiN
N aN1 aN2 aNj aNN
1/3
Each of these probability tables is identical
1
6Observation Model
O0
O1
O2
O3
O4
7Observation Model
Notation
O0
i P(Ot1qtsi) P(Ot2qtsi) P(Otkqtsi) P(OtMqtsi)
1 b1(1) b1 (2) b1 (k) b1(M)
2 b2 (1) b2 (2) b2(k) b2 (M)
3 b3 (1) b3 (2) b3(k) b3 (M)
i bi(1) bi (2) bi(k) bi (M)
N bN (1) bN (2) bN(k) bN (M)
O1
O2
O3
O4
8Some Famous HMM Tasks
- Question 1 State Estimation
- What is P(qTSi O1O2OT)
9Some Famous HMM Tasks
- Question 1 State Estimation
- What is P(qTSi O1O2OT)
10Some Famous HMM Tasks
- Question 1 State Estimation
- What is P(qTSi O1O2OT)
11Some Famous HMM Tasks
- Question 1 State Estimation
- What is P(qTSi O1O2OT)
- Question 2 Most Probable Path
- Given O1O2OT , what is the most probable path
that I took?
12Some Famous HMM Tasks
- Question 1 State Estimation
- What is P(qTSi O1O2OT)
- Question 2 Most Probable Path
- Given O1O2OT , what is the most probable path
that I took?
13Some Famous HMM Tasks
- Question 1 State Estimation
- What is P(qTSi O1O2OT)
- Question 2 Most Probable Path
- Given O1O2OT , what is the most probable path
that I took?
Woke up at 8.35, Got on Bus at 9.46, Sat in
lecture 10.05-11.22
14Some Famous HMM Tasks
- Question 1 State Estimation
- What is P(qTSi O1O2OT)
- Question 2 Most Probable Path
- Given O1O2OT , what is the most probable path
that I took? - Question 3 Learning HMMs
- Given O1O2OT , what is the maximum likelihood
HMM that could have produced this string of
observations?
15Some Famous HMM Tasks
- Question 1 State Estimation
- What is P(qTSi O1O2OT)
- Question 2 Most Probable Path
- Given O1O2OT , what is the most probable path
that I took? - Question 3 Learning HMMs
- Given O1O2OT , what is the maximum likelihood
HMM that could have produced this string of
observations?
16Some Famous HMM Tasks
Ot
aBB
bB(Ot)
- Question 1 State Estimation
- What is P(qTSi O1O2OT)
- Question 2 Most Probable Path
- Given O1O2OT , what is the most probable path
that I took? - Question 3 Learning HMMs
- Given O1O2OT , what is the maximum likelihood
HMM that could have produced this string of
observations?
Bus
aAB
aCB
Ot-1
Ot1
aBA
aBC
bA(Ot-1)
bC(Ot1)
Eat
walk
aAA
aCC
17Basic Operations in HMMs
- For an observation sequence O O1OT, the three
basic HMM operations are
Problem Algorithm Complexity
Evaluation Calculating P(O?) Forward-Backward O(TN2)
Inference Computing Q argmaxQ P(O,Q?) Viterbi Decoding O(TN2)
Learning Computing ? argmax? P(O?) Baum-Welch (EM) O(TN2)
T timesteps, N states
18Basic Operations in HMMs
- For an observation sequence O O1OT, the three
basic HMM operations are
Problem Algorithm Complexity
Evaluation Calculating P(O?) Forward-Backward O(TN2)
Inference Computing Q argmaxQ P(O,Q?) Viterbi Decoding O(TN2)
Learning Computing ? argmax? P(O?) Baum-Welch (EM) O(TN2)
This talk A simple approach to reducing the
complexity in N
T timesteps, N states
19Reducing Quadratic N penalty
- Why does it matter?
- Quadratic HMM algorithms hinder HMM computations
when N is large - Several promising applications for efficient
large-state-space HMM algorithms in - biological sequence analysis
- speech recognition
- real-time HMM systems such as for activity
monitoring
20Idea One Sparse Transition Matrix
- Only K ltlt N non-zero next-state probabilities
21Idea One Sparse Transition Matrix
- Only K ltlt N non-zero next-state probabilities
22Idea One Sparse Transition Matrix
Only O(TNK)
- Only K ltlt N non-zero next-state probabilities
23Idea One Sparse Transition Matrix
Only O(TNK)
- But can get very badly confused by impossible
transitions - Cannot learn the sparse structure (once chosen
cannot change)
- Only K ltlt N non-zero next-state probabilities
24Dense-Mostly-Constant Transitions
- K non-constant probabilities per row
- DMC HMMs comprise a richer and more expressive
class of models than sparse HMMs
a DMC transition matrix with K2
25Dense-Mostly-Constant Transitions
- The transition model for state i now comprises
- NCi j si?sj is a non-constant transition
probability - ci the transition probability for si to all
states not in NCi - aij the non-constant transition probability for
si? sj,
NC3 2,5 c3 0.05 a32 0.25 a35 0.6
26HMM Filtering
27HMM Filtering
28HMM Filtering
29HMM Filtering
t at(1) at(2) at(3) at(N)
1
2
3
4
5
6
7
8
9
30HMM Filtering
t at(1) at(2) at(3) at(N)
1
2
3
4
5
6
7
8
9
31HMM Filtering
t at(1) at(2) at(3) at(N)
1
2
3
4
5
6
7
8
9
32Fast Evaluation in DMC HMMs
33Fast Evaluation in DMC HMMs
O(N), but common to all j per timestep t
O(K) for each ?t(j)
- This yields O(TNK) complexity for the evaluation
problem.
34Fast Inference in DMC HMMs
- The Viterbi algorithm uses dynamic programming to
calculate the globally optimal state sequence
QgmaxQP(Q,O?).
Define ?t(i) as
The ? variables can be computed in O(TN2) time,
with the O(N) inductive step
Under the DMC assumption, this step can be
carried out in O(K) time
35Learning a DMC HMM
36Learning a DMC HMM
- Idea One
- Ask user to tell us the DMC structure
- Learn the parameters using EM
37Learning a DMC HMM
- Idea One
- Ask user to tell us the DMC structure
- Learn the parameters using EM
- Simple
- But in general, dont know the DMC structure
38Learning a DMC HMM
- Idea Two
- Use EM to learn the DMC structure too
- Guess DMC structure
- Find expected transition counts and observation
parameters, given current model and observations - Find maximum likelihood DMC model given counts
- Goto 2
39Learning a DMC HMM
- Idea Two
- Use EM to learn the DMC structure too
- Guess DMC structure
- Find expected transition counts and observation
parameters, given current model and observations - Find maximum likelihood DMC model given counts
- Goto 2
DMC structure can (and does) change!
40Learning a DMC HMM
- Idea Two
- Use EM to learn the DMC structure too
- Guess DMC structure
- Find expected transition counts and observation
parameters, given current model and observations - Find maximum likelihood DMC model given counts
- Goto 2
In fact, just start with an all-constant
transition model
DMC structure can (and does) change!
41Learning a DMC HMM
- Find expected transition counts and observation
parameters, given current model and observations
42We want
new estimate of
43We want
new estimate of
44We want
new estimate of
45We want
new estimate of
where
46(No Transcript)
47a
b
T
T
N
N
48Can get this in O(TN) time
Can get this in O(TN) time
a
b
T
T
N
N
49We want
where
Can get this in O(TN) time
Can get this in O(TN) time
a
r
T
T
N
N
50We want
where
r
a
T
T
N
N
51N
We want
where
S24
S
N
Dot Product of Columns
a2
r4
r
a
T
T
N
N
52N
We want
where
S24
S
N
Dot Product of Columns
a2
r4
r
a
T
T
O(TN2)
N
N
53N
We want
where
S24
S
N
Dot Product of Columns
a2
r4
r
a
T
T
O(TN2)
N
N
54N
We want
where
S24
S
- Speedups
- Strassen
- Approximate a by DMC
N
Dot Product of Columns
a2
r4
r
a
T
T
O(TN2)
N
N
55N
We want
where
S24
S
- Speedups
- Strassen
- Approximate a by DMC
- Approximate randomized ATB
N
Dot Product of Columns
a2
r4
r
a
T
T
O(TN2)
N
N
56N
We want
where
S24
S
- Speedups
- Strassen
- Approximate a by DMC
- Approximate randomized ATB
- Sparse structure fine?
N
Dot Product of Columns
a2
r4
r
a
T
T
O(TN2)
N
N
57N
We want
where
S24
S
- Speedups
- Strassen
- Approximate a by DMC
- Approximate randomized ATB
- Sparse structure fine
- Fixed DMC is fine?
N
Dot Product of Columns
a2
r4
r
a
T
T
O(TN2)
N
N
58N
We want
where
S24
S
- Speedups
- Strassen
- Approximate a by DMC
- Approximate randomized ATB
- Sparse structure fine
- Fixed DMC is fine
- Speedup without approximation
N
Dot Product of Columns
a2
r4
r
a
T
T
O(TN2)
N
N
59N
We want
where
S24
S
- Insight One only need the top K entries in each
row of S - Insight Two Values in rows of a and r are often
very skewed
N
r
a
T
T
N
N
60For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
r-biggies(j)
a-biggies(i)
r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
61For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
r-biggies(j)
a-biggies(i)
r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
62For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
r-biggies(j)
a-biggies(i)
r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
63For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
r-biggies(j)
a-biggies(i)
r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
64For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
r-biggies(j)
a-biggies(i)
r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
65For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
Rth largest value in ith column of a O(1) time
to obtain
r-biggies(j)
a-biggies(i)
O(R) computation
r
a
O(1) time to obtain (precached for all j in time
O(TN) )
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
66N
Computing the ith row of S
S
In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N
Sij
1
2
3
N
j
67N
Computing the ith row of S
S
In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N Only need exact values
of Sij for the k largest values within the row
Sij
1
2
3
N
j
68N
Computing the ith row of S
S
In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N Only need exact values
of Sij for the k largest values within the
row Ignore js that cant be the best
Sij
1
2
3
N
j
69N
Computing the ith row of S
S
In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N Only need exact values
of Sij for the k largest values within the
row Ignore js that cant be the best Be exact
for the rest O(N) time each.
Sij
1
2
3
N
j
70N
Computing the ith row of S
S
In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N Only need exact values
of Sij for the k largest values within the
row Ignore js that cant be the best Be exact
for the rest O(N) time each.
If theres enough pruning, total time is O(TNRN2)
Sij
1
2
3
N
j
71Evaluation and Inference Speedup
Dataset synthetic data with T2000 time steps
72Parameter Learning Speedup
Dataset synthetic data with T2000 time steps
73Performance Experiments
- DMC-friendly dataset
- 2-D gaussian 20-state DMC HMM with K5 (20,000
train, 5,000 test) - Anti-DMC dataset
- 2-D gaussian 20-state regular HMM with steadily
varying, well-distributed transition
probabilities (20,000 train, 5,000 test) - Motionlogger dataset
- Accelerometer data from two sensors worn over
several days (10,000 train, 4,720 test) - Regular and DMC HMMs
- 20 states
- Small HMM
- 5-state regular HMM
- Uniform HMM
- 20-state HMM with uniform transition
probabilities
74Learning Curves for DMC-friendly data
75Learning Curves for DMC-friendly data
76Learning Curves for DMC-friendly data
77Learning Curves for DMC-friendly data
78Learning Curves for DMC-friendly data
79Learning Curves for DMC-friendly data
80Learning Curves for DMC-friendly data
81Learning Curves for Anti-DMC data
82Learning Curves for Anti-DMC data
83Learning Curves for Anti-DMC data
84Learning Curves for Anti-DMC data
85Learning Curves for Anti-DMC data
86Learning Curves for Anti-DMC data
87Learning Curves for Anti-DMC data
88Learning Curves for Motionlogger data
89Learning Curves for Motionlogger data
90Learning Curves for Motionlogger data
91Learning Curves for Motionlogger data
92Learning Curves for Motionlogger data
93Learning Curves for Motionlogger data
94Learning Curves for Motionlogger data
95Tradeoffs between N and K
- We vary N and K while keeping the number of
transition parameters (NK) constant - Increasing N and decreasing K allows more states
for modeling data features but fewer parameters
per state for temporal structure
96Tradeoffs between N and K
- Average test-set log-likelihoods at convergence
- Datasets
- A DMC-friendly
- B Anti-DMC
- C Motionlogger
- Each dataset has a different optimal N-vs-K
tradeoff
97Regularization with DMC HMMs
- of transition parameters in regular 100-state
HMM 10,000 - of transition parameters in DMC 100-state HMM
with K 5 500
98Conclusions
- DMC HMMs are an important class of models that
allow parameterized complexity-vs-efficiency
tradeoffs in large state spaces - The speedup can be several orders of magnitude
- Even for non-DMC domains, DMC HMMs yield higher
scores than baseline models - The DMC HMM model can be applied to arbitrary
state spaces and observation densities
99Related Work
- Felzenszwalb et al. (2003) fast HMM algorithms
when transition probabilities can be expressed as
distances in an underlying parameter space - Murphy and Paskin (2002) fast inference in
hierarchical HMMs cast as DBNs - Salakhutdinov et al. (2003) combined EM and
conjugate gradient for faster HMM learning when
missing information amount is high - Beam Search widely used heuristic in word
recognition for speech systems
100Future Work
- Investigate DMC HMMs as regularization mechanism
- Eliminate R parameter using an automatic backoff
evaluation approach - Devise ways to automatically set K parameter,
have per-row K parameters
101Future Work
- Investigate DMC HMMs as regularization mechanism
- Eliminate R parameter using an automatic backoff
evaluation approach - Devise ways to automatically set K parameter,
have per-row K parameters
The End
102(No Transcript)