Fast Inference and Learning in Large-State-Space HMMs - PowerPoint PPT Presentation

1 / 102

About This Presentation

Title:

Fast Inference and Learning in Large-State-Space HMMs

Description:

Siddiqi and Moore, www.autonlab.org. Fast Inference and ... Andrew W. Moore. The Auton Lab. Carnegie Mellon University. Siddiqi and Moore, www.autonlab.org ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 103

Provided by: awm

Category:

more less

Transcript and Presenter's Notes

Title: Fast Inference and Learning in Large-State-Space HMMs

1
Fast Inference and Learning in Large-State-Space
HMMs

Sajid M. Siddiqi
Andrew W. Moore
The Auton Lab
Carnegie Mellon University

2
Fast Inference and Learning in Large-State-Space
HMMs

Sajid M. Siddiqi
Andrew W. Moore
The Auton Lab
Carnegie Mellon University

3
Sajid Siddiqi Happy
Sajid Siddiqi Discontented
4
Hidden Markov Models
1/3
1
5
Hidden Markov Models
i P(qt1s1qtsi) P(qt1s2qtsi) P(qt1sjqtsi) P(qt1sNqtsi)
1 a11 a12 a1j a1N
2 a21 a22 a2j a2N
3 a31 a32 a3j a3N

i ai1 ai2 aij aiN

N aN1 aN2 aNj aNN
1/3
Each of these probability tables is identical
1
6
Observation Model
O0
O1
O2
O3
O4
7
Observation Model
Notation
O0
i P(Ot1qtsi) P(Ot2qtsi) P(Otkqtsi) P(OtMqtsi)
1 b1(1) b1 (2) b1 (k) b1(M)
2 b2 (1) b2 (2) b2(k) b2 (M)
3 b3 (1) b3 (2) b3(k) b3 (M)

i bi(1) bi (2) bi(k) bi (M)

N bN (1) bN (2) bN(k) bN (M)
O1
O2
O3
O4
8
Some Famous HMM Tasks

Question 1 State Estimation
What is P(qTSi O1O2OT)

9
Some Famous HMM Tasks

Question 1 State Estimation
What is P(qTSi O1O2OT)

10
Some Famous HMM Tasks

Question 1 State Estimation
What is P(qTSi O1O2OT)

11
Some Famous HMM Tasks

Question 1 State Estimation
What is P(qTSi O1O2OT)
Question 2 Most Probable Path
Given O1O2OT , what is the most probable path
that I took?

12
Some Famous HMM Tasks

Question 1 State Estimation
What is P(qTSi O1O2OT)
Question 2 Most Probable Path
Given O1O2OT , what is the most probable path
that I took?

13
Some Famous HMM Tasks

Question 1 State Estimation
What is P(qTSi O1O2OT)
Question 2 Most Probable Path
Given O1O2OT , what is the most probable path
that I took?

Woke up at 8.35, Got on Bus at 9.46, Sat in
lecture 10.05-11.22
14
Some Famous HMM Tasks

Question 1 State Estimation
What is P(qTSi O1O2OT)
Question 2 Most Probable Path
Given O1O2OT , what is the most probable path
that I took?
Question 3 Learning HMMs
Given O1O2OT , what is the maximum likelihood
HMM that could have produced this string of
observations?

15
Some Famous HMM Tasks

Question 1 State Estimation
What is P(qTSi O1O2OT)
Question 2 Most Probable Path
Given O1O2OT , what is the most probable path
that I took?
Question 3 Learning HMMs
Given O1O2OT , what is the maximum likelihood
HMM that could have produced this string of
observations?

16
Some Famous HMM Tasks
Ot
aBB
bB(Ot)

Question 1 State Estimation
What is P(qTSi O1O2OT)
Question 2 Most Probable Path
Given O1O2OT , what is the most probable path
that I took?
Question 3 Learning HMMs
Given O1O2OT , what is the maximum likelihood
HMM that could have produced this string of
observations?

Bus
aAB
aCB
Ot-1
Ot1
aBA
aBC
bA(Ot-1)
bC(Ot1)
Eat
walk
aAA
aCC
17
Basic Operations in HMMs

For an observation sequence O O1OT, the three
basic HMM operations are

Problem Algorithm Complexity
Evaluation Calculating P(O?) Forward-Backward O(TN2)
Inference Computing Q argmaxQ P(O,Q?) Viterbi Decoding O(TN2)
Learning Computing ? argmax? P(O?) Baum-Welch (EM) O(TN2)
T timesteps, N states
18
Basic Operations in HMMs

For an observation sequence O O1OT, the three
basic HMM operations are

Why does it matter?
Quadratic HMM algorithms hinder HMM computations
when N is large
Several promising applications for efficient
large-state-space HMM algorithms in
biological sequence analysis
speech recognition
real-time HMM systems such as for activity
monitoring

20
Idea One Sparse Transition Matrix

Only K ltlt N non-zero next-state probabilities

21
Idea One Sparse Transition Matrix

Only K ltlt N non-zero next-state probabilities

22
Idea One Sparse Transition Matrix
Only O(TNK)

Only K ltlt N non-zero next-state probabilities

23
Idea One Sparse Transition Matrix
Only O(TNK)

But can get very badly confused by impossible
transitions
Cannot learn the sparse structure (once chosen
cannot change)

Only K ltlt N non-zero next-state probabilities

24
Dense-Mostly-Constant Transitions

K non-constant probabilities per row
DMC HMMs comprise a richer and more expressive
class of models than sparse HMMs

a DMC transition matrix with K2
25
Dense-Mostly-Constant Transitions

The transition model for state i now comprises
NCi j si?sj is a non-constant transition
probability
ci the transition probability for si to all
states not in NCi
aij the non-constant transition probability for
si? sj,

NC3 2,5 c3 0.05 a32 0.25 a35 0.6
26
HMM Filtering

P(qt si O1, O2 Ot)

27
HMM Filtering

P(qt si O1, O2 Ot)
Where

28
HMM Filtering

P(qt si O1, O2 Ot)
Where

29
HMM Filtering

P(qt si O1, O2 Ot)
Where

t at(1) at(2) at(3) at(N)
1
2
3
4
5
6
7
8
9
30
HMM Filtering

P(qt si O1, O2 Ot)
Where

t at(1) at(2) at(3) at(N)
1
2
3
4
5
6
7
8
9
31
HMM Filtering

P(qt si O1, O2 Ot)
Where

t at(1) at(2) at(3) at(N)
1
2
3
4
5
6
7
8
9

Cost O(TN2)

32
Fast Evaluation in DMC HMMs
33
Fast Evaluation in DMC HMMs
O(N), but common to all j per timestep t
O(K) for each ?t(j)

This yields O(TNK) complexity for the evaluation
problem.

34
Fast Inference in DMC HMMs

The Viterbi algorithm uses dynamic programming to
calculate the globally optimal state sequence
QgmaxQP(Q,O?).

Define ?t(i) as
The ? variables can be computed in O(TN2) time,
with the O(N) inductive step
Under the DMC assumption, this step can be
carried out in O(K) time
35
Learning a DMC HMM
36
Learning a DMC HMM

Idea One
Ask user to tell us the DMC structure
Learn the parameters using EM

37
Learning a DMC HMM

Idea One
Ask user to tell us the DMC structure
Learn the parameters using EM
Simple
But in general, dont know the DMC structure

38
Learning a DMC HMM

Idea Two
Use EM to learn the DMC structure too
Guess DMC structure
Find expected transition counts and observation
parameters, given current model and observations
Find maximum likelihood DMC model given counts
Goto 2

39
Learning a DMC HMM

Idea Two
Use EM to learn the DMC structure too
Guess DMC structure
Find expected transition counts and observation
parameters, given current model and observations
Find maximum likelihood DMC model given counts
Goto 2

DMC structure can (and does) change!
40
Learning a DMC HMM

Idea Two
Use EM to learn the DMC structure too
Guess DMC structure
Find expected transition counts and observation
parameters, given current model and observations
Find maximum likelihood DMC model given counts
Goto 2

In fact, just start with an all-constant
transition model
DMC structure can (and does) change!
41
Learning a DMC HMM

Find expected transition counts and observation
parameters, given current model and observations

42
We want
new estimate of
43
We want
new estimate of
44
We want
new estimate of
45
We want
new estimate of
where
46
(No Transcript)
47

a
b
T
T
N
N
48
Can get this in O(TN) time
Can get this in O(TN) time

a
b
T
T
N
N
49
We want
where
Can get this in O(TN) time
Can get this in O(TN) time

a
r
T
T
N
N
50
We want
where

r
a
T
T
N
N
51
N
We want
where

S24
S
N
Dot Product of Columns
a2
r4

r
a
T
T
N
N
52
N
We want
where

S24
S
N
Dot Product of Columns
a2
r4

r
a
T
T
O(TN2)
N
N
53
N
We want
where

S24
S

Speedups
Strassen?

N
Dot Product of Columns
a2
r4

r
a
T
T
O(TN2)
N
N
54
N
We want
where

S24
S

Speedups
Strassen
Approximate a by DMC

N
Dot Product of Columns
a2
r4

r
a
T
T
O(TN2)
N
N
55
N
We want
where

S24
S

Speedups
Strassen
Approximate a by DMC
Approximate randomized ATB

N
Dot Product of Columns
a2
r4

r
a
T
T
O(TN2)
N
N
56
N
We want
where

S24
S

Speedups
Strassen
Approximate a by DMC
Approximate randomized ATB
Sparse structure fine?

N
Dot Product of Columns
a2
r4

r
a
T
T
O(TN2)
N
N
57
N
We want
where

S24
S

Speedups
Strassen
Approximate a by DMC
Approximate randomized ATB
Sparse structure fine
Fixed DMC is fine?

N
Dot Product of Columns
a2
r4

r
a
T
T
O(TN2)
N
N
58
N
We want
where

S24
S

Speedups
Strassen
Approximate a by DMC
Approximate randomized ATB
Sparse structure fine
Fixed DMC is fine
Speedup without approximation

N
Dot Product of Columns
a2
r4

r
a
T
T
O(TN2)
N
N
59
N
We want
where

S24
S

Insight One only need the top K entries in each
row of S
Insight Two Values in rows of a and r are often
very skewed

N

r
a
T
T
N
N
60
For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
r-biggies(j)
a-biggies(i)

r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
61
For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
r-biggies(j)
a-biggies(i)

r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
62
For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
r-biggies(j)
a-biggies(i)

r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
63
For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
r-biggies(j)
a-biggies(i)

r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
64
For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
r-biggies(j)
a-biggies(i)

r
a
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
65
For i 1..N, store indexes of R largest values
in ith column of a
For j 1..N, store indexes of R largest values
in jth column of r
R ltlt T Takes O(TN) time to do all indexes
Rth largest value in ith column of a O(1) time
to obtain
r-biggies(j)
a-biggies(i)

O(R) computation
r
a
O(1) time to obtain (precached for all j in time
O(TN) )
T
N
N
Theres an important detail Im omitting here to
do with prescaling the rows of a and r.
66
N
Computing the ith row of S
S

In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N
Sij
1
2
3
N

j
67
N
Computing the ith row of S
S

In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N Only need exact values
of Sij for the k largest values within the row
Sij
1
2
3
N

j
68
N
Computing the ith row of S
S

In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N Only need exact values
of Sij for the k largest values within the
row Ignore js that cant be the best
Sij
1
2
3
N

j
69
N
Computing the ith row of S
S

In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N Only need exact values
of Sij for the k largest values within the
row Ignore js that cant be the best Be exact
for the rest O(N) time each.
Sij
1
2
3
N

j
70
N
Computing the ith row of S
S

In O(NR) time, we can put upper and lower bounds
on Sij for j 1,2 .. N Only need exact values
of Sij for the k largest values within the
row Ignore js that cant be the best Be exact
for the rest O(N) time each.
If theres enough pruning, total time is O(TNRN2)
Sij
1
2
3
N

j
71
Evaluation and Inference Speedup
Dataset synthetic data with T2000 time steps
72
Parameter Learning Speedup
Dataset synthetic data with T2000 time steps
73
Performance Experiments

DMC-friendly dataset
2-D gaussian 20-state DMC HMM with K5 (20,000
train, 5,000 test)
Anti-DMC dataset
2-D gaussian 20-state regular HMM with steadily
varying, well-distributed transition
probabilities (20,000 train, 5,000 test)
Motionlogger dataset
Accelerometer data from two sensors worn over
several days (10,000 train, 4,720 test)
Regular and DMC HMMs
20 states
Small HMM
5-state regular HMM
Uniform HMM
20-state HMM with uniform transition
probabilities

74
Learning Curves for DMC-friendly data
75
Learning Curves for DMC-friendly data
76
Learning Curves for DMC-friendly data
77
Learning Curves for DMC-friendly data
78
Learning Curves for DMC-friendly data
79
Learning Curves for DMC-friendly data
80
Learning Curves for DMC-friendly data
81
Learning Curves for Anti-DMC data
82
Learning Curves for Anti-DMC data
83
Learning Curves for Anti-DMC data
84
Learning Curves for Anti-DMC data
85
Learning Curves for Anti-DMC data
86
Learning Curves for Anti-DMC data
87
Learning Curves for Anti-DMC data
88
Learning Curves for Motionlogger data
89
Learning Curves for Motionlogger data
90
Learning Curves for Motionlogger data
91
Learning Curves for Motionlogger data
92
Learning Curves for Motionlogger data
93
Learning Curves for Motionlogger data
94
Learning Curves for Motionlogger data
95
Tradeoffs between N and K

We vary N and K while keeping the number of
transition parameters (NK) constant
Increasing N and decreasing K allows more states
for modeling data features but fewer parameters
per state for temporal structure

96
Tradeoffs between N and K

Average test-set log-likelihoods at convergence
Datasets
A DMC-friendly
B Anti-DMC
C Motionlogger
Each dataset has a different optimal N-vs-K
tradeoff

97
Regularization with DMC HMMs

of transition parameters in regular 100-state
HMM 10,000
of transition parameters in DMC 100-state HMM
with K 5 500

98
Conclusions

DMC HMMs are an important class of models that
allow parameterized complexity-vs-efficiency
tradeoffs in large state spaces
The speedup can be several orders of magnitude
Even for non-DMC domains, DMC HMMs yield higher
scores than baseline models
The DMC HMM model can be applied to arbitrary
state spaces and observation densities

99
Related Work

Felzenszwalb et al. (2003) fast HMM algorithms
when transition probabilities can be expressed as
distances in an underlying parameter space
Murphy and Paskin (2002) fast inference in
hierarchical HMMs cast as DBNs
Salakhutdinov et al. (2003) combined EM and
conjugate gradient for faster HMM learning when
missing information amount is high
Beam Search widely used heuristic in word
recognition for speech systems

100
Future Work