Title: HMM II: Parameter Estimation
1HMM II Parameter Estimation
2Reminder Hidden Markov Model
Markov Chain transition probabilities p(Si1
tSi s) ast Emission probabilities p(Xi b
Si s) es(b)
For each integer Lgt0, this defines a probability
space in which the simple events are HMM of
length L.
3Hidden Markov Model for CpG Islands
Domain(Si), - ? A,C,T,G (8 values)
The states
A-
T
G
A
T
G
In this representation P(xi si) 0 or 1
depending on whether xi is consistent with si .
E.g. xi G is consistent with si(,G) and with
si(-,G) but not with any other state of si.
4Most Probable state path
Given an observation sequence x (x1,,xL), A
most probable path s (s1,,sL) is one which
maximizes p(sx).
5Viterbis algorithm for most probable path
s1
s2
sL-1
sL
si
0
X1
X2
XL-1
XL
Xi
We add the special initial state 0.
Initialization v0(0) 1 , vk(0) 0 for k gt 0
For i1 to L do for each state l
vl(i) el(xi) MAXk vk(i-1)akl
ptri(l)argmaxkvk(i-1)akl storing previous
state for reconstructing the path Termination
the probability of the most probable path
p(s1,,sLx1,,xl)
6Predicting CpG islands via most probable path
Output symbols A, C, G, T (4 letters). Markov
Chain states 4 - states and 4 states, two
for each letter (8 states total). The most
probable path (found by Viterbis algorithm)
predicts CpG islands.
Experiment (Durbin et al, p. 60-61) shows that
the predicted islands are shorter than the
assumed ones. In addition, quite a few false
negatives are found.
7Most probable state
Given an output sequence x (x1,,xL), si is a
most probable state (at location i) if
siargmaxk p(Sik x).
8Finding most probable state
fl(i) p(x1,,xi,sil ), the probability of a
path which emits (x1,..,xi) and in which state
sil. bl(i) p(xi1,,xL,sil), the probability
of a path which emits (xi1,..,xL) and in which
state sil.
- The forward algorithm finds fk(si)
P(x1,,xi,sik) k 1,...m. - The backward algorithm finds bk(si)
P(xi1,,xLsik ) k 1,...m. - Return p(Sikx) fk(si) bk(si) k1,...,m.
- To Compute for every i simply run the forward
and backward algorithms once, and compute fk(si)
bk(si) for every i, k.
9 Finding the probability that a letter is in a
CpG island via the algorithm for most probable
state
i
The probability that the occurrence of G in the
i-th location is in a CpG island ( state)
is ?s p(Si s x) ?s F(Sis )B(Sis )
Where the summation is formally over the 4
states, but actually only state G need to be
considered (why?)
10Parameter Estimation for HMM
An HMM model is defined by the parameters akl
and ek(b), for all states k,l and all symbols b.
Let ? denote the collection of these parameters.
11Parameter Estimation for HMM
To determine the values of (the parameters in) ?,
use a training set x1,...,xn, where each xj
is a sequence which is assumed to fit the
model. Given the parameters ?, each sequence xj
has an assigned probability p(xj?).
12Maximum Likelihood Parameter Estimation for HMM
The elements of the training set x1,...,xn, are
assumed to be independent, p(x1,..., xn?) ?j
p (xj?). ML parameter estimation looks for ?
which maximizes the above. The exact method for
finding or approximating this ? depends on the
nature of the training set used.
13Data for HMM
- Possible properties of (the sequences in) the
training set - For each xj, the information on the states sji
- The size (number of sequences) of the training set
14Case 1 ML when state paths are fully known
We know the complete structure of each sequence
in the training set x1,...,xn. We wish to
estimate akl and ek(b) for all pairs of states k,
l and symbols b.
By the ML method, we look for parameters ? which
maximize the probability of the sample set
p(x1,...,xn ?) MAX? p(x1,...,xn ?).
15Case 1 State paths are fully known
For each xj we have
Let mkl i si-1k,sil (in xj).
mk(b)isik,xib (in xj).
16Case 1
By the independence of the xjs, p(x1,...,xn
?)?jp(xj?).
Thus, if Akl (transitions from k to l) in the
training set, and Ek(b) (emissions of symbol b
from state k) in the training set, we have
17Case 1 (cont)
So we need to find akls and ek(b)s which
maximize
Subject to
18Case 1
Rewriting, we need to maximize
19Case 1 (cont)
Then we will maximize also F. Each of the above
is a simpler ML problem, which is similar to ML
parameters estimation for a die, treated next.
20MLE for n outcomes
The MLE is given by the relative frequencies
21Apply the ML method to HMM
Let Akl (transitions from k to l) in the
training set. Ek(b) (emissions of symbol b
from state k) in the training set. We need to
22Apply to HMM (cont.)
We apply the previous technique to get for each k
the parameters akll1,..,m and ek(b)b?S
Which gives the optimal ML parameters
23Summary of Case 1 State paths are fully known
We know the complete structure of each sequence
in the training set x1,...,xn. We wish to
estimate akl and ek(b) for all pairs of states k,
l and symbols b.
Summary when everything is known, we can find
the (unique set of) parameters ? which
maximizes p(x1,...,xn ?) MAX? p(x1,...,xn ?).
24Adding pseudo counts in HMM
If the sample set is too small, we may get a
biased result. In this case we modify the actual
count by our prior knowledge/belief rkl is our
prior belief and transitions from k to l. rk(b)
is our prior belief on emissions of b from state
k.
25Case 2 State paths are unknown
For a given ? we have p(x1,..., xn?) p(x1
?) ? ? ? p (xn?) (since the xj are independent)
For each sequence x, p(x?)?s p(x,s?), The sum
taken over all state paths s which emit x.
26Case 2 State paths are unknown
Thus, for the n sequences (x1,..., xn) we
have p(x1,..., xn ?) ? p(x1,..., xn ,
s1,..., sn ?), Where the summation is taken
over all tuples of n state paths (s1,..., sn )
which generate (x1,..., xn ) . For simplicity, we
will assume that n1.
(s1,..., sn )
27Case 2 State paths are unknown
So we need to maximize p(x?)?s p(x,s?), where
the summation is over all the sequences S which
produce the output sequence x. Finding ? which
maximizes ?s p(x,s?) is hard. Unlike finding ?
which maximizes p(x,s?) for a single sequence
(x,s).
28ML Parameter Estimation for HMM
- The general process for finding ? in this case is
- Start with an initial value of ?.
- Find ? so that p(x?) gt p(x?)
- set ? ?.
- Repeat until some convergence criterion is met.
A general algorithm of this type is the
Expectation Maximization algorithm, which we will
meet later. For the specific case of HMM, it is
the Baum-Welch training.
29Baum Welch training
We start with some values of akl and ek(b), which
define prior values of ?. Then we use an
iterative algorithm which attempts to replace ?
by a ? s.t. p(x?) gt p(x?) This is done by
imitating the algorithm for Case 1, where all
states are known
30Baum Welch training
xi c
In case 1 we computed the optimal values of akl
and ek(b), (for the optimal ?) by simply
counting the number Akl of transitions from state
k to state l, and the number Ek(b) of emissions
of symbol b from state k, in the training set.
This was possible since we knew all the states.
31Baum Welch training
When the states are unknown, the counting process
is replaced by averaging process For each edge
si-1? si we compute the average number of k to
l transitions, for all possible pairs (k,l),
over this edge. Then, for each k and l, we take
Akl to be the sum over all edges.
32Baum Welch training
Si ?
xi-1 b
Similarly, For each edge si? b and each state k,
we compute the average number of times that sik,
which is the expected number of k ? b
transmission on this edge. Then we take Ek(b) to
be the sum over all such edges. These expected
values are computed by assuming the current
parameters ?
33Akl and Ek(b) when states are unknown
Akl and Ek(b) are computed according to the
current distribution ?, that is
Akl?sAsklp(sx,?), where Askl is the number of k
to l transitions in the sequence s. Ek(b)?sEsk
(b)p(sx,?), where Esk (b) is the number of times
k emits b in the sequence s with output x.
34Baum Welch step 1a Count expected number of
state transitions
For each i, k,l, compute the state transitions
probabilities by the current ?
P(si-1k, sil x,?)
For this, we use the forwards and backwards
algorithms
35Finding state probabilities
Fk(i) p(x1,,xi,sik ), the probability that in
a path which emits (x1,..,xi), state sik.
Bk(i) p(xi1,,xLsik), the probability that
a path which emits (xi1,..,xL), given that state
sik.
p(sik,x) Fk(si) Bk(si) Fk(i) Bk(i) for every
i and k are computed by one run of the
backward/forward algorithms.
36Baum Welch Step 1a (cont)
Claim By the probability distribution of HMM
(akl and el(xi) are the parameters defined by ? ,
and Fk(i-1), Bk(i) are the forward and backward
algorithms)
37Step 1a Computing P(si-1k, sil x,?)
P(x1,,xL,si-1k,sil?) P(x1,,xi-1,si-1k?)
aklel(xi ) P(xi1,,xL sil,?)
38Step 1a
For each pair (k,l), compute the expected number
of state transitions from k to l, as the sum of
the expected number of k to l transitions over
all L edges
39Step 1a for many sequences
When we have n independent input sequences
(x1,..., xn ), then Akl is given by
40Baum-Welch Step 1b count expected number of
symbols emissions
for state k and each symbol b, for each i where
Xib, compute the expected number of times that
Xib.
41Baum-Welch Step 1b
For each state k and each symbol b, compute the
expected number of emissions of b from k as the
sum of the expected number of times that si k,
over all is for which xi b.
42Step 1b for many sequences
Exercise when we have n sequences (x1,..., xn ),
the expected number of emissions of b from k is
given by
43Summary of Steps 1a and 1b the E part of the
Baum Welch training
These steps compute the expected numbers Akl of
k,l transitions for all pairs of states k and l,
and the expected numbers Ek(b) of transmitions
of symbol b from state k, for all states k and
symbols b. The next step is the M step, which is
identical to the computation of optimal ML
parameters when all states are known.
44Baum-Welch step 2
Use the Akls, Ek(b)s to compute the new values
of akl and ek(b). These values define ?.
The correctness of the EM algorithm implies
that p(x1,..., xn?) ? p(x1,..., xn?) i.e,
? increases the probability of the data This
procedure is iterated, until some convergence
criterion is met.