Title: Hidden Markov Models
1Hidden Markov Models Gene Finding
- Lecture 21 November 15, 2005
- Algorithms in Biosequence Analysis
- Nathan Edwards - Fall, 2005
2Hidden Markov Models
- Set of states Q
- Transition probabilities ps(t), s,t in Q
- Output alphabet S
- Emission probabilities es(x), s in Q, x in S
- Separate states from observed sequence.
3Dishonest casino
0.90
0.95
1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6
1 1/10 2 1/10 3 1/10 4 1/10 5 1/10 6 1/2
0.05
0.10
Loaded
Fair
4Dishonest casino
- Q Fair, Loaded , S 1, 2, 3, 4, 5, 6
- pFair(Fair) 0.95, pFair(Loaded) 0.05,
- pLoaded(Fair) 0.10, pLoaded(Loaded) 0.90,
- eFair(1) 1/6, ..., eFair(6) 1/6
- eLoaded(1) 1/10, ..., eLoaded(6) 1/2
0.90
0.95
1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6
1 1/10 2 1/10 3 1/10 4 1/10 5 1/10 6 1/2
0.05
0.10
Loaded
Fair
5Finding CpG Islands
6Finding CpG Islands
- Q A, C, G, T, A-, C-, G-, T-
- S A, C, G, T
- eA(A) 1, ... , eA(T) 0,
- eA-(A) 1, ... , eA-(T) 0,
- ...
- px(y) CpG Island transition axy
- px-(y-) non-CpG Island transition axy
- px(y-) prob of switch from CpG to non-CpG
- px-(y) prob of switch from non-CpG to CpG
7Dynamic Programming Algorithms
- Viterbi Find most likely state sequence S for a
given observed sequence x. - S argmax S Pp,e S x argmax S Pp,e S, x
- Forward, Backward Find the probability of
observed sequence x. - Pp,e x SS PS,x
- fk(i) Px1,...,xi,Si k, bk(i)
Pxi1,...,xLSi k. - Posterior State Probability Likelihood, given
the observed sequence x, that Si k. - Pp,e Si k x fk(i) bk(i) / P x
8Parameter Estimation
- Given a set of training sequences, (and Q and S)
determine p and e. - Require a set of training sequences, much as in
any data-mining problem - Ideally, these are already labeled with the
states Q.
9Parameter Estimation
- Given sequences S and x, compute ps(t) and
es(y) - Count the number of times s is followed by t in
S, and number of times y is output in state s. - ps(t) Sis, Si1t / Sis
Sis, Si1t / Sk Sis,
Si1kes(y) xiy, Sis / Sis
xiy, Sis / Sy xiy, Sis - What if we never see a particular transition?
10Parameter Estimation
- ps(t) Sis, Si1t / Sk Sis,
Si1kes(y) xiy, Sis / Sy
xiy, Sis - These are the maximum likelihood estimates (p,e)
argmax Pp,e S, x - What if we never see a particular transition or
emission? - Add pseudo-counts to represent our Bayesian
belief that these events happen, albeit rarely.
11What if we have no labels?
- If the state sequence is unknown, we cant just
count. - If only we had p and e, we could compute the
expected number of s to t transitions by summing
over all paths. - Pe,p Si s, Si1 t x
fs(i)ps(t)et(xi1)bt(i1) / Pe,p x - E Sis,Si1t Sx Si Pe,p Sis,Si1 t x
- E xiy,Sis Sx Si Pe,p Si s,xi y x
12Baum-Welch
- If we use these expected counts to estimate the
parameters - ps(t) Ep,e Sis, Si1t /Sk Ep,e Sis,
Si1kes(y) Ep,e xiy, Sis /Sy Ep,e
xiy, Sis - then Pp,e x Pp,e x , or p p and e
e. - So, pick initial p,e values and iterate,
stopwhen no more improvement is made. - This is a special case of a general class of
parameter estimation procedures called
Expectation-Maximization (EM).
13Baum-Welch
- The solution found by this iterative procedure is
a local maximum - very much dependent on initial parameters
- Updates are equivalent to p, e for maxp,e
L(p,e,p,e) SSPp,eSx log Pp,eS,xs.t.
St ps(t) 1, for all s in Q St
es(y) 1, for all y in S ps(t) 0,
es(y) 0
14Baum-Welch
- Could use Viterbi state assignments instead
less accurate, but a little faster. - Many computational biology applications of HMMs
use training sequences with labels.
15HMM structure
- These basic algorithms can be quite expressive,
but we need more. - Our current models emit a geometric number of
emissions in each state - Pls ps(s)l-1(1-ps(s))
- This is fine, as long as its really true!
16Dishonest casino
0.90
0.95
1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6
1 1/10 2 1/10 3 1/10 4 1/10 5 1/10 6 1/2
0.05
0.10
Loaded
Fair
17HMM structure
min length 7, geometric distribution otherwise
arbitrary distribution of lengths between 2 and 8
18HMM structure
- Using topology in this way is equivalent to
setting some ps(t) to zero. - Typically, also constrain es(t) to be equal, for
s in Qj. - The dynamic programming algorithms stay much the
same, but become much more graph like. - Can also change the emit model
- Silent states
19HMM structure
- Suppose we need to model skipping states
- Too many parameters! O(n2)
silent states
20Semi-HMM
- Semi-Hidden Markov Models extend this idea still
further - Explicitly model the output length distribution
- Each state emits a sequence of characters, whose
length is drawn from some distribution. - No more s to s transitions!
21Dishonest casino
Geom(p0.95)
Geom(p0.90)
1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6
1 1/10 2 1/10 3 1/10 4 1/10 5 1/10 6 1/2
1.00
1.00
Loaded
Fair
22Finding CpG Islands
23CpG Islands
Plength
P-length
24Semi-HMMs
- Semi-HMMs free us to choose arbitrary length
distributions and emit sequences of (arbitrary)
characters - Markov chains for local dependencies
- Highly structured models that capture specific
modules - CpG / non-CpG
- Gene / Intergenic
- Exon / Intron
25Example Gene-finding HMM
26Example Gene-finding HMM
- 5th order intergenic markov model
- Pxixi-5xi-4xi-3xi-2xi-1
- 453 3072 parameters
- First order internal codon model
- 6160 3660 parameters
- Stop codon choice
- 2 parameters
- Total of 6734 parameters two length
distributions!
27Viterbi for Semi-HMM
- Let vk(i) be the probability of the most probable
sequence of states that ends in state k with
observation i, then vk(i) maxk in Q ek(xi)
pk(k) vk(i-1) - Let Vk(i) be the probability of the most probable
sequence of states that ends in state k with an
emission that ends at position i. Vk(i) maxj
28Viterbi for Semi-HMM
- Vk(i) maxj xj1...xipk(k)Vk(j)
- This should look familiar
- arbitrary gap models for sequence alignment!
- Unfortunately, this means O(N2) time.
- For gene finding models, not so bad
- exons are short
- start and stop codons define few positions to
check
29Summary
- HMM parameters are estimated from training data.
- Explicit frequencies if states are known
- Baum-Welch if states are unknown
- HMM structure makes domain knowledge explicit
- Silent states
- Semi-HMM for arbitrary length distributions
- Emit sequences (Markov dependencies) rather than
single character