Title: Hidden Markov Models
1Hidden Markov Models
2What is an HMM?
- An HMM is a stochastic machine M(Q, ?, Pt, Pe)
consisting of the following - a finite set of states, Qq0, q1, ... , qm
- a finite alphabet ? s0, s1, ... , sn
- a transition distribution Pt QQ a
i.e., Pt (qj qi) - an emission distribution Pe Q? a
i.e., Pe (sj qi)
An Example
5
M1(q0,q1,q2,Y,R,Pt,Pe) Pt(q0,q1,1),
(q1,q1,0.8), (q1,q2,0.15), (q1,q0,0.05),
(q2,q2,0.7), (q2,q1,0.3) Pe(q1,Y,1),
(q1,R,0), (q2,Y,0), (q2,R,1)
15
Y0 R 100 q2
R0 Y 100 q1
q 0
80
30
70
100
3Probability of a Sequence
P(YRYRYM1) a0?1?b1,Y?a1?2?b2,R?a2?1?b1,Y?a1?2?
b2,R?a2?1?b1,Y?a1?0 1 ? 1 ? 0.15 ? 1 ? 0.3 ? 1
? 0.15 ? 1 ? 0.3 ? 1 ? 0.05 0.00010125
4Another Example
M2 (Q, ?, Pt, Pe) Q q0, q1, q2, q3, q4
? A, C, G, T
q2
65
A35 T25 C15 G25
q4
q1
35
50
A10 T30 C40 G20
A27 T14 C22 G37
100
q0
A11 T17 C43 G29
50
20
100
80
q3
5Finding the Most Probable Path
Finding the Most Probable Path
q2
65
Example C A T T A A T A G
A35 T25 C15 G25
q4
q1
50
35
A10 T30 C40 G20
A27 T14 C22 G37
top 7.010-7 bottom 2.810-9
100
q0
A11 T17 C43 G29
20
50
100
80
q3
The most probable path is States
122222224 Sequence CATTAATAG resulting in
this parse States 122222224 Sequence
CATTAATAG
feature 1 C feature 2 ATTAATA feature 3 G
6Decoding with an HMM
emission prob.
transition prob.
7The Best Partial Parse
8The Viterbi Algorithm
. . .
. . .
sequence
k
k1
k-1
k-2
states
(i,k)
. . .
9Viterbi Traceback
T( T( T( ... T( T(i, L-1), L-2) ..., 2), 1), 0)
0
10Viterbi Algorithm in Pseudocode
?transqiqj Pt(qiqj)gt0 ?emits qi
Pe(sqi)gt0
initialization
fill out main part of DP matrix
choose best state from last column in DP matrix
traceback
11The Forward Algorithm Probability of a Sequence
F(i,k) represents the probability P(S0..k-1 qi)
that the machine emits the subsequence x0...xk-1
by any path ending in state qii.e., so that
symbol xk-1 is emitted by state qi.
12The Forward Algorithm Probability of a Sequence
the single most probable path
Viterbi
sum over all paths
Forward
i.e.,
. . .
. . .
sequence
k
k1
k-1
k-2
states
(i,k)
. . .
13The Forward Algorithm in Pseudocode
fill out the DP matrix
sum over the final column to get P(S)
14Training an HMM from Labeled Sequences
CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC 011111112222222111111222211111112222111110
to state to state to state
0 1 2
from state 0 0 (0) 1 (100) 0 (0)
from state 1 1 (4) 21 (84) 3 (12)
from state 2 0 (0) 3 (20) 12 (80)
transitions
symbol symbol symbol symbol
A C G T
in state 1 6 (24) 7 (28) 5 (20) 7 (28)
in state 2 3 (20) 3 (20) 2 (13) 7 (47)
emissions
15Recall Eukaryotic Gene Structure
complete mRNA
coding segment
ATG
TGA
exon
exon
exon
intron
intron
. . .
. . .
. . .
AG
GT
AG
ATG
GT
TGA
start codon
stop codon
donor site
donor site
acceptor site
acceptor site
16Using an HMM for Gene Prediction
Intron
Donor
Acceptor
Exon
the Markov model
Start codon
Stop codon
Intergenic
q0
the input sequence
AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTA
GTATAGGTCGATAGTACGCGA
the most probable path
the gene prediction
exon 1
exon 2
exon 3
17Higher Order Markovian Eukaryotic Recognizer
(HOMER)
H17
H5
H3
H95
H27
H77
18HOMER, version H3
Intron
Iintron state Eexon state Nintergenic state
Donor
Acceptor
?
Exon
Start codon
Stop codon
Intergenic
tested on 500 Arabidopsis genes
q0
nucleotides nucleotides nucleotides splice sites splice sites start/stop codons start/stop codons exons exons exons genes genes
Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
baseline 100 28 44 0 0 0 0 0 0 0 0 0
H3 53 88 66 0 0 0 0 0 0 0 0 0
19Recall Sensitivity and Specificity
20HOMER, version H5
three exon states, for the three codon positions
nucleotides nucleotides nucleotides splice sites splice sites start/stop codons start/stop codons exons exons exons genes genes
Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
H3 53 88 66 0 0 0 0 0 0 0 0 0
H5 65 91 76 1 3 3 3 0 0 0 0 0
21HOMER version H17
donor site
acceptor site
stop codon
start codon
nucleotides nucleotides nucleotides splice sites splice sites start/stop codons start/stop codons exons exons exons genes genes
Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
H5 65 91 76 1 3 3 3 0 0 0 0 0
H17 81 93 87 34 48 43 37 19 24 21 7 35
22Maintaining Phase Across an Intron
01201201 2012012012
phase
GTATGCGATAGTCAAGAGTGATCGCTAGACC
sequence
0 5 10 15
20 25 30
coordinates
23HOMER version H27
three separate intron models
nucleotides nucleotides nucleotides splice splice start/stop start/stop exons exons exons genes genes
Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
H17 81 93 87 34 48 43 37 19 24 21 7 35
H27 83 93 88 40 49 41 36 23 27 25 8 38
24Recall Weight Matrices
(stop codons)
T G A
(start codons)
T A A
A T G
T A G
(acceptor splice sites)
(donor splice sites)
A G
G T
25HOMER version H77
positional biases near splice sites
nucleotides nucleotides nucleotides splice splice start/stop start/stop exons exons exons genes genes
Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
H27 83 93 88 40 49 41 36 23 27 25 8 38
H77 88 96 92 66 67 51 46 47 46 46 13 65
26HOMER version H95
nucleotides nucleotides nucleotides splice splice start/stop start/stop exons exons exons genes genes
Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
H77 88 96 92 66 67 51 46 47 46 46 13 65
H95 92 97 94 79 76 57 53 62 59 60 19 93
27Summary of HOMER Results
28Higher-order Markov Models
P(G)
A C G C T A
0th order
P(GC)
A C G C T A
1st order
P(GAC)
A C G C T A
2nd order
29Higher-order Markov Models
order nucleotides nucleotides nucleotides splice sites splice sites starts/ stops starts/ stops exons exons exons genes genes
order Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
H95 0 92 97 94 79 76 57 53 62 59 60 19 93
H95 1 95 98 97 87 81 64 61 72 68 70 25 127
H95 2 98 98 98 91 82 65 62 76 69 72 27 136
H95 3 98 98 98 91 82 67 63 76 69 72 28 140
H95 4 98 97 98 90 81 69 64 76 68 72 29 143
H95 5 98 97 98 90 81 66 62 74 67 70 27 137
0
1
2
3
4
5
30Variable-Order Markov Models
31Interpolation Results
32Summary
- An HMM is a stochastic generative model which
emits sequences - Parsing with an HMM can be accomplished using a
decoding algorithm (such as Viterbi) to find the
most probable (MAP) path generating the input
sequence - Training of unambiguous HMMs can be
accomplished using labeled sequence training - Training of ambiguous HMMs can be accomplished
using Viterbi training or the Baum-Welch
algorithm (next lesson...) - Posterior decoding can be used to estimate the
probability that a given symbol or substring was
generate by a particular state (next lesson...)