Hidden Markov Models - PowerPoint PPT Presentation

About This Presentation

Title:

Hidden Markov Models

Description:

Hidden Markov Models. CBB 231 / COMPSCI 261. An HMM is a ... Higher Order Markovian Eukaryotic Recognizer (HOMER) H3. H5. H17. H27. H77. H95. 0. 0. 0. 0 ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 33

Provided by: BillMa46

Learn more at: http://geneprediction.org

Category:

more less

Transcript and Presenter's Notes

Title: Hidden Markov Models

1
Hidden Markov Models

CBB 231 / COMPSCI 261

2
What is an HMM?

An HMM is a stochastic machine M(Q, ?, Pt, Pe)
consisting of the following
a finite set of states, Qq0, q1, ... , qm
a finite alphabet ? s0, s1, ... , sn
a transition distribution Pt QQ a
i.e., Pt (qj qi)
an emission distribution Pe Q? a
i.e., Pe (sj qi)

An Example
5
M1(q0,q1,q2,Y,R,Pt,Pe) Pt(q0,q1,1),
(q1,q1,0.8), (q1,q2,0.15), (q1,q0,0.05),
(q2,q2,0.7), (q2,q1,0.3) Pe(q1,Y,1),
(q1,R,0), (q2,Y,0), (q2,R,1)
15
Y0 R 100 q2
R0 Y 100 q1
q 0
80
30
70
100
3
Probability of a Sequence
P(YRYRYM1) a0?1?b1,Y?a1?2?b2,R?a2?1?b1,Y?a1?2?
b2,R?a2?1?b1,Y?a1?0 1 ? 1 ? 0.15 ? 1 ? 0.3 ? 1
? 0.15 ? 1 ? 0.3 ? 1 ? 0.05 0.00010125
4
Another Example
M2 (Q, ?, Pt, Pe) Q q0, q1, q2, q3, q4
? A, C, G, T
q2
65
A35 T25 C15 G25
q4
q1
35
50
A10 T30 C40 G20
A27 T14 C22 G37
100
q0
A11 T17 C43 G29
50
20
100
80
q3
5
Finding the Most Probable Path
Finding the Most Probable Path
q2
65
Example C A T T A A T A G
A35 T25 C15 G25
q4
q1
50
35
A10 T30 C40 G20
A27 T14 C22 G37
top 7.010-7 bottom 2.810-9
100
q0
A11 T17 C43 G29
20
50
100
80
q3
The most probable path is States
122222224 Sequence CATTAATAG resulting in
this parse States 122222224 Sequence
CATTAATAG
feature 1 C feature 2 ATTAATA feature 3 G
6
Decoding with an HMM
emission prob.
transition prob.
7
The Best Partial Parse
8
The Viterbi Algorithm
. . .
. . .
sequence
k
k1
k-1
k-2
states
(i,k)
. . .
9
Viterbi Traceback
T( T( T( ... T( T(i, L-1), L-2) ..., 2), 1), 0)
0
10
Viterbi Algorithm in Pseudocode
?transqiqj Pt(qiqj)gt0 ?emits qi
Pe(sqi)gt0
initialization
fill out main part of DP matrix
choose best state from last column in DP matrix
traceback
11
The Forward Algorithm Probability of a Sequence
F(i,k) represents the probability P(S0..k-1 qi)
that the machine emits the subsequence x0...xk-1
by any path ending in state qii.e., so that
symbol xk-1 is emitted by state qi.
12
The Forward Algorithm Probability of a Sequence
the single most probable path
Viterbi
sum over all paths
Forward
i.e.,
. . .
. . .
sequence
k
k1
k-1
k-2
states
(i,k)
. . .
13
The Forward Algorithm in Pseudocode
fill out the DP matrix
sum over the final column to get P(S)
14
Training an HMM from Labeled Sequences
CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC 011111112222222111111222211111112222111110
to state to state to state
0 1 2
from state 0 0 (0) 1 (100) 0 (0)
from state 1 1 (4) 21 (84) 3 (12)
from state 2 0 (0) 3 (20) 12 (80)
transitions
symbol symbol symbol symbol
A C G T
in state 1 6 (24) 7 (28) 5 (20) 7 (28)
in state 2 3 (20) 3 (20) 2 (13) 7 (47)
emissions
15
Recall Eukaryotic Gene Structure
complete mRNA
coding segment
ATG
TGA
exon
exon
exon
intron
intron
. . .
. . .
. . .
AG
GT
AG
ATG
GT
TGA
start codon
stop codon
donor site
donor site
acceptor site
acceptor site
16
Using an HMM for Gene Prediction
Intron
Donor
Acceptor
Exon
the Markov model
Start codon
Stop codon
Intergenic
q0
the input sequence
AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTA
GTATAGGTCGATAGTACGCGA
the most probable path
the gene prediction
exon 1
exon 2
exon 3
17
Higher Order Markovian Eukaryotic Recognizer
(HOMER)
H17
H5
H3
H95
H27
H77
18
HOMER, version H3
Intron
Iintron state Eexon state Nintergenic state
Donor
Acceptor
?
Exon
Start codon
Stop codon
Intergenic
tested on 500 Arabidopsis genes
q0
nucleotides nucleotides nucleotides splice sites splice sites start/stop codons start/stop codons exons exons exons genes genes
Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
baseline 100 28 44 0 0 0 0 0 0 0 0 0
H3 53 88 66 0 0 0 0 0 0 0 0 0
19
Recall Sensitivity and Specificity
20
HOMER, version H5
three exon states, for the three codon positions
nucleotides nucleotides nucleotides splice sites splice sites start/stop codons start/stop codons exons exons exons genes genes
Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
H3 53 88 66 0 0 0 0 0 0 0 0 0
H5 65 91 76 1 3 3 3 0 0 0 0 0
21
HOMER version H17
donor site
acceptor site
stop codon
start codon
nucleotides nucleotides nucleotides splice sites splice sites start/stop codons start/stop codons exons exons exons genes genes
Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
H5 65 91 76 1 3 3 3 0 0 0 0 0
H17 81 93 87 34 48 43 37 19 24 21 7 35
22
Maintaining Phase Across an Intron
01201201 2012012012
phase
GTATGCGATAGTCAAGAGTGATCGCTAGACC

sequence

0 5 10 15
20 25 30
coordinates
23
HOMER version H27
three separate intron models
nucleotides nucleotides nucleotides splice splice start/stop start/stop exons exons exons genes genes
Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
H17 81 93 87 34 48 43 37 19 24 21 7 35
H27 83 93 88 40 49 41 36 23 27 25 8 38
24
Recall Weight Matrices
(stop codons)
T G A
(start codons)
T A A
A T G
T A G
(acceptor splice sites)
(donor splice sites)
A G
G T
25
HOMER version H77
positional biases near splice sites
nucleotides nucleotides nucleotides splice splice start/stop start/stop exons exons exons genes genes
Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
H27 83 93 88 40 49 41 36 23 27 25 8 38
H77 88 96 92 66 67 51 46 47 46 46 13 65
26
HOMER version H95
nucleotides nucleotides nucleotides splice splice start/stop start/stop exons exons exons genes genes
Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
H77 88 96 92 66 67 51 46 47 46 46 13 65
H95 92 97 94 79 76 57 53 62 59 60 19 93
27
Summary of HOMER Results
28
Higher-order Markov Models
P(G)
A C G C T A
0th order
P(GC)
A C G C T A
1st order
P(GAC)
A C G C T A
2nd order
29
Higher-order Markov Models
order nucleotides nucleotides nucleotides splice sites splice sites starts/ stops starts/ stops exons exons exons genes genes
order Sn Sp F Sn Sp Sn Sp Sn Sp F Sn
H95 0 92 97 94 79 76 57 53 62 59 60 19 93
H95 1 95 98 97 87 81 64 61 72 68 70 25 127
H95 2 98 98 98 91 82 65 62 76 69 72 27 136
H95 3 98 98 98 91 82 67 63 76 69 72 28 140
H95 4 98 97 98 90 81 69 64 76 68 72 29 143
H95 5 98 97 98 90 81 66 62 74 67 70 27 137
0
1
2
3
4
5
30
Variable-Order Markov Models
31
Interpolation Results
32
Summary

An HMM is a stochastic generative model which
emits sequences
Parsing with an HMM can be accomplished using a
decoding algorithm (such as Viterbi) to find the
most probable (MAP) path generating the input
sequence
Training of unambiguous HMMs can be
accomplished using labeled sequence training
Training of ambiguous HMMs can be accomplished
using Viterbi training or the Baum-Welch
algorithm (next lesson...)
Posterior decoding can be used to estimate the
probability that a given symbol or substring was
generate by a particular state (next lesson...)