Title: CSE182-L10
1CSE182-L10
2Probability of being in specific states
- What is the probability that we were in state k
at step I? - PrAll paths that passed through state k
at step I, and emitted x - PrAll paths that
emitted x
3The Forward Algorithm
- Recall vi,j Probability of the most likely
path the automaton chose in emitting x1xi, and
ending up in state j. - Define fi,j Probability that the automaton
started from state 1, and emitted x1xi - What is the difference?
x1xi
4Most Likely path versus Probability of Arrival
- There are multiple paths from states 1..j in
which the automaton can output x1xi - In computing the viterbi path, we choose the most
likely path - Vi,j maxp Prx1xip
- The probability of emitting x1xi and ending up
in state j is given by - Fi,j ?p Prx1xip
5The Forward Algorithm
- Recall that
- v(i,j) max l?Q v(i-1,l).Al,j .ej(xi)
- Instead
- F(i,j) ?l?Q (F(i-1,l).Al,j ).ej(xi)
1
j
6The Backward Algorithm
- Define bi,j Probability that the automaton
started from state i, emitted xi1xn and ended
up in the final state
xi1xn
x1xi
1
m
i
7Forward Backward Scoring
- F(i,j) ?l?Q (F(i-1,l).Al,j ).ej(xi)
- Bi,j ?l?Q (Aj,l.el(xi1) B(i1,l))
- Prx,pikF(i,k) B(i,k)
8Application of HMMs
- How do we modify this to handle indels?
9Applications of the HMM paradigm
- Modifying Profile HMMs to handle indels
- States Ii insertion states
- States Di deletion states
1 2 3 4 5 6 7 8
0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.0 0.0
0.2 0.7 0.0 0.3 0.0 0.0 0.0 0.1
0.2 0.0 0.0 0.3 1.0 0.3 0.0 0.0 0.2
0.0 0.4 0.3 0.0 0.5 0.0
A C G T
10Profile HMMs
- An assignment of states implies insertion, match,
or deletion. EX ACACTGTA
1 2 3 4 5 6 7 8
0.9 0.4 0.3 0.6 0.1 0.0 0.2 1.0 0.0
0.2 0.7 0.0 0.3 0.0 0.0 0.0 0.1
0.2 0.0 0.0 0.3 1.0 0.3 0.0 0.0 0.2
0.0 0.4 0.3 0.0 0.5 0.0
A C G T
C
A
A
A
T
G
T
C
11Viterbi Algorithm revisited
- Define vMj (i) as the log likelihood score of
the best path for matching x1..xi to profile HMM
ending with xi emitted by the state Mj. - vIj(i) and vDj(i) are defined similarly.
12Viterbi Equations for Profile HMMs
vMj-1(i-1) log(AMj-1, Mj) vMj(i)
log (eMj(xi)) max vIj-1(i-1)
log(AIj-1, Mj)
vDj-1(i-1) log(ADj-1,
Mj)
vMj(i-1) log(AMj-1, Ij) vIj(i)
log (eIj(xi)) max vIj(i-1)
log(AIj-1, Ij)
vDj(i-1) log(ADj-1, Ij)
13Compositional Signals
- CpG islands. In genomic sequence, the CG
di-nucleotide is rarely seen - CG helps methylation of C, and subsequent
mutation to T. - In regions around a gene, the methylation is
suppressed, and therefore CG is more common. - CpG islands Islands of CG on the genome.
- How can you detect CpG islands?
14An HMM for Genomic regions
- Node A emits A with Prob. 1, and 0 for all other
bases. - The start and end node do not emit any symbol.
- All outgoing edges from nodes are equi-probable,
except for the ones coming out of C.
A
G
0.1
.25
end
start
C
T
0.4
.25
15An HMM for CpG islands
- Node A emits A with Prob. 1, and 0 for all other
bases. - The start and end node do not emit any symbol.
- All outgoing edges from nodes are equi-probable,
except for the ones coming out of C.
A
G
0.25
0.25
end
start
C
T
0.25
16HMM for detecting CpG Islands
A
B
A
G
A
0.1
end
G
start
end
C
start
0.4
T
C
T
- In the best parse of a genomic sequence, each
base is assigned a state from the sets A, and B. - Any substring with multiple states coming from B
can be described as a CpG island.
17HMM Summary
- HMMs are a natural technique for modeling many
biological domains. - They can capture position dependent, and also
compositional properties. - HMMs have been very useful in an important
Bioinformatics application gene finding.