Title: Hidden Markov Models Modified by Winfried Just
1Hidden Markov ModelsModified by Winfried Just
2Outline
- CG-islands
- The Fair Bet Casino
- Hidden Markov Model
- Decoding Algorithm
- Forward-Backward Algorithm
- HMM Parameter Estimation
- Profile HMM Alignment
3CG-Islands
- Given 4 nucleotides probability of occurrence is
1/4. Thus, probability of occurrence of a
dinucleotide is 1/16. - However, the frequencies of dinucleotides in DNA
sequences vary widely. - In particular, CG is typically underrepresented
- CG often mutates to TG. Thus, prob. of CG
occurrence is typically lt (1/16)
4Why CG-Islands?
- CG is the least frequent dinucleotide because C
in CG is easily methylated, then has the tendency
to mutate into T - However, the methylation is suppressed around
genes in a genome. So, CG appears at relatively
high frequency within these CG islands - So, finding the CG islands in a genome is an
important problem
5CG Islands and the Fair Bet Casino
- The CG islands problem can be modeled after a
problem named The Fair Bet Casino
6TheFair Bet Casino
- The game is to flip coins, which results in only
two possible outcomes Head or Tail. - Suppose that the dealer uses both Fair and Biased
coins. - The Fair coin will give Heads and Tails with the
same probability of ½. - The Biased coin will give Heads with a
probability of ¾.
7The Fair Bet Casino (contd)
- Thus, we define the probabilities
- P(HF) P(TF) ½
- P(HB) ¾, P(TB) ¼
- The crooked dealer changes between Fair and
Biased coins with probability 10
8The Fair Bet Casino Problem
- Input A sequence of x x1x2x3xn of coin tosses
made by two possible coins (F or B). -
- Output A sequence p p1 p2 p3 pn, with each pi
being either F or B indicating that xi is the
result of tossing the Fair or Biased coin
respectively.
9Problem
Fair Bet Casino Problem Any observed outcome
could have been generated by any sequence of coin
tosses!
Need to incorporate a way to grade different
sequences differently.
Decoding Problem
10P(xfair coin) vs. P(xbiased coin)
- Some definitions
- P(xfair coin) probability of generating the
outcome x if the dealer uses the F coin. - P(xbiased coin) probability of generating the
outcome x if the dealer uses the B coin. - k the number of Heads in x.
11P(xfair coin) vs. P(xbiased coin)
- P(xfair coin) 1/2n
- P(xbiased coin) 3k/4n
- P(xfair coin) P(xbiased coin)
- when k n / log23
- k 0.67n
12Log-odds Ratio
- We define the log-odds ratio as follows
- log2(P(xfair coin) / P(xbiased coin))
- Ski1 log2(p(xi) / p-(xi))
- n k log23
13Computing Log-odds Ratio in Sliding Windows
x1x2x3x4x5x6x7x8xn Consider a sliding window
of the outcome sequence. Find the log-odds for
this short window.
14Hidden Markov Model (HMM)
- Can be viewed as an abstract machine with k
hidden states. - Each state has its own probability distribution,
and the machine switches between states according
to this probability distribution. - At each step, the machine makes 2 decisions
- What state should it move to next?
- What symbol from its alphabet should it emit?
15Why Hidden?
- Observers can see the emitted symbols of an HMM
but have no ability to know which state the HMM
is currently in. - Thus, the goal is to infer the most likely states
of an HMM basing on some given sequence of
emitted symbols.
16HMM Parameters
- S set of all possible emission characters.
- Ex. S H, T for coin tossing
- S 1, 2, 3, 4, 5, 6 for dice
tossing - S a, c, g, t for nucleotide
sequences - Q set of hidden states, each emitting symbols
- from S.
- Ex. Fair or Biased coin
- CP island or not CP island
- coding region or non-coding region
17HMM Parameters (contd)
- A (akl) a Q x Q matrix of probabilities of
changing from state k to state l. - E (ek(b)) a Q x S matrix of probabilities
of emitting symbol b during a step in which the
HMM is in state k.
18HMM for Fair Bet Casino
- The Fair Bet Casino can be defined in HMM terms
as follows - S 0, 1 (0 for Tails and 1 Heads)
- Q F,B F for Fair B for Biased coin.
- aFF aBB 0.9
- aFB aBF 0.1
- eF(0) ½ eF(1) ½
- eB(0) ¼ eB(1) ¾
19HMM for Fair Bet Casino (contd)
- Visualization of the Transition Probabilities A
20HMM for Fair Bet Casino (contd)
- Visualization of the Emission Probabilities E
21HMM for Fair Bet Casino (contd)
HMM model for the Fair Bet Casino Problem
22Hidden Paths
- A path p p1 pn in the HMM is defined as a
sequence of states. - Consider path p FFFBBBBBFFF and sequence x
01011101001
x 0 1 0 1 1 1
0 1 0 0 1 p F F F B
B B B B F F F P(xipi) ½ ½
½ ¾ ¾ ¾ ¾ ¾ ½ ½ ½ P(pi-1 ? pi) ½
9/10 9/10 1/10 9/10 9/10 9/10 9/10
1/10 9/10 9/10
23P(xp) Calculation
- P(xp) Probability that sequence x was generated
and the path p was followed, according to the
model M. - n
- P(xp) P(p0? p1) . ? P(xi pi).P(pi ? pi1)
- i1
- n
- a p0, p1 . ? e pi (xi) . a pi, pi1
- i1
24Decoding Problem
- Goal Find an optimal hidden path of states given
observations. - Input Sequence of observations x x1..xn
generated by an HMM M(S, Q, A, E) - Output A path that maximizes P(xp) (and thus
P(px) ) over all possible paths p.
25Building Manhattan for Decoding Problem
- Andrew Viterbi used the Manhattan grid model to
solve our Decoding Problem. - Every choice of p p1 pn corresponds to a path
in the graph. - The only valid direction in the graph is
eastward. - This graph has Q2(n-1) edges.
26Edit Graph for Decoding Problem
27Decoding Problem vs. Alignment Problem
Valid directions in the alignment problem.
Valid directions in the decoding problem.
28Decoding Problem as Finding a Longest Path in a
DAG
- The Decoding Problem is reduced to finding a
longest path in the directed acyclic graph (DAG)
above. - Notes the length of the path is defined as the
product of its edges weights, not the sum.
29Decoding Problem (contd)
- Every path in the graph has weight P(xp).
- The Viterbi algorithm finds the path that
maximizes P(xp) among all possible paths. - The Viterbi algorithm runs in O(nQ2) time.
30Decoding Problem (contd)
w
(k, i)
(l, i1)
The weight w is given by w el(xi1). akl
31Decoding Problem (contd)
- Initialization
- sbegin,0 1
- sk,0 0 for k ? begin.
- Final result
- Let p be the optimal path. Then,
-
- P(xp) maxk ? Q sk,n . ak,end
32Viterbi Algorithm
- The value of the product can become extremely
small, which leads to overflowing. - To avoid overflowing, use log value instead. So,
-
- sk,i1 logel(xi1) max k ? Q sk,i
log(akl)
33Forward-Backward Problem
- Given a sequence of coin tosses generated by
an HMM. - Goal find the probability that the dealer was
using a biased coin at a particular time.
34Forward Algorithm
- Defined fk,i (forward probability) as the
probability of emitting the prefix x1xi and
reaching the state pi k. - The recurrence for the forward algorithm is
- fk,i ek(xi) . S fk,i-1 . alk
- l ?
Q
35Backward Algorithm
- However, forward probability is not the only
factor affecting P(pi kx). - The sequence of transitions and emissions that
the HMM undergoes between pi and pn also affect
P(pi kx).
36Backward Algorithm (contd)
- Backward probability bk,i the probability of
being in state pi k and emitting the suffix
xi1xn. - The backward algorithms recurrence
- bk,i S el(xi1) . bl,i1 . Akl
- l ? Q
37Backward-Forward Algorithm
- The probability that the dealer used a biased
coin at any moment i is as follows - P(x, pi k)
fk(i) . bk(i) - P(pi kx) _______________
______________ - P(x)
P(x)
38HMM Parameter Estimation
- So far, we have assumed that the transition and
emission probabilities are known. - However, in most HMM applications, the
probabilities are not known. Its very hard to
estimate the probabilities.
39HMM Parameter Estimation (contd)
- Let T be a vector combining the unknown
transition and emission probabilities. - Given training sequences x (x1,, xm), let
P(xT) be the maximum probability of x given the
assignment of parameters T. - Then our goal is to find
- m
- maxT ? P(xjT)
- j1
40Finding Distant Members of a Protein Family
- Motivation Distant cousins of functionally
related biological sequences in a protein family
may have weak similarities, and thus fail
statistical tests, but may have weak similarities
with many members of the family. So, the goal is
to align a sequence to all members of the family
at once. - Families of related proteins can be represented
by their multiple alignment and the corresponding
profile.
41Profile Representation of Protein Families
- Aligned DNA sequences can be represented by a 4n
profile matrix reflecting the frequencies of
nucleotides.
Protein family can be represented by a 20n
profile representing frequencies of amino acids.
42Profiles and HMMs
- HMMs can also be used for aligning a sequence
against a profile representing - a protein family.
- A 20n profile P corresponds to n sequentially
linked match states M1,,Mn in the profile HMM of
P.
43Profile HMM
A profile HMM
44Insertion and Deletion States of Profile HMM
- States Ii insertion states
- States Di deletion states
- Assumption
- eIj(a) p(a)
- where p(a) is the frequency of the occurrence of
the symbol a in all the sequences.
45Profile HMM Alignment
- Define vMj as the logarithmic likelihood score of
the best path for matching x1, , xi to a
profile HMM ending with xi emitted by the state
Mj. - vIj(i) and vDj(i) are defined similarly.
46Profile HMM Alignment Dynamic Programming
-
vMj-1(i-1) log(aMj-1, Mj) - vMj(i) log (eMj(xi)/p(xi)) max
vIj-1(i-1) log(aIj-1, Mj) -
vDj-1(i-1) log(aDj-1, Mj) -
-
vMj(i-1) log(aMj, Ij) - vIj(i) log (eIj(xi)/p(xi)) max
vIj(i-1) log(aIj, Ij) -
vDj(i-1) log(aDj, Ij)
47Profile HMM Alignment Dynamic Programming
-
vMj-1(i-1) log(aMj-1, Dj) - vMj(i) max
vIj-1(i-1) log(aIj-1, Dj) -
vDj-1(i-1) log(aDj-1, Dj) -
-
48Paths in Edit Graph and Profile HMM
- A path through an edit graph and the
corresponding path through a profile HMM
49Speech Recognition
- Create an HMM of the words in a language
- Each word is a state in Q.
- Each of the basic sounds in the language is a
symbol in S. - Input use speech as the input sequence.
- Goal find the most probable sequence of states.
50Speech Recognition Building the Model
- Analyze some large source of English sentences,
such as a database of newspaper articles, to form
probability matrixes. - A0i the chance that word i begins a sentence.
- Aij the chance that word j follows word i.
51Building the Model (contd)
- Analyze English speakers to determine what sounds
are emitted with what words. - Ek(b) the chance that sound b is spoken in word
k. Allows for alternate pronunciation of words.
52Speech Recognition Using the Model
- Use the same dynamic programming algorithm as
before - Weave the spoken sounds through the model the
same way we wove the rolls of the die through the
casino model. - p represents the most likely set of words.
53Using the Model (contd)
- How well does it work?
- Common words, such as the, a, of make
prediction less accurate, since there are so many
words that follow normally. - We can add more states to incorporate a little
context into the decision.