Title: HMMs Hidden Markov Models
1HMMs Hidden Markov Models
2Definitions Uses
- A probabilistic model which deals with sequences
of symbols. - Originally used in speech recognition (the
symbols being phonemes) - Useful in biology the sequence of symbols being
the DNA.
3Questions where HMMs are used
- Does this sequence belong to a particular family?
- Can we identify regions in a sequence (for
instance alpha helices, beta sheets)? - and less directly
- Pairwise/multiple sequence alignment
- Searching databases for protein families
(building profiles).
4Markov Chains (reminder)
- A sequence of random variables X1,X2, where
each present state depends only on the previous
state. - A?C?G?G?T?A(vertical or horizontal!)
- These conditional probabilities can be
illustrated as follows (in DNA)
5Markov chain probabilities
- Each arrow has a transition probability PCA
P(xiAXi-1C) - Thus the probability of a sequence x will be
6Some terminology
- We differentiate between states and
symbolsSymbols are what we see, our
observations the sequence we see.States are
what we talk of in theory. They are part of the
probabilistic model (Pst the probability of
the transition between 2 theoretic states, s and
t. Here s and t are A , C, G or T).We will see
the importance of this difference later on.
7CpG islands
- In the human genome, for biological reasons, the
occurrence of the dinucleotide CG is lower than
would be expected from C G independent
probabilities.
- On the other hand, around promoters or the
start regions of many genes, we see many more
CG couplets (and in fact more C and G generally).
8Markov chains in use
- Lets answer one of the previous questions, using
Markov Chains. For example Given a sequence, is
it a CpG island? - Lets assume that we have the 2 sets of
transition probabilities in standard DNA, and
transition probabilities in CpG DNA.
9Markov chains in use - continued
CpG islands ( model)
normal DNA (- model)
10Markov chains in use - continued
- Thus, we have two models which we can compare
statistically (likelihood ratio test)
Note Px0x1P(x1) the probability of beginning
with x1.
11And now to HMMs
- HMMs can answer a question we posed before that
Markov chains cannot how do we find CpG islands
in a long un-annotated sequence?
- Instead of two Markov chain models (like before)
we need one model that incorporates both the
models from before. How do we do that in the
previous example?
12The solution
- We relable the states. In order to integrate, we
will define A , G , C , T (CpG areas) and A-,
G-, C-, T- (normal DNA). Both A and A- emit
the symbol A.
13- Many state sequences emit a sequence of
symbolsAGCTG-C-C-T- A-G-CTGCCT- A
G C T G C C T
14Most probable state path - Viterbi algorithm
- We can compute the likelihood of each one of
these state paths. - Look for the maximum likelihood state sequence.
- This way, given a long unnotated sequence, I get
the ML state at the endA-G-C-G-T-T-T-C-GCAGA
-C-G-T-CGT - Viterbi algorithm does this with dynamic
programming algorithm (an example later).
15HMM Formalism
A
A
A
A
Si1
Si
Si-1
S1
Sn
. . .
. . .
B
B
B
Ki1
Ki
Ki-1
K1
Kn
. . .
. . .
- S, K, P, A, B
- S s1sN are the values for the hidden states
- K k1kM are the values for the observations
- The hidden states emit/generate the symbols
(observations) - ? ?i are the initial state probabilities
- P Pij are the state transition probabilities
- B bik are the emission probabilities (which
in our case are 0 or 1 P(A/A)P(A/A-)1,
P(A/C)0)
16Markov chain vs. HMM
- The state sequence itself follows a simple Markov
chain. But- - The essential difference between a Markov chain
and an HMM is it is no longer possible to know
the state by looking at the symbols the state
is hidden.
17Another example the dishonest casino
- In a casino, they use a fair die most of the
time, but occasionally switch to an unfair die.
The switch between dice can be represented by an
HMM
0.9
0.95
1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6
1 1/10 2 1/10 3 1/10 4 1/10 5 1/10 6 1/2
0.05
0.1
FAIR
UNFAIR
18Dishonest casino - continued
- The symbols (observations) are the sequence of
rolls3 5 6 2 1 4 6 3 6 - What is hidden?If the die is fair or unfairf f
f f u u u f fThis is a Markov chain. Except for
that, we have - Emission probabilitiesGiven a state, we have 6
possible matching symbols, each with an emission
probability.
19Exposing the casino
- Once again we can estimate which is the most
probable state path, and estimate when the casino
was cheating
20Rate Variation Among Sites
- Many methods assume equal rate of evolution in
all sites unrealistic - Rate4site a method for deducing the rate at
each sites. Assumes independence of the rate
between sites. - Rate variation estimation shouldAllow some
correlation between the rates of evolution at
adjacent sites.
21A Hidden Markov Model Approach to Variation Among
Sites in Rate of Evolution (Churchill and
Felsenstein 1996)
22Motivation
- To find the path of rates that fits the data in
the best way. - We will represent a general path as
follows(c1,c2,,cn)where ci 10.0 or 2.0 or
0.3 (in the example above).ci corresponds to the
rate at site i.n is the length of the sequence.
23Paths ? HMM
- Each path corresponds to a state path.The real
path is hidden. - Before we had ACG ? A C G
- Now we have c1,c2,c3 corresponding to a multiple
sequence alignment
x1 y1 y3 x2 y2 y3 s1 s2 s3
24Motivation - notes
- Finding the path that best fits the datahow?The
path that makes the maximum contribution to the
likelihood of the data. - Instead of going over all possible paths (lots of
computation) ? dynamic programming.
25Outline of the algorithm
- Assume we have k categories of rater1, r2, ,
rk.We know ?ri for i1kWe know Pri rj for
i,j1k (correlation values)
26Outline - continued
- 2. The likelihood of a given phylogeny T
- n the number of sites in sequence.
- The combination (c1,,cn) denotes a combination
of rates for each site
27Outline - continued
- 3. Look for the combination which makes the
largest contribution to the likelihood
28the gory details1. Choosing rate categories
- How do we choose the rates?- estimate the values
of ri and fi by ML, using the EM algorithm-
examine a few sets of rates and choose the one
with ML- equal-probability categories from gamma
distribution (Yang, 1995)
29Computing the likelihood
- Computing the likelihoodVia the up algorithm.
30The most probable rates combination
- The most probable rates combinationa version of
the Viterbi algorithm.Definewhere for kn
- D(k) data of sites k till n
- Dk data at site k
31The most probable rates combination - continued
- This is the likelihood contribution of sites k
thru n, where site k has rate ck, and the rest
(k1 thru n) have the combination of rates that
maximizes the contribution.
32The most probable rates combination - continued
33Proof (the essentials)
- Because the rate categories are the outcome of a
Markov chain and P(c1,cn)fc1Pc1,c2Pc2c3Pcn-1,cn
- Due to the assumption that once the rate is set,
the sites evolve independently
34The most probable rates combination - continued
- Thus, we compute the above equation from sites n,
n-1, n-2 up till 1, for each one of the rate
categories. - At the end, we take the largest of the quantities
- This is the quantity that maximizes the
contribution to the likelihood.
35Backtracking the rate combination
- We still dont know which combination c1,cn
brought us here, so we backtrack
36THE END!