HMMs Hidden Markov Models - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

HMMs Hidden Markov Models

Description:

A probabilistic model which deals with sequences of symbols. ... Instead of going over all possible paths (lots of computation...) dynamic programming. ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 37

Provided by: tau1

Category:

more less

Transcript and Presenter's Notes

Title: HMMs Hidden Markov Models

1
HMMs Hidden Markov Models
2
Definitions Uses

A probabilistic model which deals with sequences
of symbols.
Originally used in speech recognition (the
symbols being phonemes)
Useful in biology the sequence of symbols being
the DNA.

3
Questions where HMMs are used

Does this sequence belong to a particular family?
Can we identify regions in a sequence (for
instance alpha helices, beta sheets)?
and less directly
Pairwise/multiple sequence alignment
Searching databases for protein families
(building profiles).

4
Markov Chains (reminder)

A sequence of random variables X1,X2, where
each present state depends only on the previous
state.
A?C?G?G?T?A(vertical or horizontal!)
These conditional probabilities can be
illustrated as follows (in DNA)

5
Markov chain probabilities

Each arrow has a transition probability PCA
P(xiAXi-1C)
Thus the probability of a sequence x will be

6
Some terminology

We differentiate between states and
symbolsSymbols are what we see, our
observations the sequence we see.States are
what we talk of in theory. They are part of the
probabilistic model (Pst the probability of
the transition between 2 theoretic states, s and
t. Here s and t are A , C, G or T).We will see
the importance of this difference later on.

7
CpG islands

In the human genome, for biological reasons, the
occurrence of the dinucleotide CG is lower than
would be expected from C G independent
probabilities.

On the other hand, around promoters or the
start regions of many genes, we see many more
CG couplets (and in fact more C and G generally).

8
Markov chains in use

Lets answer one of the previous questions, using
Markov Chains. For example Given a sequence, is
it a CpG island?
Lets assume that we have the 2 sets of
transition probabilities in standard DNA, and
transition probabilities in CpG DNA.

9
Markov chains in use - continued
CpG islands ( model)
normal DNA (- model)
10
Markov chains in use - continued

Thus, we have two models which we can compare
statistically (likelihood ratio test)

Note Px0x1P(x1) the probability of beginning
with x1.
11
And now to HMMs

HMMs can answer a question we posed before that
Markov chains cannot how do we find CpG islands
in a long un-annotated sequence?

Instead of two Markov chain models (like before)
we need one model that incorporates both the
models from before. How do we do that in the
previous example?

12
The solution

We relable the states. In order to integrate, we
will define A , G , C , T (CpG areas) and A-,
G-, C-, T- (normal DNA). Both A and A- emit
the symbol A.

Many state sequences emit a sequence of
symbolsAGCTG-C-C-T- A-G-CTGCCT- A
G C T G C C T

14
Most probable state path - Viterbi algorithm

We can compute the likelihood of each one of
these state paths.
Look for the maximum likelihood state sequence.
This way, given a long unnotated sequence, I get
the ML state at the endA-G-C-G-T-T-T-C-GCAGA
-C-G-T-CGT
Viterbi algorithm does this with dynamic
programming algorithm (an example later).

15
HMM Formalism
A
A
A
A
Si1
Si
Si-1
S1
Sn
. . .
. . .
B
B
B
Ki1
Ki
Ki-1
K1
Kn
. . .
. . .

S, K, P, A, B
S s1sN are the values for the hidden states
K k1kM are the values for the observations
The hidden states emit/generate the symbols
(observations)
? ?i are the initial state probabilities
P Pij are the state transition probabilities
B bik are the emission probabilities (which
in our case are 0 or 1 P(A/A)P(A/A-)1,
P(A/C)0)

16
Markov chain vs. HMM

The state sequence itself follows a simple Markov
chain. But-
The essential difference between a Markov chain
and an HMM is it is no longer possible to know
the state by looking at the symbols the state
is hidden.

17
Another example the dishonest casino

In a casino, they use a fair die most of the
time, but occasionally switch to an unfair die.
The switch between dice can be represented by an
HMM

0.9
0.95
1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6
1 1/10 2 1/10 3 1/10 4 1/10 5 1/10 6 1/2
0.05
0.1
FAIR
UNFAIR
18
Dishonest casino - continued

The symbols (observations) are the sequence of
rolls3 5 6 2 1 4 6 3 6
What is hidden?If the die is fair or unfairf f
f f u u u f fThis is a Markov chain. Except for
that, we have
Emission probabilitiesGiven a state, we have 6
possible matching symbols, each with an emission
probability.

19
Exposing the casino

Once again we can estimate which is the most
probable state path, and estimate when the casino
was cheating

20
Rate Variation Among Sites

Many methods assume equal rate of evolution in
all sites unrealistic
Rate4site a method for deducing the rate at
each sites. Assumes independence of the rate
between sites.
Rate variation estimation shouldAllow some
correlation between the rates of evolution at
adjacent sites.

21
A Hidden Markov Model Approach to Variation Among
Sites in Rate of Evolution (Churchill and
Felsenstein 1996)
22
Motivation

To find the path of rates that fits the data in
the best way.
We will represent a general path as
follows(c1,c2,,cn)where ci 10.0 or 2.0 or
0.3 (in the example above).ci corresponds to the
rate at site i.n is the length of the sequence.

23
Paths ? HMM

Each path corresponds to a state path.The real
path is hidden.
Before we had ACG ? A C G
Now we have c1,c2,c3 corresponding to a multiple
sequence alignment

x1 y1 y3 x2 y2 y3 s1 s2 s3
24
Motivation - notes

Finding the path that best fits the datahow?The
path that makes the maximum contribution to the
likelihood of the data.
Instead of going over all possible paths (lots of
computation) ? dynamic programming.

25
Outline of the algorithm

Assume we have k categories of rater1, r2, ,
rk.We know ?ri for i1kWe know Pri rj for
i,j1k (correlation values)

26
Outline - continued

2. The likelihood of a given phylogeny T
n the number of sites in sequence.
The combination (c1,,cn) denotes a combination
of rates for each site

27
Outline - continued

3. Look for the combination which makes the
largest contribution to the likelihood

28
the gory details1. Choosing rate categories

How do we choose the rates?- estimate the values
of ri and fi by ML, using the EM algorithm-
examine a few sets of rates and choose the one
with ML- equal-probability categories from gamma
distribution (Yang, 1995)

29
Computing the likelihood

Computing the likelihoodVia the up algorithm.

30
The most probable rates combination

The most probable rates combinationa version of
the Viterbi algorithm.Definewhere for kn

D(k) data of sites k till n
Dk data at site k

31
The most probable rates combination - continued

This is the likelihood contribution of sites k
thru n, where site k has rate ck, and the rest
(k1 thru n) have the combination of rates that
maximizes the contribution.

32
The most probable rates combination - continued

We thus get

33
Proof (the essentials)

Because the rate categories are the outcome of a
Markov chain and P(c1,cn)fc1Pc1,c2Pc2c3Pcn-1,cn
Due to the assumption that once the rate is set,
the sites evolve independently

34
The most probable rates combination - continued

Thus, we compute the above equation from sites n,
n-1, n-2 up till 1, for each one of the rate
categories.
At the end, we take the largest of the quantities
This is the quantity that maximizes the
contribution to the likelihood.

35
Backtracking the rate combination