HMMs Hidden Markov Models - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

HMMs Hidden Markov Models

Description:

A probabilistic model which deals with sequences of symbols. ... Instead of going over all possible paths (lots of computation...) dynamic programming. ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 37
Provided by: tau1
Category:

less

Transcript and Presenter's Notes

Title: HMMs Hidden Markov Models


1
HMMs Hidden Markov Models
2
Definitions Uses
  • A probabilistic model which deals with sequences
    of symbols.
  • Originally used in speech recognition (the
    symbols being phonemes)
  • Useful in biology the sequence of symbols being
    the DNA.

3
Questions where HMMs are used
  • Does this sequence belong to a particular family?
  • Can we identify regions in a sequence (for
    instance alpha helices, beta sheets)?
  • and less directly
  • Pairwise/multiple sequence alignment
  • Searching databases for protein families
    (building profiles).

4
Markov Chains (reminder)
  • A sequence of random variables X1,X2, where
    each present state depends only on the previous
    state.
  • A?C?G?G?T?A(vertical or horizontal!)
  • These conditional probabilities can be
    illustrated as follows (in DNA)

5
Markov chain probabilities
  • Each arrow has a transition probability PCA
    P(xiAXi-1C)
  • Thus the probability of a sequence x will be

6
Some terminology
  • We differentiate between states and
    symbolsSymbols are what we see, our
    observations the sequence we see.States are
    what we talk of in theory. They are part of the
    probabilistic model (Pst the probability of
    the transition between 2 theoretic states, s and
    t. Here s and t are A , C, G or T).We will see
    the importance of this difference later on.

7
CpG islands
  • In the human genome, for biological reasons, the
    occurrence of the dinucleotide CG is lower than
    would be expected from C G independent
    probabilities.
  • On the other hand, around promoters or the
    start regions of many genes, we see many more
    CG couplets (and in fact more C and G generally).

8
Markov chains in use
  • Lets answer one of the previous questions, using
    Markov Chains. For example Given a sequence, is
    it a CpG island?
  • Lets assume that we have the 2 sets of
    transition probabilities in standard DNA, and
    transition probabilities in CpG DNA.

9
Markov chains in use - continued
CpG islands ( model)
normal DNA (- model)
10
Markov chains in use - continued
  • Thus, we have two models which we can compare
    statistically (likelihood ratio test)

Note Px0x1P(x1) the probability of beginning
with x1.
11
And now to HMMs
  • HMMs can answer a question we posed before that
    Markov chains cannot how do we find CpG islands
    in a long un-annotated sequence?
  • Instead of two Markov chain models (like before)
    we need one model that incorporates both the
    models from before. How do we do that in the
    previous example?

12
The solution
  • We relable the states. In order to integrate, we
    will define A , G , C , T (CpG areas) and A-,
    G-, C-, T- (normal DNA). Both A and A- emit
    the symbol A.

13
  • Many state sequences emit a sequence of
    symbolsAGCTG-C-C-T- A-G-CTGCCT- A
    G C T G C C T

14
Most probable state path - Viterbi algorithm
  • We can compute the likelihood of each one of
    these state paths.
  • Look for the maximum likelihood state sequence.
  • This way, given a long unnotated sequence, I get
    the ML state at the endA-G-C-G-T-T-T-C-GCAGA
    -C-G-T-CGT
  • Viterbi algorithm does this with dynamic
    programming algorithm (an example later).

15
HMM Formalism
A
A
A
A
Si1
Si
Si-1
S1
Sn
. . .
. . .
B
B
B
Ki1
Ki
Ki-1
K1
Kn
. . .
. . .
  • S, K, P, A, B
  • S s1sN are the values for the hidden states
  • K k1kM are the values for the observations
  • The hidden states emit/generate the symbols
    (observations)
  • ? ?i are the initial state probabilities
  • P Pij are the state transition probabilities
  • B bik are the emission probabilities (which
    in our case are 0 or 1 P(A/A)P(A/A-)1,
    P(A/C)0)

16
Markov chain vs. HMM
  • The state sequence itself follows a simple Markov
    chain. But-
  • The essential difference between a Markov chain
    and an HMM is it is no longer possible to know
    the state by looking at the symbols the state
    is hidden.

17
Another example the dishonest casino
  • In a casino, they use a fair die most of the
    time, but occasionally switch to an unfair die.
    The switch between dice can be represented by an
    HMM

0.9
0.95
1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6
1 1/10 2 1/10 3 1/10 4 1/10 5 1/10 6 1/2
0.05
0.1
FAIR
UNFAIR
18
Dishonest casino - continued
  • The symbols (observations) are the sequence of
    rolls3 5 6 2 1 4 6 3 6
  • What is hidden?If the die is fair or unfairf f
    f f u u u f fThis is a Markov chain. Except for
    that, we have
  • Emission probabilitiesGiven a state, we have 6
    possible matching symbols, each with an emission
    probability.

19
Exposing the casino
  • Once again we can estimate which is the most
    probable state path, and estimate when the casino
    was cheating

20
Rate Variation Among Sites
  • Many methods assume equal rate of evolution in
    all sites unrealistic
  • Rate4site a method for deducing the rate at
    each sites. Assumes independence of the rate
    between sites.
  • Rate variation estimation shouldAllow some
    correlation between the rates of evolution at
    adjacent sites.

21
A Hidden Markov Model Approach to Variation Among
Sites in Rate of Evolution (Churchill and
Felsenstein 1996)
22
Motivation
  • To find the path of rates that fits the data in
    the best way.
  • We will represent a general path as
    follows(c1,c2,,cn)where ci 10.0 or 2.0 or
    0.3 (in the example above).ci corresponds to the
    rate at site i.n is the length of the sequence.

23
Paths ? HMM
  • Each path corresponds to a state path.The real
    path is hidden.
  • Before we had ACG ? A C G
  • Now we have c1,c2,c3 corresponding to a multiple
    sequence alignment

x1 y1 y3 x2 y2 y3 s1 s2 s3
24
Motivation - notes
  • Finding the path that best fits the datahow?The
    path that makes the maximum contribution to the
    likelihood of the data.
  • Instead of going over all possible paths (lots of
    computation) ? dynamic programming.

25
Outline of the algorithm
  • Assume we have k categories of rater1, r2, ,
    rk.We know ?ri for i1kWe know Pri rj for
    i,j1k (correlation values)

26
Outline - continued
  • 2. The likelihood of a given phylogeny T
  • n the number of sites in sequence.
  • The combination (c1,,cn) denotes a combination
    of rates for each site

27
Outline - continued
  • 3. Look for the combination which makes the
    largest contribution to the likelihood

28
the gory details1. Choosing rate categories
  • How do we choose the rates?- estimate the values
    of ri and fi by ML, using the EM algorithm-
    examine a few sets of rates and choose the one
    with ML- equal-probability categories from gamma
    distribution (Yang, 1995)

29
Computing the likelihood
  • Computing the likelihoodVia the up algorithm.

30
The most probable rates combination
  • The most probable rates combinationa version of
    the Viterbi algorithm.Definewhere for kn
  • D(k) data of sites k till n
  • Dk data at site k

31
The most probable rates combination - continued
  • This is the likelihood contribution of sites k
    thru n, where site k has rate ck, and the rest
    (k1 thru n) have the combination of rates that
    maximizes the contribution.

32
The most probable rates combination - continued
  • We thus get

33
Proof (the essentials)
  • Because the rate categories are the outcome of a
    Markov chain and P(c1,cn)fc1Pc1,c2Pc2c3Pcn-1,cn
  • Due to the assumption that once the rate is set,
    the sites evolve independently

34
The most probable rates combination - continued
  • Thus, we compute the above equation from sites n,
    n-1, n-2 up till 1, for each one of the rate
    categories.
  • At the end, we take the largest of the quantities
  • This is the quantity that maximizes the
    contribution to the likelihood.

35
Backtracking the rate combination
  • We still dont know which combination c1,cn
    brought us here, so we backtrack

36
THE END!
Write a Comment
User Comments (0)
About PowerShow.com