Sequence Alignment via HMM - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Sequence Alignment via HMM

Description:

An important task in analyzing DNA sequences is recognizing the genes which code ... Initialization: fM(0,0)=1, fX(0,0)= fY(0,0)=0 (we start from M, but we could ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 21

Provided by: webTh

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Alignment via HMM

1
Sequence Alignment via HMM
Background Readings chapters 3.4, 3.5, 4, in
the Durbin et al.
2
HMM model structure1. Duration Modeling

Markov chains are rather limited in describing
sequences of symbols with non-random structures.
For instance, Markov chain forces the
distribution of segments in which some state is
repeated for k times to be (1-p)pk-1, for some p.
Several ways enable modeling of other
distributions. One is assigning k gt1 states to
represent the same real state. This may
guarantee k repetitions with any desired
probability.

3
HMM model structure2. Silent states

States which do not emit symbols (as we saw in
the abo locus).
Can be used to model duration distributions.
Also used to allow arbitrary jumps (may be used
to model deletions)
Need to generalize the Forward and Backward
algorithms for arbitrary acyclic digraphs to
count for the silent states (see next)

Silent states
Regular states
4
Eg, the forwards algorithm should look
z
Silent states
Directed cycles of silence (or other) states
complicate things, and should be avoided.
Regular states
v
x
symbols
5
HMM model structure3. High Order Markov Chains

Markov chains in which the transition
probabilities depends on the last n states
P(xixi-1,...,x1) P(xixi-1,...,xi-n)
Can be represented by a standard Markov chain
with more states

AA
AB
BB
BA
6
HMM model structure4. Inhomogenous Markov Chains

An important task in analyzing DNA sequences is
recognizing the genes which code for proteins.
A triplet of 3 nucleotides called codon - codes
for amino acids (see next slide).
It is known that in parts of DNA which code for
genes, the three codons positions has different
statistics.
Thus a Markov chain model for DNA should
represent not only the Nucleotide (A, C, G or
T), but also its position the same nucleotide
in different position will have different
transition probabilities. This seems to work
better in some cases (see eg Durbin et al, p. 76)

7
Genetic Code
There are 20 amino acids from which proteins are
build.
8
Sequence Comparison using HMM

HMM for sequence alignment, which incorporates
affine gap scores.
Hidden States
Match
Insertion in x
insertion in y.
Symbols emitted
Match (a,b) a,b in ?
Insertion in x (a,-) a in ? .
Insertion in y (-,a) a in ? .

9
The Transition Probabilities
Transitions probabilities (note the forbidden
ones). d probability for 1st gap e
probability for tailing gap.

Emission Probabilities
Match (a,b) with pab only from M states
Insertion in x (a,-) with qa only from X state
Insertion in y (-,a).with qa - only from Y
state.

10
The Transition Probabilities
Each aligned pair is generated by the above HMM
with certain probability. Note that the hidden
states can be reconstructed from the alignment.
However, for each pair of sequences x (of length
m) and y (of length n), there are many alignments
of x and y, each corresponds to a different
state-path (the length of the paths are between
maxm,n and mn). Our task is to score
alignments by using this model. The score should
reflect the probability of the alignment.
11
Most probable alignment

Let vM(i,j) be the probability of the most
probable alignment of x(1..i) and y(1..j), which
ends with a match. Then using a recursive
argument, we get

12
Most probable alignment

Similarly, vX(i,j) and vY(i,j), the
probabilities of the most probable alignment of
x(1..i) and y(1..j), which ends with an insertion
to x or y

13
Adding termination probabilities
We may want a model which defines a probability
distribution over all possible sequences.
For this, an END state is added, with transition
probability t from any other state to END. This
assume expected sequence length of 1/ t.
14
Full probability of x and y

Rather then computing the probability of the most
probable alignment, we look for the total
probability that x and y are related by our
model.
Let fM(i,j) be the sum of the probabilities of
all alignments of x(1..i) and y(1..j), which end
with a match. Similarly, fX(i,j) and fY(i,j) are
the sum of these alignments which end with
insertion to x (y resp.). A forward type
algorithm for computing these functions.
Initialization fM(0,0)1, fX(0,0) fY(0,0)0
(we start from M, but we could select other
initial state).

15
Full probability of x and y (cont.)
The total probability of all alignments
is P(x,ymodel) fMm,n fXm,n fYm,n
16
The log-odds scoring function

We wish to know if the alignment score is above
or below the score of random alignment.
Recall The log-odds ratio s(a,b) log (pab /
qaqb).
log (pab/qaqb)gt0 iff the probability that a and
b are related by our model is larger than the
probability that they are picked at random.
To adapt this for the HMM model, we need to model
random sequence by HMM, with end state.

17
The log-odds scoring function
The transition probabilities for the random
model (x is the start state)
The emission probability for a is qa. Thus the
probability of of x (of length n) and y (of
length m) being random is
18
Markov Chains for Random and Model
Random
Model
19
The log-odds scoring function

The resulted scoring function s in the HMM model
d and e are the penalties for a first gap and a
tailing gap, resp, which are subtracted from the
score.

The algorithm corrects for extra prepayment, as
follows
20
Log-odds alignment algorithm
Initialization VM(0,0)logt - 2log?.
Termination V max VM(m,n), VX(m,n)c,
VY(m,n)c Where c log (1-2d-t) log(1-e-t)

Write a Comment

User Comments (0)