Title: Sequence Alignment via HMM
1Sequence Alignment via HMM
Background Readings chapters 3.4, 3.5, 4, in
the Durbin et al.
2HMM model structure1. Duration Modeling
- Markov chains are rather limited in describing
sequences of symbols with non-random structures.
For instance, Markov chain forces the
distribution of segments in which some state is
repeated for k times to be (1-p)pk-1, for some p. - Several ways enable modeling of other
distributions. One is assigning k gt1 states to
represent the same real state. This may
guarantee k repetitions with any desired
probability.
3HMM model structure2. Silent states
- States which do not emit symbols (as we saw in
the abo locus). - Can be used to model duration distributions.
- Also used to allow arbitrary jumps (may be used
to model deletions) - Need to generalize the Forward and Backward
algorithms for arbitrary acyclic digraphs to
count for the silent states (see next)
Silent states
Regular states
4Eg, the forwards algorithm should look
z
Silent states
Directed cycles of silence (or other) states
complicate things, and should be avoided.
Regular states
v
x
symbols
5HMM model structure3. High Order Markov Chains
- Markov chains in which the transition
probabilities depends on the last n states - P(xixi-1,...,x1) P(xixi-1,...,xi-n)
- Can be represented by a standard Markov chain
with more states
AA
AB
BB
BA
6HMM model structure4. Inhomogenous Markov Chains
- An important task in analyzing DNA sequences is
recognizing the genes which code for proteins. - A triplet of 3 nucleotides called codon - codes
for amino acids (see next slide). - It is known that in parts of DNA which code for
genes, the three codons positions has different
statistics. - Thus a Markov chain model for DNA should
represent not only the Nucleotide (A, C, G or
T), but also its position the same nucleotide
in different position will have different
transition probabilities. This seems to work
better in some cases (see eg Durbin et al, p. 76)
7Genetic Code
There are 20 amino acids from which proteins are
build.
8Sequence Comparison using HMM
- HMM for sequence alignment, which incorporates
affine gap scores. - Hidden States
- Match
- Insertion in x
- insertion in y.
- Symbols emitted
- Match (a,b) a,b in ?
- Insertion in x (a,-) a in ? .
- Insertion in y (-,a) a in ? .
9The Transition Probabilities
Transitions probabilities (note the forbidden
ones). d probability for 1st gap e
probability for tailing gap.
- Emission Probabilities
- Match (a,b) with pab only from M states
- Insertion in x (a,-) with qa only from X state
- Insertion in y (-,a).with qa - only from Y
state.
10The Transition Probabilities
Each aligned pair is generated by the above HMM
with certain probability. Note that the hidden
states can be reconstructed from the alignment.
However, for each pair of sequences x (of length
m) and y (of length n), there are many alignments
of x and y, each corresponds to a different
state-path (the length of the paths are between
maxm,n and mn). Our task is to score
alignments by using this model. The score should
reflect the probability of the alignment.
11Most probable alignment
- Let vM(i,j) be the probability of the most
probable alignment of x(1..i) and y(1..j), which
ends with a match. Then using a recursive
argument, we get
12Most probable alignment
- Similarly, vX(i,j) and vY(i,j), the
probabilities of the most probable alignment of
x(1..i) and y(1..j), which ends with an insertion
to x or y
13Adding termination probabilities
We may want a model which defines a probability
distribution over all possible sequences.
For this, an END state is added, with transition
probability t from any other state to END. This
assume expected sequence length of 1/ t.
14Full probability of x and y
- Rather then computing the probability of the most
probable alignment, we look for the total
probability that x and y are related by our
model. - Let fM(i,j) be the sum of the probabilities of
all alignments of x(1..i) and y(1..j), which end
with a match. Similarly, fX(i,j) and fY(i,j) are
the sum of these alignments which end with
insertion to x (y resp.). A forward type
algorithm for computing these functions. - Initialization fM(0,0)1, fX(0,0) fY(0,0)0
(we start from M, but we could select other
initial state).
15Full probability of x and y (cont.)
The total probability of all alignments
is P(x,ymodel) fMm,n fXm,n fYm,n
16The log-odds scoring function
- We wish to know if the alignment score is above
or below the score of random alignment. - Recall The log-odds ratio s(a,b) log (pab /
qaqb). - log (pab/qaqb)gt0 iff the probability that a and
b are related by our model is larger than the
probability that they are picked at random. - To adapt this for the HMM model, we need to model
random sequence by HMM, with end state.
17The log-odds scoring function
The transition probabilities for the random
model (x is the start state)
The emission probability for a is qa. Thus the
probability of of x (of length n) and y (of
length m) being random is
18Markov Chains for Random and Model
Random
Model
19The log-odds scoring function
- The resulted scoring function s in the HMM model
- d and e are the penalties for a first gap and a
tailing gap, resp, which are subtracted from the
score.
The algorithm corrects for extra prepayment, as
follows
20Log-odds alignment algorithm
Initialization VM(0,0)logt - 2log?.
Termination V max VM(m,n), VX(m,n)c,
VY(m,n)c Where c log (1-2d-t) log(1-e-t)