Dynamic Programming for Sequence alignment


Dynamic Programming for Sequence alignment
  • Neha Jain
  • Lecturer
  • School of Biotechnology
  • Devi Ahilya University, Indore

Sequence alignment
  • Sequence alignment is the procedure of comparing
    two (pair-wise alignment) or more multiple
    sequences by searching for a series of individual
    characters or patterns that are in the same order
    in the sequences.
  • There are two types of alignment local and
  • In Global alignment, an attempt is made to align
    the entire sequence. If two sequences have
    approximately the same length and are quite
    similar, they are suitable for the global
  • Local alignment concentrates on finding
    stretches of sequences with high level of matches.

Interpretation of sequence alignment
  • Sequence alignment is useful for discovering
    structural, functional and evolutionary
  • Sequences that are very much alike may have
    similar secondary and 3D structure, similar
    function and likely a common ancestral sequence.
    It is extremely unlikely that such sequences
    obtained similarity by chance
  • Large scale genome studies revealed existence of
    horizontal transfer of genes and other sequences
    between species, which may cause similarity
    between some sequences in very distant species.

Methods of sequence alignment
  • Dot matrix analysis- Starting from the first
    character in second sequence, one moves across
    the page keeping in the first row and placing a
    dot in many column where the character in A is
    the same. The process is continued until all
    possible comparisons between both the sequences
    are made. Any region of similarity is revealed by
    a diagonal row of dots
  • The dynamic programming (DP) algorithm- The
    method compares every pair of characters in the
    two sequences and generates an alignment, which
    is the best or optimal.
  • Word or k-tuple methods BLAST is the best
    example to deal with k-tuple.

Pairwise Sequence Alignment
  • The Aim given two sequences and scoring system
    find the best alignment
  • Points to remember
  • 1) Should consider all possible Pairs
  • 2) Take the best score found
  • 3) There may be more than one best alignment

Finding the best alignment is hard!!
  • How to get optimal alignment?
  • The number of possible alignments is large.
  • If both sequences have the same length there is
    one possible for complete alignment with no gap.
  • More complicated when gaps are allowed
  • It is not good idea to go over all alignments
  • Solution Dynamic Programming Algorithm

Dynamic Programming
  • General optimization method
  • Proposed by Richard Bellman of Princeton
    University in 1950s. The word dynamic was chosen
    by Bellman to capture the time-varying aspect of
    the problems, and because it sounded impressive.
     The word programming referred to the use of the
    method to find an optimal program
  • Extensively used in sequence alignment and other
    computational problems
  • Applied to biological sequences by Needleman and

Dynamic Programming
  • Original problem is broken into smaller sub
    problems and then solved
  • Pieces of larger problem have a sequential
  • 4th piece can be solved using solution of the 3rd
    piece, the 3rd piece can be solved by using
    solution of the 2nd piece and so on

Dynamic Programming
  • First solve all the subproblems
  • Store each intermediate solution in a table along
    with a score
  • Uses an m x n matrix of scores where m and n
    are the lengths of sequences being aligned.
  • Can be used for
  • Local Alignment (Smith-Waterman Algorithm)
  • Global Alignment (Needleman-Wunsch Algorithm)

Formal description of dynamic programming
  • This diagram indicates the moves that are
    possible to reach a certain position (i,j)
    starting from the previous row and column at
    position (i -1, j-1) or from any position in the
    same row or column
  • Diagonal move with no gap penalties or move from
    any other position from column j or row i, with a
    gap penalty that depends on the size of the gap

Dynamic Programming
  • Sequence alignment has an optimal-substructure
  • As a result DP makes it easier to consider all
    possible alignments
  • DP algorithms solve optimization problems by
    dividing the problem into independent
  • Each subproblem is then only solved once, and the
    answer is stored in a table, thus avoiding the
    work of recomputing the solution.

Dynamic Programming
  • With sequence alignment, the subproblems can be
    thought of as the alignment of the prefixes of
    the two sequences to a certain point.
  • DP matrix is computed.
  • The optimal alignment score for any particular
    point in the matrix is built upon the optimal
    alignment that has been computed to that point.

Dynamic Programming
  • Advantage The method is guaranteed to give a
    global optimum given the choice of parameters
    the scoring matrix and gap penalty with no
  • A disadvantage Many alignment may give the same
    optimal score. And none of these correspond to
    the biologically correct alignment

Dynamic Programming
  • Comparison of a- ß- chains of chicken
    hemoglobin, Fitch Smith found 17 optimal
    alignments, only one of which was correct
    biologically (1317 alignments were 5 of
    optimal score)
  • Another bad news The time required to align two
    sequences of length n m is proportional to n
    x m.
  • This makes DP unsuitable for use in searching a
    sequence DB for a match to a probe sequence

Dynamic Programming
  • Steps Involved
  • Initialization
  • Matrix Fill (scoring)
  • Traceback (alignment)

Gap Penalties..????
  • Gaps are due to Insertion or deletion mutations
    in the genes.
  • Penalties are given for the gaps.
  • Through empirical studies for globular proteins,
    a set of penalty values have been developed that
    appear to suit most alignment purposes.
  • They are normally implemented as default values
    in most alignment programs.

Gap Penalties..????
  • Caution-
  • Penalty too low- gaps numerous, even non related
    pairs will be aligned.
  • If penalties too high- difficult to pair even
    the related ones.
  • Another factor to consider is the cost difference
    between opening a gap and extending an existing
    gap. It is known that it is easier to extend a
    gap that has already been started. Thus, gap
    opening should have a much higher penalty than
    gap extension.
  • This is based on the rationale that if insertions
    and deletions ever occur, several adjacent
    residues are likely to have been inserted or
    deleted together.
  • Affine Gap Penalties- Gap opening penalty should
    always be lower then gap extension penalty..
  • Constant Penalty- When gap opening and gap
    extension penalties are same

Global Alignment Needleman-Wunsch Algorithm
  • In global sequence alignment, an attempt to align
    the entirety of two different sequences is made,
    up to and including the ends of the sequence.
  • Needleman and Wunsch (1970) were among the first
    to describe a dynamic programming algorithm for
    global sequence alignment.

  • Two sequences TACT, AATC
  • Scoring system
  • Match 3
  • Mismatch -1
  • Gap -2

  • Initializing entry (0,0) 0
  • Fill the matrix from top left to bottom right
  • The score in each entry (i,j) is calculated using
    the three near entries values
  • Global alignment score is the bottom right cell
  • May find more than one alignment

4 T 3 C 2 A 1 T 0 -
0 -
1 A
2 A
3 T
4 C
Construct a matrix one sequence (TACT) at the
top another sequence (AATC) at the left
  • Entry (i,j)
  • i for column, j for row
  • alignment of i first letters of one sequence
  • with j first letters of another

4 T 3 C 2 A 1 T 0 -
0 0 -
1 A
2 A
3 T
4 C
Initialization entry (0,0) 0
Fill the matrix from top left to bottom right
4 T 3 C 2 A 1 T 0 -
-2 0 0 -
1 A
2 A
3 T
4 C
entry (1,0) entry(0,0) gap score 0 (-2)
T -
Horizontal line gap in the left sequence
4 T 3 C 2 A 1 T 0 -
-4 -2 0 0 -
1 A
2 A
3 T
4 C
TA - -
entry (2,0) entry(1,0) gap score -2
(-2) -4
4 T 3 C 2 A 1 T 0 -
-6 -4 -2 0 0 -
1 A
2 A
3 T
4 C
TAC - - -
entry (3,0) entry(2,0) gap score -4
(-2) -6
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
1 A
2 A
3 T
4 C
TACT - - - -
entry (4,0) entry(3,0) gap score -6
(-2) -8
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-2 1 A
-4 2 A
-6 3 T
-8 4 C
- - - - AATC
Vertical line gap in the top sequence
Global Alignment Needleman-Wunsch Algorithm
For each position, Si,j is defined to be the
maximum score at position i,j i.e. Si,j
MAXIMUM Si-1, j-1 s(ai,bj)
(match/mismatch in the diagonal), Si,j-1 w
(gap in sequence 1), Si-1,j w (gap in
sequence 2)
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -2 1 A
-4 2 A
-6 3 T
-8 4 C
Three options
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
First option Entry(0,0) mismatch score
0(-1) -1
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-4 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Second option Entry(1,0) gap score -2(-2)
T - - A
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-4 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Third option Entry(0,1) gap score -2(-2)
- T A -
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Choosing the option with the maximal score
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
First option Entry(1,0) match score -2(3)
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Second option Entry(2,0) gap score -4(-2)
TA - - - A
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Third option Entry(1,1) gap score -1(-2)
TA A -
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
1 -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Choosing the option with the maximal score
T A - A
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
TACT - A - -
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
-3 -4 2 A
-6 3 T
-8 4 C
T - AA
- T AA
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
0 2 -3 -4 2 A
3 T
4 C
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
-2 0 2 -3 -4 2 A
3 1 0 -1 -6 3 T
1 3 -2 -3 -8 4 C
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
-2 0 2 -3 -4 2 A
3 1 0 -1 -6 3 T
1 3 -2 -3 -8 4 C
4 T 3 C 2 A 1 T 0 -
-2 0 0 -
1 -1 1 A
0 2 2 A
3 0 3 T
1 3 4 C
Three possible of alignments
4 T 3 C 2 A 1 T 0 -
-2 0 0 -
1 1 A
0 2 A
3 3 T
1 4 C
T A C T - A A T C
4 T 3 C 2 A 1 T 0 -
0 0 -
-1 1 A
2 2 A
0 3 T
1 3 4 C
T A - C T A A T C -
4 T 3 C 2 A 1 T 0 -
0 0 -
-1 1 A
0 2 2 A
3 3 T
1 4 C
T A C T A A - T C
Local Alignment Algorithm
  • Algorithm of Smith Waterman (1981)
  • Makes an optimal alignment of the best segment of
    similarity between two sequences
  • Sequences that are not highly similar as a whole,
    but contain regions that are highly similar
  • Use when one sequence is short and the other is
    very long (e.g. database)
  • Can return a number of highly aligned segments

Does a Local Alignment program always produce a
Local Alignment and a Global Alignment program
always produces a Global Alignment?
  • Although a Computer program that is based on the
    Smith waterman local alignment algorithm is used
    for producing an optimal alignment, this does not
    assure that a local alignment will be produced.
  • The scoring matrix or match/mismatch scores and
    gap penalties chosen also influence whether or
    not a local alignment is obtained.
  • Similar is the case with Needleman-Wunsch

  • IF the matched regions are long and cover most of
    the sequences and depends on the presence of many
    gaps, the alignment is global.
  • A local alignment will tends to be shorter and
    not include many gaps.

Tools based on Dynamic programming
  • Global Alignment-
  • GAP- No penalties for terminal gaps, thus suits
    for unequal length sequences.
  • Local Alignment-

Multiple Sequence Alignment
  • It is theoretically possible to use dynamic
    programming to align any number of sequences for
    the pair wise alignment
  • The amount of computing time increases
    exponentially as the number of sequences
  • Therefore full dynamic programming cannot be
    applied for datasets having more then ten
  • So heuristic method is used for MSA.

Thank you
