Title: Dynamic Programming for Sequence alignment
1Dynamic Programming for Sequence alignment
- Neha Jain
- Lecturer
- School of Biotechnology
- Devi Ahilya University, Indore
2Sequence alignment
- Sequence alignment is the procedure of comparing
two (pair-wise alignment) or more multiple
sequences by searching for a series of individual
characters or patterns that are in the same order
in the sequences. - There are two types of alignment local and
global. - In Global alignment, an attempt is made to align
the entire sequence. If two sequences have
approximately the same length and are quite
similar, they are suitable for the global
alignment. - Local alignment concentrates on finding
stretches of sequences with high level of matches.
3Interpretation of sequence alignment
- Sequence alignment is useful for discovering
structural, functional and evolutionary
information. - Sequences that are very much alike may have
similar secondary and 3D structure, similar
function and likely a common ancestral sequence.
It is extremely unlikely that such sequences
obtained similarity by chance - Large scale genome studies revealed existence of
horizontal transfer of genes and other sequences
between species, which may cause similarity
between some sequences in very distant species.
4Methods of sequence alignment
- Dot matrix analysis- Starting from the first
character in second sequence, one moves across
the page keeping in the first row and placing a
dot in many column where the character in A is
the same. The process is continued until all
possible comparisons between both the sequences
are made. Any region of similarity is revealed by
a diagonal row of dots - The dynamic programming (DP) algorithm- The
method compares every pair of characters in the
two sequences and generates an alignment, which
is the best or optimal. - Word or k-tuple methods BLAST is the best
example to deal with k-tuple.
5Pairwise Sequence Alignment
- The Aim given two sequences and scoring system
find the best alignment - Points to remember
- 1) Should consider all possible Pairs
- 2) Take the best score found
- 3) There may be more than one best alignment
6Finding the best alignment is hard!!
- How to get optimal alignment?
- The number of possible alignments is large.
- If both sequences have the same length there is
one possible for complete alignment with no gap. - More complicated when gaps are allowed
- It is not good idea to go over all alignments
- Solution Dynamic Programming Algorithm
7Dynamic Programming
- General optimization method
- Proposed by Richard Bellman of Princeton
University in 1950s. The word dynamic was chosen
by Bellman to capture the time-varying aspect of
the problems, and because it sounded impressive.
 The word programming referred to the use of the
method to find an optimal program - Extensively used in sequence alignment and other
computational problems - Applied to biological sequences by Needleman and
Wunsch
8Dynamic Programming
- Original problem is broken into smaller sub
problems and then solved - Pieces of larger problem have a sequential
dependency - 4th piece can be solved using solution of the 3rd
piece, the 3rd piece can be solved by using
solution of the 2nd piece and so on
9Dynamic Programming
- First solve all the subproblems
- Store each intermediate solution in a table along
with a score - Uses an m x n matrix of scores where m and n
are the lengths of sequences being aligned. - Can be used for
- Local Alignment (Smith-Waterman Algorithm)
- Global Alignment (Needleman-Wunsch Algorithm)
10Formal description of dynamic programming
algorithm
- This diagram indicates the moves that are
possible to reach a certain position (i,j)
starting from the previous row and column at
position (i -1, j-1) or from any position in the
same row or column - Diagonal move with no gap penalties or move from
any other position from column j or row i, with a
gap penalty that depends on the size of the gap
11Dynamic Programming
- Sequence alignment has an optimal-substructure
property - As a result DP makes it easier to consider all
possible alignments - DP algorithms solve optimization problems by
dividing the problem into independent
subproblems. - Each subproblem is then only solved once, and the
answer is stored in a table, thus avoiding the
work of recomputing the solution.
12Dynamic Programming
- With sequence alignment, the subproblems can be
thought of as the alignment of the prefixes of
the two sequences to a certain point. - DP matrix is computed.
- The optimal alignment score for any particular
point in the matrix is built upon the optimal
alignment that has been computed to that point.
13 Dynamic Programming
- Advantage The method is guaranteed to give a
global optimum given the choice of parameters
the scoring matrix and gap penalty with no
approximation - A disadvantage Many alignment may give the same
optimal score. And none of these correspond to
the biologically correct alignment
14 Dynamic Programming
- Comparison of a- ß- chains of chicken
hemoglobin, Fitch Smith found 17 optimal
alignments, only one of which was correct
biologically (1317 alignments were 5 of
optimal score) - Another bad news The time required to align two
sequences of length n m is proportional to n
x m. - This makes DP unsuitable for use in searching a
sequence DB for a match to a probe sequence
15Dynamic Programming
- Steps Involved
- Initialization
- Matrix Fill (scoring)
- Traceback (alignment)
16Gap Penalties..????
- Gaps are due to Insertion or deletion mutations
in the genes. - Penalties are given for the gaps.
- Through empirical studies for globular proteins,
a set of penalty values have been developed that
appear to suit most alignment purposes. - They are normally implemented as default values
in most alignment programs.
17Gap Penalties..????
- Caution-
- Penalty too low- gaps numerous, even non related
pairs will be aligned. - If penalties too high- difficult to pair even
the related ones. - Another factor to consider is the cost difference
between opening a gap and extending an existing
gap. It is known that it is easier to extend a
gap that has already been started. Thus, gap
opening should have a much higher penalty than
gap extension. - This is based on the rationale that if insertions
and deletions ever occur, several adjacent
residues are likely to have been inserted or
deleted together. - Affine Gap Penalties- Gap opening penalty should
always be lower then gap extension penalty.. - Constant Penalty- When gap opening and gap
extension penalties are same
18Global Alignment Needleman-Wunsch Algorithm
- In global sequence alignment, an attempt to align
the entirety of two different sequences is made,
up to and including the ends of the sequence. - Needleman and Wunsch (1970) were among the first
to describe a dynamic programming algorithm for
global sequence alignment.
19 Example
- Two sequences TACT, AATC
- Scoring system
- Match 3
- Mismatch -1
- Gap -2
20- Initializing entry (0,0) 0
- Fill the matrix from top left to bottom right
- The score in each entry (i,j) is calculated using
the three near entries values - Global alignment score is the bottom right cell
value - May find more than one alignment
214 T 3 C 2 A 1 T 0 -
0 -
1 A
2 A
3 T
4 C
Construct a matrix one sequence (TACT) at the
top another sequence (AATC) at the left
- Entry (i,j)
- i for column, j for row
- alignment of i first letters of one sequence
- with j first letters of another
224 T 3 C 2 A 1 T 0 -
0 0 -
1 A
2 A
3 T
4 C
Initialization entry (0,0) 0
Fill the matrix from top left to bottom right
234 T 3 C 2 A 1 T 0 -
-2 0 0 -
1 A
2 A
3 T
4 C
entry (1,0) entry(0,0) gap score 0 (-2)
-2
T -
Horizontal line gap in the left sequence
244 T 3 C 2 A 1 T 0 -
-4 -2 0 0 -
1 A
2 A
3 T
4 C
TA - -
entry (2,0) entry(1,0) gap score -2
(-2) -4
254 T 3 C 2 A 1 T 0 -
-6 -4 -2 0 0 -
1 A
2 A
3 T
4 C
TAC - - -
entry (3,0) entry(2,0) gap score -4
(-2) -6
264 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
1 A
2 A
3 T
4 C
TACT - - - -
entry (4,0) entry(3,0) gap score -6
(-2) -8
274 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-2 1 A
-4 2 A
-6 3 T
-8 4 C
- - - - AATC
Vertical line gap in the top sequence
28Global Alignment Needleman-Wunsch Algorithm
For each position, Si,j is defined to be the
maximum score at position i,j i.e. Si,j
MAXIMUM Si-1, j-1 s(ai,bj)
(match/mismatch in the diagonal), Si,j-1 w
(gap in sequence 1), Si-1,j w (gap in
sequence 2)
294 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -2 1 A
-4 2 A
-6 3 T
-8 4 C
Three options
304 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
First option Entry(0,0) mismatch score
0(-1) -1
T A
314 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-4 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Second option Entry(1,0) gap score -2(-2)
-4
T - - A
324 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-4 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Third option Entry(0,1) gap score -2(-2)
-4
- T A -
334 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Choosing the option with the maximal score
T A
344 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
354 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
First option Entry(1,0) match score -2(3)
1
TA -A
364 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Second option Entry(2,0) gap score -4(-2)
-6
TA - - - A
374 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Third option Entry(1,1) gap score -1(-2)
-3
TA A -
384 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
1 -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Choosing the option with the maximal score
T A - A
394 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
TACT - A - -
404 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
-3 -4 2 A
-6 3 T
-8 4 C
T - AA
- T AA
414 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
0 2 -3 -4 2 A
3 T
4 C
TAC -AA
TACAA -
424 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
-2 0 2 -3 -4 2 A
3 1 0 -1 -6 3 T
1 3 -2 -3 -8 4 C
434 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
-2 0 2 -3 -4 2 A
3 1 0 -1 -6 3 T
1 3 -2 -3 -8 4 C
444 T 3 C 2 A 1 T 0 -
-2 0 0 -
1 -1 1 A
0 2 2 A
3 0 3 T
1 3 4 C
Three possible of alignments
454 T 3 C 2 A 1 T 0 -
-2 0 0 -
1 1 A
0 2 A
3 3 T
1 4 C
T A C T - A A T C
464 T 3 C 2 A 1 T 0 -
0 0 -
-1 1 A
2 2 A
0 3 T
1 3 4 C
T A - C T A A T C -
474 T 3 C 2 A 1 T 0 -
0 0 -
-1 1 A
0 2 2 A
3 3 T
1 4 C
T A C T A A - T C
48Local Alignment Algorithm
- Algorithm of Smith Waterman (1981)
- Makes an optimal alignment of the best segment of
similarity between two sequences - Sequences that are not highly similar as a whole,
but contain regions that are highly similar - Use when one sequence is short and the other is
very long (e.g. database) - Can return a number of highly aligned segments
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53(No Transcript)
54Does a Local Alignment program always produce a
Local Alignment and a Global Alignment program
always produces a Global Alignment?
- Although a Computer program that is based on the
Smith waterman local alignment algorithm is used
for producing an optimal alignment, this does not
assure that a local alignment will be produced. - The scoring matrix or match/mismatch scores and
gap penalties chosen also influence whether or
not a local alignment is obtained. - Similar is the case with Needleman-Wunsch
algorithm.
55- IF the matched regions are long and cover most of
the sequences and depends on the presence of many
gaps, the alignment is global. - A local alignment will tends to be shorter and
not include many gaps.
56Tools based on Dynamic programming
- Global Alignment-
- GAP- No penalties for terminal gaps, thus suits
for unequal length sequences. - Local Alignment-
- SIM, SSEARCH and LALIGN
57Multiple Sequence Alignment
- It is theoretically possible to use dynamic
programming to align any number of sequences for
the pair wise alignment - The amount of computing time increases
exponentially as the number of sequences
increases - Therefore full dynamic programming cannot be
applied for datasets having more then ten
sequences - So heuristic method is used for MSA.
58Thank you