Dynamic Programming for Sequence alignment - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

Dynamic Programming for Sequence alignment

Description:

Title: PowerPoint Presentation Author: Neha Last modified by: lenovo Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 59

Provided by: Neha9

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Programming for Sequence alignment

1
Dynamic Programming for Sequence alignment

Neha Jain
Lecturer
School of Biotechnology
Devi Ahilya University, Indore

2
Sequence alignment

Sequence alignment is the procedure of comparing
two (pair-wise alignment) or more multiple
sequences by searching for a series of individual
characters or patterns that are in the same order
in the sequences.
There are two types of alignment local and
global.
In Global alignment, an attempt is made to align
the entire sequence. If two sequences have
approximately the same length and are quite
similar, they are suitable for the global
alignment.
Local alignment concentrates on finding
stretches of sequences with high level of matches.

3
Interpretation of sequence alignment

Sequence alignment is useful for discovering
structural, functional and evolutionary
information.
Sequences that are very much alike may have
similar secondary and 3D structure, similar
function and likely a common ancestral sequence.
It is extremely unlikely that such sequences
obtained similarity by chance
Large scale genome studies revealed existence of
horizontal transfer of genes and other sequences
between species, which may cause similarity
between some sequences in very distant species.

4
Methods of sequence alignment

Dot matrix analysis- Starting from the first
character in second sequence, one moves across
the page keeping in the first row and placing a
dot in many column where the character in A is
the same. The process is continued until all
possible comparisons between both the sequences
are made. Any region of similarity is revealed by
a diagonal row of dots
The dynamic programming (DP) algorithm- The
method compares every pair of characters in the
two sequences and generates an alignment, which
is the best or optimal.
Word or k-tuple methods BLAST is the best
example to deal with k-tuple.

5
Pairwise Sequence Alignment

The Aim given two sequences and scoring system
find the best alignment
Points to remember
1) Should consider all possible Pairs
2) Take the best score found
3) There may be more than one best alignment

6
Finding the best alignment is hard!!

How to get optimal alignment?
The number of possible alignments is large.
If both sequences have the same length there is
one possible for complete alignment with no gap.
More complicated when gaps are allowed
It is not good idea to go over all alignments
Solution Dynamic Programming Algorithm

7
Dynamic Programming

General optimization method
Proposed by Richard Bellman of Princeton
University in 1950s. The word dynamic was chosen
by Bellman to capture the time-varying aspect of
the problems, and because it sounded impressive.
The word programming referred to the use of the
method to find an optimal program
Extensively used in sequence alignment and other
computational problems
Applied to biological sequences by Needleman and
Wunsch

8
Dynamic Programming

Original problem is broken into smaller sub
problems and then solved
Pieces of larger problem have a sequential
dependency
4th piece can be solved using solution of the 3rd
piece, the 3rd piece can be solved by using
solution of the 2nd piece and so on

9
Dynamic Programming

First solve all the subproblems
Store each intermediate solution in a table along
with a score
Uses an m x n matrix of scores where m and n
are the lengths of sequences being aligned.
Can be used for
Local Alignment (Smith-Waterman Algorithm)
Global Alignment (Needleman-Wunsch Algorithm)

10
Formal description of dynamic programming
algorithm

This diagram indicates the moves that are
possible to reach a certain position (i,j)
starting from the previous row and column at
position (i -1, j-1) or from any position in the
same row or column
Diagonal move with no gap penalties or move from
any other position from column j or row i, with a
gap penalty that depends on the size of the gap

11
Dynamic Programming

Sequence alignment has an optimal-substructure
property
As a result DP makes it easier to consider all
possible alignments
DP algorithms solve optimization problems by
dividing the problem into independent
subproblems.
Each subproblem is then only solved once, and the
answer is stored in a table, thus avoiding the
work of recomputing the solution.

12
Dynamic Programming

With sequence alignment, the subproblems can be
thought of as the alignment of the prefixes of
the two sequences to a certain point.
DP matrix is computed.
The optimal alignment score for any particular
point in the matrix is built upon the optimal
alignment that has been computed to that point.

13
Dynamic Programming

Advantage The method is guaranteed to give a
global optimum given the choice of parameters
the scoring matrix and gap penalty with no
approximation
A disadvantage Many alignment may give the same
optimal score. And none of these correspond to
the biologically correct alignment

14
Dynamic Programming

Comparison of a- ß- chains of chicken
hemoglobin, Fitch Smith found 17 optimal
alignments, only one of which was correct
biologically (1317 alignments were 5 of
optimal score)
Another bad news The time required to align two
sequences of length n m is proportional to n
x m.
This makes DP unsuitable for use in searching a
sequence DB for a match to a probe sequence

15
Dynamic Programming

Steps Involved
Initialization
Matrix Fill (scoring)
Traceback (alignment)

16
Gap Penalties..????

Gaps are due to Insertion or deletion mutations
in the genes.
Penalties are given for the gaps.
Through empirical studies for globular proteins,
a set of penalty values have been developed that
appear to suit most alignment purposes.
They are normally implemented as default values
in most alignment programs.

17
Gap Penalties..????

Caution-
Penalty too low- gaps numerous, even non related
pairs will be aligned.
If penalties too high- difficult to pair even
the related ones.
Another factor to consider is the cost difference
between opening a gap and extending an existing
gap. It is known that it is easier to extend a
gap that has already been started. Thus, gap
opening should have a much higher penalty than
gap extension.
This is based on the rationale that if insertions
and deletions ever occur, several adjacent
residues are likely to have been inserted or
deleted together.
Affine Gap Penalties- Gap opening penalty should
always be lower then gap extension penalty..
Constant Penalty- When gap opening and gap
extension penalties are same

18
Global Alignment Needleman-Wunsch Algorithm

In global sequence alignment, an attempt to align
the entirety of two different sequences is made,
up to and including the ends of the sequence.
Needleman and Wunsch (1970) were among the first
to describe a dynamic programming algorithm for
global sequence alignment.

19
Example

Two sequences TACT, AATC
Scoring system
Match 3
Mismatch -1
Gap -2

Initializing entry (0,0) 0
Fill the matrix from top left to bottom right
The score in each entry (i,j) is calculated using
the three near entries values
Global alignment score is the bottom right cell
value
May find more than one alignment

21
4 T 3 C 2 A 1 T 0 -
0 -
1 A
2 A
3 T
4 C
Construct a matrix one sequence (TACT) at the
top another sequence (AATC) at the left

Entry (i,j)
i for column, j for row
alignment of i first letters of one sequence
with j first letters of another

22
4 T 3 C 2 A 1 T 0 -
0 0 -
1 A
2 A
3 T
4 C
Initialization entry (0,0) 0
Fill the matrix from top left to bottom right
23
4 T 3 C 2 A 1 T 0 -
-2 0 0 -
1 A
2 A
3 T
4 C
entry (1,0) entry(0,0) gap score 0 (-2)
-2
T -
Horizontal line gap in the left sequence
24
4 T 3 C 2 A 1 T 0 -
-4 -2 0 0 -
1 A
2 A
3 T
4 C
TA - -
entry (2,0) entry(1,0) gap score -2
(-2) -4
25
4 T 3 C 2 A 1 T 0 -
-6 -4 -2 0 0 -
1 A
2 A
3 T
4 C
TAC - - -
entry (3,0) entry(2,0) gap score -4
(-2) -6
26
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
1 A
2 A
3 T
4 C
TACT - - - -
entry (4,0) entry(3,0) gap score -6
(-2) -8
27
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-2 1 A
-4 2 A
-6 3 T
-8 4 C
- - - - AATC
Vertical line gap in the top sequence
28
Global Alignment Needleman-Wunsch Algorithm
For each position, Si,j is defined to be the
maximum score at position i,j i.e. Si,j
MAXIMUM Si-1, j-1 s(ai,bj)
(match/mismatch in the diagonal), Si,j-1 w
(gap in sequence 1), Si-1,j w (gap in
sequence 2)
29
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -2 1 A
-4 2 A
-6 3 T
-8 4 C
Three options
30
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
First option Entry(0,0) mismatch score
0(-1) -1
T A
31
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-4 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Second option Entry(1,0) gap score -2(-2)
-4
T - - A
32
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-4 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Third option Entry(0,1) gap score -2(-2)
-4
- T A -
33
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Choosing the option with the maximal score
T A
34
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
35
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
First option Entry(1,0) match score -2(3)
1
TA -A
36
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Second option Entry(2,0) gap score -4(-2)
-6
TA - - - A
37
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
? -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Third option Entry(1,1) gap score -1(-2)
-3
TA A -
38
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
1 -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
Choosing the option with the maximal score
T A - A
39
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
-4 2 A
-6 3 T
-8 4 C
TACT - A - -
40
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
-3 -4 2 A
-6 3 T
-8 4 C
T - AA
- T AA
41
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
0 2 -3 -4 2 A
3 T
4 C
TAC -AA
TACAA -
42
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
-2 0 2 -3 -4 2 A
3 1 0 -1 -6 3 T
1 3 -2 -3 -8 4 C
43
4 T 3 C 2 A 1 T 0 -
-8 -6 -4 -2 0 0 -
-3 -1 1 -1 -2 1 A
-2 0 2 -3 -4 2 A
3 1 0 -1 -6 3 T
1 3 -2 -3 -8 4 C
44
4 T 3 C 2 A 1 T 0 -
-2 0 0 -
1 -1 1 A
0 2 2 A
3 0 3 T
1 3 4 C
Three possible of alignments
45
4 T 3 C 2 A 1 T 0 -
-2 0 0 -
1 1 A
0 2 A
3 3 T
1 4 C
T A C T - A A T C
46
4 T 3 C 2 A 1 T 0 -
0 0 -
-1 1 A
2 2 A
0 3 T
1 3 4 C
T A - C T A A T C -
47
4 T 3 C 2 A 1 T 0 -
0 0 -
-1 1 A
0 2 2 A
3 3 T
1 4 C
T A C T A A - T C
48
Local Alignment Algorithm

Algorithm of Smith Waterman (1981)
Makes an optimal alignment of the best segment of
similarity between two sequences
Sequences that are not highly similar as a whole,
but contain regions that are highly similar
Use when one sequence is short and the other is
very long (e.g. database)
Can return a number of highly aligned segments

49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
Does a Local Alignment program always produce a
Local Alignment and a Global Alignment program
always produces a Global Alignment?

Although a Computer program that is based on the
Smith waterman local alignment algorithm is used
for producing an optimal alignment, this does not
assure that a local alignment will be produced.
The scoring matrix or match/mismatch scores and
gap penalties chosen also influence whether or
not a local alignment is obtained.
Similar is the case with Needleman-Wunsch
algorithm.

IF the matched regions are long and cover most of
the sequences and depends on the presence of many
gaps, the alignment is global.
A local alignment will tends to be shorter and
not include many gaps.

56
Tools based on Dynamic programming

Global Alignment-
GAP- No penalties for terminal gaps, thus suits
for unequal length sequences.
Local Alignment-
SIM, SSEARCH and LALIGN

57
Multiple Sequence Alignment

It is theoretically possible to use dynamic
programming to align any number of sequences for
the pair wise alignment
The amount of computing time increases
exponentially as the number of sequences
increases
Therefore full dynamic programming cannot be
applied for datasets having more then ten
sequences
So heuristic method is used for MSA.

58
Thank you

Write a Comment

User Comments (0)