Title: Pairwise Sequence Alignment III
1Introduction to bioinformatics 2007 Lecture 7
Pair-wise Sequence Alignment (III)
2What can sequence alignment tell us about
structure HSSP Sander Schneider, 1991
30 sequence identity
3Global dynamic programmingPAM250, Gap 6 (linear)
These values are copied from the PAM250 matrix
(see earlier slide)
Start in left upper cell before either sequence
(circled in red). Path will end in lower right
cell (circled in blue)
SPEARE S-HAKE
The easy algorithm is only for linear gap
penalties
Higgs Attwood, p. 124 Note There are errors
in the matrices!!
4DP is a two-step process
- Forward step calculate scores
- Trace back start at highest score and
reconstruct the path leading to the highest score - These two steps lead to the highest scoring
alignment (the optimal alignment) - This is guaranteed when you use DP!
5Time and memory complexity of DP
- The time complexity is O(n2) if you would align
two sequences of n residues, you would need to
perform n2 algorithmic steps (square search
matrix has n2 cells that need to be filled) - The memory (space) complexity is also O(n2) if
you would align two sequences of n residues, you
would need a square search matrix of n by n
containing n2 cells
6Global dynamic programming
j-1 j
i-1 i
H(i-1,j-1) S(i,j) H(i-1,j) - g H(i,j-1) - g
diagonal vertical horizontal
H(i,j) Max
- Problem with simple DP approach
- Can only do linear gap penalties
- Not suitable for affine and concave penalties
7Global dynamic programmingusing affine penalties
j-2 j-1 j
Looking back from cell (i, j) we can adapt the
algorithm for affine gap penalties by looking
back to four more cells (magenta)
i-2 i-1 i
If you came from here, gap was already open, so
apply gap-extension penalty
If you came from here, gap was opened so apply
gap-open penalty
8Global dynamic programmingall three types of gap
penalties
j-1
i-1
Gap opening penalty
MaxS0ltxlti-1, j-1 - Pi - (i-x-1)Px Si-1,j-1 MaxS
i-1, 0ltyltj-1 - Pi - (j-y-1)Px
Si,j si,j Max
Gap extension penalty
9Global dynamic programmingGapo10, Gape2
These values are copied from the PAM250 matrix
(see earlier slide), after being made
non-negative by adding 8 to each PAM250 matrix
cell (-8 is the lowest number in the PAM250
matrix)
The extra bottom row and rightmost column give
the final global alignment scores
10Easy DP recipe for using affine gap penalties
j-1
i-1
- Mi,j is optimal alignment (highest scoring
alignment until i,j) - Check
- preceding row until j-2 apply appropriate gap
penalties - preceding row until i-2 apply appropriate gap
penalties - and celli-1, j-1 apply score for celli-1,
j-1
11DP is a two-step process
- Forward step calculate scores
- Trace back start at highest score and
reconstruct the path leading to the highest score - These two steps lead to the highest scoring
alignment (the optimal alignment) - This is guaranteed when you use DP!
12Global dynamic programming
13There are three kinds of alignments
- Global alignment
- Semi-global alignment
- Local alignment
14Semi-global pairwise alignment
- Global alignment all gaps are penalised
- Semi-global alignment N- and C-terminal gaps
(end-gaps) are not penalised - MSTGAVLIY--TS-----
- ---GGILLFHRTSGTSNS
End-gaps
End-gaps
15Semi-global dynamic programmingPAM250, Gap 6
(linear)
These values are copied from the PAM250 matrix
(see earlier slide)
Start in left upper cell before either sequence
(circled in red). Path will end in cell anywhere
in the bottom row or rightmost columns with the
highest score
SPEARE -SHAKE
The easy algorithm is only for linear gap
penalties
Higgs Attwood, p. 124 Note There are errors
in the matrices!!
16Semi-global dynamic programming- two examples
with different gap penalties -
These values are copied from the PAM250 matrix
(see earlier slide), after being made
non-negative by adding 8 to each PAM250 matrix
cell (-8 is the lowest number in the PAM250
matrix)
Global score would have been 65 10 12 10
22 (because of the two end gaps)
17Semi-global pairwise alignment
- Applications of semi-global
- Finding a gene in genome
- Placing marker onto a chromosome
- Generally if one sequence is much longer than
the other -
- Danger if gap penalties high -- really bad
alignments for divergent sequences
18Local dynamic programming (Smith Waterman,
1981)
LCFVMLAGSTVIVGTR
E D A S T I L C G S
Negative numbers
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open, extension)
AGSTVIVG A-STILCG
19Local dynamic programming (Smith and Waterman,
1981)basic algorithm
j-1 j
i-1 i
H(i-1,j-1) S(i,j) H(i-1,j) - g H(i,j-1) - g 0
diagonal vertical horizontal
H(i,j) Max
20Local dynamic programmingMatch/mismatch 1/-1,
Gap 2
Fill the matrix (forward pass), then do trace
back from highest cell anywhere in the matrix
till you reach 0 or the beginning of a sequence
21Local dynamic programmingMatch/mismatch 1/-1,
Gap 2
GAC GAC
Fill the matrix (forward pass), then do trace
back from highest cell anywhere in the matrix
till you reach 0 or the beginning of a sequence
22 Local dynamic programming (Smith Waterman,
1981)
j-1
This is the general DP algorithm, which is
suitable for linear, affine and concave
penalties, although for the examples here affine
penalties are used
i-1
Gap opening penalty
Si,j MaxS0ltxlti-1,j-1 - Pi - (i-x-1)Px Si,j
Si-1,j-1 Si,j Max Si-1,0ltyltj-1 - Pi -
(j-y-1)Px 0
Si,j Max
Gap extension penalty
23Local dynamic programming
24Global and local alignment
B
B
C
A
A
B
A
A
C
A
B
C
A
Local
B
Local
A
B
C
A
B
C
B
A
Global
Global
A
B
C
A
25Global or Local Pairwise alignment
B
B
C
A
A
B
A
A
C
A
B
C
A
Local
B
Local
A
B
C
A
B
C
B
A
Global
Global
A
B
C
A
26Globin fold ? protein myoglobin PDB 1MBN
Alpha-helices are labelled A (blue) to H
(red). The D helix can be missing in some
globins What happens with the alignment if
D-helix containing globin sequences are aligned
with D-less ones?
27 ? sandwich ? protein immunoglobulin PDB 7FAB
Immunoglobulinstructures have variable regions
where numbers of amino acids can vary
substantially
28TIM barrel ? / ? protein Triose phosphate
IsoMerase PDB 1TIM
The evolutionary history of this protein family
has been the subject of rigorous debate.
Arguments have been made in favor of both
convergent and divergent evolution. Because of
the general lack of sequence homology, the
ancestry of this molecule is still a mystery.
29Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel
catalytic substrate binding domain a/b
nucleotide binding domain
30What does all this mean for alignments?
- Alignments need to be able to skip secondary
structural elements to complete domains (i.e.
putting gaps opposite these motifs in the shorter
sequence). - Depending on gap penalties chosen, the algorithm
might have difficulty with making such long gaps
(for example when using high affine gap
penalties), resulting in incorrect alignment. - Alignments are only meaningful for homologous
sequences (with a common ancestor)
31There are three kinds of pairwise alignments
- Global alignment align all residues in both
sequences all gaps are penalised - Semi-global alignment align all residues in
both sequences end gaps are not penalised (zero
end gap penalties) - Local alignment align one part of each
sequence end gaps are not applicable
32Easy global DP recipe for using affine gap
penalties (after Gotoh)
j-1
Penalty Pi gap_lengthPe
MaxS0ltxlti-1, j-1 - Pi - (i-x-1)Px Si-1,j-1 MaxS
i-1, 0ltyltj-1 - Pi - (j-y-1)Px
Si,j si,j Max
i-1
- Mi,j is optimal alignment (highest scoring
alignment until i, j) - At each cell i, j in search matrix, check Max
coming from - any cell in preceding row until j-2 add score
for celli, j minus appropriate gap penalties - any cell in preceding column until i-2 add score
for celli, j minus appropriate gap penalties - or celli-1, j-1 add score for celli, j
- Select highest scoring cell in bottom row and
rightmost column and do trace-back
33Lets do an example global alignmentGotohs DP
algorithm with affine gap penalties (PAM250,
Pi10, Pe2)
PAM250
Cell (D2, T4) can alternatively come from two
cells (same score) high-road or low-road
Row and column 0 are filled with 0, -12, -14,
-16, if global alignment is used (for
N-terminal end-gaps) also extra row and column
at the end to calculate the score including
C-terminal end-gap penalties. Note that only
non-diagonal arrows are indicated for clarity
(no arrow means that you go back to earlier
diagonal cell).
34Lets do another example semi-global
alignmentGotohs DP algorithm with affine gap
penalties (PAM250, Pi10, Pe2)
PAM250
Starting row and column 0, and extra column at
right or extra row at bottom is not necessary
when using semi global alignment (zero end-gaps).
Rest works as under global alignment.
35Easy local DP recipe for using affine gap
penalties (after Gotoh)
j-1
Penalty Pi gap_lengthPe
Si,j MaxS0ltxlti-1,j-1 - Pi - (i-x-1)Px Si,j
Si-1,j-1 Si,j Max Si-1,0ltyltj-1 - Pi -
(j-y-1)Px 0
Si,j Max
i-1
- Mi,j is optimal alignment (highest scoring
alignment until i, j) - At each cell i, j in search matrix, check Max
coming from - any cell in preceding row until j-2 add score
for celli, j minus appropriate gap penalties - any cell in preceding column until i-2 add score
for celli, j minus appropriate gap penalties - or celli-1, j-1 add score for celli, j
- Select highest scoring cell anywhere in matrix
and do trace-back until zero-valued cell or start
of sequence(s)
36Lets do yet another example local
alignmentGotohs DP algorithm with affine gap
penalties (PAM250, Pi10, Pe2)
PAM250
Extra start/end columns/rows not necessary (no
end-gaps). Each negative scoring cell is set to
zero. Highest scoring cell may be found anywhere
in search matrix after calculating it. Trace
highest scoring cell back to first cell with zero
value (or the beginning of one or both sequences)
37Dot plots
- Way of representing (visualising) sequence
similarity without doing dynamic programming (DP) - Make same matrix, but locally represent sequence
similarity by averaging using a window
38Comparing two sequences We want to be able to
choose the best alignment between two
sequences. A simple method of visualising
similarities between two sequences is to use dot
plots. The first sequence to be compared is
assigned to the horizontal axis and the second is
assigned to the vertical axis.
39Dot plots can be filtered by window approaches
(to calculate running averages) and applying a
threshold They can identify insertions,
deletions, inversions
40For your first exam D1Make sure you
understand and can carry out 1. the simple DP
algorithm for global, semi-global and local
alignment (using linear gap penalties but make
sure you know the extension of the basic
algorithm for affine gap penalties) and 2. The
general DP algorithm for global, semi-global and
local alignment (using linear, affine and concave
gap penalties)!