Title: Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
1Introduction to bioinformaticsLecture
5Pair-wise sequence alignment
2Bioinformatics
- Nothing in Biology makes sense except in the
light of evolution (Theodosius Dobzhansky
(1900-1975)) - Nothing in bioinformatics makes sense except in
the light of Biology
3Example today Pairwise sequence alignment needs
sense of evolution Global dynamic programming
MDAGSTVILCFVG
Evolution
M D A A S T I L C G S
Amino Acid Exchange Matrix
Search matrix
MDAGSTVILCFVG-
Gap penalties (open,extension)
MDAAST-ILC--GS
4Evolution
- Ancestral sequence ABCD
-
- ACCD (B C)
ABD (C ø) -
- ACCD or ACCD
Pairwise Alignment - AB-D A-BD
-
mutation deletion
5Evolution
- Ancestral sequence ABCD
-
- ACCD (B C)
ABD (C ø) - ACCD or ACCD
Pairwise Alignment - AB-D A-BD
-
mutation deletion
true alignment
6A protein sequence alignment MSTGAVLIY--TSILIKECHA
MPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS
A DNA sequence
alignment attcgttggcaaatcgcccctatccggccttaa att---
tggcggatcg-cctctacgggcc----
7Searching for similarities What is the function
of the new gene? The lazy investigation (i.e.,
no biologial experiments, just bioinformatics
techniques) Find a set of similar protein
sequences to the unknown sequence Identify
similarities and differences For long proteins
identify domains
8- Evolutionary and functional relationships
- Reconstruct evolutionary relation
- Based on sequence
- -Identity (simplest method)
- -Similarity
- Homology (common ancestry the ultimate goal)
- Other (e.g., 3D structure)
- Functional relation
- Sequence Structure Function
9Searching for similarities
Common ancestry is more interesting Makes it
more likely that genes share the same
function Homology sharing a common ancestor a
binary property (yes/no) its a nice tool When
(an unknown) gene X is homologous to (a known)
gene G it means that we gain a lot of information
on X what we know about G can be transferred to
X as a good suggestion.
10How to go from DNA to protein sequence
A piece of double stranded DNA 5
attcgttggcaaatcgcccctatccggc 3 3
taagcaaccgtttagcggggataggccg 5
DNA direction is from 5 to 3
11How to go from DNA to protein sequence
6-frame translation using the codon table (last
lecture) 5 attcgttggcaaatcgcccctatccggc
3 3 taagcaaccgtttagcggggataggccg 5
12Evolution and three-dimensional protein structure
information
Isocitrate dehydrogenase The distance from the
active site (in yellow) determines the rate of
evolution (red fast evolution, blue slow
evolution)
Dean, A. M. and G. B. Golding Pacific Symposium
on Bioinformatics 2000
13How to determine similarity Frequent evolutionary
events at the DNA level 1. Substitution 2.
Insertion, deletion 3. Duplication 4. Inversion
We will restrict ourselves to these events
14A protein sequence alignment MSTGAVLIY--TSILIKECHA
MPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS
A DNA sequence
alignment attcgttggcaaatcgcccctatccggccttaa att---
tggcggatcg-cctctacgggcc----
15Dynamic programmingScoring alignments
Substitution (or match/mismatch) DNA
proteins Gap penalty Linear gp(k)ak
Affine gp(k)bak Concave, e.g.
gp(k)log(k) The score for an alignment is the
sum of the scores of all alignment columns
16Dynamic programmingScoring alignments
Sa,b gp(k) gapinit
k?gapextension affine gap penalties
17DNA define a score for match/mismatch of
letters Simple Used in genome
alignments
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
A C G T
A 91 -114 -31 -123
C -114 100 -125 -31
G -31 -125 100 -114
T -123 -31 -114 91
18Dynamic programmingScoring alignments
T D W V T A L K T D W L - - I K
20?20
10
1
Affine gap penalties (open, extension)
Amino Acid Exchange Matrix
Score s(T,T)s(D,D)s(W,W)s(V,L)-Po-2Px
s(L,I)s(K,K)
19Amino acid exchange matrices
20?20
How do we get one? And how do we get associated
gap penalties? First systematic method to derive
a.a. exchange matrices by Margaret Dayhoff et al.
(1978) Atlas of Protein Structure
20A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4
-5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2
4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3
1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3
-3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0
-2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4
0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0
9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5
6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1
2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0
1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0
-6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4
-2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2
4 2 -2 2 -1 -1 -1 0 -6 -2 4 B 0 -1 2 3 -4
1 2 0 1 -2 -3 1 -2 -5 -1 0 0 -5 -3 -2 2 Z
0 0 1 3 -5 3 3 -1 2 -2 -3 0 -2 -5 0 0
-1 -6 -4 -2 2 3 A R N D C Q E G H I
L K M F P S T W Y V B Z
PAM250 matrix amino acid exchange matrix (log
odds)
Positive exchange values denote mutations that
are more likely than randomly expected, while
negative numbers correspond to avoided mutations
compared to the randomly expected situation
21Amino acid exchange matrices
Amino acids are not equal 1. Some are easily
substituted because they have similar
physico-chemical properties structure 2. Some
mutations between amino acids occur more often
due to similar codons The two above observations
give us ways to define substitution matrices
22Pair-wise alignment
T D W V T A L K T D W L - - I K
Combinatorial explosion - 1 gap in 1 sequence
n1 possibilities - 2 gaps in 1 sequence (n1)n
- 3 gaps in 1 sequence (n1)n(n-1), etc.
2n (2n)! 22n
n (n!)2
??n 2 sequences of 300 a.a. 1088
alignments 2 sequences of 1000 a.a. 10600
alignments!
23Technique to overcome the combinatorial
explosionDynamic Programming
- Alignment is simulated as Markov process, all
sequence positions are seen as independent - Chances of sequence events are independent
24Sequence alignmentHistory of Dynamic Programming
algorithm
1970 Needleman-Wunsch global pair-wise
alignment Needleman SB, Wunsch CD (1970) A
general method applicable to the search for
similarities in the amino acid sequence of two
proteins, J Mol Biol. 48(3)443-53. 1981
Smith-Waterman local pair-wise alignment Smith,
TF, Waterman, MS (1981) Identification of common
molecular subsequences. J. Mol. Biol. 147,
195-197.
25Pairwise sequence alignment Global dynamic
programming
MDAGSTVILCFVG
Evolution
M D A A S T I L C G S
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open,extension)
MDAGSTVILCFVG-
MDAAST-ILC--GS
26Global dynamic programming
j-1
i-1
MaxS0ltxlti-1, j-1 - Pi - (i-x-1)Px Si-1,j-1 MaxS
i-1, 0ltyltj-1 - Pi - (j-y-1)Px
Si,j si,j Max
27Global dynamic programming
These values are copied from the PAM250 matrix
(see earlier slide), after being made
non-negative by adding 8 to each PAM250 matrx
cell (-8 is the lowest number in the PAM250
matrix)
Global score is 65 10 12 10 22
28Global dynamic programming
These values are copied from the PAM250 matrix
(see earlier slide), after being made
non-negative by adding 8 to each PAM250 matrx
cell (-8 is the lowest number in the PAM250
matrix)
Global score is 65 10 12 10 22
29Global dynamic programmingGapo10, Gape2
D W V T A L K
0 -12 -14 -16 -18 -20 -22 -24
T -12 8 -9 -6 -5 -9 -11 -14
D -14 0 9 2 2 3 -5 -3 -34
W -16 -13 25 11 5 4 9 0 -21
V -18 -10 -4 37 21 19 19 15 -16
L -20 -14 -2 23 46 31 37 26 1
K -22 -12 -9 17 33 53 39 50 14
-34 -29 -1 17 39 27 50
D W V T A L K
T 8 3 8 11 9 9 8
D 12 1 6 8 8 4 8
W 1 25 2 3 2 6 5
V 6 2 12 8 8 10 6
L 4 6 10 9 6 14 5
K 8 5 6 8 7 5 13
These values are copied from the PAM250 matrix
(see earlier slide), after being made
non-negative by adding 8 to each PAM250 matrx
cell (-8 is the lowest number in the PAM250
matrix)
The extra bottom row and rightmost column give
the final global alignment scores
30Easy DP recipe for using affine gap penalties
j-1
i-1
- Mi,j is optimal alignment (highest scoring
alignment until (i,j) - Check
- preceding row until j-2 apply appropriate gap
penalties - preceding row until i-2 apply appropriate gap
penalties - and celli-1, j-1 apply score for celli-1,
j-1
31DP is a two-step process
- Forward step calculate scores
- Trace back start at highest score and
reconstruct the path leading to the highest score - These two steps lead to the highest scoring
alignment (the optimal alignment) - This is guaranteed when you use DP!
32Global dynamic programming
33Global pairwise alignment
- Global alignment all gaps are penalised
- Semi-global alignment N- and C-terminal gaps
(end-gaps) are not penalised - MSTGAVLIY--TS-----
- ---GGILLFHRTSGTSNS
End-gaps
End-gaps
34Semi-global pairwise alignment
- Applications of semi-global
- Finding a gene in genome
- Placing marker onto a chromosome
- One sequence much longer than the other
-
- Danger if gap penalties high -- really bad
alignments for divergent sequences
35Local dynamic programming (Smith Waterman,
1981)
LCFVMLAGSTVIVGTR
E D A S T I L C G S
Negative numbers
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open, extension)
AGSTVIVG A-STILCG
36 Local dynamic programming (Smith Waterman,
1981)
j-1
i-1
Si,j MaxS0ltxlti-1,j-1 - Pi - (i-x-1)Px Si,j
Si-1,j-1 Si,j Max Si-1,0ltyltj-1 - Pi -
(j-y-1)Px 0
Si,j Max
37Local dynamic programming
38Dot plots
- Way of representing (visualising) sequence
similarity without doing dynamic programming (DP) - Make same matrix, but locally represent sequence
similarity by averaging using a window - See Lesks book pp. 167-171
39Comparing two sequences We want to be able to
choose the best alignment between two
sequences. A simple method of visualising
similarities between two sequences is to use dot
plots. The first sequence to be compared is
assigned to the horizontal axis and the second is
assigned to the vertical axis.
40Dot plots can be filtered by window approaches
(to calculate running averages) and applying a
threshold They can identify insertions,
deletions, inversions