Title: One can not compare apples and pears German saying
1One can not compare apples and pearsGerman saying
Is this true ?
2or
the origin of life
3One can not compare apples and pearsGerman saying
Is this true ?
OC Eukaryota Viridiplantae Streptophyta
Embryophyta Tracheophyta OC Spermatophyta
Magnoliophyta eudicotyledons core eudicots
Rosidae OC eurosids I Rosales Rosaceae
Maloideae Malus. OC Eukaryota Viridiplantae
Streptophyta Embryophyta Tracheophyta OC
Spermatophyta Magnoliophyta eudicotyledons
core eudicots Rosidae OC eurosids I Rosales
Rosaceae Maloideae Pyrus.
4- Nothing in Biology Makes Sense Except in the
Light of Evolution
Theodosius Dobzhansky (1900-1975)
5Whole genome sequencing facilitates a systematic
analysis of redundancy within the genome
Arabidopsis thaliana (MIPS)
6How does evolution work ?
7redundancy / homologyintra- and inter-genomes
- Diversification within one genome
- Speciation/diversification between genomes
8Pairwise sequence alignment
- Why sequence alignment and definitions
- What is sequence alignment
- How align sequences
- global vs. local alignment
- gaps
- Substitution matrix
- Dot plot
- Dynamic programming
9Why sequence alignment
- Lots of sequences with unknown structure and
function vs. a few (but growing number) sequences
with known structure and function - If they align, they are similar
- If they are similar, then they might have similar
structure and/or function. Identify conserved
patterns (motifs) - If one of them has known structure/function, then
alignment of other might yield insight about how
the structure/functions works. Similar motif
content might hint to similar function - Define evolutionary relationships
10Basics in sequence comparison
- Identity
- The extent to which two (nucleotide or amino
acid) sequences are invariant (identical). - Similarity
- The extent to which (nucleotide or amino acid)
sequences are related. The extent of similarity
between two sequences can be based on percent
sequence identity and/or conservation. In BLAST
similarity refers to a positive matrix score.
This is quite flexible (see later examples of DNA
polymerases) similar across the whole sequence
or similarity restricted to domains ! - Homology
- Similarity attributed to descent from a common
ancestor.
11Homologous Similarity attributed to any gene
descenting from a common ancestor. Orthologous Ho
mologous sequences in different species that
arose from a common ancestral gene during
speciation Orthologous genes may or may not have
the same function ! In most cases they
will. Paralogous Homologous sequences within a
single species that arose by gene duplication.
12Pairwise sequence alignment
- Why sequence alignment and definitions
- What is sequence alignment
- How align sequences
- global vs. local alignment
- gaps
- Substitution matrix
- Dot plot
- Dynamic programming
13Aligning biological sequences
- Nucleic acid (4 letter alphabet gap)
- TT-GCAC
- TTTACAC
- Proteins (20 letter alphabet gap)
- RKVA--GMAKPNM
- RKIAVAAASKPAV
14Problem
- Any two sequences can always be aligned
- There are many possible alignments
- Sequence alignment needs to be scored to find the
optimal alignment - In many cases there will be several solutions
with the same score
ACGTACGTACGTACGTACGTACGTACGT
GATCGATCGATCGATCGATCGATCGATC
ACGTACGTACGTACGTACGTACGTACGT
GATCGATCGATCGATCGATCGATCGATC
ACGTACGTACGTACGTACGTACGTACGT
GATCGATCGATCGATCGATCGATCGATC
ACGTACGTACGTACGTACGTACGTACGT
GATCGATCGATCGATCGATCGATCGATC
Question what is similar enough to be relevant
?
ACCGGTACGTTACGATACGTAACGTTACTGTACTGT
GATCGATCGATCGATCGATCGATCGAT
C
15What is sequence alignment
- Given two sentences of letters (strings), and a
scoring scheme for evaluating matching letters,
find the optimal pairing of letters from one
sequence to letters of the other sequence - Align
- THIS IS A RATHER LONGER SENTENCE THAN THE NEXT
- THIS IS A SHORT SENTENCE
- THIS IS A RATHER LONGER - SENTENCE THAN THE NEXT
- ---- ---- - ---- --- ----
- THIS IS A --SH-- -O---R T SENTENCE ---- --- ----
- or
- THIS IS A RATHER LONGER SENTENCE THAN THE NEXT
- ------ ------ ---- --- ----
- THIS IS A SHORT- ------ SENTENCE ---- --- ----
16Statement of problem
- Given
- 2 sequences
- Scoring system for evaluating match (or mismatch)
of two characters (simple for nucleic acids /
difficult for proteins) - Penalty function for gaps in sequences
- Produce
- Optimal pairing of sequences that retains the
order of characters in each sequence, perhaps
introducing gaps, such that the total score is
optimal.
17Pairwise sequence alignment
- Why sequence alignment and definitions
- What is sequence alignment
- How align sequences
- global vs. local alignment
- gaps
- Substitution matrix
- Dot plot
- Dynamic programming
18Pairwise sequence alignment
- Global alignment
- Local alignment
19Global alignment
- 1 TGTCGATTAAGCGGTCGTAGCTGACCTGAGATTGCCCGATGGCGTAGT
AGCTGACC 56 -
- 1 TGTCGATTATGCGGTCGTAG..GACCTGAGTTTCCCCGATGGCGTAGT
AGGTGACC 54
Two closely related sequences
Algorithm GAP (Needleman Wunsch) Produces an
end-to-end alignment
20Global alignment
1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAA
TTAAAGAGGAGGTAGACCG... 67
1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGA
AGCACTAAAGCGTCAGCGAGACCG 70
Two sequences sharing several local regions of
local similarity
Algorithm GAP (Needleman Wunsch) Produces an
end-to-end alignment
21Local alignment
Algorithm Bestfit (Smith Waterman) Identifies
the region with the best local similarity
Algorithm Similarity (X. Huang) Identifies all
regions with local similarity
22Pairwise sequence alignment
- Why sequence alignment and definitions
- What is sequence alignment
- How align sequences
- global vs. local alignment
- gaps
- Substitution matrix
- Dot plot
- Dynamic programming
23Global alignmentthe gap
1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAA
TTAAAGAGGAGGTAGACCG 67
1 AGGATTGGAATGCTACAGAAGCAGCTAAAGCGTGTATGCAGGATTGG
AATTAAAGAGGAGGTAGACCG 68
24Parameters for sequence alignment
Gap penalties Opening The cost to introduce a
gap Extension The cost to extend a gap Scoring
systems Every symbol pairing is assigned with a
numerical value that is based on a symbol
comparison or replacement table/matrix
25Drawing alignments
- Exact matches OK, Inexact costly, Gaps cheapThis
is a rather longer sentence than the nextThis is
a ------ ------ sentence ---- --- ---- - Exact matches OK, Inexact costly, Gaps
cheapThis is a -rather longer - sentence than
the nextThis is a s---h-- -o---r-t sentence ----
--- ---- - Exact matches OK, Inexact moderately, Gap
extension cheapThis is a rather longer sentence
than the nextThis is a -----s hort-- sentence
---- --- ---- - Exact matches OK, Inexact cheap, Gaps costlyThis
is a rather longer sentence than the nextThis is
a short sentence----------------------
Which is best ? - what is best ? How can this be
calculated
26Gap (insertions/deletions) are scored
ATGTAATGCA TATGTGGAATGA
ATGT..AATGCA TATGTGGAATGA
or
Insertion / deletion
The generation of a gap is penalized with a
negative score
27Why gap penalties ?
Gaps not permitted Score 0
1 GTGATAGACACAGACCGGTGGCATTGTGG 29
1 GTGTCGGGAAGAGATAACTCCGATGGTT
G 29
13 matches 16 mismatches
Gaps allowed but not penalized Score 88
1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29
1
GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29
20 matches 3 mismatches
Match 5 Mismatch -4
28Why gap penalties ?
- The optimal alignment of two similar sequences
usually - maximizes the number of matches and
- minimizes the number of gaps.
- Permitting the insertion of arbitrarily many gaps
might lead to high scoring alignments of
non-homologous sequences. - Penalizing gaps forces alignments to have
relatively few gaps.
Gap penalties increase the quality of an
alignment non-homologous sequences are not
aligned
29Gap penalties
Linear gap penalty score Affine gap penalty
score g(g) gap penalty score of a gap of
length g d gap opening penalty e
gap extension penalty g gap length
g(g) - gd
g(g) -d - (g -1) e
30Scoring insertions and deletions
T A T G T G C G T A T A A T G T T
A T A C
Total Score 4
T A T G T G C G T A T A
A T G T - - - T A T A C
Total Score 8 (-3.2) 4.8
match 1 mismatch 0
31Modification of gap penalties
Score Matrix BLOSUM62
gap opening penalty 3 gap extension penalty
0.1 score 6.3 gap opening penalty
0 gap extension penalty 0.1 score 11.3
1 ...VLSPADKFLTNV 12 1
VFTELSPAKTV.... 11
1 V...LSPADKFLTNV 12 1
VFTELSPA.K..T.V 11
32Aligning biological sequenceswhere put the gap ?
- Nucleic acid (4 letter alphabet gap)
-
- TT-GCAC
- ?
- TTTACAC
TTG-CAC ? TTTACAC
or
G-A or GT ?
33Pairwise sequence alignment
- Why sequence alignment and definitions
- What is sequence alignment
- How align sequences
- global vs. local alignment
- gaps
- Substitution matrix
- Dot plot
- Dynamic programming
34Parameters for sequence alignment
Gap penalties Opening The cost to introduce a
gap Extension The cost to extend a gap Scoring
systems Every symbol pairing is assigned with a
numerical value that is based on a symbol
comparison or substitution table/matrix
35DNA scoring systems
Sequence 1 ACTACCAGTTCATTTGATACTTCTCAAA
Sequence 2
TACCATTACCGTGTTAACTGAAAGGACTTAAAGACT
A C G T A 1 0 0 0 C 0 1 0 0 G 0
0 1 0 T 0 0 0 1
Match 5 x 1 5 Mismatch 19 x 0 0 Score
5
36DNA scoring systems
Sequence 1 ACTACCAGTTCATTTGATACTTCTCAAA
Sequence 2
TACCATTACCGTGTTAACTGAAAGGACTTAAAGACT
Negative scoring values to penalize mismatches
A C G T A 5 -4 -4 -4 C -4 5 -4 -4 G -4
-4 5 -4 T -4 -4 -4 5
Match 5 x 5 25 Mismatch 19 x -4 -76
Score -51
37Complication inexact is not binary (10) but
something relative
Amino acids have different physical and
biochemical properties that are/are not important
for function and thus influence their probability
to be replaced in evolution
38Protein scoring systems
substitution matrix C S T P A G N D . . C 9
S -1 4 T -1 1 5 P -3 -1 -1 7 A 0 1 0 -1 4 G -3 0
-2 -2 0 6 N -3 1 0 -2 -2 0 5 D -3 0 -1 -1 -2 -1
1 6 . .
TG -2 TT 5 ... Score 48
39substitution (scoring) matrix
Grouping of side chains by charge, polarity ...
Exchange of D (Asp) by E (Glu) is better (both
are negatively charged) than replacement e.g. by
F (Phe) (aromatic) C (Cys) makes disulphide
bridges and cannot be exchanged by other residue
? high score of 9.
40Alignment of human hemoglobulin a and b chains
ltgt identical ltgt highly similar lt.gt similar lt
gt unrelated
Symbol Comparison Table PAM250 Gap opening
penalty 3 Gap extension penalty
0.1 Score 116
41How are substitution matrices generated ?
- Manually align protein structures (or, more
risky, sequences) - Look for frequency of amino acid substitutions at
structurally constant sites. - Entry -log(freq(observed/freq(expected))
- ? more likely than random
- 0 ? At random base rate
- - ? less likely than random
42PAM (Percent Accepted Mutations) matrices
- Derived from global alignments of protein
families.Family members sharing at least 85
identity (Dayhoff et al., 1978). -
- Construction of phylogenetic tree and ancestral
sequences of each protein family - Computation of number of substitutions for each
pair of amino acids
43PAM (Percent Accepted Mutations) matrices
- The PAM-1 matrix, which was computed calculating
the number of substitutions, reflects an average
change of 1 of all amino acid positions. PAM
matrices for larger evolutionary distances can be
extrapolated from the PAM-1 matrix. - PAM250 250 mutations per 100 residues.
- Greater numbers mean bigger evolutionary distance
44PAM250
45BLOSUM (Blocks Substitution Matrix)
- Derived from alignments of domains of distantly
related proteins (Henikoff Henikoff, 1992)
A A C E C
- Occurrences of each amino acid pair in each
column of each block alignment is counted - The numbers derived from all blocks were used to
compute the BLOSUM matrices
A A C E C
A - A 1 A - C 4 A - E 2 C - E 2 C - C 1
46BLOSUM (Blocks Substitution Matrix)
- Sequences within blocks are clustered according
to their level of identity - Clusters are counted as a single sequence
- Different BLOSUM matrices differ in the
percentage of sequence identity used in
clustering - The number in the matrix name (e.g. 62 in
BLOSUM62) refers to the percentage of sequence
identity used to build the matrix - Greater numbers mean smaller evolutionary distance
47Different substitution matrices for different
alignments
less stringent
more stringent
- BLOSUM matrices usually perform better than PAM
matrices for local similarity searches (Henikoff
Henikoff, 1993) - When comparing closely related proteins one
should use lower PAM or higher BLOSUM matrices,
for distantly related proteins higher PAM or
lower BLOSUM matrices - For database searching the commonly used matrix
(default) is BLOSUM62
48Calculating alignmentsGlobal vs. Local alignment
- For optimal GLOBAL alignment, we want best score
in the final row or final column - GLOBAL - best alignment of entirety of both
sequences (possibly at expense of great local
similarity) - For optimal LOCAL alignment, we want best score
anywhere in matrix - LOCAL - best alignment of segments, without
regard to rest of two sequences (at the expense
of the overall score)
49Pairwise sequence alignment
- Why sequence alignment and definitions
- What is sequence alignment
- How align sequences
- global vs. local alignment
- gaps
- Substitution matrix
- Dot plot
- Dynamic programming
50Dotplot
Dotplot gives an overview of all possible
alignments The ideal case two identical sequences
Sequence 1
T A T C G A A G T A T A T C G A A G T A
Every word in one sequence is aligned with each
word in the second sequence
Sequence 2
51Dotplot
Dotplot gives an overview of all possible
alignments The normal case two somewhat similar
sequences
Sequence 1
T A T C G A A G T A T A T T C A T G T A
isolated dots
2 dots form a diagonal
Sequence 2
3 dots form a diagonal
52Dotplot
Dotplot gives an overview of all possible
alignments
Sequence 1
T A T C G A A G T A T A T T C A T G T A
Sequence 2
Word Size 1
53Dotplot
In a dotplot each diagonal corresponds to a
possible (ungapped) alignment
Sequence 1
T A T C G A A G T A T A T T C A T G T A
One possible alignment
Sequence 2
TATCGAAGTA TATTCATGTA
Word Size 1
54Dotplot
Dotplot gives an overview of all possible
alignments Filters (word size) can be introduced
to get rid of noise
Sequence 1
Word size 1
T A T C G A A G T A T A T T C A T G T A
isolated dots
2 dots form a diagonal
Sequence 2
3 dots form a diagonal
55Dotplot
Dotplot gives an overview of all possible
alignments Filters (word size) can be introduced
to get rid of noise
Sequence 1
Word size 2
T A T C G A A G T A T A T T C A T G T A
2 dots form a diagonal
Sequence 2
3 dots form a diagonal
56Dotplot
Dotplot gives an overview of all possible
alignments Filters (word size) can be introduced
to get rid of noise
Sequence 1
Word size 3
T A T C G A A G T A T A T T C A T G T A
Sequence 2
3 dots form a diagonal
57Dotplot
Dotplot gives an overview of all possible
alignments Filters (word size) can be introduced
to get rid of noise
Sequence 1
Word size 4
T A T C G A A G T A T A T T C A T G T A
Sequence 2
conditions too stringent !!
58Word size algorithm
sliding window
Sequence 1
T A C G G T A T G A C A G T A
T C T A C G G T A T G A C A G
T A T C T A C G G T A T G A C
A G T A T C T A C G G T A T G
A C A G T A T C T A C G G T A T G
A C A G T A T C T A C G G T A T G
A C A G T A T C
T A C G G T A T G A C A G T A T C
Sequence 2
Word Size 3 Only perfect matches are counted
59Word size algorithm
Stringency
Sequence 1
T A C G G T A T G A C A G T A
T C T A C G G T A T G A C A G
T A T C T A C G G T A T G A C
A G T A T C T A C G G T A T G
A C A G T A T C T A C G G T A T G
A C A G T A T C T A C G G T A T G
A C A G T A T C
T A C G G T A T G A C A G T A T C
Sequence 2
Word Size 3 One mismatch allowed is called
stringency
60Word size algorithm
sliding window
Sequence 1
T A C G G T A T G A C A G T A
T C T A C G G T A T G A C A G
T A T C T A C G G T A T G A C
A G T A T C T A C G G T A T G
A C A G T A T C T A C G G T A T G
A C A G T A T C T A C G G T A T G
A C A G T A T C
T A C G G T A T G A C A G T A T C
Sequence 2
Word Size 5 Stringency 4
61window / stringency
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM PTHPLASKTQILPEDLASEDL
TI PTHPLAGERAIGLARLAEEDFGM
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
Score 11
Substitution Matrix Filtering
Score 11
Matrix PAM250 Window 12 Stringency 9
Score 7
62Insertions / deletions in a Dotplot
Sequence 1
T A C G A C T A T G T A C G A T A T G
Sequence 2
T A C G A C T A T G T A C G A - T A T G
Word Size 1 Stringency 1
63Insertions / deletions in a Dotplot
Sequence 1
T A C G A C T A T G T A C G A T A T G
Sequence 2
T A C G A C T A T G T A C G A - T A T G
Word Size 2 Stringency 2
64Insertions / deletions in a Dotplot
Sequence 1
T A C G G G T A T G T A C G G T A T G
Sequence 2
T A C G G G T A T G T A C G G - T A T G
T A C G G G T A T G T A C G - G T A T G
T A C G G G T A T G T A C - G G T A T G
Word Size 1 Stringency 1
65considerations
- The window/stringency method is more sensitive
than the wordsize method (ambiguities are
permitted) - The smaller the window, the larger the weight of
statistical (unspecific) matches - With large windows the sensitivity for short
sequences is reduced - Insertions/deletions are not treated explicitly
66Dot matrixexample of a repetitive DNA sequence
- In addition to the main diagonal, there are
several other diagonalsOnly one half of the
matrix is shown because of the symmetry
perfect tool to visualize repeats
/
-/
67Problems with Dot matrices
- Rely on visual analysis (necessarily merely a
screen dump due to number of operations)
Improvement Dotter (Sonnhammer et al.) - Difficult to find optimal alignments
- Difficult to estimate significance of alignments
- Insensitive to conserved substitutions (e.g. L ?
I or S ?T) if no substitution matrix can be
applied - Compares only two sequences (vs. multiple
alignment) - Time consuming (1,000 bp vs. 1,000 bp 106
operations, 1,000,000 vs. 1,000,000 bp 1012
operations)
81 days !
68Sequence alignment of phage T7 and Thermus
aquaticus DNA polymerases
Exonuclease domain
Which are the catalytically important residues?
Polymerase domain
69Multiple sequence alignment to identify
catalytically important residues
F discriminates between dNTP and ddNTP Y does
not discriminate important for DNA sequencing
If not identical, what is significant
70Pairwise sequence alignment
- Why sequence alignment and definitions
- What is sequence alignment
- How align sequences
- global vs. local alignment
- gaps
- Substitution matrix
- Dot plot
- Dynamic programming
71Dynamic Programming
- Automatic procedure that finds the best alignment
with an optimal score depending on the selected
parameters. - Needleman and Wunsch Algorithm - Global
Alignment - - Smith and Waterman Algorithm - Local Alignment -
72Needleman Wunschglobal alignment
align sequences MGKP and MGPKKP
MGKP MGPKKP
MGKP MG.PKKP
MGKP MGPKKP
MGK..P MGPKKP
MGKP MGPKKP
?
MG..KP MGPKKP
MG.K.P MGPKKP
73Generation of an alignment path matrix
IdeaBuild up an optimal alignment using
previous solutions for optimal alignments of
smaller subsequences
- Construct matrix F indexed by i and j (one index
for each sequence) - F(i,j) is the score of the best alignment between
the initial segment x1...i of x1 up to xi and the
initial segment y1...j of y1 up to yj - Build F(i,j) recursively beginning with F(0,0) 0
74Generation of an alignment path matrix
- We can calculate F(i,j), if F(i-1,j-1), F(i-1,j)
and F(i,j-1) are known - Three possibilities
- xi and yj are aligned, F(i,j) F(i-1,j-1)
s(xi,yj) - xi is aligned to a gap, F(i,j) F(i-1,j) - d
- yj is aligned to a gap, F(i,j) F(i,j-1) - d
- The best score up to (i,j) will be the highest
value of the three options
F(i,j) max
s(xi,yj) score from substitution matrix d gap
penalty
75Problem align sequences MGKP and MGPKKP
Generation of an alignment path matrix
i
1. gap penalties d gap penalty
j
0
-6
-12
-18
-24
-6
boundary conditions F(i, 0) F(i-1,0) -
d F(0,j) F(0,j-1) - d
-12
-18
-24
-30
-36
d gap penalty (here 6)
76Problem align sequences MGKP and MGPKKP
Generation of an alignment path matrix
i
2. substitution matrixhere Blosom62
j
0
-6
-12
-18
-24
displaying the score matrix blosum62... A R
N D C Q E G H I L K M F P S T W Y
V B Z X A 4 R -1 5 N -2 0 6 D
-2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0
-3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1
-3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I
-1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1
-2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1
-3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1
5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2
-1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3
-1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2
-1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1
-2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 B
-2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0
-1 -4 -3 -3 4 Z -1 0 0 1 -3 3 4 -2 0 -3
-3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 X 0 -1 -1
-1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1
-1 -1 -1 -1
-6
-12
-18
-24
-30
-36
d gap penalty (here 6)
77Problem align sequences MGKP and MGPKKP
Generation of an alignment path matrix
i
2. substitution matrixhere Blosom62
j
0
-6
-12
-18
-24
-6
-12
-18
-24
-30
-36
d gap penalty (here 6)
78Problem align sequences MGKP and MGPKKP
Generation of an alignment path matrix
i
3. path matrix generationcalculation of F(i,j)
j
0
-6
-12
-18
-24
-6
5
-12
-18
-24
-30
-36
d gap penalty (here 6)
79Problem align sequences MGKP and MGPKKP
Generation of an alignment path matrix
i
3. path matrix generationcalculation of F(i,j)
j
0
-6
-12
-18
-24
-6
5
-1
-12
-18
-24
-30
-36
d gap penalty (here 6)
80Problem align sequences MGKP and MGPKKP
Generation of an alignment path matrix
i
3. path matrix generationcalculation of F(i,j)
j
0
-6
-12
-18
-24
-6
5
-1
-7
-13
-1
11
-12
-18
-24
-30
-36
d gap penalty (here 6)
81Calculating the possible paths
Diagonal add blue value from field
below Vertical/horizontal axis add penalty score
(here -6)
F(2,3)s(xi,yj) 5 5 10 F(3,4) max
F(2,4) -d -2 -6 -8 10 F(3,3)
-d 10 -6 4
82Constructing the alignment
i
4. alignment
j
Diagonal match Vertical/horizontal gap
83Dynamic Programming
- Automatic procedure that finds the best alignment
with an optimal score depending on the selected
parameters. - Needleman and Wunsch Algorithm - Global
Alignment - - Smith and Waterman Algorithm - Local Alignment -
84Smith Waterman Local alignment
Two differences
2. An alignment can end anywhere in the matrix
Example Sequence 1 H E A G A W G H E E
Sequence 2 P A W H E A E Scoring parameters
Log-odds ratiosGap penalty Linear gap penalty
of 8
85Smith Waterman
H E A G A W G H E E 0 0 0 0
0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0
0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0
0 0 0 2 0 20 12 4 0 0 H 0 10 2 0 0
0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18
28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0
0 6 13 18 12 4 0 4 16 26
highscore
AWGHE AW.HE
Optimal local alignment
86extended Smith Waterman
- To obtain multiple local alignments
- delete regions around best path
- repeat backtracking
87extended Smith Waterman
H E A G A W G H E E 0 0 0 0
0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0
0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0
0 0 0 2 0 20 12 4 0 0 H 0 10 2 0 0
0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18
28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0
0 6 13 18 12 4 0 4 16 26
88extended Smith Waterman
H E A G A W G H E E 0 0 0 0
0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0
0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0
0 0 0 2 0 20 12 4 0 0 H 0 10 2 0 0
0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18
28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0
0 6 13 18 12 4 0 4 16 26
highscore
HEA HEA
Second best local alignment
89Summary
- Critical user choices are
- Availability (web-server)
- Speed
- Gap penalty
- Replacement matrix
- Local vs. Global alignment
90- Mind the parameters you apply in sequence
alignment and comparison ! - If you can chance parameters/search criteria, you
need to know what the consequences will be