Title: Sequence Alignment
1Sequence Alignment
- Arthur Chou
- Clark University
- Spring 2005
2Sequence Alignment
- Input two sequences over the same alphabet
- Output an alignment of the two sequences
- Example
- GCGCATGGATTGAGCGA
- TGCGCCATTGATGACCA
- A possible alignment
- -GCGC-ATGGATTGAGCGA
- TGCGCCATTGAT-GACC-A
3Why align sequences?
- Lots of sequences dont have known ancestry,
structure, or function. A few of them do. - If they align, they are similar.
- If they are similar, they might have the same
- ancestry, similar structure or function.
- If one of them has known ancestry, structure,
or - function, then alignment to the others yields
- insight about them.
4Alignments
-GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three
kinds of match Exact matches Mismatches
Indels (gaps)
5Choosing Alignments
- There are many possible alignments
- For example, compare
- -GCGC-ATGGATTGAGCGA
- TGCGCCATTGAT-GACC-A
- to
- ------GCGCATGGATTGAGCGA
- TGCGCC----ATTGATGACCA--
- Which one is better?
6Scoring Alignments
- Similar sequences evolved from a common ancestor
- Evolution changed the sequences from this
ancestral sequence by mutations - Replacement one letter replaced by another
- Deletion deletion of a character
- Insertion insertion of a character
- Scoring of sequence similarity should examine how
many and which operations took place
7Simple Scoring Rule
- Score each position independently
- Match 1
- Mismatch -1
- Indel -2
- Score of an alignment is sum of position scores
8Example
-
- -GCGC-ATGGATTGAGCGA
- TGCGCCATTGAT-GACC-A
- Score (1?13) (-1 ? 2) (-2 ? 4)
3 - ------GCGCATGGATTGAGCGA
- TGCGCC----ATTGATGACCA--
- Score (1 ? 5) (-1 ? 6) (-2 ? 11)
-23
9More General Scores
- The choice of 1,-1, and -2 scores is quite
arbitrary - Depending on the context, some changes are more
plausible than others - Exchange of an amino-acid by one with similar
properties (size, charge, etc.) vs. - Exchange of an amino-acid by one with opposite
properties - Probabilistic interpretation How likely is one
alignment versus another ?
10Dot Matrix Method
- A dot is placed at each position where two
residues match. - It's a visual aid. The human eye can rapidly
identify similar regions in sequences. - It's a good way to explore sequence organization
e.g. sequence repeats. - It does not provide an alignment.
THEFA-TCAT THEFASTCAT
- This method produces dot-plots with too much
noise - to be useful
- The noise can be reduced by calculating a score
using a window of residues. - The score is compared to a threshold or
stringency.
11Dot Matrix Representation
- Produces a graphical representation of similarity
regions - The horizontal and vertical dimensions correspond
to the compared sequences - A region of similarity stands out as a diagonal
12Dot Matrix or Dot-plot
- Each window of the first sequence is aligned
(without - gaps) to each window of the 2nd sequence
- A colour is set into a rectangular array
according to the - score of the aligned windows
13Dot Matrix Display
- Diagonal rows ( ) of dots
- reveal sequence similarity
- or repeats.
- Anti-diagonal rows ( )
- of dots represent inverted
- repeats.
- Isolated dots represent
- random similarity.
H C G E T F G R W F T P E W K C G
P T
F G R
I A C G E
M
14We can filter it by using a sliding window
looking for longer strings of matches and
eliminates random matches
15Longest Common Subsequence
- Sequence A nematode_knowledge
- Sequence B empty_bottle
- n e m a t o d e _ k n o w l e d g e
-
- e m p t y _ b o t t l e
- LCS Alignment with match score 1,
- mismatch score 0, and gap penalty 0
16What is an algorithm?
- A step-by-step description of the procedures to
accomplish a task. - Properties
- Determination of output for each input
- Generality
- Termination
- Correctness (proof, test, etc.)
- Time efficiency (no. of steps is small)
- Space efficiency (spaced used is small)
17Naïve algorithm exhaustive search
- G C G A A T G G A T T G A G C G T
- T G A G C C A T T G A T G A C C A
sequences of length n
i
j
i j j i j i j j i i . . . . . . . . . . . . . .
2n
Worst case time complexity is 2
18Dynamic programming algorithms for pairwise
sequence alignment
- Similar to Longest Common Subsequence
- Introduced for biological sequences by
- S. B. Needleman C. D. Wunsch. A general method
applicable to the search for similarities in the
amino acid sequence of two proteins. J. Mol.
Biol. 48443-453 (1970)
19Dynamic Programming
- Optimality substructure
- Reduction to a small number of sub-problems
- Memorization of solutions to sub-problems in a
table - Table look-up and tracing
- G C G C A T G G A T T G A G C G A T G C G C C
A T T G A T G A C C - A
20Recursive LCS
lcs_len( i , j ) length of LCS from i-th
position onward in String A and from j-th
position onward in String B
int lcs_len ( i , j ) if (A i \0
B j \0 ) return 0 else
if (A i B j ) return ( 1 lcs_len (
i1, j1 ) ) else return max ( lcs_len
( i1, j ) , lcs_len ( i, j1 )
)
21Reduction to Subproblems
int lcs_len ( String A , String B )
return subproblem ( 0, 0 ) int
subproblem ( int i, int j ) if (A i
\0 B j \0) return 0 else
if ( A i B j ) return (1
subproblem ( i1, j1 )) else return max (
subproblem ( i1, j ) , subproblem (
i, j1 ) )
22Memorizing the solutions
- Matrix L i , j -1 // initializing
the memory device - int subproblem ( int i, int j )
- if ( Li, j lt 0 )
- if (A i \0 B j \0) Li ,
j 0 - else if ( A i B j )
- Li, j 1 subproblem(i1,
j1) - else Li, j max( subproblem(i1,
j), - subproblem(i, j1))
- return L i, j
-
23Iterative LCS Table Look-up
To get the length of LCS of Sq. A and Sq. B
first allocate storage for the matrix L
for each row i from m downto 0 for each
column j from n downto 0 if (A i
\0 or B j \0) L i, j 0
else if (A i B j ) L i, j 1
Li1, j1 else L i, j
max(Li1, j, Li, j1)
return L0, 0
24Iterative LCS Table Look-up
- int lcs_len ( String A , String B ) // the
length -
- // First allocate storage for the matrix L
- for ( i m i gt 0 i-- ) // A has
length m1 - for ( j n j gt 0 j-- ) // B
has length n1 - if (A i \0 B j \0)
L i, j 0 - else if (A i B j ) L i, j 1
Li1, j1 - else L i, j max(Li1, j,
Li, j1) -
- return L0, 0
-
25Dynamic Programming Algorithm
- Li, j 1 Li1, j1 , if A i B
j - Li, j max ( Li1, j, Li, j1 )
otherwise
j
j1
B
A
Matrix L
L i, j
L i, j1
i
Li1, j1
L i1, j
i1
26 n e m a t o d e _ k n o w l e dÂ
g ee  7 7 6 5 5 5 5 5 4 3 3 3 2Â
2 2 1 1 1 0m  6 6 6 5 5 4 4 4 4Â
3 3 3 2 2 1 1 1 1 0p  5 5 5 5 5Â
4 4 4 4 3 3 3 2 2 1 1 1 1 0t  5Â
5Â 5Â 5Â 5Â 4Â 4Â 4Â 4Â 3Â 3Â 3Â 2Â 2Â 1Â 1Â 1Â
1 0y  4 4 4 4 4 4 4 4 4 3 3 3 2Â
2Â 1Â 1Â 1Â 1Â 0_Â Â 4Â 4Â 4Â 4Â 4Â 4Â 4Â 4Â 4Â
3 3 3 2 2 1 1 1 1 0b  3 3 3 3 3Â
3 3 3 3 3 3 3 2 2 1 1 1 1 0o  3Â
3Â 3Â 3Â 3Â 3Â 3Â 3Â 3Â 3Â 3Â 3Â 2Â 2Â 1Â 1Â 1Â
1 0t  3 3 3 3 3 2 2 2 2 2 2 2 2Â
2 1 1 1 1 0t  3 3 3 3 3 2 2 2 2Â
2 2 2 2 2 1 1 1 1 0l  2 2 2 2 2Â
2 2 2 2 2 2 2 2 2 1 1 1 1 0e  1Â
1Â 1Â 1Â 1Â 1Â 1Â 1Â 1Â 1Â 1Â 1Â 1Â 1Â 1Â 1Â 1Â
1Â 0Â Â Â 0Â 0Â 0Â 0Â 0Â 0Â 0Â 0Â 0Â 0Â 0Â 0Â 0Â
0Â 0Â 0Â 0Â 0Â 0
27Obtain the subsequence
- Sequence S empty // the LCS
- i 0 j 0
- while ( i lt m j lt n)
- if ( A i B j )
- add Ai to end of S
- i j
- else
- if ( Li1, j gt Li, j1) i
- else j
-
28 n e m a t o d e _ k n o w l e d g e  Â
e o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o     Â
 \  \  \  \   mÂ
o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o      Â
\ Â Â Â pÂ
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o     Â
   tÂ
o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o     Â
 \    yÂ
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o     Â
   _Â
o-o-o-o-o-o-o-o-o o-o-o-o-o-o-o-o-o-o     Â
 \    bÂ
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o     Â
   oÂ
o-o-o-o-o-o o-o-o-o o o o-o-o-o-o-o-o     Â
 \     tÂ
o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o     Â
 \    tÂ
o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o     Â
 \    lÂ
o-o-o-o-o-o-o-o-o-o-o-o-o-o o-o-o-o-o     Â
 \    e o-o
o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o       \
 \  \  \      o o o o o o
o o o o o o o o o o o o o
29 Dynamic Programming with scores and penalties
x
y
j
30Dynamic Programming with scores and penalties
- from i-th pos. in A and j-th pos. in B
onward - s ( Ai , Bj )
Si1, j1 - Si , j max max Six, j w( x )
- gap x in
sequence B - max Si, jy
w( y ) - gap y in sequence A
s score
w penalty function
best score from i, j onward
31Algorithm for simple gap penalty
- If for each gap, the penalty is a fixed constant
c, then - s(A i , B j ) Si1, j1
- Si , j max S i1, j c //
one gap - S i, j1 c // one gap
32Table Tracing
- To do table tracing based on similarity matrix
of amino acids, we re-define Si , j to be the
optimal score of choosing the match of Ai with
Bj. - S i , j s (A i , B j ) // s
score - Si1, j1 // w
gap penalty - max Si1x, j1 w( x )
- max gap x in sequence B
- max Si1, j1y w( y )
- gap y in
sequence A
33Diagram
Matrix S
j
j1
i
i1
34Summation operation
- 1. Start at lower right corner.
- 2. Move diagonally up one position.
- 3. Find largest value in either
- ? row segment starting diagonally below
current position and extending to the right or - ? column segment starting diagonally
below current position and extending down. - 4. Add this value to the value in the current
cell. - 5. Repeat steps 3 and 4 for all cells to the left
in current row and all cells above in current
column. - 6. If we are not in the top left corner, go to
step 2.
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50----V HGQKV
51(No Transcript)
52----VA HGQKVA
53----VADALTK HGQKVADALTK
54----VADALTK HGQKVADALTK
55----VADALTKPVNFKFA HGQKVADALTK------A
56----VADALTKPVNFKFAVAH HGQKVADALTK------AVAH
57Dynamic Programming by Tracing a Similarity Matrix
- Recall the algorithm Table Tracing
- Tracing a Similarity Matrix of Amino Acids
58Use of dynamic programming to evaluate homology
between pairs of sequences
- If we just want to know maximum match possible
between two sequences, then we dont need to do
trace-back but can just look at the highest value
in the first row or column (match score). This
represents the best possible alignment score.
59Gap penalty alternatives
- constant gap penalty for gap gt 1
- gap penalty proportional to gap size (affine gap
penalty) - one penalty for starting a gap (gap opening
penalty) - different (lower) penalty for adding to a gap
(gap extension penalty) - dynamic programming algorithm can be made more
efficient
60Gap penalty alternatives (cont.)
- gap penalty proportional to gap size and sequence
- for nucleic acids, can be used to mimic
thermodynamics of helix formation. - two kinds of gap opening penalties
- one for gap closed by AT, different for GC.
- different gap extension penalty.
61End gaps
- Some programs treat end gaps as normal gaps and
apply penalties, other programs do not apply
penalties for end gaps.
62End gaps (cont.)
- Can determine which a program does by adding
extra (unmatched) bases to the end of one
sequence and seeing if match score changes. - Penalties for end gaps appropriate for aligned
sequences where ends "should match. - Penalties for end gaps inappropriate when
surrounding sequences are expected to be
different (e.g., conserved exon surrounded by
varying introns).
63Global vs. Local Similarity
- Should result of alignment include all amino
acids or proteins or just those that match? - If yes, a global alignment is desired
- If no, a local alignment is desired
- Global alignment is accomplished by including
negative scores for mismatched positions, thus
scores get worse as we move away from region of
match (local alignment). - Instead of starting trace-back with highest value
in first row or column, start with highest value
in entire matrix, stop when score hits zero.