Title: Sequence Alignment
1Sequence Alignment
- Lecture 15 October 20, 2005
- Algorithms in Biosequence Analysis
- Nathan Edwards - Fall, 2005
2Sequence Alignment
- Global alignment of S, length m, and T, length n
- each char of S (T) aligned with char of T (S) or
space - - in O(nm) time, using (n1)(m1) space
- (min) edit-distance and (max) similarity
formulations - Dynamic Programming
- Base conditions recurrence relation
- Dynamic programming table (bottom up)
- Traceback from (n,m) to (0,0) to obtain sequence
alignment
3Recurrence Relation
- Base conditions
- V(i,0) Ski s(S(k),-), for all i 0,...,n
- V(0,j) Skj s(-,T(k)), for all j 0,...,m
- Recurrence
- V(i,j) max V(i-1,j) s(S(i),-),
V(i,j-1) s(-,T(j)),
V(i-1,j-1) s(S(i),T(j))
4Dynamic Programming Table
5Similar Protein Sequences(Human v Worm)
- 8 FAKDFLAGGVAAAISKTAVAPIERVKLLLQVQHASKQITADKQYK
GIIDCVVRIPKEQGV 67 - F D GG AAASKTAVAPIERVKLLLQVQ ASK I
DKYKGID RPKEQGV - 12 FLIDLASGGTAAAVSKTAVAPIERVKLLLQVQDASKAIAVDKRYK
GIMDVLIRVPKEQGV 71 - 68 LSFWRGNLANVIRYFPTQALNFAFKDKYKQIFLGGVDKRTQFWRY
FAGNLASGGAAGATS 127 - WRGNLANVIRYFPTQANFAFKD YK IFL GDK
FWFAGNLASGGAAGATS - 72 AALWRGNLANVIRYFPTQAMNFAFKDTYKAIFLEGLDKKKDFWKF
FAGNLASGGAAGATS 131 - 128 LCFVYPLDFARTRLAADVGKAGAEREFRGLGDCLVKIYKSDGIKG
LYQGFNVSVQGIIIY 187 - LCFVYPLDFARTRLAADGKA REFGL DCLKI KSDG
GLYGF VSVQGIIIY - 132 LCFVYPLDFARTRLAADIGKAN-DREFKGLADCLIKIVKSDGPIG
LYRGFFVSVQGIIIY 190 - 188 RAAYFGIYDTAKGML-PDPKNTHIVISWMIAQTVTAVAGLTSYPF
DTVRRRMMMQSGRKG 246 - RAAYFGDTAK D W IAQ VT G
SYPDTVRRRMMMQSGRK - 191 RAAYFGMFDTAKMVFASDGQKLNFFAAWGIAQVVTVGSGILSYPW
DTVRRRMMMQSGRK- 249 - 247 TDIMYTGTLDCWRKIARDEGGKAFFKGAWSNVLRGMGGAFVLVLY
DEIKKY 297 - DIY TLDC KI EG A FKGA SNV RG GGA VL
YDEIK - 250 -DILYKNTLDCAKKIIQNEGMSAMFKGALSNVFRGTGGALVLAIY
DEIQKF 299
6Global Alignment Schematic
T
(0,0)
S
(n,m)
7End-space free variant
T
(0,0)
S
(n,m)
8End-space free variant
T
(0,0)
S
(n,m)
9End-space free variant
T
(0,0)
S
(n,m)
10End-space free variant
- Dont charge for optimal alignment starting in
cells (i,0) or (0,j) - Base conditions V(i,0) V(0,j) 0
- Dont charge for adding spaces at end of
alignment - Find cell (n,j) or (i,m) with maximum similarity
value, begin traceback from there
11Approximate Search
T
T
(0,0)
P
(n,m)
Similarity P T d
12Approximate Search
- Dont charge for optimal alignment starting in
cells (0,j) - Base conds V(0,j) 0, V(i,0) Ski s(S(k),-)
- Dont charge for ending alignment at end of P
(but not necc. T) - Find cell (n,j) with similarity value d
13Local alignment
- In many biological contexts, two strings may only
have regions of similarity. - S pqraxabcstvq, T xyaxbacsll
- poor global alignment, but for a axabcs and ß
axbacs, there is strong similarity.
14Local alignment problem
- Given two sequences S, length n, and T, length m,
find substrings a from S and ß from T whose
similarity is maximum over all pairs of
substrings from S and T - For S pqraxabcstvq, T xyaxbacsll, a x a b
c s a x b a c shas similarity 8 for match
score 2, mismatch -2, and space -1.
15Local alignment
- Surprisingly, the optimal local alignment can be
computed in O(nm) time and O(nm) space. - Base cond v(i,0) v(0,j) 0 for all i,j
- Recurrence v(i,j) max 0, v(i-1,j)
s(S(i),-),
v(i,j-1) s(-,T(j)),
v(i-1,j-1) s(S(i),T(j)) - Check each cell to find max v(i,j) for all i,j.
16Local Alignment Schematic
T
(0,0)
S
(n,m)
17Local Alignment Schematic
T
(0,0)
S
(n,m)
18Local Alignment Schematic
T
(0,0)
S
(n,m)
19Local alignment
- Dont charge for optimal alignment starting in
any cell (i,j) - Base conds V(i,0) V(0,j) 0
- Can re-start alignment in any cell.
- Dont charge for ending alignment in any cell
- Find cell (i,j) with maximum similarity value
- Traceback from end of alignment.
20Terminology
- Global alignment is often called Needleman-Wunsch
alignment - Local alignment is often called Smith-Waterman
alignment
21Gap alignment models
- Consecutive run of spaces in a sequence
alignment - Need to model block insertions and deletions
better than linear gap model does. - No encouragement for long gaps to form
- Arbitrary gap model
- cost of gap of length g is w(g)
- Affine gap model (open extension cost)
- cost of gap of length g is o e.g
22Gap alignment models
- Have to keep track of whether we are opening or
extending a gap - Current DP formulation doesnt cut it!
- Consider any alignment of S1...i and T1...j.
Either - 1.) S(i) and T(j) are aligned with each other
- 2.) S(i) is aligned to T(j), with j lt j
- 3.) T(j) is aligned to S(i), with i lt i or
23Gap alignment models
- Let G(i,j) be maximum value of any alignment with
S(i) aligned with T(j) 1 - Let E(i,j) be maximum value of any alignment with
T(j) aligned with a gap 2 - Let F(i,j) be maximum value of any alignment with
S(i) aligned with a gap 3 - Let V(i,j) max E(i,j), F(i,j), G(i,j)
24Arbitrary gap cost recurrence
- Alignment type 1
- G(i,j) V(i-1,j-1) s(S(i),T(j))
- Alignment type 2
- E(i,j) max 0kj-1 V(i,k) w(j-k)
- Alignment type 3
- F(i,j) max0ki-1 V(k,j) w(i-k)
- V(i,j) max E(i,j), F(i,j), G(i,j)
25Arbitrary gap cost recurrence
- Base conditions
- V(i,0) -w(i), E(i,0) -w(i)
- V(0,j) -w(j), F(0,j) -w(j)
- V(0,0) G(0,0) 0
- Optimal value of alignment is found in cell (n,m)
- Traceback may jump multiple cells horizontally or
vertically - Running time is O(nm(nm)), space is O(nm) as
before.
26Affine gap model recurrence
- Base conditions
- V(i,0) E(i,0) o e.i
- V(0,j) F(0,j) o e.j
- V(0,0) G(0,0) 0
- Recurrences
- V(i,j) max E(i,j), F(i,j), G(i,j)
- G(i,j) V(i-1,j-1) s(S(i),T(j))
- E(i,j) max E(i,j-1) e, V(i,j-1) o e
- F(i,j) max F(i-1,j) e, V(i-1,j) o e
- Running time O(nm), space O(nm)
27Linear space global alignment algorithm
- Notice that if we only wanted the value of the
optimal alignment, then O(m) space is sufficient - Only use previous row of table when computing
current row - So V(n,m) in O(m) space and O(nm) time.
- How can we recover the optimal alignment without
giving up O(m) space?
28Optimal global alignment in linear space
- Define VR(i,j) to be the similarity of SR1...i
and TR1...j - Run DP from bottom right corner up left
- V(n,m) max0km V(n/2,k) VR(n/2,m-k)
- The optimal alignment can be broken into the
piece for S1...n/2 and the piece for
Sn/21...n - T1...k aligns with the first half of S,
whileTk1...m aligns with the second half of
S.
29Optimal global alignment in linear space
- Compute the values in row n/2 of V in O(nm) time
and O(m) space. - Compute the values in row n/2 of VR in O(nm) time
and O(m) space. - Check each possible k to find k in O(m) time.
- We know there is an optimal alignment passing
through cell (n/2,k).
30Optimal global alignment in linear space
- When computing row n/2 of V and VR, retain DP
back-pointers. - Use back-pointers of V to find an optimal path
from (n/2,k) to (n/2-1,k1) - Use back-pointers of VR to find an optimal path
from (n/2,k) to (n/21,k2) - Recursively solve global alignment of
S1...n/2-1 T1...k1 and Sn/21...n
Tk2...m
31Optimal global alignment in linear space
k
k1
A
n/2-1
n/2
n/21
B
k2
32Optimal global alignment in linear space
- Running time analysis
- T(n,m) T(n/2,k) T(n/2,m-k) O(nm)
- Final term is time to find k.
- In second phase, time to find each k
- first subproblem O(n/2 k),
- second subproblem O(n/2 (m-k)).
- Total O(nm/2)
- T(n,m) O(nm nm/2 nm/4 ....) O(2nm)