Title: CS 5263 Bioinformatics
1CS 5263 Bioinformatics
- Lecture 4 Local Sequence Alignment, More
Efficient Sequence Alignment Algorithms
2Roadmap
- Review of last lecture
- Local sequence alignment
- More efficient sequence alignment algorithms
3- Given a scoring scheme,
- Match m
- Mismatch -s
- Gap -d
- We can easily compute an optimal alignment by
dynamic programming
4- Look at any column of an alignment between two
sequences X x1x2xM, Y y1y1yN - Only three cases
- xi is aligned to yj
- xi is aligned to a gap
- yj is aligned to a gap
F(i-1, j-1) ? (xi, yj) F(i, j) max
F(i-1, j) - d F(i, j-1) - d
5(No Transcript)
6(No Transcript)
7Example
F(i,j) j 0 1 2 3 4
i 0
A A
G -
T T
A A
A A
G -
T T
A A
1
2
3
8Equivalent graph problem
S1
G
A
T
A
(0,0)
? a gap in the 2nd sequence ? a gap in the 1st
sequence match / mismatch
1
1
A
S2
1
T
Value on vertical/horizontal line -d Value on
diagonal m or -s
1
1
A
(3,4)
- Number of steps length of the alignment
- Path length alignment score
- Optimal alignment find the longest path from (0,
0) to (3, 4) - General longest path problem cannot be found with
DP. Longest path on this graph can be found by DP
since no cycle is possible.
9Variants of Needleman-Wunsch alg
- LCS longest common subsequence
- No penalty for gaps or mutations
- Change score function
- Overlapping variants
- No penalty for starting/ending gaps
- Change initial / termination step
- Other variants
- cDNA-genome alignment
10Local alignment
11The local alignment problem
- Given two strings X x1xM,
- Y y1yN
- Find substrings x, y whose similarity (optimal
global alignment value) is maximum - e.g. X abcxdex X cxde
- Y xxxcde Y c-de
x
y
12Why local alignment
- Conserved regions may be a small part of the
whole - Global alignment might miss them if flanking
junk outweighs similar regions - Genes are shuffled between genomes
C
D
B
A
D
A
B
C
13Naïve algorithm
- for all substrings X of X and Y of Y
- Align X Y via dynamic programming
- Retain pair with max value
- end
- Output the retained pair
- Time O(n2) choices for A, O(m2) for B, O(nm) for
DP, so O(n3m3 ) total.
14Reminder
- The overlap detection algorithm
- We do not give penalty to gaps in the ends
Free gap
Free gap
15The local alignment idea
- Do not penalize the unaligned regions (gaps or
mismatches) - The alignment can start anywhere and ends
anywhere - Strategy whenever we get to some low similarity
region (negative score), we restart a new
alignment - By resetting alignment score to zero
16The Smith-Waterman algorithm
- Initialization F(0, j) F(i, 0) 0
-
- 0
- F(i 1, j) d
- F(i, j 1) d
- F(i 1, j 1) ?(xi, yj)
Iteration F(i, j) max
17The Smith-Waterman algorithm
- Termination
- If we want the best local alignment
- FOPT maxi,j F(i, j)
- If we want all local alignments scoring gt t
- For all i, j find F(i, j) gt t, and trace back
18Match 2 Mismatch -1 Gap -1
19Match 2 Mismatch -1 Gap -1
20Match 2 Mismatch -1 Gap -1
21Match 2 Mismatch -1 Gap -1
22Match 2 Mismatch -1 Gap -1
23Match 2 Mismatch -1 Gap -1
24Match 2 Mismatch -1 Gap -1
25Trace back
Match 2 Mismatch -1 Gap -1
26Trace back
Match 2 Mismatch -1 Gap -1
cxde c-de
x-de xcde
27- No negative values in local alignment DP array
- Optimal local alignment will never have a gap on
either end - Local alignment Smith-Waterman
- Global alignment Needleman-Wunsch
28Analysis
- Time
- O(MN) for finding the best alignment
- Time to report all alignments depends on the
number of sub-opt alignments - Memory
- O(MN)
- O(MN) possible
29- More efficient alignment algorithms
30- Given two sequences of length M, N
- Time O(MN)
- ok
- Space O(MN)
- bad
- 1Mb seq x 1Mb seq 1000G memory
- Can we do better?
31Bounded alignment
- Good alignment should appear near the diagonal
32Bounded Dynamic Programming
- If we know that x and y are very similar
- Assumption gaps(x, y) lt k
- xi
- Then, implies i j lt k
- yj
33Bounded Dynamic Programming
- Initialization
- F(i,0), F(0,j) undefined for i, j gt k
- Iteration
- For i 1M
- For j max(1, i k)min(N, ik)
- F(i 1, j 1) ?(xi, yj)
- F(i, j) max F(i, j 1) d, if j gt i k
- F(i 1, j) d, if j lt i k
- Termination same
x1 xM
yN y1
k
34Analysis
- Time O(kM) ltlt O(MN)
- Space O(kM) with some tricks
gt
M
M
2k
2k
35(No Transcript)
36- Given two sequences of length M, N
- Time O(MN)
- ok
- Space O(MN)
- bad
- 1mb seq x 1mb seq 1000G memory
- Can we do better?
37Linear space algorithm
- If all we need is the alignment score but not the
alignment, easy!
We only need to keep two rows (You only need one
row, with a little trick)
But how do we get the alignment?
38Linear space algorithm
- When we finish, we know how we have aligned the
ends of the sequences
XM YN
Naïve idea Repeat on the smaller subproblem
F(M-1, N-1) Time complexity O((MN)(MN))
39(0, 0)
M/2
(M, N)
Key observation optimal alignment (longest path)
must use an intermediate point on the M/2-th row.
Call it (M/2, k), where k is unknown.
40(0,0)
(3,2)
(3,4)
(3,6)
(3,0)
(6,6)
- Longest path from (0, 0) to (6, 6) is max_k
(LP(0,0,3,k) LP(6,6,3,k))
41Hirschbergs idea
Y
Forward algorithm Align x1x2xM/2 with Y
X
M/2
F(M/2, k) represents the best alignment between
x1x2xM/2 and y1y2yk
42Backward Algorithm
Y
Backward algorithm Align reverse(xM/21xM) with
reverse(Y)
X
M/2
B(M/2, k) represents the best alignment between
reverse(xM/21xM) and reverse(ykyk1yN )
43Linear-space alignment
- Using 2 (4) rows of space, we can compute
- for k 1N, F(M/2, k), B(M/2, k)
M/2
44Linear-space alignment
- Now, we can find k maximizing F(M/2, k) B(M/2,
k) - Also, we can trace the path exiting column M/2
from k
Conclusion In O(NM) time, O(N) space, we found
optimal alignment path at row M/2
45Linear-space alignment
- Iterate this procedure to the two sub-problems!
M/2
k
M/2
N-k
46Analysis
- Memory O(N) for computation, O(NM) to store the
optimal alignment - Time
- MN for first iteration
- k M/2 (N-k) M/2 MN/2 for second
k
M/2
M/2
N-k
47MN
MN/2
MN/4
MN MN/2 MN/4 MN/8 MN (1 ½ ¼
1/8 1/16 ) 2MN O(MN)
MN/8