Title: CS 5263 Bioinformatics
1CS 5263 Bioinformatics
- Lecture 3 Dynamic Programming and Global
Sequence Alignment
2Evolution at the DNA level
C
ACGGTGCAGTCACCA
ACGTTGC-GTCCACCA
DNA evolutionary events (sequence
edits) Mutation, deletion, insertion
3Sequence conservation implies function
next generation
OK
OK
OK
X
X
Still OK?
4Why sequence alignment?
- Conserved regions are more likely to be
functional - Can be used for finding genes, regulatory
elements, etc. - Similar sequences often have similar origin and
function - Can be used to predict functions for new genes /
proteins - Sequence alignment is one of the most widely used
computational tools in biology
5Global Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
S
T
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
S
T
- Definition
- An alignment of two strings S, T is a pair of
strings S, T (with spaces) s.t. - S T, and (S length of S)
- removing all spaces in S, T leaves S, T
6What is a good alignment?
- Alignment
- The best way to match the letters of one
sequence with those of the other - How do we define best?
7S -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
- The score of aligning (characters or spaces) x
y is s (x,y). - Score of an alignment
- An optimal alignment one with max score
8Scoring Function
- Sequence edits
- AGGCCTC
- Mutations AGGACTC
- Insertions AGGGCCTC
- Deletions AGG-CTC
- Scoring Function
- Match m AAC
- Mismatch -s A-A
- Gap (indel) -d
9- Match 2, mismatch -1, gap -1
- Score 3 x 2 2 x 1 1 x 1 3
10More complex scoring function
- Substitution matrix
- Similarity score of matching two letters a, b
should reflect the probability of a, b derived
from same ancestor - It is usually defined by log likelihood ratio
(Durbin book) - Active research area. Especially for proteins.
- Commonly used PAM, BLOSUM
11An example substitution matrix
12How to find an optimal alignment?
- A naïve algorithm
- for all subseqs A of S, B of T s.t. A B do
- align Ai with Bi, 1 i A
- align all other chars to spaces
- compute its value
- retain the max
- end
- output the retained alignment
S abcd A cd T wxyz B xz -abc-d
a-bc-d w--xyz -w-xyz
13Analysis
- Assume S T n
- Cost of evaluating one alignment n
- How many alignments are there
- pick n chars of S,T together
- say k of them are in S
- match these k to the k unpicked chars of T
- Total time
- E.g., for n 20, time is gt 240 gt1012 operations
14- Intro to Dynamic Programming
15Dynamic programming
- What is dynamic programming?
- A method for solving problems exhibiting the
properties of overlapping subproblems and optimal
substructure - Key idea tabulating sub-problem solutions rather
than re-computing them repeatedly - Two simple examples
- Computing Fibonacci numbers
- Find the special shortest path in a grid
16Example 1 Fibonacci numbers
- 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89,
- F(0) 1
- F(1) 1
- F(n) F(n-1) f(n-2)
- How to compute F(n)?
17A recursive algorithm
- function fib(n)
- if (n 0 or n 1) return 1
- else return fib(n-1) fib(n-2)
18n/2
n
- Time complexity
- Between 2n/2 and 2n
- O(1.62n), i.e. exponential
- Why recursive Fib algorithm is inefficient?
- Overlapping subproblems
19An iterative algorithm
- function fib(n)
- F0 1 F1 1
- for i 2 to n
- Fi Fi-1 Fi-2
- Return Fn
Time complexity Time O(n), space O(n)
20Example 2 shortest path in a grid
S
m
G
n
Each edge has a length (cost). We need to get to
G from S. Can only move right or down. Aim find
a path with the minimum total length
21Optimal substructures
- Naïve algorithm enumerate all possible paths and
compare costs - Exponential number of paths
- Key observation
- If a path P(S, G) is the shortest from S to G,
any of its sub-path P(S,x), where x is on P(S,G),
is the shortest from S to x
22Proof
- Proof by contradiction
- If the path between P(S,x) is not the shortest,
i.e., P(S,x) lt P(S,x) - Construct a new path P(S,G) P(S,x) P(x, G)
- P(S,G) lt P(S,G) gt P(S,G) is not the shortest
- Contradiction
- Therefore, P(S, x) is the shortest
S
x
G
23Recursive solution
(0,0)
- Index each intersection by two indices, (i, j)
- Let F(i, j) be the total length of the shortest
path from (0, 0) to (i, j). Therefore, F(m, n) is
the shortest path we wanted. - To compute F(m, n), we need to compute both
F(m-1, n) and F(m, n-1)
m
(m, n)
n
F(m-1, n) length((m-1, n), (m, n)) F(m,
n) min F(m, n-1) length((m,
n-1), (m, n))
24Recursive solution
F(i-1, j) length((i-1, j), (i, j)) F(i, j)
min F(i, j-1) length((i, j-1), (i, j))
(0,0)
- But if we use recursive call, many subpaths will
be recomputed for many times - Strategy pre-compute F values starting from the
upper-left corner. Fill in row by row (what other
order will also do?)
(i-1, j)
(i, j)
(i, j-1)
m
(m, n)
n
25Dynamic programming illustration
S
9
1
2
3
3
12
13
15
0
5
3
3
3
3
2
5
2
3
6
8
13
15
5
2
3
3
9
3
4
2
3
2
9
7
11
13
16
6
2
3
7
4
6
3
3
3
11
14
17
20
13
4
6
3
1
3
2
3
2
1
17
17
18
20
17
G
F(i-1, j) length(i-1, j, i, j) F(i, j)
min F(i, j-1) length(i, j-1, i, j)
26Trackback
9
1
2
3
3
12
13
15
0
5
3
3
3
3
2
5
2
3
6
8
13
15
5
2
3
3
9
3
4
2
3
2
9
7
11
13
16
6
2
3
7
4
6
3
3
3
11
14
17
20
13
4
6
3
1
3
2
3
2
1
17
17
18
20
17
27Elements of dynamic programming
- Optimal sub-structures
- Optimal solutions to the original problem
contains optimal solutions to sub-problems - Overlapping sub-problems
- Some sub-problems appear in many solutions
- Memorization and reuse
- Carefully choose the order that sub-problems are
solved
28Dynamic Programming for sequence alignment
- Suppose we wish to align
- x1xM
- y1yN
- Let F(i,j) optimal score of aligning
- x1xi
- y1yj
- Scoring Function
- Match m
- Mismatch -s
- Gap (indel) -d
29Optimal substructure
- If xi is aligned to yj in the optimal
alignment between x1..M and y1..N, then - The alignment between x1..i and y1..j is also
optimal - Easy to prove by contradiction
30Recursive formula
- Notice three possible cases
- xM aligns to yN
- xM
- yN
- 2. xM aligns to a gap
- xM
- ?
- yN aligns to a gap
- ?
- yN
m, if xM yN F(M,N)
F(M-1, N-1) -s, if not
F(M,N) F(M-1, N) - d
F(M,N) F(M, N-1) - d
31Recursive formula
- Generalize
- F(i-1, j-1) ?(Xi,Yj)
- F(i,j) max F(i-1, j) d
- F(i, j-1) d
- ?(Xi,Yj) m if Xi Yj, and s otherwise
- Boundary conditions
- F(0, 0) 0.
- F(0, j) ?
- F(i, 0) ?
-jd y1..j aligned to gaps.
-id x1..i aligned to gaps.
32What order to fill?
33What order to fill?
34Example
F(i,j) i 0 1 2 3 4
j 0
1
2
3
35Example
F(i,j) i 0 1 2 3 4
j 0
1
2
3
36Example
F(i,j) i 0 1 2 3 4
j 0
1
2
3
37Example
F(i,j) i 0 1 2 3 4
j 0
1
2
3
38Example
F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2
j 0
1
2
3
39Example
F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2 This only tells us
the best score
j 0
1
2
3
40Trace-back
F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
j 0
1
2
3
41Trace-back
F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
j 0
1
2
3
42Trace-back
F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
j 0
1
2
3
43Trace-back
F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
j 0
1
2
3
44Trace-back
F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
45Using trace-back pointers
F(i,j) i 0 1 2 3 4
j 0
1
2
3
46Using trace-back pointers
F(i,j) i 0 1 2 3 4
j 0
1
2
3
47Using trace-back pointers
F(i,j) i 0 1 2 3 4
j 0
1
2
3
48Using trace-back pointers
F(i,j) i 0 1 2 3 4
j 0
1
2
3
49Using trace-back pointers
F(i,j) i 0 1 2 3 4
j 0
1
2
3
50Using trace-back pointers
F(i,j) i 0 1 2 3 4
j 0
1
2
3
51Using trace-back pointers
F(i,j) i 0 1 2 3 4
j 0
1
2
3
52Using trace-back pointers
F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
53The Needleman-Wunsch Algorithm
- Initialization.
- F(0, 0) 0
- F(0, j) - j ? d
- F(i, 0) - i ? d
- Main Iteration. Filling in scores
- For each i 1M
- For each j 1N
- F(i-1,j) d case 1
- F(i, j) max F(i, j-1) d
case 2 - F(i-1, j-1) s(xi, yj) case 3
- UP, if case 1
- Ptr(i,j) LEFT if case 2
- DIAG if case 3
- Termination. F(M, N) is the optimal score, and
from Ptr(M, N) can trace back optimal alignment
54Performance
- Time
- O(NM)
- Space
- O(NM)
- Later we will cover more efficient methods
55Equivalent graph problem
S1
G
A
T
A
(0,0)
? a gap in the 2nd sequence ? a gap in the 1st
sequence match / mismatch
1
1
A
S2
1
T
Value on vertical/horizontal line -d Value on
diagonal m or -s
1
1
A
(3,4)
- Number of steps length of the alignment
- Path length alignment score
- Optimal alignment find the longest path from (0,
0) to (3, 4) - General longest path problem cannot be found with
DP. Longest path on this graph can be found by DP
since no cycle is possible.
56Question
- If we change the scoring scheme, will the optimal
alignment be changed? - Old Match 1, mismatch gap -1
- New match 2, mismatch gap 0
- New Match 2, mismatch gap -2?
57Question
- What kind of alignment is represented by these
paths?
A
A
A
A
A
B C
B C
B C
B C
B C
A- BC
A-- -BC
--A BC-
-A- B-C
-A BC
Alternating gaps are impossible if s gt -2d
58A variant of the basic algorithm
- Scoring scheme m s d 1
- Seq1 CAGCA-CTTGGATTCTCGG
-
- Seq2 ---CAGCGTGG--------
- Seq1 CAGCACTTGGATTCTCGG
-
- Seq2 CAGC-----G-T----GG
- The first alignment may be biologically more
realistic
Score -7
Score -2
59A variant of the basic algorithm
- Maybe it is OK to have an unlimited of gaps in
the beginning and end
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCG
AGTTCATCTATCAC--GACCGC--GGTCG--------------
- Then, we dont want to penalize gaps in the ends
60The Overlap Detection variant
- Changes
- Initialization
- For all i, j,
- F(i, 0) 0
- F(0, j) 0
- Termination
- maxi F(i, N)
- FOPT max maxj F(M, j)
x1 xM
yN y1
61Different types of overlaps
x
x
y
y
62A non-bio variant
- Shell command diff Compare two text files
- Given file1 and file2
- Find the difference between file1 and file2
- Similar to sequence alignment
- How to score?
- Longest common subsequence (LCS)
- Match has score 1
- No mismatch penalty
- No gap penalty
63(No Transcript)
64 diff file1 file2 1c1 lt A --- gt G 4c4 lt D --- gt -
LCS 4
65The LCS variant
- Changes
- Initialization
- For all i, j, F(i, 0) F(0, j) 0
- Filling in table
- F(i-1,j)
- F(i, j) max F(i, j-1)
- F(i-1, j-1) s(xi, yj)
- where s(xi, yj) 1 if xi yj and 0 otherwise.
- Termination
- maxi F(i, N)
- FOPT max
- maxj F(M, j)
66More efficient algorithms
- What happens if you have 1 million lines of text
in each file? - O(mn) algorithm is too inefficient
- Memory inefficient
- 1 TB memory to store the matrix
- Bounded DP
- maybe the majority of the two files are the same?
(e.g., two versions of a software) - Linear-space algorithm
- same time complexity