CS 5263 Bioinformatics - PowerPoint PPT Presentation

1 / 66

About This Presentation

Title:

CS 5263 Bioinformatics

Description:

... subproblems and optimal substructure ... Optimal substructures. Na ve algorithm: enumerate all possible paths and compare costs ... Optimal substructure ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 67

Provided by: jianhu

Learn more at: http://www.cs.utsa.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 5263 Bioinformatics

1
CS 5263 Bioinformatics

Lecture 3 Dynamic Programming and Global
Sequence Alignment

2
Evolution at the DNA level
C
ACGGTGCAGTCACCA
ACGTTGC-GTCCACCA
DNA evolutionary events (sequence
edits) Mutation, deletion, insertion
3
Sequence conservation implies function

next generation
OK

OK

OK

X

X

Still OK?

4
Why sequence alignment?

Conserved regions are more likely to be
functional
Can be used for finding genes, regulatory
elements, etc.
Similar sequences often have similar origin and
function
Can be used to predict functions for new genes /
proteins
Sequence alignment is one of the most widely used
computational tools in biology

5
Global Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
S
T
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
S
T

Definition
An alignment of two strings S, T is a pair of
strings S, T (with spaces) s.t.
S T, and (S length of S)
removing all spaces in S, T leaves S, T

6
What is a good alignment?

Alignment
The best way to match the letters of one
sequence with those of the other
How do we define best?

7
S -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

The score of aligning (characters or spaces) x
y is s (x,y).
Score of an alignment
An optimal alignment one with max score

8
Scoring Function

Sequence edits
AGGCCTC
Mutations AGGACTC
Insertions AGGGCCTC
Deletions AGG-CTC
Scoring Function
Match m AAC
Mismatch -s A-A
Gap (indel) -d

Match 2, mismatch -1, gap -1
Score 3 x 2 2 x 1 1 x 1 3

10
More complex scoring function

Substitution matrix
Similarity score of matching two letters a, b
should reflect the probability of a, b derived
from same ancestor
It is usually defined by log likelihood ratio
(Durbin book)
Active research area. Especially for proteins.
Commonly used PAM, BLOSUM

11
An example substitution matrix
12
How to find an optimal alignment?

A naïve algorithm
for all subseqs A of S, B of T s.t. A B do
align Ai with Bi, 1 i A
align all other chars to spaces
compute its value
retain the max
end
output the retained alignment

S abcd A cd T wxyz B xz -abc-d
a-bc-d w--xyz -w-xyz
13
Analysis

Assume S T n
Cost of evaluating one alignment n
How many alignments are there
pick n chars of S,T together
say k of them are in S
match these k to the k unpicked chars of T
Total time
E.g., for n 20, time is gt 240 gt1012 operations

Intro to Dynamic Programming

15
Dynamic programming

What is dynamic programming?
A method for solving problems exhibiting the
properties of overlapping subproblems and optimal
substructure
Key idea tabulating sub-problem solutions rather
than re-computing them repeatedly
Two simple examples
Computing Fibonacci numbers
Find the special shortest path in a grid

16
Example 1 Fibonacci numbers

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89,
F(0) 1
F(1) 1
F(n) F(n-1) f(n-2)
How to compute F(n)?

17
A recursive algorithm

function fib(n)
if (n 0 or n 1) return 1
else return fib(n-1) fib(n-2)

18
n/2
n

Time complexity
Between 2n/2 and 2n
O(1.62n), i.e. exponential
Why recursive Fib algorithm is inefficient?
Overlapping subproblems

19
An iterative algorithm

function fib(n)
F0 1 F1 1
for i 2 to n
Fi Fi-1 Fi-2
Return Fn

Time complexity Time O(n), space O(n)
20
Example 2 shortest path in a grid
S
m
G
n
Each edge has a length (cost). We need to get to
G from S. Can only move right or down. Aim find
a path with the minimum total length
21
Optimal substructures

Naïve algorithm enumerate all possible paths and
compare costs
Exponential number of paths
Key observation
If a path P(S, G) is the shortest from S to G,
any of its sub-path P(S,x), where x is on P(S,G),
is the shortest from S to x

22
Proof

Proof by contradiction
If the path between P(S,x) is not the shortest,
i.e., P(S,x) lt P(S,x)
Construct a new path P(S,G) P(S,x) P(x, G)
P(S,G) lt P(S,G) gt P(S,G) is not the shortest
Contradiction
Therefore, P(S, x) is the shortest

S
x
G
23
Recursive solution
(0,0)

Index each intersection by two indices, (i, j)
Let F(i, j) be the total length of the shortest
path from (0, 0) to (i, j). Therefore, F(m, n) is
the shortest path we wanted.
To compute F(m, n), we need to compute both
F(m-1, n) and F(m, n-1)

m
(m, n)
n
F(m-1, n) length((m-1, n), (m, n)) F(m,
n) min F(m, n-1) length((m,
n-1), (m, n))
24
Recursive solution
F(i-1, j) length((i-1, j), (i, j)) F(i, j)
min F(i, j-1) length((i, j-1), (i, j))
(0,0)

But if we use recursive call, many subpaths will
be recomputed for many times
Strategy pre-compute F values starting from the
upper-left corner. Fill in row by row (what other
order will also do?)

(i-1, j)
(i, j)
(i, j-1)
m
(m, n)
n
25
Dynamic programming illustration
S
9
1
2
3
3
12
13
15
0
5
3
3
3
3
2
5
2
3
6
8
13
15
5
2
3
3
9
3
4
2
3
2
9
7
11
13
16
6
2
3
7
4
6
3
3
3
11
14
17
20
13
4
6
3
1
3
2
3
2
1
17
17
18
20
17
G
F(i-1, j) length(i-1, j, i, j) F(i, j)
min F(i, j-1) length(i, j-1, i, j)
26
Trackback
9
1
2
3
3
12
13
15
0
5
3
3
3
3
2
5
2
3
6
8
13
15
5
2
3
3
9
3
4
2
3
2
9
7
11
13
16
6
2
3
7
4
6
3
3
3
11
14
17
20
13
4
6
3
1
3
2
3
2
1
17
17
18
20
17
27
Elements of dynamic programming

Optimal sub-structures
Optimal solutions to the original problem
contains optimal solutions to sub-problems
Overlapping sub-problems
Some sub-problems appear in many solutions
Memorization and reuse
Carefully choose the order that sub-problems are
solved

28
Dynamic Programming for sequence alignment

Suppose we wish to align
x1xM
y1yN
Let F(i,j) optimal score of aligning
x1xi
y1yj
Scoring Function
Match m
Mismatch -s
Gap (indel) -d

29
Optimal substructure

If xi is aligned to yj in the optimal
alignment between x1..M and y1..N, then
The alignment between x1..i and y1..j is also
optimal
Easy to prove by contradiction

30
Recursive formula

Notice three possible cases
xM aligns to yN
xM
yN
2. xM aligns to a gap
xM
?
yN aligns to a gap
?
yN

m, if xM yN F(M,N)
F(M-1, N-1) -s, if not
F(M,N) F(M-1, N) - d
F(M,N) F(M, N-1) - d
31
Recursive formula

Generalize
F(i-1, j-1) ?(Xi,Yj)
F(i,j) max F(i-1, j) d
F(i, j-1) d
?(Xi,Yj) m if Xi Yj, and s otherwise
Boundary conditions
F(0, 0) 0.
F(0, j) ?
F(i, 0) ?

-jd y1..j aligned to gaps.
-id x1..i aligned to gaps.
32
What order to fill?
33
What order to fill?
34
Example

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
35
Example

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
36
Example

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
37
Example

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
38
Example

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2
j 0
1
2
3
39
Example

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2 This only tells us
the best score
j 0
1
2
3
40
Trace-back

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
j 0
1
2
3
41
Trace-back

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
j 0
1
2
3
42
Trace-back

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
j 0
1
2
3
43
Trace-back

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
j 0
1
2
3
44
Trace-back

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
45
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
46
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
47
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
48
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
49
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
50
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
51
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
52
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
53
The Needleman-Wunsch Algorithm

Initialization.
F(0, 0) 0
F(0, j) - j ? d
F(i, 0) - i ? d
Main Iteration. Filling in scores
For each i 1M
For each j 1N
F(i-1,j) d case 1
F(i, j) max F(i, j-1) d
case 2
F(i-1, j-1) s(xi, yj) case 3
UP, if case 1
Ptr(i,j) LEFT if case 2
DIAG if case 3
Termination. F(M, N) is the optimal score, and
from Ptr(M, N) can trace back optimal alignment

54
Performance

Time
O(NM)
Space
O(NM)
Later we will cover more efficient methods

55
Equivalent graph problem
S1
G
A
T
A
(0,0)
? a gap in the 2nd sequence ? a gap in the 1st
sequence match / mismatch
1
1
A
S2
1
T
Value on vertical/horizontal line -d Value on
diagonal m or -s
1
1
A
(3,4)

Number of steps length of the alignment
Path length alignment score
Optimal alignment find the longest path from (0,
0) to (3, 4)
General longest path problem cannot be found with
DP. Longest path on this graph can be found by DP
since no cycle is possible.

56
Question

If we change the scoring scheme, will the optimal
alignment be changed?
Old Match 1, mismatch gap -1
New match 2, mismatch gap 0
New Match 2, mismatch gap -2?

57
Question

What kind of alignment is represented by these
paths?

A
A
A
A
A
B C
B C
B C
B C
B C
A- BC
A-- -BC
--A BC-
-A- B-C
-A BC
Alternating gaps are impossible if s gt -2d
58
A variant of the basic algorithm

Scoring scheme m s d 1
Seq1 CAGCA-CTTGGATTCTCGG
Seq2 ---CAGCGTGG--------
Seq1 CAGCACTTGGATTCTCGG
Seq2 CAGC-----G-T----GG
The first alignment may be biologically more
realistic

Score -7
Score -2
59
A variant of the basic algorithm

Maybe it is OK to have an unlimited of gaps in
the beginning and end

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCG
AGTTCATCTATCAC--GACCGC--GGTCG--------------

Then, we dont want to penalize gaps in the ends

60
The Overlap Detection variant

Changes
Initialization
For all i, j,
F(i, 0) 0
F(0, j) 0
Termination
maxi F(i, N)
FOPT max maxj F(M, j)

x1 xM
yN y1
61
Different types of overlaps
x
x
y
y
62
A non-bio variant

Shell command diff Compare two text files
Given file1 and file2
Find the difference between file1 and file2
Similar to sequence alignment
How to score?
Longest common subsequence (LCS)
Match has score 1
No mismatch penalty
No gap penalty

63
(No Transcript)
64
diff file1 file2 1c1 lt A --- gt G 4c4 lt D --- gt -
LCS 4
65
The LCS variant

Changes
Initialization
For all i, j, F(i, 0) F(0, j) 0
Filling in table
F(i-1,j)
F(i, j) max F(i, j-1)
F(i-1, j-1) s(xi, yj)
where s(xi, yj) 1 if xi yj and 0 otherwise.
Termination
maxi F(i, N)
FOPT max
maxj F(M, j)

66
More efficient algorithms

What happens if you have 1 million lines of text
in each file?
O(mn) algorithm is too inefficient
Memory inefficient
1 TB memory to store the matrix
Bounded DP
maybe the majority of the two files are the same?
(e.g., two versions of a software)
Linear-space algorithm
same time complexity