CS 5263 Bioinformatics - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

CS 5263 Bioinformatics

Description:

... subproblems and optimal substructure ... Optimal substructures. Na ve algorithm: enumerate all possible paths and compare costs ... Optimal substructure ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 67
Provided by: jianhu
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 5263 Bioinformatics


1
CS 5263 Bioinformatics
  • Lecture 3 Dynamic Programming and Global
    Sequence Alignment

2
Evolution at the DNA level
C
ACGGTGCAGTCACCA
ACGTTGC-GTCCACCA
DNA evolutionary events (sequence
edits) Mutation, deletion, insertion
3
Sequence conservation implies function



next generation
OK



OK



OK



X



X



Still OK?



4
Why sequence alignment?
  • Conserved regions are more likely to be
    functional
  • Can be used for finding genes, regulatory
    elements, etc.
  • Similar sequences often have similar origin and
    function
  • Can be used to predict functions for new genes /
    proteins
  • Sequence alignment is one of the most widely used
    computational tools in biology

5
Global Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
S
T
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
S
T
  • Definition
  • An alignment of two strings S, T is a pair of
    strings S, T (with spaces) s.t.
  • S T, and (S length of S)
  • removing all spaces in S, T leaves S, T

6
What is a good alignment?
  • Alignment
  • The best way to match the letters of one
    sequence with those of the other
  • How do we define best?

7
S -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
  • The score of aligning (characters or spaces) x
    y is s (x,y).
  • Score of an alignment
  • An optimal alignment one with max score

8
Scoring Function
  • Sequence edits
  • AGGCCTC
  • Mutations AGGACTC
  • Insertions AGGGCCTC
  • Deletions AGG-CTC
  • Scoring Function
  • Match m AAC
  • Mismatch -s A-A
  • Gap (indel) -d

9
  • Match 2, mismatch -1, gap -1
  • Score 3 x 2 2 x 1 1 x 1 3

10
More complex scoring function
  • Substitution matrix
  • Similarity score of matching two letters a, b
    should reflect the probability of a, b derived
    from same ancestor
  • It is usually defined by log likelihood ratio
    (Durbin book)
  • Active research area. Especially for proteins.
  • Commonly used PAM, BLOSUM

11
An example substitution matrix
12
How to find an optimal alignment?
  • A naïve algorithm
  • for all subseqs A of S, B of T s.t. A B do
  • align Ai with Bi, 1 i A
  • align all other chars to spaces
  • compute its value
  • retain the max
  • end
  • output the retained alignment

S abcd A cd T wxyz B xz -abc-d
a-bc-d w--xyz -w-xyz
13
Analysis
  • Assume S T n
  • Cost of evaluating one alignment n
  • How many alignments are there
  • pick n chars of S,T together
  • say k of them are in S
  • match these k to the k unpicked chars of T
  • Total time
  • E.g., for n 20, time is gt 240 gt1012 operations

14
  • Intro to Dynamic Programming

15
Dynamic programming
  • What is dynamic programming?
  • A method for solving problems exhibiting the
    properties of overlapping subproblems and optimal
    substructure
  • Key idea tabulating sub-problem solutions rather
    than re-computing them repeatedly
  • Two simple examples
  • Computing Fibonacci numbers
  • Find the special shortest path in a grid

16
Example 1 Fibonacci numbers
  • 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89,
  • F(0) 1
  • F(1) 1
  • F(n) F(n-1) f(n-2)
  • How to compute F(n)?

17
A recursive algorithm
  • function fib(n)
  • if (n 0 or n 1) return 1
  • else return fib(n-1) fib(n-2)

18
n/2
n
  • Time complexity
  • Between 2n/2 and 2n
  • O(1.62n), i.e. exponential
  • Why recursive Fib algorithm is inefficient?
  • Overlapping subproblems

19
An iterative algorithm
  • function fib(n)
  • F0 1 F1 1
  • for i 2 to n
  • Fi Fi-1 Fi-2
  • Return Fn

Time complexity Time O(n), space O(n)
20
Example 2 shortest path in a grid
S
m
G
n
Each edge has a length (cost). We need to get to
G from S. Can only move right or down. Aim find
a path with the minimum total length
21
Optimal substructures
  • Naïve algorithm enumerate all possible paths and
    compare costs
  • Exponential number of paths
  • Key observation
  • If a path P(S, G) is the shortest from S to G,
    any of its sub-path P(S,x), where x is on P(S,G),
    is the shortest from S to x

22
Proof
  • Proof by contradiction
  • If the path between P(S,x) is not the shortest,
    i.e., P(S,x) lt P(S,x)
  • Construct a new path P(S,G) P(S,x) P(x, G)
  • P(S,G) lt P(S,G) gt P(S,G) is not the shortest
  • Contradiction
  • Therefore, P(S, x) is the shortest

S
x
G
23
Recursive solution
(0,0)
  • Index each intersection by two indices, (i, j)
  • Let F(i, j) be the total length of the shortest
    path from (0, 0) to (i, j). Therefore, F(m, n) is
    the shortest path we wanted.
  • To compute F(m, n), we need to compute both
    F(m-1, n) and F(m, n-1)

m
(m, n)
n
F(m-1, n) length((m-1, n), (m, n)) F(m,
n) min F(m, n-1) length((m,
n-1), (m, n))
24
Recursive solution
F(i-1, j) length((i-1, j), (i, j)) F(i, j)
min F(i, j-1) length((i, j-1), (i, j))
(0,0)
  • But if we use recursive call, many subpaths will
    be recomputed for many times
  • Strategy pre-compute F values starting from the
    upper-left corner. Fill in row by row (what other
    order will also do?)

(i-1, j)
(i, j)
(i, j-1)
m
(m, n)
n
25
Dynamic programming illustration
S
9
1
2
3
3
12
13
15
0
5
3
3
3
3
2
5
2
3
6
8
13
15
5
2
3
3
9
3
4
2
3
2
9
7
11
13
16
6
2
3
7
4
6
3
3
3
11
14
17
20
13
4
6
3
1
3
2
3
2
1
17
17
18
20
17
G
F(i-1, j) length(i-1, j, i, j) F(i, j)
min F(i, j-1) length(i, j-1, i, j)
26
Trackback
9
1
2
3
3
12
13
15
0
5
3
3
3
3
2
5
2
3
6
8
13
15
5
2
3
3
9
3
4
2
3
2
9
7
11
13
16
6
2
3
7
4
6
3
3
3
11
14
17
20
13
4
6
3
1
3
2
3
2
1
17
17
18
20
17
27
Elements of dynamic programming
  • Optimal sub-structures
  • Optimal solutions to the original problem
    contains optimal solutions to sub-problems
  • Overlapping sub-problems
  • Some sub-problems appear in many solutions
  • Memorization and reuse
  • Carefully choose the order that sub-problems are
    solved

28
Dynamic Programming for sequence alignment
  • Suppose we wish to align
  • x1xM
  • y1yN
  • Let F(i,j) optimal score of aligning
  • x1xi
  • y1yj
  • Scoring Function
  • Match m
  • Mismatch -s
  • Gap (indel) -d

29
Optimal substructure
  • If xi is aligned to yj in the optimal
    alignment between x1..M and y1..N, then
  • The alignment between x1..i and y1..j is also
    optimal
  • Easy to prove by contradiction

30
Recursive formula
  • Notice three possible cases
  • xM aligns to yN
  • xM
  • yN
  • 2. xM aligns to a gap
  • xM
  • ?
  • yN aligns to a gap
  • ?
  • yN

m, if xM yN F(M,N)
F(M-1, N-1) -s, if not
F(M,N) F(M-1, N) - d
F(M,N) F(M, N-1) - d
31
Recursive formula
  • Generalize
  • F(i-1, j-1) ?(Xi,Yj)
  • F(i,j) max F(i-1, j) d
  • F(i, j-1) d
  • ?(Xi,Yj) m if Xi Yj, and s otherwise
  • Boundary conditions
  • F(0, 0) 0.
  • F(0, j) ?
  • F(i, 0) ?

-jd y1..j aligned to gaps.
-id x1..i aligned to gaps.
32
What order to fill?
33
What order to fill?
34
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
35
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
36
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
37
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
38
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2
j 0
1
2
3
39
Example
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2 This only tells us
the best score
j 0
1
2
3
40
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
j 0
1
2
3
41
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
j 0
1
2
3
42
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
j 0
1
2
3
43
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
j 0
1
2
3
44
Trace-back
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
45
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
46
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
47
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
48
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
49
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
50
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
51
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
j 0
1
2
3
52
Using trace-back pointers
  • x AGTA m 1
  • y ATA s 1
  • d 1

F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
53
The Needleman-Wunsch Algorithm
  • Initialization.
  • F(0, 0) 0
  • F(0, j) - j ? d
  • F(i, 0) - i ? d
  • Main Iteration. Filling in scores
  • For each i 1M
  • For each j 1N
  • F(i-1,j) d case 1
  • F(i, j) max F(i, j-1) d
    case 2
  • F(i-1, j-1) s(xi, yj) case 3
  • UP, if case 1
  • Ptr(i,j) LEFT if case 2
  • DIAG if case 3
  • Termination. F(M, N) is the optimal score, and
    from Ptr(M, N) can trace back optimal alignment

54
Performance
  • Time
  • O(NM)
  • Space
  • O(NM)
  • Later we will cover more efficient methods

55
Equivalent graph problem
S1
G
A
T
A
(0,0)
? a gap in the 2nd sequence ? a gap in the 1st
sequence match / mismatch
1
1
A
S2
1
T
Value on vertical/horizontal line -d Value on
diagonal m or -s
1
1
A
(3,4)
  • Number of steps length of the alignment
  • Path length alignment score
  • Optimal alignment find the longest path from (0,
    0) to (3, 4)
  • General longest path problem cannot be found with
    DP. Longest path on this graph can be found by DP
    since no cycle is possible.

56
Question
  • If we change the scoring scheme, will the optimal
    alignment be changed?
  • Old Match 1, mismatch gap -1
  • New match 2, mismatch gap 0
  • New Match 2, mismatch gap -2?

57
Question
  • What kind of alignment is represented by these
    paths?

A
A
A
A
A
B C
B C
B C
B C
B C
A- BC
A-- -BC
--A BC-
-A- B-C
-A BC
Alternating gaps are impossible if s gt -2d
58
A variant of the basic algorithm
  • Scoring scheme m s d 1
  • Seq1 CAGCA-CTTGGATTCTCGG
  • Seq2 ---CAGCGTGG--------
  • Seq1 CAGCACTTGGATTCTCGG
  • Seq2 CAGC-----G-T----GG
  • The first alignment may be biologically more
    realistic

Score -7
Score -2
59
A variant of the basic algorithm
  • Maybe it is OK to have an unlimited of gaps in
    the beginning and end

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCG
AGTTCATCTATCAC--GACCGC--GGTCG--------------
  • Then, we dont want to penalize gaps in the ends

60
The Overlap Detection variant
  • Changes
  • Initialization
  • For all i, j,
  • F(i, 0) 0
  • F(0, j) 0
  • Termination
  • maxi F(i, N)
  • FOPT max maxj F(M, j)

x1 xM
yN y1
61
Different types of overlaps
x
x
y
y
62
A non-bio variant
  • Shell command diff Compare two text files
  • Given file1 and file2
  • Find the difference between file1 and file2
  • Similar to sequence alignment
  • How to score?
  • Longest common subsequence (LCS)
  • Match has score 1
  • No mismatch penalty
  • No gap penalty

63
(No Transcript)
64
diff file1 file2 1c1 lt A --- gt G 4c4 lt D --- gt -
LCS 4
65
The LCS variant
  • Changes
  • Initialization
  • For all i, j, F(i, 0) F(0, j) 0
  • Filling in table
  • F(i-1,j)
  • F(i, j) max F(i, j-1)
  • F(i-1, j-1) s(xi, yj)
  • where s(xi, yj) 1 if xi yj and 0 otherwise.
  • Termination
  • maxi F(i, N)
  • FOPT max
  • maxj F(M, j)

66
More efficient algorithms
  • What happens if you have 1 million lines of text
    in each file?
  • O(mn) algorithm is too inefficient
  • Memory inefficient
  • 1 TB memory to store the matrix
  • Bounded DP
  • maybe the majority of the two files are the same?
    (e.g., two versions of a software)
  • Linear-space algorithm
  • same time complexity
Write a Comment
User Comments (0)
About PowerShow.com