CS 5263 Bioinformatics - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

CS 5263 Bioinformatics

Description:

We can easily compute an optimal alignment by dynamic programming ... Global alignment might miss them if flanking 'junk' outweighs similar regions ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 48

Provided by: jianhu

Learn more at: http://www.cs.utsa.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 5263 Bioinformatics

1
CS 5263 Bioinformatics

Lecture 4 Local Sequence Alignment, More
Efficient Sequence Alignment Algorithms

2
Roadmap

Review of last lecture
Local sequence alignment
More efficient sequence alignment algorithms

Given a scoring scheme,
Match m
Mismatch -s
Gap -d
We can easily compute an optimal alignment by
dynamic programming

Look at any column of an alignment between two
sequences X x1x2xM, Y y1y1yN
Only three cases
xi is aligned to yj
xi is aligned to a gap
yj is aligned to a gap

F(i-1, j-1) ? (xi, yj) F(i, j) max
F(i-1, j) - d F(i, j-1) - d
5
(No Transcript)
6
(No Transcript)
7
Example
F(i,j) j 0 1 2 3 4
i 0
A A
G -
T T
A A
A A
G -
T T
A A
1
2
3
8
Equivalent graph problem
S1
G
A
T
A
(0,0)
? a gap in the 2nd sequence ? a gap in the 1st
sequence match / mismatch
1
1
A
S2
1
T
Value on vertical/horizontal line -d Value on
diagonal m or -s
1
1
A
(3,4)

Number of steps length of the alignment
Path length alignment score
Optimal alignment find the longest path from (0,
0) to (3, 4)
General longest path problem cannot be found with
DP. Longest path on this graph can be found by DP
since no cycle is possible.

9
Variants of Needleman-Wunsch alg

LCS longest common subsequence
No penalty for gaps or mutations
Change score function
Overlapping variants
No penalty for starting/ending gaps
Change initial / termination step
Other variants
cDNA-genome alignment

10
Local alignment
11
The local alignment problem

Given two strings X x1xM,
Y y1yN
Find substrings x, y whose similarity (optimal
global alignment value) is maximum
e.g. X abcxdex X cxde
Y xxxcde Y c-de

x
y
12
Why local alignment

Conserved regions may be a small part of the
whole
Global alignment might miss them if flanking
junk outweighs similar regions
Genes are shuffled between genomes

C
D
B
A
D
A
B
C
13
Naïve algorithm

for all substrings X of X and Y of Y
Align X Y via dynamic programming
Retain pair with max value
end
Output the retained pair
Time O(n2) choices for A, O(m2) for B, O(nm) for
DP, so O(n3m3 ) total.

14
Reminder

The overlap detection algorithm
We do not give penalty to gaps in the ends

Free gap
Free gap
15
The local alignment idea

Do not penalize the unaligned regions (gaps or
mismatches)
The alignment can start anywhere and ends
anywhere
Strategy whenever we get to some low similarity
region (negative score), we restart a new
alignment
By resetting alignment score to zero

16
The Smith-Waterman algorithm

Initialization F(0, j) F(i, 0) 0
0
F(i 1, j) d
F(i, j 1) d
F(i 1, j 1) ?(xi, yj)

Iteration F(i, j) max
17
The Smith-Waterman algorithm

Termination
If we want the best local alignment
FOPT maxi,j F(i, j)
If we want all local alignments scoring gt t
For all i, j find F(i, j) gt t, and trace back

18
Match 2 Mismatch -1 Gap -1
19
Match 2 Mismatch -1 Gap -1
20
Match 2 Mismatch -1 Gap -1
21
Match 2 Mismatch -1 Gap -1
22
Match 2 Mismatch -1 Gap -1
23
Match 2 Mismatch -1 Gap -1
24
Match 2 Mismatch -1 Gap -1
25
Trace back
Match 2 Mismatch -1 Gap -1
26
Trace back
Match 2 Mismatch -1 Gap -1
cxde c-de
x-de xcde
27

No negative values in local alignment DP array
Optimal local alignment will never have a gap on
either end
Local alignment Smith-Waterman
Global alignment Needleman-Wunsch

28
Analysis

Time
O(MN) for finding the best alignment
Time to report all alignments depends on the
number of sub-opt alignments
Memory
O(MN)
O(MN) possible

More efficient alignment algorithms

Given two sequences of length M, N
Time O(MN)
ok
Space O(MN)
bad
1Mb seq x 1Mb seq 1000G memory
Can we do better?

31
Bounded alignment

Good alignment should appear near the diagonal

32
Bounded Dynamic Programming

If we know that x and y are very similar
Assumption gaps(x, y) lt k
xi
Then, implies i j lt k
yj

33
Bounded Dynamic Programming

Initialization
F(i,0), F(0,j) undefined for i, j gt k
Iteration
For i 1M
For j max(1, i k)min(N, ik)
F(i 1, j 1) ?(xi, yj)
F(i, j) max F(i, j 1) d, if j gt i k
F(i 1, j) d, if j lt i k
Termination same

x1 xM
yN y1
k
34
Analysis

Time O(kM) ltlt O(MN)
Space O(kM) with some tricks

gt
M
M
2k
2k
35
(No Transcript)
36

Given two sequences of length M, N
Time O(MN)
ok
Space O(MN)
bad
1mb seq x 1mb seq 1000G memory
Can we do better?

37
Linear space algorithm

If all we need is the alignment score but not the
alignment, easy!

We only need to keep two rows (You only need one
row, with a little trick)
But how do we get the alignment?
38
Linear space algorithm

When we finish, we know how we have aligned the
ends of the sequences

XM YN
Naïve idea Repeat on the smaller subproblem
F(M-1, N-1) Time complexity O((MN)(MN))
39
(0, 0)
M/2
(M, N)
Key observation optimal alignment (longest path)
must use an intermediate point on the M/2-th row.
Call it (M/2, k), where k is unknown.
40
(0,0)
(3,2)
(3,4)
(3,6)
(3,0)
(6,6)

Longest path from (0, 0) to (6, 6) is max_k
(LP(0,0,3,k) LP(6,6,3,k))

41
Hirschbergs idea

Divide and conquer!

Y
Forward algorithm Align x1x2xM/2 with Y
X
M/2
F(M/2, k) represents the best alignment between
x1x2xM/2 and y1y2yk
42
Backward Algorithm
Y
Backward algorithm Align reverse(xM/21xM) with
reverse(Y)
X
M/2
B(M/2, k) represents the best alignment between
reverse(xM/21xM) and reverse(ykyk1yN )
43
Linear-space alignment