CS 5263 Bioinformatics - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

CS 5263 Bioinformatics

Description:

We can easily compute an optimal alignment by dynamic programming ... Global alignment might miss them if flanking 'junk' outweighs similar regions ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 48
Provided by: jianhu
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 5263 Bioinformatics


1
CS 5263 Bioinformatics
  • Lecture 4 Local Sequence Alignment, More
    Efficient Sequence Alignment Algorithms

2
Roadmap
  • Review of last lecture
  • Local sequence alignment
  • More efficient sequence alignment algorithms

3
  • Given a scoring scheme,
  • Match m
  • Mismatch -s
  • Gap -d
  • We can easily compute an optimal alignment by
    dynamic programming

4
  • Look at any column of an alignment between two
    sequences X x1x2xM, Y y1y1yN
  • Only three cases
  • xi is aligned to yj
  • xi is aligned to a gap
  • yj is aligned to a gap

F(i-1, j-1) ? (xi, yj) F(i, j) max
F(i-1, j) - d F(i, j-1) - d
5
(No Transcript)
6
(No Transcript)
7
Example
F(i,j) j 0 1 2 3 4
i 0
A A
G -
T T
A A
A A
G -
T T
A A
1
2
3
8
Equivalent graph problem
S1
G
A
T
A
(0,0)
? a gap in the 2nd sequence ? a gap in the 1st
sequence match / mismatch
1
1
A
S2
1
T
Value on vertical/horizontal line -d Value on
diagonal m or -s
1
1
A
(3,4)
  • Number of steps length of the alignment
  • Path length alignment score
  • Optimal alignment find the longest path from (0,
    0) to (3, 4)
  • General longest path problem cannot be found with
    DP. Longest path on this graph can be found by DP
    since no cycle is possible.

9
Variants of Needleman-Wunsch alg
  • LCS longest common subsequence
  • No penalty for gaps or mutations
  • Change score function
  • Overlapping variants
  • No penalty for starting/ending gaps
  • Change initial / termination step
  • Other variants
  • cDNA-genome alignment

10
Local alignment
11
The local alignment problem
  • Given two strings X x1xM,
  • Y y1yN
  • Find substrings x, y whose similarity (optimal
    global alignment value) is maximum
  • e.g. X abcxdex X cxde
  • Y xxxcde Y c-de

x
y
12
Why local alignment
  • Conserved regions may be a small part of the
    whole
  • Global alignment might miss them if flanking
    junk outweighs similar regions
  • Genes are shuffled between genomes

C
D
B
A
D
A
B
C
13
Naïve algorithm
  • for all substrings X of X and Y of Y
  • Align X Y via dynamic programming
  • Retain pair with max value
  • end
  • Output the retained pair
  • Time O(n2) choices for A, O(m2) for B, O(nm) for
    DP, so O(n3m3 ) total.

14
Reminder
  • The overlap detection algorithm
  • We do not give penalty to gaps in the ends

Free gap
Free gap
15
The local alignment idea
  • Do not penalize the unaligned regions (gaps or
    mismatches)
  • The alignment can start anywhere and ends
    anywhere
  • Strategy whenever we get to some low similarity
    region (negative score), we restart a new
    alignment
  • By resetting alignment score to zero

16
The Smith-Waterman algorithm
  • Initialization F(0, j) F(i, 0) 0
  • 0
  • F(i 1, j) d
  • F(i, j 1) d
  • F(i 1, j 1) ?(xi, yj)

Iteration F(i, j) max
17
The Smith-Waterman algorithm
  • Termination
  • If we want the best local alignment
  • FOPT maxi,j F(i, j)
  • If we want all local alignments scoring gt t
  • For all i, j find F(i, j) gt t, and trace back

18
Match 2 Mismatch -1 Gap -1
19
Match 2 Mismatch -1 Gap -1
20
Match 2 Mismatch -1 Gap -1
21
Match 2 Mismatch -1 Gap -1
22
Match 2 Mismatch -1 Gap -1
23
Match 2 Mismatch -1 Gap -1
24
Match 2 Mismatch -1 Gap -1
25
Trace back
Match 2 Mismatch -1 Gap -1
26
Trace back
Match 2 Mismatch -1 Gap -1
cxde c-de
x-de xcde
27
  • No negative values in local alignment DP array
  • Optimal local alignment will never have a gap on
    either end
  • Local alignment Smith-Waterman
  • Global alignment Needleman-Wunsch

28
Analysis
  • Time
  • O(MN) for finding the best alignment
  • Time to report all alignments depends on the
    number of sub-opt alignments
  • Memory
  • O(MN)
  • O(MN) possible

29
  • More efficient alignment algorithms

30
  • Given two sequences of length M, N
  • Time O(MN)
  • ok
  • Space O(MN)
  • bad
  • 1Mb seq x 1Mb seq 1000G memory
  • Can we do better?

31
Bounded alignment
  • Good alignment should appear near the diagonal

32
Bounded Dynamic Programming
  • If we know that x and y are very similar
  • Assumption gaps(x, y) lt k
  • xi
  • Then, implies i j lt k
  • yj

33
Bounded Dynamic Programming
  • Initialization
  • F(i,0), F(0,j) undefined for i, j gt k
  • Iteration
  • For i 1M
  • For j max(1, i k)min(N, ik)
  • F(i 1, j 1) ?(xi, yj)
  • F(i, j) max F(i, j 1) d, if j gt i k
  • F(i 1, j) d, if j lt i k
  • Termination same

x1 xM
yN y1
k
34
Analysis
  • Time O(kM) ltlt O(MN)
  • Space O(kM) with some tricks

gt
M
M
2k
2k
35
(No Transcript)
36
  • Given two sequences of length M, N
  • Time O(MN)
  • ok
  • Space O(MN)
  • bad
  • 1mb seq x 1mb seq 1000G memory
  • Can we do better?

37
Linear space algorithm
  • If all we need is the alignment score but not the
    alignment, easy!

We only need to keep two rows (You only need one
row, with a little trick)
But how do we get the alignment?
38
Linear space algorithm
  • When we finish, we know how we have aligned the
    ends of the sequences

XM YN
Naïve idea Repeat on the smaller subproblem
F(M-1, N-1) Time complexity O((MN)(MN))
39
(0, 0)
M/2
(M, N)
Key observation optimal alignment (longest path)
must use an intermediate point on the M/2-th row.
Call it (M/2, k), where k is unknown.
40
(0,0)
(3,2)
(3,4)
(3,6)
(3,0)
(6,6)
  • Longest path from (0, 0) to (6, 6) is max_k
    (LP(0,0,3,k) LP(6,6,3,k))

41
Hirschbergs idea
  • Divide and conquer!

Y
Forward algorithm Align x1x2xM/2 with Y
X
M/2
F(M/2, k) represents the best alignment between
x1x2xM/2 and y1y2yk
42
Backward Algorithm
Y
Backward algorithm Align reverse(xM/21xM) with
reverse(Y)
X
M/2
B(M/2, k) represents the best alignment between
reverse(xM/21xM) and reverse(ykyk1yN )
43
Linear-space alignment
  • Using 2 (4) rows of space, we can compute
  • for k 1N, F(M/2, k), B(M/2, k)

M/2
44
Linear-space alignment
  • Now, we can find k maximizing F(M/2, k) B(M/2,
    k)
  • Also, we can trace the path exiting column M/2
    from k

Conclusion In O(NM) time, O(N) space, we found
optimal alignment path at row M/2
45
Linear-space alignment
  • Iterate this procedure to the two sub-problems!

M/2
k
M/2
N-k
46
Analysis
  • Memory O(N) for computation, O(NM) to store the
    optimal alignment
  • Time
  • MN for first iteration
  • k M/2 (N-k) M/2 MN/2 for second

k
M/2
M/2
N-k
47
MN
MN/2
MN/4
MN MN/2 MN/4 MN/8 MN (1 ½ ¼
1/8 1/16 ) 2MN O(MN)
MN/8
Write a Comment
User Comments (0)
About PowerShow.com