Title: L3:%20Blast:%20Local%20Alignment%20and%20other%20flavors
1L3 Blast Local Alignment and other flavors
2An Example
T C A T - T G C A A
T C A T T G C A A
A1
A2
- Align sTCAT with tTGCAA
- Match Score 1
- Mismatch score -1, Indel Score -1
- Score A1?, Score A2?
3Sequence Alignment
- Recall Instead of computing the optimum
alignment, we are computing the score of the
optimum alignment - Let Si,j denote the score of the optimum
alignment of the prefix s1..i and t 1..j
4An O(nm) algorithm for score computation
For i 1 to n For j 1 to m
- The iteration ensures that all values on the
right are computed in earlier steps.
5Base case (Initialization)
6A tableaux approach
t
1
j
n
1
s
Si-1,j
Si-1,j-1
i
Si,j-1
Si,j
n
Cell (i,j) contains the score Si,j. Each cell
only looks at 3 neighboring cells
7Alignment Table
T G C A
A
0 -1 -2 -3 -4 -5
-1 1 0 -1 -2 -3
-2 0 0 1 0 -1
-3 -1 -1 0 2 1
-4 -2 -2 -1 1 1
T
C
A
T
8Alignment Table
- S4,5 1 is the score of an optimum alignment
- Therefore, A2 is an optimum alignment
- We know how to obtain the optimum Score. How do
we get the best alignment?
T G C A A
T
C
A
T
9Computing Optimum Alignment
- At each cell, we have 3 choices
- We maintain additional information to record the
choice at each step.
For i 1 to n For j 1 to m
j-1
j
If (Si,j Si-1,j-1 C(si,tj)) Mi,j
i-1
If (Si,j Si-1,j C(si,-)) Mi,j
If (Si,j Si,j-1 C(-,tj) ) Mi,j
i
10Computing Optimal Alignments
T G C A
A
T
C
A
T
11Retrieving Opt.Alignment
- M4,5
- Implies that S4,5S3,4C(A,T)
- or
1 2 3 4 5
T G C A A
A
T
1
T
C
2
M3,4 Implies that S3,4S2,3 C(A,A)
or
A
3
T
4
A
A
T
A
12Retrieving Opt.Alignment
- M2,3
- Implies that S2,3S1,2C(C,C)
- or
1 2 3 4 5
T G C A A
A
A
C
T
1
T
A
C
C
2
M1,2 Implies that S1,2S1,1 C(-,G)
or
A
3
T
4
A
A
C
-
T
T
A
C
G
T
13Algorithm to retrieve optimal alignment
- RetrieveAl(i,j)
- if (Mi,j \)
- return (RetrieveAl (i-1,j-1) . )
- else if (Mi,j )
- return (RetrieveAl (i-1,j) . )
si
tj
si
-
else if (Mi,j --) return
(RetrieveAl (i,j-1) . )
-
tj
14Summary
- An optimal alignment of strings of length n and m
can be computed in O(nm) time - There is a tight connection between computation
of optimal score, and computation of opt.
Alignment - True for all DP based solutions
15Global versus Local Alignment
- Consider
- s ACCACCCCTT
- t ATCCCCACAT
ACCACCCCTT
ACCACCCCT T
A TCCCCACAT
ATCCCCACAT
16Blast Outputs Local Alignment
Schematic
26
422
query
db
19
405
17Local Alignment
- Compute maximum scoring interval over all
sub-intervals (a,b), and (a,b) - How can we compute this efficiently?
a
b
a
b
18Local Alignment
- Recall that in global alignment, we compute the
best score for all prefix pairs s(1..i) and
t(1..j). - Instead, compute the best score for all
sub-alignments that end in s(i) and t(j). - What changes in the recurrence?
a
i
a
j
19Local Alignment
j
j-1
- The original recurrence still works, except when
the optimum score Si,j is negative - When Si,j lt0, it means that the optimum local
alignment cannot include the point (i,j). - So, we must reset the score to 0.
i-1
i
si
tj
20Local Alignment Trick (Smith-Waterman algorithm)
How can we compute the local alignment itself?
21Generalizing Gap Cost
- It is more likely for gaps to be contiguous
- The penalty for a gap of length l should be
22Using affine gap penalties
- What is the time taken for this?
- What are the values that l can take?
- Can we get rid of the extra Dimension?
23Affine gap penalties
- Define Di,j Score of the best alignment,
given that the final column is a deletion (si
is aligned to a gap) - Define Ii,j Score of the best alignment, given
that the final column is an insertion (tj is
aligned to a gap)
-
tj
24O(nm) solution for affine gap costs
25Alignment Space?
- How much space to we need?
- What if the query and database are each 1Mbp?
26Alignment (Linear Space)
For i 1 to n For j 1 to m
27Linear Space Alignment
- In Linear Space, we can do each row of the D.P.
- We need to compute the optimum path from the
origin (0,0) to (m,n)
28Linear Space (contd)
- At in/2, we know scores of all the optimal paths
ending at that row. - Define Fj Sn/2,j
- One of these j is on the true path. Which one?
29Backward alignment
- Let Si,j be the optimal score of aligning
si..n with tj..m - Define Bj Sn/2,j
- One of these j is on the true path. Which one?
30Forward, Backward computation
- At the optimal coordinate, j
- FjBjSn,m
- In O(nm) time, and O(m) space, we can compute one
of the coordinates on the optimum path.
31Linear Space Alignment
- Align(1..n,1..m)
- For all 1ltj lt m
- Compute FjS(n/2,j)
- For all 1ltj lt m
- Compute BjSb(n/2,j)
- j maxj FjBj
- X Align(1..n/2,1..j)
- Y Align(n/2..n,j..m)
- Return X,j,Y
32Linear Space complexity
- T(nm) c.nm T(nm/2) O(nm)
- Space O(m)
33Summary
- We considered the basics of sequence alignment
- Opt score computation
- Reconstructing alignments
- Local alignments
- Affine gap costs
- Space saving measures
- Can we recreate Blast?
34Blast and local alignment
- Concatenate all of the database sequences to form
one giant sequence. - Do local alignment computation with the query.
35Large database search
Database size n100M, Querysize m1000. O(nm)
1011 computations
36Why is Blast Fast?
37Silly Question!
- True or False No two people in new york city
have the same number of hair
38Observations
- Much of the database is random from the querys
perspective - Consider a random DNA string of length n.
- PrAPrC PrGPrT0.25
- Assume for the moment that the query is all As
(length m). - What is the probability that an exact match to
the query can be found?
39Basic probability
- Probability that there is a match starting at a
fixed position i 0.25m - What is the probability that some position i has
a match. - Dependencies confound probability estimates.
40Basic ProbabilityExpectation
- Q Toss a coin each time it comes up heads, you
get a dollar - What is the money you expect to get after n
tosses? - Let Xi be the amount earned in the i-th toss
41Expected number of matches
- Expected number of matches can still be computed.
i
- Let Xi1 if there is a match starting at
position i, Xi0 otherwise
- Expected number of matches
42Expected number of exact Matches is small!
- Expected number of matches n0.25m
- If n107, m10,
- Then, expected number of matches 9.537
- If n107, m11
- expected number of hits 2.38
- n107,m12,
- Expected number of hits 0.5 lt 1
- Bottom Line An exact match to a substring of the
query is unlikely just by chance.
43Observation 2
- What is the pigeonhole principle?
44Why is this important?
- Suppose we are looking for sequences that are 80
identical to the query sequence of length 100. - Assume that the mismatches are randomly
distributed. - What is the probability that there is no stretch
of 10 bp, where the query and the subject match
exactly? - Rough calculations show that it is very low.
Exact match of a short query substring to a truly
similar subject is very high. - The above equation does not take dependencies
into account - Reality is better because the matches are not
randomly distributed
45Just the Facts
- Consider the set of all substrings of the query
string of fixed length W. - Prob. of exact match to a random database string
is very low. - Prob. of exact match to a true homolog is very
high. - Keyword Search (exact matches) is MUCH faster
than sequence alignment
46BLAST
Database (n)
- Consider all (m-W) query words of size W (Default
11) - Scan the database for exact match to all such
words - For all regions that hit, extend using a dynamic
programming alignment. - Can be many orders of magnitude faster than SW
over the entire string
47Why is BLAST fast?
- Assume that keyword searching does not consume
any time and that alignment computation the
expensive step. - Query m1000, random Db n107, no TP
- SW O(nm) 1000107 1010 computations
- BLAST, W11
- E(11-mer hits) 1000 (1/4)11 1072384
- Number of computations 23841001002.384107
- Ratio1010/(2.384107)420
- Further speed improvements are possible
48Keyword Matching
- How fast can we match keywords?
- Hash table/Db index? What is the size of the hash
table, for m11 - Suffix trees? What is the size of the suffix
trees? - Trie based search. We will do this in class.
AATCA
567
49Related notes
- How to choose the alignment region?
- Extend greedily until the score falls below a
certain threshold - What about protein sequences?
- Default word size 3, and mismatches are
allowed. - Like sequences, BLAST has been evolving
continuously - Banded alignment
- Seed selection
- Scanning for exact matches, keyword search versus
database indexing
50P-value computation
- How significant is a score? What happens to
significance when you change the score function - A simple empirical method
- Compute a distribution of scores against a random
database. - Use an estimate of the area under the curve to
get the probability. - OR, fit the distribution to one of the standard
distributions.
51Z-scores for alignment
- Initial assumption was that the scores followed a
normal distribution. - Z-score computation
- For any alignment, score S, shuffle one of the
sequences many times, and recompute alignment.
Get mean and standard deviation - Look up a table to get a P-value
52Blast E-value
- Initial (and natural) assumption was that scores
followed a Normal distribution - 1990, Karlin and Altschul showed that ungapped
local alignment scores follow an exponential
distribution - Practical consequence
- Longer tail.
- Previously significant hits now not so
significant
53Exponential distribution
- Random Database, Pr(1) p
- What is the expected number of hits to a sequence
of k 1s - Instead, consider a random binary Matrix.
Expected of diagonals of k 1s
54- As you increase k, the number decreases
exponentially. - The number of diagonals of k runs can be
approximated by a Poisson process - In ungapped alignments, we replace the coin
tosses by column scores, but the behaviour does
not change (Karlin Altschul). - As the score increases, the number of alignments
that achieve the score decreases exponentially
55Blast E-value
- Choose a score such that the expected score
between a pair of residues lt 0 - Expected number of alignments with a particular
score - For small values, E-value and P-value are the
same
56Blast Variants
- What is mega-blast?
- What is discontiguous mega-blast?
- Phi-Blast/Psi-Blast?
- BLAT?
- PatternHunter?
Longer seeds. Seeds with dont care
values Later Database pre-processing Seeds with
dont care values
57Keyword Matching
P O T A S T P O T A T O
O
T
A
T
O
T
U
I
S
A
E
58(No Transcript)