L3:%20Blast:%20Local%20Alignment%20and%20other%20flavors - PowerPoint PPT Presentation

About This Presentation

Title:

L3:%20Blast:%20Local%20Alignment%20and%20other%20flavors

Description:

L3: Blast: Local Alignment and other flavors – PowerPoint PPT presentation

Number of Views:143

Avg rating:3.0/5.0

Slides: 59

Provided by: Vine88

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: L3:%20Blast:%20Local%20Alignment%20and%20other%20flavors

1
L3 Blast Local Alignment and other flavors
2
An Example
T C A T - T G C A A
T C A T T G C A A
A1
A2

Align sTCAT with tTGCAA
Match Score 1
Mismatch score -1, Indel Score -1
Score A1?, Score A2?

3
Sequence Alignment

Recall Instead of computing the optimum
alignment, we are computing the score of the
optimum alignment
Let Si,j denote the score of the optimum
alignment of the prefix s1..i and t 1..j

4
An O(nm) algorithm for score computation
For i 1 to n For j 1 to m

The iteration ensures that all values on the
right are computed in earlier steps.

5
Base case (Initialization)
6
A tableaux approach
t
1
j
n
1
s
Si-1,j
Si-1,j-1
i
Si,j-1
Si,j
n
Cell (i,j) contains the score Si,j. Each cell
only looks at 3 neighboring cells
7
Alignment Table
T G C A
A
0 -1 -2 -3 -4 -5
-1 1 0 -1 -2 -3
-2 0 0 1 0 -1
-3 -1 -1 0 2 1
-4 -2 -2 -1 1 1
T
C
A
T
8
Alignment Table

S4,5 1 is the score of an optimum alignment
Therefore, A2 is an optimum alignment
We know how to obtain the optimum Score. How do
we get the best alignment?

T G C A A
T
C
A
T
9
Computing Optimum Alignment

At each cell, we have 3 choices
We maintain additional information to record the
choice at each step.

For i 1 to n For j 1 to m
j-1
j
If (Si,j Si-1,j-1 C(si,tj)) Mi,j

i-1
If (Si,j Si-1,j C(si,-)) Mi,j
If (Si,j Si,j-1 C(-,tj) ) Mi,j
i
10
Computing Optimal Alignments
T G C A
A
T
C
A
T
11
Retrieving Opt.Alignment

M4,5
Implies that S4,5S3,4C(A,T)
or

1 2 3 4 5
T G C A A
A
T
1
T
C
2
M3,4 Implies that S3,4S2,3 C(A,A)
or
A
3
T
4
A
A
T
A
12
Retrieving Opt.Alignment

M2,3
Implies that S2,3S1,2C(C,C)
or

1 2 3 4 5
T G C A A
A
A
C
T
1
T
A
C
C
2
M1,2 Implies that S1,2S1,1 C(-,G)
or
A
3
T
4
A
A
C
-
T
T
A
C
G
T
13
Algorithm to retrieve optimal alignment

RetrieveAl(i,j)
if (Mi,j \)
return (RetrieveAl (i-1,j-1) . )
else if (Mi,j )
return (RetrieveAl (i-1,j) . )

si
tj
si
-
else if (Mi,j --) return
(RetrieveAl (i,j-1) . )
-
tj
14
Summary

An optimal alignment of strings of length n and m
can be computed in O(nm) time
There is a tight connection between computation
of optimal score, and computation of opt.
Alignment
True for all DP based solutions

15
Global versus Local Alignment

Consider
s ACCACCCCTT
t ATCCCCACAT

ACCACCCCTT
ACCACCCCT T
A TCCCCACAT
ATCCCCACAT
16
Blast Outputs Local Alignment
Schematic
26
422
query
db
19
405
17
Local Alignment

Compute maximum scoring interval over all
sub-intervals (a,b), and (a,b)
How can we compute this efficiently?

a
b
a
b
18
Local Alignment

Recall that in global alignment, we compute the
best score for all prefix pairs s(1..i) and
t(1..j).
Instead, compute the best score for all
sub-alignments that end in s(i) and t(j).
What changes in the recurrence?

a
i
a
j
19
Local Alignment
j
j-1

The original recurrence still works, except when
the optimum score Si,j is negative
When Si,j lt0, it means that the optimum local
alignment cannot include the point (i,j).
So, we must reset the score to 0.

i-1
i
si
tj
20
Local Alignment Trick (Smith-Waterman algorithm)
How can we compute the local alignment itself?
21
Generalizing Gap Cost

It is more likely for gaps to be contiguous
The penalty for a gap of length l should be

22
Using affine gap penalties

What is the time taken for this?
What are the values that l can take?
Can we get rid of the extra Dimension?

23
Affine gap penalties

Define Di,j Score of the best alignment,
given that the final column is a deletion (si
is aligned to a gap)
Define Ii,j Score of the best alignment, given
that the final column is an insertion (tj is
aligned to a gap)

-
tj
24
O(nm) solution for affine gap costs
25
Alignment Space?

How much space to we need?
What if the query and database are each 1Mbp?

26
Alignment (Linear Space)

Score computation

For i 1 to n For j 1 to m
27
Linear Space Alignment

In Linear Space, we can do each row of the D.P.
We need to compute the optimum path from the
origin (0,0) to (m,n)

28
Linear Space (contd)

At in/2, we know scores of all the optimal paths
ending at that row.
Define Fj Sn/2,j
One of these j is on the true path. Which one?

29
Backward alignment

Let Si,j be the optimal score of aligning
si..n with tj..m
Define Bj Sn/2,j
One of these j is on the true path. Which one?

30
Forward, Backward computation

At the optimal coordinate, j
FjBjSn,m
In O(nm) time, and O(m) space, we can compute one
of the coordinates on the optimum path.

31
Linear Space Alignment

Align(1..n,1..m)
For all 1ltj lt m
Compute FjS(n/2,j)
For all 1ltj lt m
Compute BjSb(n/2,j)
j maxj FjBj
X Align(1..n/2,1..j)
Y Align(n/2..n,j..m)
Return X,j,Y

32
Linear Space complexity

T(nm) c.nm T(nm/2) O(nm)
Space O(m)

33
Summary

We considered the basics of sequence alignment
Opt score computation
Reconstructing alignments
Local alignments
Affine gap costs
Space saving measures
Can we recreate Blast?

34
Blast and local alignment

Concatenate all of the database sequences to form
one giant sequence.
Do local alignment computation with the query.

35
Large database search
Database size n100M, Querysize m1000. O(nm)
1011 computations
36
Why is Blast Fast?
37
Silly Question!

True or False No two people in new york city
have the same number of hair

38
Observations

Much of the database is random from the querys
perspective
Consider a random DNA string of length n.
PrAPrC PrGPrT0.25
Assume for the moment that the query is all As
(length m).
What is the probability that an exact match to
the query can be found?

39
Basic probability

Probability that there is a match starting at a
fixed position i 0.25m
What is the probability that some position i has
a match.
Dependencies confound probability estimates.

40
Basic ProbabilityExpectation

Q Toss a coin each time it comes up heads, you
get a dollar
What is the money you expect to get after n
tosses?
Let Xi be the amount earned in the i-th toss

41
Expected number of matches

Expected number of matches can still be computed.

Let Xi1 if there is a match starting at
position i, Xi0 otherwise

Expected number of matches

42
Expected number of exact Matches is small!

Expected number of matches n0.25m
If n107, m10,
Then, expected number of matches 9.537
If n107, m11
expected number of hits 2.38
n107,m12,
Expected number of hits 0.5 lt 1
Bottom Line An exact match to a substring of the
query is unlikely just by chance.

43
Observation 2

What is the pigeonhole principle?

44
Why is this important?

Suppose we are looking for sequences that are 80
identical to the query sequence of length 100.
Assume that the mismatches are randomly
distributed.
What is the probability that there is no stretch
of 10 bp, where the query and the subject match
exactly?
Rough calculations show that it is very low.
Exact match of a short query substring to a truly
similar subject is very high.
The above equation does not take dependencies
into account
Reality is better because the matches are not
randomly distributed

45
Just the Facts

Consider the set of all substrings of the query
string of fixed length W.
Prob. of exact match to a random database string
is very low.
Prob. of exact match to a true homolog is very
high.
Keyword Search (exact matches) is MUCH faster
than sequence alignment

46
BLAST
Database (n)

Consider all (m-W) query words of size W (Default
11)
Scan the database for exact match to all such
words
For all regions that hit, extend using a dynamic
programming alignment.
Can be many orders of magnitude faster than SW
over the entire string

47
Why is BLAST fast?

Assume that keyword searching does not consume
any time and that alignment computation the
expensive step.
Query m1000, random Db n107, no TP
SW O(nm) 1000107 1010 computations
BLAST, W11
E(11-mer hits) 1000 (1/4)11 1072384
Number of computations 23841001002.384107
Ratio1010/(2.384107)420
Further speed improvements are possible

48
Keyword Matching

How fast can we match keywords?
Hash table/Db index? What is the size of the hash
table, for m11
Suffix trees? What is the size of the suffix
trees?
Trie based search. We will do this in class.

AATCA
567
49
Related notes

How to choose the alignment region?
Extend greedily until the score falls below a
certain threshold
What about protein sequences?
Default word size 3, and mismatches are
allowed.
Like sequences, BLAST has been evolving
continuously
Banded alignment
Seed selection
Scanning for exact matches, keyword search versus
database indexing

50
P-value computation

How significant is a score? What happens to
significance when you change the score function
A simple empirical method
Compute a distribution of scores against a random
database.
Use an estimate of the area under the curve to
get the probability.
OR, fit the distribution to one of the standard
distributions.

51
Z-scores for alignment

Initial assumption was that the scores followed a
normal distribution.
Z-score computation
For any alignment, score S, shuffle one of the
sequences many times, and recompute alignment.
Get mean and standard deviation
Look up a table to get a P-value

52
Blast E-value

Initial (and natural) assumption was that scores
followed a Normal distribution
1990, Karlin and Altschul showed that ungapped
local alignment scores follow an exponential
distribution
Practical consequence
Longer tail.
Previously significant hits now not so
significant

53
Exponential distribution

Random Database, Pr(1) p
What is the expected number of hits to a sequence
of k 1s
Instead, consider a random binary Matrix.
Expected of diagonals of k 1s

As you increase k, the number decreases
exponentially.
The number of diagonals of k runs can be
approximated by a Poisson process
In ungapped alignments, we replace the coin
tosses by column scores, but the behaviour does
not change (Karlin Altschul).
As the score increases, the number of alignments
that achieve the score decreases exponentially

55
Blast E-value

Choose a score such that the expected score
between a pair of residues lt 0
Expected number of alignments with a particular
score
For small values, E-value and P-value are the
same

56
Blast Variants

What is mega-blast?
What is discontiguous mega-blast?
Phi-Blast/Psi-Blast?
BLAT?
PatternHunter?

Longer seeds. Seeds with dont care
values Later Database pre-processing Seeds with
dont care values
57
Keyword Matching
P O T A S T P O T A T O
O
T
A
T
O
T
U
I
S
A
E
58
(No Transcript)

Write a Comment

User Comments (0)