L3:%20Blast:%20Local%20Alignment%20and%20other%20flavors - PowerPoint PPT Presentation

About This Presentation
Title:

L3:%20Blast:%20Local%20Alignment%20and%20other%20flavors

Description:

L3: Blast: Local Alignment and other flavors – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 59
Provided by: Vine88
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: L3:%20Blast:%20Local%20Alignment%20and%20other%20flavors


1
L3 Blast Local Alignment and other flavors
2
An Example
T C A T - T G C A A
T C A T T G C A A
A1
A2
  • Align sTCAT with tTGCAA
  • Match Score 1
  • Mismatch score -1, Indel Score -1
  • Score A1?, Score A2?

3
Sequence Alignment
  • Recall Instead of computing the optimum
    alignment, we are computing the score of the
    optimum alignment
  • Let Si,j denote the score of the optimum
    alignment of the prefix s1..i and t 1..j

4
An O(nm) algorithm for score computation
For i 1 to n For j 1 to m
  • The iteration ensures that all values on the
    right are computed in earlier steps.

5
Base case (Initialization)
6
A tableaux approach
t
1
j
n
1
s
Si-1,j
Si-1,j-1
i
Si,j-1
Si,j
n
Cell (i,j) contains the score Si,j. Each cell
only looks at 3 neighboring cells
7
Alignment Table
T G C A
A
0 -1 -2 -3 -4 -5
-1 1 0 -1 -2 -3
-2 0 0 1 0 -1
-3 -1 -1 0 2 1
-4 -2 -2 -1 1 1
T
C
A
T
8
Alignment Table
  • S4,5 1 is the score of an optimum alignment
  • Therefore, A2 is an optimum alignment
  • We know how to obtain the optimum Score. How do
    we get the best alignment?

T G C A A
T
C
A
T
9
Computing Optimum Alignment
  • At each cell, we have 3 choices
  • We maintain additional information to record the
    choice at each step.

For i 1 to n For j 1 to m
j-1
j
If (Si,j Si-1,j-1 C(si,tj)) Mi,j


i-1
If (Si,j Si-1,j C(si,-)) Mi,j
If (Si,j Si,j-1 C(-,tj) ) Mi,j
i
10
Computing Optimal Alignments
T G C A
A
T
C
A
T
11
Retrieving Opt.Alignment
  • M4,5
  • Implies that S4,5S3,4C(A,T)
  • or

1 2 3 4 5
T G C A A
A
T
1
T
C
2
M3,4 Implies that S3,4S2,3 C(A,A)
or
A
3
T
4
A
A
T
A
12
Retrieving Opt.Alignment
  • M2,3
  • Implies that S2,3S1,2C(C,C)
  • or

1 2 3 4 5
T G C A A
A
A
C
T
1
T
A
C
C
2
M1,2 Implies that S1,2S1,1 C(-,G)
or
A
3
T
4
A
A
C
-
T
T
A
C
G
T
13
Algorithm to retrieve optimal alignment
  • RetrieveAl(i,j)
  • if (Mi,j \)
  • return (RetrieveAl (i-1,j-1) . )
  • else if (Mi,j )
  • return (RetrieveAl (i-1,j) . )

si
tj
si
-
else if (Mi,j --) return
(RetrieveAl (i,j-1) . )
-
tj
14
Summary
  • An optimal alignment of strings of length n and m
    can be computed in O(nm) time
  • There is a tight connection between computation
    of optimal score, and computation of opt.
    Alignment
  • True for all DP based solutions

15
Global versus Local Alignment
  • Consider
  • s ACCACCCCTT
  • t ATCCCCACAT

ACCACCCCTT
ACCACCCCT T
A TCCCCACAT
ATCCCCACAT
16
Blast Outputs Local Alignment
Schematic
26
422
query
db
19
405
17
Local Alignment
  • Compute maximum scoring interval over all
    sub-intervals (a,b), and (a,b)
  • How can we compute this efficiently?

a
b
a
b
18
Local Alignment
  • Recall that in global alignment, we compute the
    best score for all prefix pairs s(1..i) and
    t(1..j).
  • Instead, compute the best score for all
    sub-alignments that end in s(i) and t(j).
  • What changes in the recurrence?

a
i
a
j
19
Local Alignment
j
j-1
  • The original recurrence still works, except when
    the optimum score Si,j is negative
  • When Si,j lt0, it means that the optimum local
    alignment cannot include the point (i,j).
  • So, we must reset the score to 0.

i-1
i
si
tj
20
Local Alignment Trick (Smith-Waterman algorithm)
How can we compute the local alignment itself?
21
Generalizing Gap Cost
  • It is more likely for gaps to be contiguous
  • The penalty for a gap of length l should be

22
Using affine gap penalties
  • What is the time taken for this?
  • What are the values that l can take?
  • Can we get rid of the extra Dimension?

23
Affine gap penalties
  • Define Di,j Score of the best alignment,
    given that the final column is a deletion (si
    is aligned to a gap)
  • Define Ii,j Score of the best alignment, given
    that the final column is an insertion (tj is
    aligned to a gap)

-
tj
24
O(nm) solution for affine gap costs
25
Alignment Space?
  • How much space to we need?
  • What if the query and database are each 1Mbp?

26
Alignment (Linear Space)
  • Score computation

For i 1 to n For j 1 to m
27
Linear Space Alignment
  • In Linear Space, we can do each row of the D.P.
  • We need to compute the optimum path from the
    origin (0,0) to (m,n)

28
Linear Space (contd)
  • At in/2, we know scores of all the optimal paths
    ending at that row.
  • Define Fj Sn/2,j
  • One of these j is on the true path. Which one?

29
Backward alignment
  • Let Si,j be the optimal score of aligning
    si..n with tj..m
  • Define Bj Sn/2,j
  • One of these j is on the true path. Which one?

30
Forward, Backward computation
  • At the optimal coordinate, j
  • FjBjSn,m
  • In O(nm) time, and O(m) space, we can compute one
    of the coordinates on the optimum path.

31
Linear Space Alignment
  • Align(1..n,1..m)
  • For all 1ltj lt m
  • Compute FjS(n/2,j)
  • For all 1ltj lt m
  • Compute BjSb(n/2,j)
  • j maxj FjBj
  • X Align(1..n/2,1..j)
  • Y Align(n/2..n,j..m)
  • Return X,j,Y

32
Linear Space complexity
  • T(nm) c.nm T(nm/2) O(nm)
  • Space O(m)

33
Summary
  • We considered the basics of sequence alignment
  • Opt score computation
  • Reconstructing alignments
  • Local alignments
  • Affine gap costs
  • Space saving measures
  • Can we recreate Blast?

34
Blast and local alignment
  • Concatenate all of the database sequences to form
    one giant sequence.
  • Do local alignment computation with the query.

35
Large database search
Database size n100M, Querysize m1000. O(nm)
1011 computations
36
Why is Blast Fast?
37
Silly Question!
  • True or False No two people in new york city
    have the same number of hair

38
Observations
  • Much of the database is random from the querys
    perspective
  • Consider a random DNA string of length n.
  • PrAPrC PrGPrT0.25
  • Assume for the moment that the query is all As
    (length m).
  • What is the probability that an exact match to
    the query can be found?

39
Basic probability
  • Probability that there is a match starting at a
    fixed position i 0.25m
  • What is the probability that some position i has
    a match.
  • Dependencies confound probability estimates.

40
Basic ProbabilityExpectation
  • Q Toss a coin each time it comes up heads, you
    get a dollar
  • What is the money you expect to get after n
    tosses?
  • Let Xi be the amount earned in the i-th toss

41
Expected number of matches
  • Expected number of matches can still be computed.

i
  • Let Xi1 if there is a match starting at
    position i, Xi0 otherwise
  • Expected number of matches

42
Expected number of exact Matches is small!
  • Expected number of matches n0.25m
  • If n107, m10,
  • Then, expected number of matches 9.537
  • If n107, m11
  • expected number of hits 2.38
  • n107,m12,
  • Expected number of hits 0.5 lt 1
  • Bottom Line An exact match to a substring of the
    query is unlikely just by chance.

43
Observation 2
  • What is the pigeonhole principle?

44
Why is this important?
  • Suppose we are looking for sequences that are 80
    identical to the query sequence of length 100.
  • Assume that the mismatches are randomly
    distributed.
  • What is the probability that there is no stretch
    of 10 bp, where the query and the subject match
    exactly?
  • Rough calculations show that it is very low.
    Exact match of a short query substring to a truly
    similar subject is very high.
  • The above equation does not take dependencies
    into account
  • Reality is better because the matches are not
    randomly distributed

45
Just the Facts
  • Consider the set of all substrings of the query
    string of fixed length W.
  • Prob. of exact match to a random database string
    is very low.
  • Prob. of exact match to a true homolog is very
    high.
  • Keyword Search (exact matches) is MUCH faster
    than sequence alignment

46
BLAST
Database (n)
  • Consider all (m-W) query words of size W (Default
    11)
  • Scan the database for exact match to all such
    words
  • For all regions that hit, extend using a dynamic
    programming alignment.
  • Can be many orders of magnitude faster than SW
    over the entire string

47
Why is BLAST fast?
  • Assume that keyword searching does not consume
    any time and that alignment computation the
    expensive step.
  • Query m1000, random Db n107, no TP
  • SW O(nm) 1000107 1010 computations
  • BLAST, W11
  • E(11-mer hits) 1000 (1/4)11 1072384
  • Number of computations 23841001002.384107
  • Ratio1010/(2.384107)420
  • Further speed improvements are possible

48
Keyword Matching
  • How fast can we match keywords?
  • Hash table/Db index? What is the size of the hash
    table, for m11
  • Suffix trees? What is the size of the suffix
    trees?
  • Trie based search. We will do this in class.







AATCA
567
49
Related notes
  • How to choose the alignment region?
  • Extend greedily until the score falls below a
    certain threshold
  • What about protein sequences?
  • Default word size 3, and mismatches are
    allowed.
  • Like sequences, BLAST has been evolving
    continuously
  • Banded alignment
  • Seed selection
  • Scanning for exact matches, keyword search versus
    database indexing

50
P-value computation
  • How significant is a score? What happens to
    significance when you change the score function
  • A simple empirical method
  • Compute a distribution of scores against a random
    database.
  • Use an estimate of the area under the curve to
    get the probability.
  • OR, fit the distribution to one of the standard
    distributions.

51
Z-scores for alignment
  • Initial assumption was that the scores followed a
    normal distribution.
  • Z-score computation
  • For any alignment, score S, shuffle one of the
    sequences many times, and recompute alignment.
    Get mean and standard deviation
  • Look up a table to get a P-value

52
Blast E-value
  • Initial (and natural) assumption was that scores
    followed a Normal distribution
  • 1990, Karlin and Altschul showed that ungapped
    local alignment scores follow an exponential
    distribution
  • Practical consequence
  • Longer tail.
  • Previously significant hits now not so
    significant

53
Exponential distribution
  • Random Database, Pr(1) p
  • What is the expected number of hits to a sequence
    of k 1s
  • Instead, consider a random binary Matrix.
    Expected of diagonals of k 1s

54
  • As you increase k, the number decreases
    exponentially.
  • The number of diagonals of k runs can be
    approximated by a Poisson process
  • In ungapped alignments, we replace the coin
    tosses by column scores, but the behaviour does
    not change (Karlin Altschul).
  • As the score increases, the number of alignments
    that achieve the score decreases exponentially

55
Blast E-value
  • Choose a score such that the expected score
    between a pair of residues lt 0
  • Expected number of alignments with a particular
    score
  • For small values, E-value and P-value are the
    same

56
Blast Variants
  1. What is mega-blast?
  2. What is discontiguous mega-blast?
  3. Phi-Blast/Psi-Blast?
  4. BLAT?
  5. PatternHunter?

Longer seeds. Seeds with dont care
values Later Database pre-processing Seeds with
dont care values
57
Keyword Matching
P O T A S T P O T A T O
O
T
A
T
O
T
U
I
S
A
E
58
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com