CS882' Chapter 1' Modern Homology Search Techniques - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

CS882' Chapter 1' Modern Homology Search Techniques

Description:

... WU-Blast, psi-Blast, SIM, BLAT, MUMmer, Quasar, etc. ... Suffix tree: MUMmer, Quasar, REPuter. Only good for precise matches, highly similar sequences. ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 51
Provided by: min87
Category:

less

Transcript and Presenter's Notes

Title: CS882' Chapter 1' Modern Homology Search Techniques


1
CS882. Chapter 1. Modern Homology Search
Techniques
Ming Li CRC Chair in Bioinformatics,
Professor Computer Science. U. Waterloo
2
  • I want to teach you how simple beautiful ideas
    can transform a field.

3
Homology Search
  • Definition Given two DNA sequences, find all
    similar regions (local homologies, with gapped
    alignments).
  • Formalize Edit distance --- insert, delete,
    mutate.
  • Applications This is the single most important
    task in bioinformatics. Also comparative
    genomics, assembly
  • Smith-Waterman dynamic programming, Blast, FASTA,
    MegaBlast, SENSEI, WU-Blast, psi-Blast, SIM,
    BLAT, MUMmer, Quasar, etc.
  • PatternHunter

4
Importance
  • A significant fraction of worlds super-computing
    power is doing homology search.
  • NCBI Blast server serves 100,000 users each day.
    10-15 increase each month.
  • Blast paper is referenced 100,000 times.
  • These have two implications
  • Big demands.
  • Current software are too slow.

5
Scalability of Software
  • The trend of genetic data growth
  • Genomes Human, Rice, Mouse, Fly, plus many.
  • Software must be scalable to large datasets.

30 billion in year 2005
6
History
  • 1970s Smith-Waterman dynamic programming full
    sensitivity, slow
  • 1980-90s BLAST, FASTA heuristics fast, but not
    sensitive
  • PatternHunter Smith-Waterman sensitivity, BLAST
    speed. (Or BLAST sensitivity, many times faster.)

7
Bioinformatics Companies living on Blast
  • Paracel (Celera)
  • TimeLogic
  • TurboGenomics (TurboWorx)
  • NSF, NIH, pharmaceuticals proudly support many
    supercomputing centers mainly for homology search
  • Hardware high cost become obsolete in 2 years.

8
Outline
  • Review dynamic programming and heuristics like
    Blast
  • Optimized spaced seeds
  • Mathematical theory why they are better
  • How to compute them
  • NP completeness
  • Data models, coding regions, HMM
  • Multiple optimized spaced seeds
  • How to compute them
  • Approach Smith-Waterman sensitivity

9
Dynamic Programming
  • Longest Common Subsequence (LCS).
  • Vv1v2 vn
  • Ww1w2 wm
  • s(i,j) length of LCS
  • of V1..i and W1..j
  • Dynamic Programming
  • s(i-1,j)
  • s(i,j)max s(i,j-1)
  • s(i-1,j-1)1,viwj
  • Sequence Alignment (Needleman-Wunsch, 1970)
  • s(i,j)max
  • s(i-1,j) d(vi,-)
  • s(i,j-1)d(-,wj)
  • s(i-1,j-1)d(vi,wj)
  • 0 (Smith-Waterman, 1981)
  • No affine gap penalties,
  • where d, for proteins, is either PAM or BLOSUM
    matrix. For DNA, it can be match 1, mismatch -1,
    gap -3 (In fact open gap -5, extension -1.)
  • d(-,x)d(x,-) -a, d(x,y)-u. When a0,
    uinfinity, it is LCS.

10
Misc. Concerns
  • Local sequence alignment, add si,j0.
  • Gap penalties. For good reasons, we charge first
    gap cost a, and then subsequent continuous
    insertions blta.
  • Space efficient sequence alignment. Hirschberg
    alg. in O(n2) time, O(n) space.
  • Multiple alignment of k sequences O(nk)

11
Blast
  • Popular software, using heuristics. By Altschul,
    Gish, Miller, Myers, Lipman, 1990.
  • E(xpected)-value e dmne -lS, here S is score, m
    is database length and n is query length.
  • Meaning e is number of hits one can expect to
    see just by chance when searching a database of
    size m.
  • Basic Strategy For all 3 a.a. (and closely
    related) triples, remember their locations in
    database. Given a query sequence S. For all
    triples in S, find their location in database,
    then extend as long as e-value significant.
    Similar strategy for DNA (7-11 nucleotides). Too
    slow in genome level.

12
Blast Algorithm
  • Find seeded matches
  • Extent to HSPs (High Scoring Pairs)
  • Gapped Extension, dynamic programming
  • Report all local alignments

13
Here is an example of a BLASTP match (E-value 0)
between gene 0189 in C. pneumoniae and gene 131
in C. trachomatis. Query CPn0189
Score
(bits) E-value Aligned with CT131 hypothetical
protein 1240
0.0 Query 1 MKRRSWLKILGICLGSSIVLGFLIFLPQLLSTE
SRKYLVFSL I HKESGLSCSAEELKISW 60
MKR W KI G L L L LP SES KYL
SKEGL EL SW Sbjct 1
MKRSPWYKIFGYYLLVGVPLALLALLPKFFSSESGKYLFLSVLNKETGLQ
F EIEQLHLSW 60 Query 61 FGRQTARKIKLTG-EAKDEVFS
AEKFELDGSLLRLL I YKKPKGITLSGWSLKINEPASID 119
FG QTAKI G EFAEK GSL
RLLY PK TLGWSLIE S Sbjct 61
FGSQTAKKIRIRGIDSDSEIFAAEKI IVKGSLPRLLL
YRFPKALTLTGWSLQIDESLSMN 120 Etc. Note Because of
powerpoint character alignment problems, I
inserted some white space to make it look more
aligned.
14
PatternHunter(Ma, Tromp, Li Bioinformatics,
183, 2002, 440-445)
  • Hundred times faster than Blastn at same
    sensitive level.
  • Scalable, portable, memory-friendly,
    super-computer tasks on a PC, written in Java
    any genome anywhere.
  • Used in Mouse Genome Consortium (Nature, Dec. 5,
    2002), as well as in hundreds of institutions and
    industry.
  • PatternHunter demo.

15
Comparison with Blastn, MegaBlast
  • On Pentium III 700MH, 1GB
  • Blastn MB
    PH
  • E.coli vs H.inf 716s 5s/561M
    14s/68M
  • Arabidopsis 2 vs 4 -- 21720s/1087M
    498s/280M
  • Human 21 vs 22 -- --
    5250s/417M
  • 61M vs 61M
    3hr37m/700M
  • 100M vs 35M
    6m
  • Human (3G) vs Mouse (x39G) 20
    days
  • All with filter off and identical parameters
  • 16M reads of Mouse genome against Human genome
    for MIT Whitehead. Best Blast program takes 19
    years at the same sensitivity (seed length 11).

16
Quality Comparisonx-axis alignment
ranky-axis alignment scoreboth axes in
logarithmic scale
A. thaliana chr 2 vs 4
E. Coli vs H. influenza
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Genome Alignment by PatternHunter (4 seconds)
21
New thinking
  • Three lines of existing approaches
  • Smith-Waterman exhaustive dynamic programming.
  • Blast family find seed matches (11 for Blastn,
    28 for MegaBlast), then extend. Dilemma
    increasing seed size speeds up but loses
    sensitivity descreasing seed size gains
    sensitivity but loses speed.
  • Suffix tree MUMmer, Quasar, REPuter. Only good
    for precise matches, highly similar sequences.
  • Blast approach is the only way to deal with real
    and large problems, but we must to solve Blast
    dilemma We need a way to improve sensitivity and
    speed simultaneously.
  • Goals Improve (a) Blast, (b) Smith-Waterman.

22
Deterministic Optimized Space Seed
  • Blastn finds a match of length 11, then extend
    from there. MegaBlast, in order to increase
    speed, increases this to 28.
  • Spaced Seed PatternHunter looks for matches of
    11 nonconsecutive matches and optimized such
    seeding scheme.
  • Example 111010010100110111 is weight 11 optimal
    seed (Blast default seed is 11111111111).
  • This seemingly simple change makes a big
    difference significantly increases hit to
    homologous region while reducing bad hits.

23
Sensitivity PH weight 11 seed vs Blast 11 10
24
PH 2-hit sensitivity vs Blastn 11, 12 1-hit
25
Literature
  • Random multiple spaced q-grams were used in the
    following work
  • FLASH by Califano and Rigoutsos
  • Multiple filtration by Pevzner and Waterman
  • LSH of Buhler
  • Praparata et al
  • Recent work on spaced seeds theory and
    algorithms
  • Ma, Tromp, Li original PH paper.
  • Choi and Zhang x 2
  • Brejova, Brown, Vinar x 2
  • Buhler, Keich, Sun
  • Keich, Li, Ma, Tromp
  • Li, Ma, Kisman, Tromp

26
Expect Less, Get More
  • Lemma The expected number of hits of a weight W
    length M seed model within a length L region with
    similarity p is
  • (L-M1)pW
  • Proof The expected number of hits is the sum,
    over the L-M1 possible positions of fitting the
    seed within the region, of the probability of W
    specific matches, the latter being pW.
  • Example In a region of length 64 with 0.7
    similarity, PH has probability of 0.466 to hit vs
    Blast 0.3, 50 increase. On the other hand, by
    above lemma, Blast expects 1.07 hits, while PH
    0.93, 14 less.

27
Why are the spaced seeds better?
  • A wrong (but intuitive) proof seed s, interval
    I, similarity p
  • E(hits in I)
  • Prob.(s hits in I) E(hits in I s hits in I)
  • Thus Prob.(s hits in I)
  • Lpw / E(hits in I s hits in I)
  • For optimized spaced seed
  • 111010010100110111 Non overlap
    Prob
  • 111010010100110111 6
    p6
  • 111010010100110111 6
    p6
  • 111010010100110111 6
    p6
  • 111010010100110111 7
    p7
  • ..
  • For spaced seed the divisor is 1p6p6p6p7
  • For Blast seed the divisor is bigger 1 p p2
    p3

28
Why are the spaced seeds better
  • Theorem 1. Let I be a homologous region, homology
    level p. For any sequence 0i0 lt i1 lt in-1
    I-s, let Aj be the event a spaced seed s hits
    I at ij, Bj be event the consecutive seed hits I
    at j.
  • (1) P(Ujltn Aj) P(Ujltn Bj), when
    ijj, gt holds.
  • Proof. Induction on n. For n1, P(A0)pWP(B0).
    Assume the theorem holds for nN, we prove it
    holds for N1. Let Ek denote the event that, when
    putting a seed at I0, 0th, 1st k-1st 1s in
    the seed all match the k-th 1 mismatches.
    Clearly, Ek is a partition of sample space for
    all k0, .. , W, and P(Ek)pk(1-p) for any seed.
    Thus it is sufficient to show, for kW
  • (2) P(Uj0..N AjEk)P(Uj0..NBjEk)
  • When kW, both sides of above equal 1. For
    kltW since (UjkBj)?EkF and Bk1,Bk2, BN are
    mutually independent of Ek, we have
  • (3) P(Uj0..NBjEk)P(Ujk1..NBj)
  • Now consider the first term in (2). For each
    k in 0, W-1, at most k1 of the events Aj
    satisfy Aj?EkF because Aj?EkF iff when aligned
    at ij seed s has a 1 bit at overlapping k-th bit
    when the seed was at 0 there are at most k1
    choices. Thus, there exist indices 0ltmk1ltmk2
    ltmN N such that Amj?Ek?F. Since Ek means all
    previous bits matched, it is clear that Ek is
    non-negatively correlated with Ujk1..N Amj,
    thus
  • (4) P(Uj0..N
    AjEk)P(Ujk1..N AmjEk)P(Ujk1..NAmj)
  • The inductive hypothesis yields
  • P(Ujk1..N Amj)
    P(Ujk1..NBj)
  • Combined with (3),(4), this proves (2),
    hence part of (1) of the theorem.

29
To prove when ijj, P(Uj0..n-1Aj)gtP(Uj0..n-1Bj)
--- (5)
  • We prove (5) by induction on n. For n2, we have
  • P(Uj0,1Aj)2pw p2w-Shift1gt2pw-pw1P
    (Uj0,1Bj) ()
  • where shift1number of overlaped bits of the
    spaced seed s when shifted to the right by 1 bit.
  • For inductive step, note that the proof of (1)
    shows that for all k0,1, ,W,
    P(Uj0..n-1AjEk)P(Uj0..n-1BjEk). Thus to
    prove (5) we only need to prove that
  • P(Uj0..n-1AjE0)gtP(Uj0..n-1BjE0).
  • The above follows from the inductive hypothesis
    as follows
  • P(Uj0..n-1AjE0)P(Uj1..n-1Aj)gtP(Uj1..n-1Bj)P
    (Uj0..n-1BjE0).

30
Corollary and Open Question
  • Corollary Assume I is infinite. Let ps and pc be
    the first positions a spaced seed and the
    consecutive seed hit I, respectively. Then Eps
    lt Epc.
  • Proof. EpsSk0..8P(psgtk)
  • Sk0..8(1-P(Uj0..kAj))
  • lt Sk0..8(1-P(Uj0..kBj))
  • Epc.
  • Open question Quantitative estimate?

31
Theorem 2. It is NP hard to find the optimal
seed(PH II paper)
  • Proof. Reduction from 3-SAT. Given a 3-SAT
    instance C1 Cn, on Boolean variables x1, ,
    xm, where Ci l1 l2 l3, each l is some xi or
    its complement. We reduce this to seed selection
    problem. The required seed has weight Wm2 and
    length L2m2, and is intended to be of the form
    1(0110)m1, a straightforward variable assignment
    encoding. The i-th pair of bits after the initial
    1 is called the code for variable xi. A 01 code
    corresponds to setting xi true, while a 10 means
    false. To prohibit code 11, we introduce, for
    every 1 i m, a region
  • Ai1 (11)i-101(11)m-i 10m1
    1(11)i-110(11)m-i 1.
  • Since the seed has m 0s, it cannot bridge the
    middle 0m1 part, and is thus forced to hit at
    the start or at the end. This obviously rules out
    code 11 for variable xi. Code 00 automatically
    becomes ruled out as well, since the seed must
    distribute m 1s among all m codes. Finally, for
    each clause, we introduce a region of the form
  • R1(11)a-1 c1(11)m-a 1 000
    1(11)b-1c2(11)m-b 1 000 1(11)c-1c3(11)m-c 1.
  • The total length of R is (2m2)3(2m2)3(2m2)
    6m12 it consists of three parts each
    corresponding to a literal in the constraint,
    with ci being the code for making that literal
    true. It's easy to see that a properly formed
    seed (which lacks 3 consecutive 0s) hits R iff
    the assignment encoded by the seed satisfies the
    clause.

32
Alg. DPFinding the best spaced seeds
  • Let f(i,b) be the probability that seed s hits
    the length i prefix of R that ends with b.
  • Thus, if s matches b, then
  • f(i,b) 1,
  • otherwise we have the recursive relationship
  • f(i,b) (1-p)f(i-1,0b') pf(i-1,1b')
  • where b' is b deleting the last bit.
  • Then the probability of s hitting R is
  • SbM Prob(b) f(L-M,b).
  • Choose s with the highest probability.

33
Coding Region HMM
  • The Hidden Markov Model is a finite set
    of states, each of which is associated with a
    probability distribution. Transitions among the
    states are governed by a set of probabilities
    called transition probabilities. In a particular
    state an outcome or observation can be generated,
    according to the associated probability
    distribution. It is only the outcome, not the
    state visible to an external observer and
    therefore states are hidden'' to the outside
    hence the name Hidden Markov Model. An HMM
    completely has the following components
    satisfying usual probability laws
  • The number of states of the model, N.
  • The number of observation symbols in the
    alphabet, M.
  • A set of state transition probabilities,
    depending on current state.
  • A probability distribution of the observable
    symbol at each of the states.

34
Modeling coding region with HMM
  • PatternHunter original assumption I is a
    sequence of N independent Bernoulli variables X0,
    XN-1 with P(Xi)p.
  • Coding region third position usually is less
    conserved. Skipping it should be a good idea
    (BLAT 110110110). With spaced seeds, we can model
    it using HMM.
  • Brejova, Brown, Vinar M(3) HMM I is a sequence
    of N independent Bernoulli random variables
    X0,X1, XN-1 where Pr(Xi1)pi mod 3. In
    particular
  • p0
    p1 p2
  • human/mouse 0.82 0.87 0.61
  • human/fruit fly 0.67 0.77 0.40
  • DP algorithm naturally extends to M(3).

35
Picture of M(3)
Extending M(8), Brejova-Brown-Vinar proposed
M(8), above, to model Dependencies among
positions within codons.
From Brejova-Brown-Vinar
36
Sensitivity of all seeds Under 4 models
From Brejova-Brown-Vinar
37
Multiple Spaced Seeds Approaching Smith-Waterman
Sensitivity
  • The biggest problem for Blast was low sensitivity
    (and low speed). Massive parallel machines are
    built to do Smith-Waterman exhaustive dynamic
    programming.
  • Spaced seeds give PH a unique opportunity of
    using several optimal seeds to achieve optimal
    sensitivity, this was not possible for Blast
    technology.
  • Economy --- 2 seeds (½ time slow down) achieve
    sensitivity of 1 seed of shorter length (4 times
    speed). --- note, added
  • With multiple optimal seeds. PH II approaches
    Smith-Waterman sensitivity, and 3000 times
    faster.
  • Finding optimal seeds, even one, is NP-hard.
  • Experiment 29715 mouse EST, 4407 human EST.
    Sensitivity and speed comparison next two slides.

38
Finding Multiple Seeds (PH II)
  • Computing the hit probability of given k seeds on
    a uniformly distributed random region is NP-hard.
  • Finding a set of k optimal seeds to maximize the
    hit probability cannot be approximated within
    1-1/e unless NPP
  • Given k seeds, algorithm DP can be adapted to
    compute their sensitivity (probability one of the
    seeds has a hit).
  • Greedy Strategy
  • Compute one optimal seed a. Sa
  • Given S, find b maximizing hit probability of S U
    b.
  • Coding region, use M(3), 0.8, 0.8, 0.5 (for mod 3
    positions)
  • We took 12 CPU-days on 3GHz PC to compute 16
    weight 11 seeds. General seeds (first 4)
    111010010100110111, 111100110010100001011,
    11010000110001010111, 1110111010001111.

39
Sensitivity Comparison with Smith-Waterman (at
100) The thick dashed curve is the sensitivity
of Blastn, seed weight 11. From low to high,
the solid curves are the sensitivity of PH II
using 1, 2, 4, 8 weight 11 coding region seeds,
and the thin dashed curves are the sensitivity
1, 2, 4, 8 weight 11 general purpose seeds,
respectively
40
Speed Comparison with Smith-Waterman
  • Smith-Waterman (SSearch) 20 CPU days.
  • PatternHunter II with 4 seeds 475 CPU seconds.
    3638 times faster than Smith-Waterman dynamic
    programming at the same sensitivity.

41
PHunterViewer --- Arabidopsis Chr 2 vs Chr 4
42
Selecting Interesting Region
43
Amplify the region
44
Focus further
45
Sorted View
46
Another View
47
Running PHDownload at www.BioinformaticsSolution
s.com
  • java jar ph.jar i query.fna j subject.fna o
    out.txt b multi 4
  • -Xmx512m --- for large files
  • -j missing query.fna self-comparison
  • -db multiple sequence input, 0,1,2,3 (no, query,
    subject, both)
  • -W seed weight
  • -G open gap penalty (default 5)
  • -E gap extension (default 1)
  • -q mismatch penalty (default 1)
  • -r reward for match (default 1)
  • -model specify model in binary
  • -H hits before extension
  • -P show progress
  • -b Blast output format
  • -multi number of seeds to be used (default 1
    max 16)

48
Summary and Open Questions
  • Simple ideas 0 created SW, 1 created Blast. 1
    0 created PatternHunter.
  • Good spaced seeds help to improve sensitivity and
    reduce irrelevant hit.
  • Multiple spaced seeds allow us to improve
    sensitivity to approach Smith-Waterman, not
    possible with BLAST.
  • Computing spaced seeds by DP, while it is NP
    hard.
  • Proper data models help to improve design of
    spaced seeds (HMM)
  • Open Problems (a) A mathematical theory of
    spaced seeds. I.e. quantitative statements for
    Theorem 1.
  • (b) Applications to other fields.

49
Assignment 1
  • Problem 1. Why is the proof on page 27 wrong?
  • Problem 2. Prove () on page 29
  • Problem 3. Write a program to compute the optimal
    seed, weight 11, length 18, for a uniform random
    region of length 64 at homology level 0.7.
    Estimate Prob(Blast hit optimal seed does not
    hit) for the same region, for different homology
    levels.

50
Smaller (more solvable) questions (perhaps good
for course projects)
  • Multiple protein seeds for Smith-Waterman
    sensitivity?
  • Applications in other fields, like internet
    document search?
  • Prove Theorem 1 for other models for example,
    M(3) model 0.8,0.8,0.5 together with specialized
    seeds.
  • More extensive study of Problem 3.
Write a Comment
User Comments (0)
About PowerShow.com