Title: CS882' Chapter 1' Modern Homology Search Techniques
1CS882. Chapter 1. Modern Homology Search
Techniques
Ming Li CRC Chair in Bioinformatics,
Professor Computer Science. U. Waterloo
2- I want to teach you how simple beautiful ideas
can transform a field.
3Homology Search
- Definition Given two DNA sequences, find all
similar regions (local homologies, with gapped
alignments). - Formalize Edit distance --- insert, delete,
mutate. - Applications This is the single most important
task in bioinformatics. Also comparative
genomics, assembly - Smith-Waterman dynamic programming, Blast, FASTA,
MegaBlast, SENSEI, WU-Blast, psi-Blast, SIM,
BLAT, MUMmer, Quasar, etc. - PatternHunter
4Importance
- A significant fraction of worlds super-computing
power is doing homology search. - NCBI Blast server serves 100,000 users each day.
10-15 increase each month. - Blast paper is referenced 100,000 times.
- These have two implications
- Big demands.
- Current software are too slow.
5 Scalability of Software
- The trend of genetic data growth
- Genomes Human, Rice, Mouse, Fly, plus many.
- Software must be scalable to large datasets.
30 billion in year 2005
6History
- 1970s Smith-Waterman dynamic programming full
sensitivity, slow - 1980-90s BLAST, FASTA heuristics fast, but not
sensitive - PatternHunter Smith-Waterman sensitivity, BLAST
speed. (Or BLAST sensitivity, many times faster.)
7Bioinformatics Companies living on Blast
- Paracel (Celera)
- TimeLogic
- TurboGenomics (TurboWorx)
- NSF, NIH, pharmaceuticals proudly support many
supercomputing centers mainly for homology search - Hardware high cost become obsolete in 2 years.
8Outline
- Review dynamic programming and heuristics like
Blast - Optimized spaced seeds
- Mathematical theory why they are better
- How to compute them
- NP completeness
- Data models, coding regions, HMM
- Multiple optimized spaced seeds
- How to compute them
- Approach Smith-Waterman sensitivity
9Dynamic Programming
- Longest Common Subsequence (LCS).
- Vv1v2 vn
- Ww1w2 wm
- s(i,j) length of LCS
- of V1..i and W1..j
- Dynamic Programming
- s(i-1,j)
- s(i,j)max s(i,j-1)
- s(i-1,j-1)1,viwj
- Sequence Alignment (Needleman-Wunsch, 1970)
- s(i,j)max
- s(i-1,j) d(vi,-)
- s(i,j-1)d(-,wj)
- s(i-1,j-1)d(vi,wj)
- 0 (Smith-Waterman, 1981)
- No affine gap penalties,
- where d, for proteins, is either PAM or BLOSUM
matrix. For DNA, it can be match 1, mismatch -1,
gap -3 (In fact open gap -5, extension -1.) - d(-,x)d(x,-) -a, d(x,y)-u. When a0,
uinfinity, it is LCS.
10Misc. Concerns
- Local sequence alignment, add si,j0.
- Gap penalties. For good reasons, we charge first
gap cost a, and then subsequent continuous
insertions blta. - Space efficient sequence alignment. Hirschberg
alg. in O(n2) time, O(n) space. - Multiple alignment of k sequences O(nk)
11Blast
- Popular software, using heuristics. By Altschul,
Gish, Miller, Myers, Lipman, 1990. - E(xpected)-value e dmne -lS, here S is score, m
is database length and n is query length. - Meaning e is number of hits one can expect to
see just by chance when searching a database of
size m. - Basic Strategy For all 3 a.a. (and closely
related) triples, remember their locations in
database. Given a query sequence S. For all
triples in S, find their location in database,
then extend as long as e-value significant.
Similar strategy for DNA (7-11 nucleotides). Too
slow in genome level.
12Blast Algorithm
- Find seeded matches
- Extent to HSPs (High Scoring Pairs)
- Gapped Extension, dynamic programming
- Report all local alignments
13Here is an example of a BLASTP match (E-value 0)
between gene 0189 in C. pneumoniae and gene 131
in C. trachomatis. Query CPn0189
Score
(bits) E-value Aligned with CT131 hypothetical
protein 1240
0.0 Query 1 MKRRSWLKILGICLGSSIVLGFLIFLPQLLSTE
SRKYLVFSL I HKESGLSCSAEELKISW 60
MKR W KI G L L L LP SES KYL
SKEGL EL SW Sbjct 1
MKRSPWYKIFGYYLLVGVPLALLALLPKFFSSESGKYLFLSVLNKETGLQ
F EIEQLHLSW 60 Query 61 FGRQTARKIKLTG-EAKDEVFS
AEKFELDGSLLRLL I YKKPKGITLSGWSLKINEPASID 119
FG QTAKI G EFAEK GSL
RLLY PK TLGWSLIE S Sbjct 61
FGSQTAKKIRIRGIDSDSEIFAAEKI IVKGSLPRLLL
YRFPKALTLTGWSLQIDESLSMN 120 Etc. Note Because of
powerpoint character alignment problems, I
inserted some white space to make it look more
aligned.
14PatternHunter(Ma, Tromp, Li Bioinformatics,
183, 2002, 440-445)
- Hundred times faster than Blastn at same
sensitive level. - Scalable, portable, memory-friendly,
super-computer tasks on a PC, written in Java
any genome anywhere. - Used in Mouse Genome Consortium (Nature, Dec. 5,
2002), as well as in hundreds of institutions and
industry. - PatternHunter demo.
15Comparison with Blastn, MegaBlast
- On Pentium III 700MH, 1GB
- Blastn MB
PH - E.coli vs H.inf 716s 5s/561M
14s/68M - Arabidopsis 2 vs 4 -- 21720s/1087M
498s/280M - Human 21 vs 22 -- --
5250s/417M - 61M vs 61M
3hr37m/700M - 100M vs 35M
6m - Human (3G) vs Mouse (x39G) 20
days - All with filter off and identical parameters
- 16M reads of Mouse genome against Human genome
for MIT Whitehead. Best Blast program takes 19
years at the same sensitivity (seed length 11).
16Quality Comparisonx-axis alignment
ranky-axis alignment scoreboth axes in
logarithmic scale
A. thaliana chr 2 vs 4
E. Coli vs H. influenza
17(No Transcript)
18(No Transcript)
19(No Transcript)
20Genome Alignment by PatternHunter (4 seconds)
21New thinking
- Three lines of existing approaches
- Smith-Waterman exhaustive dynamic programming.
- Blast family find seed matches (11 for Blastn,
28 for MegaBlast), then extend. Dilemma
increasing seed size speeds up but loses
sensitivity descreasing seed size gains
sensitivity but loses speed. - Suffix tree MUMmer, Quasar, REPuter. Only good
for precise matches, highly similar sequences. - Blast approach is the only way to deal with real
and large problems, but we must to solve Blast
dilemma We need a way to improve sensitivity and
speed simultaneously. - Goals Improve (a) Blast, (b) Smith-Waterman.
22Deterministic Optimized Space Seed
- Blastn finds a match of length 11, then extend
from there. MegaBlast, in order to increase
speed, increases this to 28. - Spaced Seed PatternHunter looks for matches of
11 nonconsecutive matches and optimized such
seeding scheme. - Example 111010010100110111 is weight 11 optimal
seed (Blast default seed is 11111111111). - This seemingly simple change makes a big
difference significantly increases hit to
homologous region while reducing bad hits.
23Sensitivity PH weight 11 seed vs Blast 11 10
24PH 2-hit sensitivity vs Blastn 11, 12 1-hit
25Literature
- Random multiple spaced q-grams were used in the
following work - FLASH by Califano and Rigoutsos
- Multiple filtration by Pevzner and Waterman
- LSH of Buhler
- Praparata et al
- Recent work on spaced seeds theory and
algorithms - Ma, Tromp, Li original PH paper.
- Choi and Zhang x 2
- Brejova, Brown, Vinar x 2
- Buhler, Keich, Sun
- Keich, Li, Ma, Tromp
- Li, Ma, Kisman, Tromp
26Expect Less, Get More
- Lemma The expected number of hits of a weight W
length M seed model within a length L region with
similarity p is - (L-M1)pW
- Proof The expected number of hits is the sum,
over the L-M1 possible positions of fitting the
seed within the region, of the probability of W
specific matches, the latter being pW. - Example In a region of length 64 with 0.7
similarity, PH has probability of 0.466 to hit vs
Blast 0.3, 50 increase. On the other hand, by
above lemma, Blast expects 1.07 hits, while PH
0.93, 14 less.
27Why are the spaced seeds better?
- A wrong (but intuitive) proof seed s, interval
I, similarity p - E(hits in I)
- Prob.(s hits in I) E(hits in I s hits in I)
- Thus Prob.(s hits in I)
- Lpw / E(hits in I s hits in I)
- For optimized spaced seed
- 111010010100110111 Non overlap
Prob - 111010010100110111 6
p6 - 111010010100110111 6
p6 - 111010010100110111 6
p6 - 111010010100110111 7
p7 - ..
- For spaced seed the divisor is 1p6p6p6p7
- For Blast seed the divisor is bigger 1 p p2
p3
28Why are the spaced seeds better
- Theorem 1. Let I be a homologous region, homology
level p. For any sequence 0i0 lt i1 lt in-1
I-s, let Aj be the event a spaced seed s hits
I at ij, Bj be event the consecutive seed hits I
at j. - (1) P(Ujltn Aj) P(Ujltn Bj), when
ijj, gt holds. - Proof. Induction on n. For n1, P(A0)pWP(B0).
Assume the theorem holds for nN, we prove it
holds for N1. Let Ek denote the event that, when
putting a seed at I0, 0th, 1st k-1st 1s in
the seed all match the k-th 1 mismatches.
Clearly, Ek is a partition of sample space for
all k0, .. , W, and P(Ek)pk(1-p) for any seed.
Thus it is sufficient to show, for kW - (2) P(Uj0..N AjEk)P(Uj0..NBjEk)
- When kW, both sides of above equal 1. For
kltW since (UjkBj)?EkF and Bk1,Bk2, BN are
mutually independent of Ek, we have - (3) P(Uj0..NBjEk)P(Ujk1..NBj)
- Now consider the first term in (2). For each
k in 0, W-1, at most k1 of the events Aj
satisfy Aj?EkF because Aj?EkF iff when aligned
at ij seed s has a 1 bit at overlapping k-th bit
when the seed was at 0 there are at most k1
choices. Thus, there exist indices 0ltmk1ltmk2
ltmN N such that Amj?Ek?F. Since Ek means all
previous bits matched, it is clear that Ek is
non-negatively correlated with Ujk1..N Amj,
thus - (4) P(Uj0..N
AjEk)P(Ujk1..N AmjEk)P(Ujk1..NAmj) - The inductive hypothesis yields
- P(Ujk1..N Amj)
P(Ujk1..NBj) - Combined with (3),(4), this proves (2),
hence part of (1) of the theorem.
29To prove when ijj, P(Uj0..n-1Aj)gtP(Uj0..n-1Bj)
--- (5)
- We prove (5) by induction on n. For n2, we have
- P(Uj0,1Aj)2pw p2w-Shift1gt2pw-pw1P
(Uj0,1Bj) () - where shift1number of overlaped bits of the
spaced seed s when shifted to the right by 1 bit.
- For inductive step, note that the proof of (1)
shows that for all k0,1, ,W,
P(Uj0..n-1AjEk)P(Uj0..n-1BjEk). Thus to
prove (5) we only need to prove that - P(Uj0..n-1AjE0)gtP(Uj0..n-1BjE0).
- The above follows from the inductive hypothesis
as follows - P(Uj0..n-1AjE0)P(Uj1..n-1Aj)gtP(Uj1..n-1Bj)P
(Uj0..n-1BjE0).
30Corollary and Open Question
- Corollary Assume I is infinite. Let ps and pc be
the first positions a spaced seed and the
consecutive seed hit I, respectively. Then Eps
lt Epc. - Proof. EpsSk0..8P(psgtk)
- Sk0..8(1-P(Uj0..kAj))
- lt Sk0..8(1-P(Uj0..kBj))
- Epc.
- Open question Quantitative estimate?
31Theorem 2. It is NP hard to find the optimal
seed(PH II paper)
- Proof. Reduction from 3-SAT. Given a 3-SAT
instance C1 Cn, on Boolean variables x1, ,
xm, where Ci l1 l2 l3, each l is some xi or
its complement. We reduce this to seed selection
problem. The required seed has weight Wm2 and
length L2m2, and is intended to be of the form
1(0110)m1, a straightforward variable assignment
encoding. The i-th pair of bits after the initial
1 is called the code for variable xi. A 01 code
corresponds to setting xi true, while a 10 means
false. To prohibit code 11, we introduce, for
every 1 i m, a region -
- Ai1 (11)i-101(11)m-i 10m1
1(11)i-110(11)m-i 1. - Since the seed has m 0s, it cannot bridge the
middle 0m1 part, and is thus forced to hit at
the start or at the end. This obviously rules out
code 11 for variable xi. Code 00 automatically
becomes ruled out as well, since the seed must
distribute m 1s among all m codes. Finally, for
each clause, we introduce a region of the form - R1(11)a-1 c1(11)m-a 1 000
1(11)b-1c2(11)m-b 1 000 1(11)c-1c3(11)m-c 1. - The total length of R is (2m2)3(2m2)3(2m2)
6m12 it consists of three parts each
corresponding to a literal in the constraint,
with ci being the code for making that literal
true. It's easy to see that a properly formed
seed (which lacks 3 consecutive 0s) hits R iff
the assignment encoded by the seed satisfies the
clause.
32Alg. DPFinding the best spaced seeds
- Let f(i,b) be the probability that seed s hits
the length i prefix of R that ends with b. - Thus, if s matches b, then
- f(i,b) 1,
- otherwise we have the recursive relationship
- f(i,b) (1-p)f(i-1,0b') pf(i-1,1b')
- where b' is b deleting the last bit.
- Then the probability of s hitting R is
- SbM Prob(b) f(L-M,b).
- Choose s with the highest probability.
33Coding Region HMM
- The Hidden Markov Model is a finite set
of states, each of which is associated with a
probability distribution. Transitions among the
states are governed by a set of probabilities
called transition probabilities. In a particular
state an outcome or observation can be generated,
according to the associated probability
distribution. It is only the outcome, not the
state visible to an external observer and
therefore states are hidden'' to the outside
hence the name Hidden Markov Model. An HMM
completely has the following components
satisfying usual probability laws - The number of states of the model, N.
- The number of observation symbols in the
alphabet, M. - A set of state transition probabilities,
depending on current state. - A probability distribution of the observable
symbol at each of the states.
34Modeling coding region with HMM
- PatternHunter original assumption I is a
sequence of N independent Bernoulli variables X0,
XN-1 with P(Xi)p. - Coding region third position usually is less
conserved. Skipping it should be a good idea
(BLAT 110110110). With spaced seeds, we can model
it using HMM. - Brejova, Brown, Vinar M(3) HMM I is a sequence
of N independent Bernoulli random variables
X0,X1, XN-1 where Pr(Xi1)pi mod 3. In
particular - p0
p1 p2 - human/mouse 0.82 0.87 0.61
- human/fruit fly 0.67 0.77 0.40
- DP algorithm naturally extends to M(3).
35Picture of M(3)
Extending M(8), Brejova-Brown-Vinar proposed
M(8), above, to model Dependencies among
positions within codons.
From Brejova-Brown-Vinar
36Sensitivity of all seeds Under 4 models
From Brejova-Brown-Vinar
37Multiple Spaced Seeds Approaching Smith-Waterman
Sensitivity
- The biggest problem for Blast was low sensitivity
(and low speed). Massive parallel machines are
built to do Smith-Waterman exhaustive dynamic
programming. - Spaced seeds give PH a unique opportunity of
using several optimal seeds to achieve optimal
sensitivity, this was not possible for Blast
technology. - Economy --- 2 seeds (½ time slow down) achieve
sensitivity of 1 seed of shorter length (4 times
speed). --- note, added - With multiple optimal seeds. PH II approaches
Smith-Waterman sensitivity, and 3000 times
faster. - Finding optimal seeds, even one, is NP-hard.
- Experiment 29715 mouse EST, 4407 human EST.
Sensitivity and speed comparison next two slides.
38Finding Multiple Seeds (PH II)
- Computing the hit probability of given k seeds on
a uniformly distributed random region is NP-hard. - Finding a set of k optimal seeds to maximize the
hit probability cannot be approximated within
1-1/e unless NPP - Given k seeds, algorithm DP can be adapted to
compute their sensitivity (probability one of the
seeds has a hit). - Greedy Strategy
- Compute one optimal seed a. Sa
- Given S, find b maximizing hit probability of S U
b. - Coding region, use M(3), 0.8, 0.8, 0.5 (for mod 3
positions) - We took 12 CPU-days on 3GHz PC to compute 16
weight 11 seeds. General seeds (first 4)
111010010100110111, 111100110010100001011,
11010000110001010111, 1110111010001111.
39Sensitivity Comparison with Smith-Waterman (at
100) The thick dashed curve is the sensitivity
of Blastn, seed weight 11. From low to high,
the solid curves are the sensitivity of PH II
using 1, 2, 4, 8 weight 11 coding region seeds,
and the thin dashed curves are the sensitivity
1, 2, 4, 8 weight 11 general purpose seeds,
respectively
40Speed Comparison with Smith-Waterman
- Smith-Waterman (SSearch) 20 CPU days.
- PatternHunter II with 4 seeds 475 CPU seconds.
3638 times faster than Smith-Waterman dynamic
programming at the same sensitivity.
41PHunterViewer --- Arabidopsis Chr 2 vs Chr 4
42Selecting Interesting Region
43Amplify the region
44Focus further
45Sorted View
46Another View
47Running PHDownload at www.BioinformaticsSolution
s.com
- java jar ph.jar i query.fna j subject.fna o
out.txt b multi 4 - -Xmx512m --- for large files
- -j missing query.fna self-comparison
- -db multiple sequence input, 0,1,2,3 (no, query,
subject, both) - -W seed weight
- -G open gap penalty (default 5)
- -E gap extension (default 1)
- -q mismatch penalty (default 1)
- -r reward for match (default 1)
- -model specify model in binary
- -H hits before extension
- -P show progress
- -b Blast output format
- -multi number of seeds to be used (default 1
max 16)
48Summary and Open Questions
- Simple ideas 0 created SW, 1 created Blast. 1
0 created PatternHunter. - Good spaced seeds help to improve sensitivity and
reduce irrelevant hit. - Multiple spaced seeds allow us to improve
sensitivity to approach Smith-Waterman, not
possible with BLAST. - Computing spaced seeds by DP, while it is NP
hard. - Proper data models help to improve design of
spaced seeds (HMM) - Open Problems (a) A mathematical theory of
spaced seeds. I.e. quantitative statements for
Theorem 1. - (b) Applications to other fields.
49Assignment 1
- Problem 1. Why is the proof on page 27 wrong?
- Problem 2. Prove () on page 29
- Problem 3. Write a program to compute the optimal
seed, weight 11, length 18, for a uniform random
region of length 64 at homology level 0.7.
Estimate Prob(Blast hit optimal seed does not
hit) for the same region, for different homology
levels.
50Smaller (more solvable) questions (perhaps good
for course projects)
- Multiple protein seeds for Smith-Waterman
sensitivity? - Applications in other fields, like internet
document search? - Prove Theorem 1 for other models for example,
M(3) model 0.8,0.8,0.5 together with specialized
seeds. - More extensive study of Problem 3.