CS882' Chapter 1' Modern Homology Search Techniques

About This Presentation

Title:

CS882' Chapter 1' Modern Homology Search Techniques

Description:

... WU-Blast, psi-Blast, SIM, BLAT, MUMmer, Quasar, etc. ... Suffix tree: MUMmer, Quasar, REPuter. Only good for precise matches, highly similar sequences. ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 51

Provided by: min87

Category:

more less

Transcript and Presenter's Notes

Title: CS882' Chapter 1' Modern Homology Search Techniques

1
CS882. Chapter 1. Modern Homology Search
Techniques
Ming Li CRC Chair in Bioinformatics,
Professor Computer Science. U. Waterloo
2

I want to teach you how simple beautiful ideas
can transform a field.

3
Homology Search

Definition Given two DNA sequences, find all
similar regions (local homologies, with gapped
alignments).
Formalize Edit distance --- insert, delete,
mutate.
Applications This is the single most important
task in bioinformatics. Also comparative
genomics, assembly
Smith-Waterman dynamic programming, Blast, FASTA,
MegaBlast, SENSEI, WU-Blast, psi-Blast, SIM,
BLAT, MUMmer, Quasar, etc.
PatternHunter

4
Importance

A significant fraction of worlds super-computing
power is doing homology search.
NCBI Blast server serves 100,000 users each day.
10-15 increase each month.
Blast paper is referenced 100,000 times.
These have two implications
Big demands.
Current software are too slow.

5
Scalability of Software

The trend of genetic data growth
Genomes Human, Rice, Mouse, Fly, plus many.
Software must be scalable to large datasets.

30 billion in year 2005
6
History

1970s Smith-Waterman dynamic programming full
sensitivity, slow
1980-90s BLAST, FASTA heuristics fast, but not
sensitive
PatternHunter Smith-Waterman sensitivity, BLAST
speed. (Or BLAST sensitivity, many times faster.)

7
Bioinformatics Companies living on Blast

Paracel (Celera)
TimeLogic
TurboGenomics (TurboWorx)
NSF, NIH, pharmaceuticals proudly support many
supercomputing centers mainly for homology search
Hardware high cost become obsolete in 2 years.

8
Outline

Review dynamic programming and heuristics like
Blast
Optimized spaced seeds
Mathematical theory why they are better
How to compute them
NP completeness
Data models, coding regions, HMM
Multiple optimized spaced seeds
How to compute them
Approach Smith-Waterman sensitivity

9
Dynamic Programming

Longest Common Subsequence (LCS).
Vv1v2 vn
Ww1w2 wm
s(i,j) length of LCS
of V1..i and W1..j
Dynamic Programming
s(i-1,j)
s(i,j)max s(i,j-1)
s(i-1,j-1)1,viwj

Sequence Alignment (Needleman-Wunsch, 1970)
s(i,j)max
s(i-1,j) d(vi,-)
s(i,j-1)d(-,wj)
s(i-1,j-1)d(vi,wj)
0 (Smith-Waterman, 1981)
No affine gap penalties,
where d, for proteins, is either PAM or BLOSUM
matrix. For DNA, it can be match 1, mismatch -1,
gap -3 (In fact open gap -5, extension -1.)
d(-,x)d(x,-) -a, d(x,y)-u. When a0,
uinfinity, it is LCS.

10
Misc. Concerns

Local sequence alignment, add si,j0.
Gap penalties. For good reasons, we charge first
gap cost a, and then subsequent continuous
insertions blta.
Space efficient sequence alignment. Hirschberg
alg. in O(n2) time, O(n) space.
Multiple alignment of k sequences O(nk)

11
Blast

Popular software, using heuristics. By Altschul,
Gish, Miller, Myers, Lipman, 1990.
E(xpected)-value e dmne -lS, here S is score, m
is database length and n is query length.
Meaning e is number of hits one can expect to
see just by chance when searching a database of
size m.
Basic Strategy For all 3 a.a. (and closely
related) triples, remember their locations in
database. Given a query sequence S. For all
triples in S, find their location in database,
then extend as long as e-value significant.
Similar strategy for DNA (7-11 nucleotides). Too
slow in genome level.

12
Blast Algorithm

Find seeded matches
Extent to HSPs (High Scoring Pairs)
Gapped Extension, dynamic programming
Report all local alignments

13
Here is an example of a BLASTP match (E-value 0)
between gene 0189 in C. pneumoniae and gene 131
in C. trachomatis. Query CPn0189
Score
(bits) E-value Aligned with CT131 hypothetical
protein 1240
0.0 Query 1 MKRRSWLKILGICLGSSIVLGFLIFLPQLLSTE
SRKYLVFSL I HKESGLSCSAEELKISW 60
MKR W KI G L L L LP SES KYL
SKEGL EL SW Sbjct 1
MKRSPWYKIFGYYLLVGVPLALLALLPKFFSSESGKYLFLSVLNKETGLQ
F EIEQLHLSW 60 Query 61 FGRQTARKIKLTG-EAKDEVFS
AEKFELDGSLLRLL I YKKPKGITLSGWSLKINEPASID 119
FG QTAKI G EFAEK GSL
RLLY PK TLGWSLIE S Sbjct 61
FGSQTAKKIRIRGIDSDSEIFAAEKI IVKGSLPRLLL
YRFPKALTLTGWSLQIDESLSMN 120 Etc. Note Because of
powerpoint character alignment problems, I
inserted some white space to make it look more
aligned.
14
PatternHunter(Ma, Tromp, Li Bioinformatics,
183, 2002, 440-445)

Hundred times faster than Blastn at same
sensitive level.
Scalable, portable, memory-friendly,
super-computer tasks on a PC, written in Java
any genome anywhere.
Used in Mouse Genome Consortium (Nature, Dec. 5,
2002), as well as in hundreds of institutions and
industry.
PatternHunter demo.

15
Comparison with Blastn, MegaBlast

On Pentium III 700MH, 1GB
Blastn MB
PH
E.coli vs H.inf 716s 5s/561M
14s/68M
Arabidopsis 2 vs 4 -- 21720s/1087M
498s/280M
Human 21 vs 22 -- --
5250s/417M
61M vs 61M
3hr37m/700M
100M vs 35M
6m
Human (3G) vs Mouse (x39G) 20
days
All with filter off and identical parameters
16M reads of Mouse genome against Human genome
for MIT Whitehead. Best Blast program takes 19
years at the same sensitivity (seed length 11).

16
Quality Comparisonx-axis alignment
ranky-axis alignment scoreboth axes in
logarithmic scale
A. thaliana chr 2 vs 4
E. Coli vs H. influenza
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Genome Alignment by PatternHunter (4 seconds)
21
New thinking

Three lines of existing approaches
Smith-Waterman exhaustive dynamic programming.
Blast family find seed matches (11 for Blastn,
28 for MegaBlast), then extend. Dilemma
increasing seed size speeds up but loses
sensitivity descreasing seed size gains
sensitivity but loses speed.
Suffix tree MUMmer, Quasar, REPuter. Only good
for precise matches, highly similar sequences.
Blast approach is the only way to deal with real
and large problems, but we must to solve Blast
dilemma We need a way to improve sensitivity and
speed simultaneously.
Goals Improve (a) Blast, (b) Smith-Waterman.

22
Deterministic Optimized Space Seed

Blastn finds a match of length 11, then extend
from there. MegaBlast, in order to increase
speed, increases this to 28.
Spaced Seed PatternHunter looks for matches of
11 nonconsecutive matches and optimized such
seeding scheme.
Example 111010010100110111 is weight 11 optimal
seed (Blast default seed is 11111111111).
This seemingly simple change makes a big
difference significantly increases hit to
homologous region while reducing bad hits.

23
Sensitivity PH weight 11 seed vs Blast 11 10
24
PH 2-hit sensitivity vs Blastn 11, 12 1-hit
25
Literature

Random multiple spaced q-grams were used in the
following work
FLASH by Califano and Rigoutsos
Multiple filtration by Pevzner and Waterman
LSH of Buhler
Praparata et al
Recent work on spaced seeds theory and
algorithms
Ma, Tromp, Li original PH paper.
Choi and Zhang x 2
Brejova, Brown, Vinar x 2
Buhler, Keich, Sun
Keich, Li, Ma, Tromp
Li, Ma, Kisman, Tromp

26
Expect Less, Get More

Lemma The expected number of hits of a weight W
length M seed model within a length L region with
similarity p is
(L-M1)pW
Proof The expected number of hits is the sum,
over the L-M1 possible positions of fitting the
seed within the region, of the probability of W
specific matches, the latter being pW.
Example In a region of length 64 with 0.7
similarity, PH has probability of 0.466 to hit vs
Blast 0.3, 50 increase. On the other hand, by
above lemma, Blast expects 1.07 hits, while PH
0.93, 14 less.

27
Why are the spaced seeds better?

A wrong (but intuitive) proof seed s, interval
I, similarity p
E(hits in I)
Prob.(s hits in I) E(hits in I s hits in I)
Thus Prob.(s hits in I)
Lpw / E(hits in I s hits in I)
For optimized spaced seed
111010010100110111 Non overlap
Prob
111010010100110111 6
p6
111010010100110111 6
p6
111010010100110111 6
p6
111010010100110111 7
p7
..
For spaced seed the divisor is 1p6p6p6p7
For Blast seed the divisor is bigger 1 p p2
p3

28
Why are the spaced seeds better

Theorem 1. Let I be a homologous region, homology
level p. For any sequence 0i0 lt i1 lt in-1
I-s, let Aj be the event a spaced seed s hits
I at ij, Bj be event the consecutive seed hits I
at j.
(1) P(Ujltn Aj) P(Ujltn Bj), when
ijj, gt holds.
Proof. Induction on n. For n1, P(A0)pWP(B0).
Assume the theorem holds for nN, we prove it
holds for N1. Let Ek denote the event that, when
putting a seed at I0, 0th, 1st k-1st 1s in
the seed all match the k-th 1 mismatches.
Clearly, Ek is a partition of sample space for
all k0, .. , W, and P(Ek)pk(1-p) for any seed.
Thus it is sufficient to show, for kW
(2) P(Uj0..N AjEk)P(Uj0..NBjEk)
When kW, both sides of above equal 1. For
kltW since (UjkBj)?EkF and Bk1,Bk2, BN are
mutually independent of Ek, we have
(3) P(Uj0..NBjEk)P(Ujk1..NBj)
Now consider the first term in (2). For each
k in 0, W-1, at most k1 of the events Aj
satisfy Aj?EkF because Aj?EkF iff when aligned
at ij seed s has a 1 bit at overlapping k-th bit
when the seed was at 0 there are at most k1
choices. Thus, there exist indices 0ltmk1ltmk2
ltmN N such that Amj?Ek?F. Since Ek means all
previous bits matched, it is clear that Ek is
non-negatively correlated with Ujk1..N Amj,
thus
(4) P(Uj0..N
AjEk)P(Ujk1..N AmjEk)P(Ujk1..NAmj)
The inductive hypothesis yields
P(Ujk1..N Amj)
P(Ujk1..NBj)
Combined with (3),(4), this proves (2),
hence part of (1) of the theorem.

29
To prove when ijj, P(Uj0..n-1Aj)gtP(Uj0..n-1Bj)
--- (5)

We prove (5) by induction on n. For n2, we have
P(Uj0,1Aj)2pw p2w-Shift1gt2pw-pw1P
(Uj0,1Bj) ()
where shift1number of overlaped bits of the
spaced seed s when shifted to the right by 1 bit.
For inductive step, note that the proof of (1)
shows that for all k0,1, ,W,
P(Uj0..n-1AjEk)P(Uj0..n-1BjEk). Thus to
prove (5) we only need to prove that
P(Uj0..n-1AjE0)gtP(Uj0..n-1BjE0).
The above follows from the inductive hypothesis
as follows
P(Uj0..n-1AjE0)P(Uj1..n-1Aj)gtP(Uj1..n-1Bj)P
(Uj0..n-1BjE0).

30
Corollary and Open Question

Corollary Assume I is infinite. Let ps and pc be
the first positions a spaced seed and the
consecutive seed hit I, respectively. Then Eps
lt Epc.
Proof. EpsSk0..8P(psgtk)
Sk0..8(1-P(Uj0..kAj))
lt Sk0..8(1-P(Uj0..kBj))
Epc.
Open question Quantitative estimate?

31
Theorem 2. It is NP hard to find the optimal
seed(PH II paper)

Proof. Reduction from 3-SAT. Given a 3-SAT
instance C1 Cn, on Boolean variables x1, ,
xm, where Ci l1 l2 l3, each l is some xi or
its complement. We reduce this to seed selection
problem. The required seed has weight Wm2 and
length L2m2, and is intended to be of the form
1(0110)m1, a straightforward variable assignment
encoding. The i-th pair of bits after the initial
1 is called the code for variable xi. A 01 code
corresponds to setting xi true, while a 10 means
false. To prohibit code 11, we introduce, for
every 1 i m, a region
Ai1 (11)i-101(11)m-i 10m1
1(11)i-110(11)m-i 1.
Since the seed has m 0s, it cannot bridge the
middle 0m1 part, and is thus forced to hit at
the start or at the end. This obviously rules out
code 11 for variable xi. Code 00 automatically
becomes ruled out as well, since the seed must
distribute m 1s among all m codes. Finally, for
each clause, we introduce a region of the form
R1(11)a-1 c1(11)m-a 1 000
1(11)b-1c2(11)m-b 1 000 1(11)c-1c3(11)m-c 1.
The total length of R is (2m2)3(2m2)3(2m2)
6m12 it consists of three parts each
corresponding to a literal in the constraint,
with ci being the code for making that literal
true. It's easy to see that a properly formed
seed (which lacks 3 consecutive 0s) hits R iff
the assignment encoded by the seed satisfies the
clause.

32
Alg. DPFinding the best spaced seeds

Let f(i,b) be the probability that seed s hits
the length i prefix of R that ends with b.
Thus, if s matches b, then
f(i,b) 1,
otherwise we have the recursive relationship
f(i,b) (1-p)f(i-1,0b') pf(i-1,1b')
where b' is b deleting the last bit.
Then the probability of s hitting R is
SbM Prob(b) f(L-M,b).
Choose s with the highest probability.

33
Coding Region HMM

The Hidden Markov Model is a finite set
of states, each of which is associated with a
probability distribution. Transitions among the
states are governed by a set of probabilities
called transition probabilities. In a particular
state an outcome or observation can be generated,
according to the associated probability
distribution. It is only the outcome, not the
state visible to an external observer and
therefore states are hidden'' to the outside
hence the name Hidden Markov Model. An HMM
completely has the following components
satisfying usual probability laws
The number of states of the model, N.
The number of observation symbols in the
alphabet, M.
A set of state transition probabilities,
depending on current state.
A probability distribution of the observable
symbol at each of the states.

34
Modeling coding region with HMM

PatternHunter original assumption I is a
sequence of N independent Bernoulli variables X0,
XN-1 with P(Xi)p.
Coding region third position usually is less
conserved. Skipping it should be a good idea
(BLAT 110110110). With spaced seeds, we can model
it using HMM.
Brejova, Brown, Vinar M(3) HMM I is a sequence
of N independent Bernoulli random variables
X0,X1, XN-1 where Pr(Xi1)pi mod 3. In
particular
p0
p1 p2
human/mouse 0.82 0.87 0.61
human/fruit fly 0.67 0.77 0.40
DP algorithm naturally extends to M(3).

35
Picture of M(3)
Extending M(8), Brejova-Brown-Vinar proposed
M(8), above, to model Dependencies among
positions within codons.
From Brejova-Brown-Vinar
36
Sensitivity of all seeds Under 4 models
From Brejova-Brown-Vinar
37
Multiple Spaced Seeds Approaching Smith-Waterman
Sensitivity

The biggest problem for Blast was low sensitivity
(and low speed). Massive parallel machines are
built to do Smith-Waterman exhaustive dynamic
programming.
Spaced seeds give PH a unique opportunity of
using several optimal seeds to achieve optimal
sensitivity, this was not possible for Blast
technology.
Economy --- 2 seeds (½ time slow down) achieve
sensitivity of 1 seed of shorter length (4 times
speed). --- note, added
With multiple optimal seeds. PH II approaches
Smith-Waterman sensitivity, and 3000 times
faster.
Finding optimal seeds, even one, is NP-hard.
Experiment 29715 mouse EST, 4407 human EST.
Sensitivity and speed comparison next two slides.

38
Finding Multiple Seeds (PH II)

Computing the hit probability of given k seeds on
a uniformly distributed random region is NP-hard.
Finding a set of k optimal seeds to maximize the
hit probability cannot be approximated within
1-1/e unless NPP
Given k seeds, algorithm DP can be adapted to
compute their sensitivity (probability one of the
seeds has a hit).
Greedy Strategy
Compute one optimal seed a. Sa
Given S, find b maximizing hit probability of S U
b.
Coding region, use M(3), 0.8, 0.8, 0.5 (for mod 3
positions)
We took 12 CPU-days on 3GHz PC to compute 16
weight 11 seeds. General seeds (first 4)
111010010100110111, 111100110010100001011,
11010000110001010111, 1110111010001111.

39
Sensitivity Comparison with Smith-Waterman (at
100) The thick dashed curve is the sensitivity
of Blastn, seed weight 11. From low to high,
the solid curves are the sensitivity of PH II
using 1, 2, 4, 8 weight 11 coding region seeds,
and the thin dashed curves are the sensitivity
1, 2, 4, 8 weight 11 general purpose seeds,
respectively
40
Speed Comparison with Smith-Waterman

Smith-Waterman (SSearch) 20 CPU days.
PatternHunter II with 4 seeds 475 CPU seconds.
3638 times faster than Smith-Waterman dynamic
programming at the same sensitivity.

41
PHunterViewer --- Arabidopsis Chr 2 vs Chr 4
42
Selecting Interesting Region
43
Amplify the region
44
Focus further
45
Sorted View
46
Another View
47
Running PHDownload at www.BioinformaticsSolution
s.com

java jar ph.jar i query.fna j subject.fna o
out.txt b multi 4
-Xmx512m --- for large files
-j missing query.fna self-comparison
-db multiple sequence input, 0,1,2,3 (no, query,
subject, both)
-W seed weight
-G open gap penalty (default 5)
-E gap extension (default 1)
-q mismatch penalty (default 1)
-r reward for match (default 1)
-model specify model in binary
-H hits before extension
-P show progress
-b Blast output format
-multi number of seeds to be used (default 1
max 16)

48
Summary and Open Questions

Simple ideas 0 created SW, 1 created Blast. 1
0 created PatternHunter.
Good spaced seeds help to improve sensitivity and
reduce irrelevant hit.
Multiple spaced seeds allow us to improve
sensitivity to approach Smith-Waterman, not
possible with BLAST.
Computing spaced seeds by DP, while it is NP
hard.
Proper data models help to improve design of
spaced seeds (HMM)
Open Problems (a) A mathematical theory of
spaced seeds. I.e. quantitative statements for
Theorem 1.
(b) Applications to other fields.

49
Assignment 1

Problem 1. Why is the proof on page 27 wrong?
Problem 2. Prove () on page 29
Problem 3. Write a program to compute the optimal
seed, weight 11, length 18, for a uniform random
region of length 64 at homology level 0.7.
Estimate Prob(Blast hit optimal seed does not
hit) for the same region, for different homology
levels.

50
Smaller (more solvable) questions (perhaps good
for course projects)

Multiple protein seeds for Smith-Waterman
sensitivity?
Applications in other fields, like internet
document search?
Prove Theorem 1 for other models for example,
M(3) model 0.8,0.8,0.5 together with specialized
seeds.
More extensive study of Problem 3.

Write a Comment

User Comments (0)

About PowerShow.com

CS882' Chapter 1' Modern Homology Search Techniques - PowerPoint PPT Presentation

CS882' Chapter 1' Modern Homology Search Techniques

... WU-Blast, psi-Blast, SIM, BLAT, MUMmer, Quasar, etc. ... Suffix tree: MUMmer, Quasar, REPuter. Only good for precise matches, highly similar sequences. ... – PowerPoint PPT presentation