PatternHunter: faster and more sensitive homology search

About This Presentation

Title:

PatternHunter: faster and more sensitive homology search

Description:

PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B92902019 B92902033 B92902039 – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 120

Provided by: sapphirejade

Category:

more less

Transcript and Presenter's Notes

Title: PatternHunter: faster and more sensitive homology search

1
PatternHunter faster and more sensitive homology
search

By Bin Ma, John Tromp and Ming Li

B92902019 ??? B92902033 ??? B92902039
??? B92902072 ??? B92902086 ??? B92902087 ???
2
Agenda

PatternHunter
Spaced Seed
Algorithm
Performance
PatternHunter II
Algorithm
Performance
Translated PatternHunter

3
PatternHunter Spaced Seed
4
Outline

A short review about BLAST.
Some definition and background.
Whats the difference and the same between BLAST
and PatternHunter.
Why PatternHunter is better??
Nonconsecutive seeds
Proof

5
Blast Algorithm

Find seeded matches
Extent to HSPs (High scoring Segment Pairs)
Gapped Extension, dynamic programming
Report significant local alignments

6
A short review about BLAST

Find hits.
BLAST first scans the database for words that
score at least T when aligned with some word
within the query sequence. Any aligned word pair
satisfying this condition is called a hit.

7
A short review about BLAST

Find HSPs
HSP (High scoring Segment Pair) is much longer
than a single word pair, and may therefore
entail multiple hits on the same diagonal within
a relative shot distance of one another.

8
A short review about BLAST

Generate gapped alignment
This means that two or more HSPs in BLAST with
scores well below 38 bits can, in combination,
rise to statistical significance. If any one of
these HSPs is missed, so may be the combined
result.

9
A short review about BLAST

In summary, the new gapped BLAST algorithm
requires two non-overlapping hits of score at
least T, within a distance A of one another, to
invoke an ungapped extension of the second hit.
If the HSP generated normalized score at least Sg
bits, then a gapped extension is triggered.

10
Some definition, some background

Similarity
How similar it is between two sequences?
Usually mean that the probability of the same
symbol appear in anywhere of two sequences.
Sensitivity
The probability to find a local alignment.
Specificity
In all local alignments, how many alignments are
homologous.

11
Define the Seed
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002

Defining the seed
w -gt weight or number of positions to match
Blastn 11 MegaBlast 28
model -gt relative position of letters for each w
m -gt length of model window

12
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002
Seed Parameters
w 11
letters
0, 1

1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1

m 18

model
1 exact match required 0 no match required,
any value
Patternhunter most sensitive model
Blastn seed is all 1s
13
Seed, Hit, Homology
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002

What is a seed?
Seeds determine how an algorithm looks for hits
What is a hit?
Hits indicate a similarity that may indicate a
homology

14
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002
hit
GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCA
TAAGTTCCAACAAAGTTTGC

GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGAT
CCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGA
TGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----

GAATACTCAACAGCAACATCAAC
GGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG -
-----------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGA
GGGCAGGCGAGCTCAGGTA

TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACC
AAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACC
ATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG

GGATGAGATGGAACGTGTGATGACCAT
TATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG
Human-Mouse genome homology
15
Example
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002

Consider the following two sequences
GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT
GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
Whats the differences in finding the seed
between Blast and PatternHunter?

16
BLAST usesconsecutive seeds
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002

In BLAST, we often use the consecutive model with
weight 11.
GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT
GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
? 11111111111 ? ? ? 11111111111 ?
However, it fails to find the alignment in the
two sequence.

17
Consecutive seeds
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002

Theres also a dilemma for BLAST type of search.
Dilemma
Sensitivity needs shorter seeds
too many random hits, slow computation
Speed needs longer seeds
lose distant homologies

18
PatternHunter uses non-consecutive seed
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002

In PatternHunter, we often use the spaced model
with weight 11 and length 18.
GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT
GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
111010010100110111

19
Consecutive vs. Nonconsecutive?
Reference Bin Ma, John Tromp, Ming Li
Bioinformatics Vol. 18 no. 3 2002

The non-consecutive seed is the primary
difference and strength of Patternhunter
Blastn
1 1 1 1 1 1 1 1 1 1 1
PatternHunter
1 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1

20
A trivial comparison between spaced and
consecutive seed
Reference Ming Li, NHC2005

Consider 111 and 1101.
To fail seed 111, we can use
110110110110
66.66 similarity
But we can prove, seed 1101 will hit every region
with 61 similarity for sufficient long region.

21
Proof
Reference Ming Li, NHC2005

Suppose there is a length 100 region which is not
hit by 1101.
We can break the region into blocks of 1a0b.
Besides the last block, the other blocks have the
following few cases
10b for bgt1
110b for bgt2
1110b for bgt2
In each block, similarity lt 3/5.
The last block has at most 3 matches.
So, in total there are at most 61 matches in 100
positions. The similarity is lt61.

22
Formalize
Reference Ming Li, NHC2005

Given i.i.d. sequence (homology region) with
Pr(1)p and Pr(0)1-p for each bit
1100111011101101011101101011111011101
Which seed is more likely to hit this region
BLAST seed 11111111111
Spaced seed 11111111111

11111111111
23
Expect Less, Get More
Reference Ming Li, NHC2005

Lemma The expected number of hits of a weight W
length M seed model within a length L region with
homology level p is
(L-M1)pW
Proof. E(hits) ?i1 L-M1 pW
Example In a region of length 64 with p0.7
Pr(BLAST seed hits)0.3
E( of hits by BLAST seed)1.07
Pr(optimal spaced seed hits)0.466, 50 more
E( of hits by spaced seed)0.93, 14 less

24
Why Is Spaced Seed Better?
Reference Ming Li, NHC2005

A wrong, but intuitive, proof seed s, interval
I, similarity p
E(hits) Pr(s hits) E(hits s hits)
Thus
Pr(s hits) Lpw / E(hits s hits)
For optimized spaced seed, E(hits s hits)
11111111111 Non overlap
Prob
11111111111 6
p6
11111111111 6
p6
11111111111 6
p6
11111111111 7
p7
..
For spaced seed the divisor is 1p6p6p6p7
For BLAST seed the divisor is bigger 1 p p2
p3

25
Simulated sensitivity curves
Reference Ming Li, NHC2005
26
Observations of spaced seeds
Reference Ming Li, NHC2005

Seed models with different shapes can detect
different homologies.
Two consequences
Some models may detect more homologies than
others
More sensitive homology search
PatternHunter I
Can use several seed models simultaneously to hit
more homologies
Approaching 100 sensitive homology search
PatternHunter II

27
PatternHunter Algorithm Performance
28
Outline

Hit generation
Hit extension
Gapped extension
Performance

29
Hit generation

Index created for each position in the query
sequence

30
Hit generation

Similar to MegaBlast Hash tables
Encode ATCG into binary code
00, 01, 10, 11 respectively
Find each situations in one of the sequence and
record the offsets in the hash table

31
Hit generation

An example
Now we want to find hits between sequences S and
T

32
Spaced seed

For sequence T
Model
Seed

A 00 T 01 C 10 G
11
Scan
A T A T G C A T
1 1 0 1 0 1 1 0
??
??
A T T C A
0001011000 88
Weight5 ? the value is between 0210-1
33
After filling in the hash table
???
Position in T

For each position in S
Calculate int value
2. Find hits in S by the lookup value

0
1
2
3
10 19 34
(NULL)
14
10 48 134
???
???
87
88
2 8 33
???
34
Hash tables space required
???
Position in T
0
34
19
10
4w integers T integers Total 4(w1)4T
bytes
1
(NULL)
14
2
3
134
48
10
???
???
87
88
33
8
2
???
35
Cost a lot to make a hash table?

If the number of hits found for one index is
large, the cost of computing index is relatively
negligible.

36
Hit extension

HSP Highscoring Segment Pair
Scan those hits with a window, and choose the
highest-scored one.

37
Hit extension
S
The chosen hit
T
38
Hit extension

Set the mid point of the chosen hit as the cut
point, split the graph into 4

39
Hit extension
S
T
40
Hit extension

And then do the Smith-Waterman in 2 of the 4,
until it reaches the dropoff score.

41
Hit extension
S
Smith-Waterman
Cost1/2O(mn)
Smith-Waterman
T
42
Hit extension

If the resulting segment pair has a score below
certain minimum, then ignore it.
Else we gain a HSP and do the next step-gap
extension.

43
Hit extension

A question when doing extension in 2 ways, how
to synchronize the score?

44
Gapped Extension

To find the best way to extend an HSP to the left
across gaps.
To extend an HSP we try all candidates from a
diagonal-sorted set.
Penalty for gap open gap extension cropping

45
Gapped Extension
Search front
46
From left to right
Optimal Left
Too Far Right
Too Far Right
Optimal Left
47
From left to right
Optimal Left
Too Far Right
48
Descriptions in the paper

We use a red-black tree for this.
Insert HSP when the optimal alignment to its left
is found
Retired from the tree once newly generated HSPs
are too far beyond its right endpoint to make use
of it.

49
Thought 1

The first one will be inserted ? Fast

50
Thought 1

May not find the best one

End
Start
Better
Worse
51
Thought 2

Insert HSP when the optimal alignment to its left
is found

Not complete HSP
52
Thought 2
Insert both HSPs
Far but long (Good)
Close but short (Bad)
Next turn
53
Thought 1

Retired alignments are put into a priority queue
according to their scores.

Tree 1
Tree 2
54
Performance
Ref. Altschul,S.F. et al (1997) Nucleic Acids
Res., 25, 33893402.
Ref. Bin Ma, John Tromp, Ming Li Bioinformatics
Vol. 18 no. 3 2002
55
PatternHunter II
56
Outline

Overview
PatternHunter II design
Computing hit probability
Finding seeds set
Seed performance
PHII performance

57
Overview

PatternHunter spaced seed
PH2 design for better sensitivityAchieve a
sensitivity approaching that of Smith-Waterman
with a speed similar to the default Blastn
Extend single spaced seed to multiple ones
Two main problem
Large memory required for multiple hash tables
Complexity of finding optimal seed combination

58
PatternHunter II design

A hash table is built for each seeds
All hits generated from all hash tables are used
for gap extension
In two-hit mode, two nearby hits can be from
different hash tables

59
PatternHunter II design (cont.)

Large memory problem
Divide into smaller segments
e.g., with k 8, w 11, and n 32 x 106,
the hash tables use about 256MBytes of
memory
Extend alignments across division boundary
Still may lose alignments

60
Computing hit probability

Use DP, but extend the algorithm from single seed
to multiple seeds
Definition
Homologous region R with length L
Substring from i to j is denoted by Ri j
A set of k seeds A a1, ,ak
A hits R if theres an ai that hits R
p is called the similarity level of R if R p
identities

61
Computing hit probability (cont.)

For a binary string b and ,
define
The goal is to find f(L, e)
For any i gt b, we have
We can compute f(i,b) from other f(i,b)
computed earlier

62
Computing hit probability (cont.)

Definition
b is compatible with a seed a if bb-j 1
whenever aa-j 1 for 0 lt j ? min(a, b)
Define
B be the set of binary strings that are not hit
by A but compatible with some a in A.
B(x) denote the longest proper prefix of x in B

63
Computing hit probability (cont.)

First, eis in B
Suppose b is in B, then b is compatible with some
a in A by definition. Therefore, 1b is also
compatible with some a in A
If 1b is not in B, it must hit some a in A, so
f(i,1b)1
If 0b is not in B, it cannot be hit by A,
therefore it cannot be compatible with any a in
A, so f(i,0b)f(i-bb, 0b), where 0bB(0b)

64
Computing hit probability (cont.)
Ref. Li,M. et al, (2004) Comput. Biol., 2,
417440.
65
Computing hit probability (cont.)

Can also compute k-hits probability
Change f(i,b) to f(i,b,k)
We already have k 1. By induction, compute each
f(i,b,k) from f(i,b,k-1)

66
Computing hit probability (cont.)
Ref. Li,M. et al, (2004) Comput. Biol., 2,
417440.
67
Computing hit probability (cont.)

Complexity
It is proved that computing the hit probability
of multiple seeds is NP-hard
The time complexity of the algorithm is which

68
Computing hit probability (cont.)

Implement Algorithm DP on PC
It took 0.70 sec to compute hit probability for a
set of 16 weight-11 seeds with length lt 21 on a
random region with length 64
It only took 0.37 sec for the same number of set
and the same length but change the weight to 12
The running time largely depends on the maximum
number of 0 in every seed

69
Finding seeds set

Cannot enumerate all possible seed sets by
Algorithm DP
The number of them are exponential!
Also, finding the optimal space seed set is
proved NP-hard
Use a greedy method

70
Finding seeds set (cont.)

Compute the first seed a1 which maximizes the hit
probability of the set a1
Then computer the second seed a2 for the set a1,
a2. Then a3
Compute ai until
Achieve the desire number of seeds
Achieve the desire hit probability

71
Finding seeds set (cont.)

May not optimize the hit probability
It is still time-consuming
e.g. It took 12 CPU days for a Pentium 4 3GHz PC
to compute a set of 16 weight-11 seeds, each of
them are no longer then 21
It take much longer time if the seeds become
slightly longer
Need a different approach

72
Finding seeds set (cont.)

Suppose we already have N seeds, and C is the
candidate set for the (N1)-th seed
For each c in C, estimates the hit probability in
m random region samples
m is reasonably large, such as 500
Remove the worst performing halve from C, and
increase m to 2m
Repeat until only one seed left

73
Seed performance

Two ways to increase the sensitivity
Increase the number of seeds
Reduce the weight of a single seed
Both increase running time
The sensitivity of doubling the number of seeds
is approximately equal to reducing the weight of
a single seed by 1
At high level, doubling the number of seeds
achieves better sensitivity

74
Seed performance (cont.)

From low to high
Solid curves using the first k(1, 2, 4, 8, 16)
weight-11 seeds
Dashed curves single optimal weight w(10, 9, 8,
7) seeds

Ref. Li,M. et al, (2004) Comput. Biol., 2,
417440.
75
Comparison

Sensitivity / Speed
PatternHunter II
Blast
Smith-Waterman algorithm
SSearch

76
SSearch Configuration

Smith-Waterman algorithm
A sub-program in the FASTA package
FASTA package
ftp//ftp.virginia.edu/pub/FASTA/

77
Common Environment

Score scheme
Match 1
Mismatch -1
Gapopen -5
Gapextension -1
Local alignments scores gt 16

78
Common Environment

DNA sequences
2 sets of human and mouse EST sequences
ftp//ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
month.est_human.Z
month.est_mouse.Z
Pentium IV 3GHz Linux PC

79
Term Explanation

EST
Expressed Sequence Tag
A unique stretch of DNA within a coding region of
a gene that is useful for identifying.
A short sub-sequence of a transcribed sequence.

80
Term Explanation

Coding Regions
Regions of DNA/RNA sequences that code for
proteins. Usually starts with a start codon (ATG)
and ends with a stop codon.
The coding region of a gene is the portion of DNA
that is transcribed into mRNA and translated into
proteins.

81
Repeat Masking

Fact
Long sequences of identical letters
Especially of As and Ts
example (Will be shown later)
Solution
Turn all those sequences of ten or more
repetitive letters to Ns.

82
SSearch Result

Num of humans EST 4
Num of mouses EST 2005
EST example (show)

Ref. Li,M. et al, (2004) Comput. Biol., 2,
417440.
83
Optimal Versus Sub-Optimal

Neither PatternHunter nor Blast tries to compute
the optimal alignments for the homologies they
have found.
Q Why not find the optimal alignments?
Ans
use Blast or PH2 to detect, then compute.

84
Found

SSearch finds a local alignment
score x
PatternHunter II finds a local alignment
score gt x/2
Then found for a pair of ESTs

85
Sensitivity Definition

Smith-Waterman
Finds y pairs of ESTs
Local alignment score at least x
Other programs
y of the y pairs can be found
With alignment score gt x/2
Ratio y / y

86
Blastn Configuration

Version 2.2.6
NCBIs website
-F F option
To turn off the low-complexity region filtering
Weight 11 seeds
11111111111

87
Speed comparison
Ref. Li,M. et al, (2004) Comput. Biol., 2,
417440.
88
Sensitivity comparison

From low to high
Dashed Blastn, seed weight 11
Solid PH II, 1, 2, 4, 8 seeds weight 11

89
Compare with other seeds

From left to right
PH II, two weight 11 seeds
PH II, one weight 10 seed
1101100101000101101
HMM model ,

90
Seed Selection

Use heuristic or exponential time algorithms
For general seed selection problem
PTAS
polynomial time approximation scheme

91
Homology Search

Time-consuming
DNA-DNA searches
Blastn
translated DNA-protein searches
tBlastx
tPH
protein-protein searches
Small query and database sizes

92
Conclusion

Optimized spaced seeds
Blastn PH II
Same sensitivity
Speeds up by 5-100 times
Optimized multiple spaced seeds
PH II Smith-Waterman
Approximately same sensitivity
gt1000 times faster

93
Translated PatternHunter
94
Outline

Whats translated search?
BLASTs translated search
Translated Pattern Hunter
Performance

95
Whats translated search?

To translate a DNA sequence into a protein
sequence for alignment with another protein
sequence
But whats translation?

96
Whats translation?

In biology, translation means to translate DNA
into amino acids (AA) with a universal genetic
code map on a 3-codon basis.
The DNA sequence is transcribed into a RNA
sequence in which all Ts are replaced by Us

97
The Genetic code

We can use translation in homology search since
the genetic code is universal
Degeneracy some DNA codons map to the same AA
They usually differs in the third codon
Translation is one-way DNA ? Protein

98
Why we need translated search?

When a DNA database or a Protein database is not
available
Blastx DNA query, protein database
tBlastn protein query, DNA database
To find very distant homologies
tBlastx DNA query database, both translated
Slowest but more functional structural homology
in addition to sequential homology
Why?

99
Substitution Matrix

Some AAs are similar in their chemical or
physical properties
Not only match/mismatch in substitution anymore!
Stop codon is assigned the most negative score in
BLAST and tPH
PAM (Point Accepted Mutation)
Based on global alignment of closely related
proteins (1 divergence for PAM1)
BLOSUM (BLOck SUbstitution Matrix)
Based on local alignment of divergent proteins
(62 similarity for BLOSUM 62)

100
Substitution Matrix

Short alignments need to be relatively strong to
rise above background noise, so can only detect
close related homologies

Query Length Substitution Matrix Gap costs
lt35 PAM-30 (9,1)
35-50 PAM-70 (10,1)
50-85 BLOSUM-80 (10,1)
85 BLOSUM-62 (10,1)
adapted from NCBI substitution matrix
101
BLASTs translated search

The same in tBlast, tBlastn, tBlastx
Aligns the 6-frame translations of the DNA
sequence against another protein sequence

102
Reading Frame of DNA Sequence

The DNA sequence can be read in six reading
frames, three in the forward and three in the
reverse direction.

Open Reading Frame
103
BLASTs translated search

Translate the DNA sequence into all 6 possible
frames
Align each frame against the protein sequence,
just like BLASTp.
The pairs with significant scores are reported

104
How good is significant?

The expected number of alignments scoring S or
greater between two sequences m, n is
E mnKe?S or E mne-S
where K,?, used for normalization, depend on the
sequence composition
Different K,?is used for each frame
Non-conding sequence tend to yield alignments of
marginal significance

105
Translated PatternHunter

The version of PH for translated search
Compared with PatternHunter, tPH uses very
different algorithms for hit generation and
gapped extensions

106
Hit Generation in tPH

Weight 5 instead of 11
Space complexity 520 114 in PH
Length 6 or 7
Does not require exact matches
Hit all the five pairs have scores 0 and the
total score is above a tolerance T
Use BLOSUM 62
Multiple seeds are used

107
Hit Generation in tPH
Seed 1011, T7
A
A
C
G
U
U
U
U
C
U
A
C
U
A
G
A
A
A
G
A
G
C
A
Query
All possible hits
Indexed Subject
108
Gapped Extension in tPH

The same as in BLAST?
BLAST cant handle frame shift errors
Huh?

109
Frame Shift Error

When a single DNA is deleted/inserted, it cause
the reading frame to shift

A
A
C
G
U
U
U
U
C
U
A
C
U
A
G
A
A
A
G
A
G
A

BLAST cant detect such variation
It aligns the 6 frames with subject independently
In fact, most frame shift mutations can
completely abolish the proteins function
They are usually lethal

110
Frame Shift Error

In this example
BLAST can only find at most two separated
segments
tPH can connect them with a single deletion of
C
How?

111
Gapped Extension in tPH

tPH regards the DNA sequences as a sequence of
overlapped codons
Use a modified Smith-Waterman algorithm that can
take frame shift into account
Substitution S(i-1, j-3) s (pi, nj-2..j)
Insertion of DNA S(i, j-1) frameshift
Insertion of DNA S(i, j-2) frameshift
Insertion of AA S(i, j-3) gap
Deletion of AA S(i-1, j) gap

112
Scoring Scheme
nGACACUAGAAUCG
P AspArgTyrSer
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 6 4 3
0 0 0 8
0 0 0 6
0 0 0 10
Query GAC ACU A-- GAA --- UCG Asp Thr
--- Glu Tyr Ser Subject Asp --- --- Arg Tyr Ser
S(i-1, j-3) s (pi, nj-2..j) S(i, j-1)
frameshift (-1) S(i, j-2) frameshift (-1) S(i,
j-3) gap (-2) S(i-1, j) gap (-2)
113
Performance Evaluation