Title: A TableDriven, FullSensitivity Similarity Search Algorithm
1A Table-Driven, Full-Sensitivity Similarity
Search Algorithm
- Gene Myers and Richard Durbin
- Presented by Wang, Jia-Nan and Huang, Yu-Feng
2Outline
- Introduction
- Background
- Preliminary
- Method
- Experiment
3Introduction
- Given a Query and database . Do local alignment
- Smith-Waterman Guaranteed to find all local
alignment . Expensive - BLAST
- FASTA
4Improvement
- Hardware more investment on computer ,CPU
- Software
- Phil Greens SWAT appeal to sparsity and some
machine-level coding tricks - 60 of dynamic programming matrix has value 0
- Avoiding computing most of these unproductive
entries -
5- Focus on improving protein similarity searches
- This approach examines and compute only 4 of the
underlying dynamic programming matrix
6Recall
- Sequence alignment
- Local sequence alignment
- Global sequence alignment
- Goal matching path with highest score
- Table-based computation and dynamic programming
7Dynamic Programming
- Three basic components
- Recurrence relation
- Tabular computation
- Traceback
8Smith-Waterman Method
- Dynamic programming algorithm
- Find the most similar subsequences of two
sequences - Problem
- Lots of computation ? will be googol
- Programmer ? will be crazy and excite
- Why? ? how to accelerate
9Background
- Scoring System
- Simple scoring scheme
- Affine gap penalty scoring scheme
- PAM120 (PAMn)
- BLOSUM62 (BLOSUMn)
10Simple Scoring Scheme
- Match (e.g. 8)
- Mismatch (e.g. -5)
- Gap constant penalty (e.g. -20)
11Affine Gap Penalty Scoring Scheme
- Match (e.g. 8)
- Mismatch (e.g. -5)
- Gap symbol (e.g. -5)
- Gap open penalty (e.g. -10)
12PAM
- PAM Percent Accepted Mutation
- Dayhoff et al. (1978)
- PAM unit
- Evolutionary time corresponding to average of 1
mutation per 100 residues ? 1 accepted - PAMn
- Relates to mutation probabilities in evolutionary
interval of n PAM units
Some information from http//www.apl.jhu.edu/prz
ytyck/CAMS_2004_1b.pdf
13PAM120
Source http//eta.embl-heidelberg.de8000/misc/ma
t/pam120.html
14BLOSUM62
- BLOSUM BLOcks SUbstitution Matrix
- Steven and Jorga G. Henikoff (1992)
- Paper Amino acid substitution matrices from
protein blocks PubMed - BLOSUMn
- Relates to mutation probabilities observed
between pairs of related proteins that diverged
so above n identity
Some information from http//www.apl.jhu.edu/prz
ytyck/CAMS_2004_1b.pdf
15BLOSUM62
16Preliminaries
- S sequences are composed
- S S Substitution matrix S giving the score
- Uniform gap penalty g gt 0
- Query q1q2. . .qp of P letters
- Target t1t2. . .tn of N letters
- Threshold T gt 0
17Score Table ? Edit Graph
Picture source http//searchlauncher.bcm.tmc.edu/
help/Pictures/S-Wexample.gif
18(No Transcript)
19Problem
- Find a high score local alignment between Query
and Target whose path score ?T - Edit-graph figure1
- Limit our attention to prefix-positive paths
- If there is a path of score T or greater in the
edit graph then there is a prefix positive path
of score T or greater
20Definition
- A set P of index-value pairs (i,v) i is 0,P
21The start and extension tables
- Consider a vertex x in row j of the edit graph of
Query vs. Target
22(No Transcript)
23Start Trimming
- Limiting the dynamic programming to the
startable vertices requires a table Start(w)
where w Sks
24Start Trimming
- Worst case
- Let abe the expected percentage of vertices that
are seed
25Extension Trimming
- A table that eliminates vertices that are not
extendable - (i,j) is extendable vertex iff C(i,j)gtExtend(i,Tar
getj1jke)
26Extension Trimming
27(No Transcript)
28A Table-Driven Scheme for DP
- Goal to restrict the SW computation to
productive vertices - Jump table captures the effect of Advance and
Delete over kJ gt 0 rows - space ? unmanageably large
- But only record those for which
29- Jump table
- Start table
- Space-saving version for Jump and Start tables
30-
- Check for paths scoring T or more
-
-
31(No Transcript)
32Recall Affine Gap Penalty
- Score
- Match
- Mismatch
- Gap symbol - gsp
- Gap open penalty - gop
- Affine cost of gap of length k
- g kh, g gop, h gsp
33Diagram of Affine Gap Penalty
Source kmchaos lecture note
34- Recurrence system - Gotoh
35The Case of Affine Gap Costs
- Simple scoring scheme ? affine gap penalty scheme
- Affine edit graph and vertex structure
- Question how to modify the equations defined
above?
36(No Transcript)
37Recurrence System for Affine Gap Costs
- Two observations
- To compute the jth row form the (j-1)st requires
knowing only the vectors of and values in
row j-1, and not on the values in that row - If then the value
at vertex need not be
recorded as any maximal path through its
will have score less than the maximal path
passing through the corresponding
38Recurrence System
39Results
40Experiment
- Method
- Edit graph based approach vs. SWAT
- Scoring matrix
- PAM120
- Affine gap cost
- 84n
- Database (target)
- 3 million residue subset of the PIR database
- Query
- A periodic clock protein of length 173 (pcp)
- A lactate dehydrogenase of length 319 (dehydro)
- A cGMP kinase of length 670 (kinase)
- A growth factor of length 1210 (g factor)
41PAM120 Gap Cost 84n
42BLOSUM62 Gap Cost 82n
43Ending
Thanks for Your Attention