Title: Review of BLAST
1Review of BLAST
- Heuristic, local alignment algorithm
- Permits trade-off between speed and sensitivity
- HSPs are dependent on matrix used
- Statistical significance of results can be
calculated by E-Value
significant matches detected
sensitivity
__________________________________________________
________________________
significant matches in DB
2Refinements of BLAST
- Two-hit method
- Gapped Alignments
- Position-Specific Iteration
- Gapped BLAST and PSI-BLAST a new generation of
protein database search programs SF Altschul, TL
Madden, AA Schaffer, J Zhang, Z Zhang, W Miller,
and DJ Lipman Nucl. Acids Res. 25 3389-3402.
http//nar.oxfordjournals.org/cgi/content/full/25/
17/3389
3Two-hit method
- BLAST v1
- Seeks short word pairs with aligned score T
- Each hit is extended to test if it is within a
high-scoring alignment (consumes most processing
time) - BLAST v2
- Seeks two non-overlapping word pairs on the same
diagonal, within a certain distance - T is lowered yielding more hits
- Fewer number of two non-overlapping word pairs
exist decreasing average compute time
4Gapped Alignments
- BLAST v1
- Finds several alignments involving a single
database sequence - When alignments are combined, resulting alignment
is statistically significant - When the alignments are not combined, individual
alignments may not meet statistical threshold to
be reported - BLAST v2
- Introduces an algorithm to generate gapped
alignments overcoming issues with BLAST v1 - Allows T to be raised increasing speed of initial
database scan - Gapped alignment algorithm uses DP to extend a
central pair of aligned residues in both
directions confined to a pre-defined strip of the
DP path graph
5PSI-BLAST Position-Specific Iterative BLAST
- Motif or profile search methods are much more
sensitive than pairwise comparison methods at
detecting distant relationships - Basic Idea
- BLAST searches may be iterated, with a
position-specific score matrix generated from
significant alignments found in round i used for
round i 1
6Definition of Profile/Motif
- an analysis representing the extent to which
something exhibits various characteristics - Examples from ProSite
- N-glycosylation site
- N-P-ST-P
- Glycosaminoglycan attachment site
- S-G-x-G
- cAMP- and cGMP-dependent protein kinase
phosphorylation site - RK(2)-x-ST
7Position-Specific Scoring Matrix
- A PSSM is a motif descriptor
- The description includes a weight (score,
probability, likelihood) for each symbol
occurring at each position along the motif - Examples of motifs
- Protein active sites, structural elements, zinc
finger, intron/exon boundaries,
transcription-factor binding sites, etc.
8PSSM Example
- For DNA
- GTA-AT-G-AC-N-TAC
9PSSM Example
- For DNA
- GTA-AT-G-AC-N-TAC
10Position-Specific Scoring Matrix
- Construction of PSSM is a multi-stage process
- Architecture of matrix
- Create multiple alignment from which the matrix
is derived - Calculate frequencies for each position
- Applying BLAST to PSSM
11Position-Specific Scoring Matrix
- 10 vertebrate donor site sequences aligned at
exon/intron boundary
12Position-Specific Scoring Matrix
- Calculate the absolute frequency of each
nucleotide at each position
13Position-Specific Scoring Matrix
- Calculate the absolute frequency of each
nucleotide at each position
14Position-Specific Scoring Matrix
- Calculate the relative frequency of each
nucleotide at each position
15Position-Specific Scoring Matrix
- Calculate the relative frequency of each
nucleotide at each position
16Position-Specific Scoring Matrix
- What is the probability of finding CAGGTTGGA?
- The product of the frequency of each nucleotide
at each position - C is 0.2 at position 1, A is 0.6 at position 2,
etc -gt 0.2 0.6 0.7 1 1 0.1 0.1 0.5
0.1
17Position-Specific Scoring Matrix
- The ratio of the probability of a sequence in a
given model, P(SM), and the probability of a
sequence in a random model, P(SR) is a
likelihood ratio (or odds ratio) P(SM) / P(SR) - And its logarithm is a log likelihood ratio (or
log odds ratio) - If the ratio is
- 0, S has the same probability to appear in M as
in R - gt0, S is more likely to appear in M than in R
- lt 0, S is less likely to appear in M than in R
18Position-Specific Scoring Matrix
- Compute the log-likelihood values with the
transformation ln(Mij/Pi) - where Mij is the probability of nucleotide i at
position j and Pi is background probability of of
nucleotide i - Assume each nucleotide can appear at any
position, then Pi0.25 in random model -
19Position-Specific Scoring Matrix
- Compute the log-likelihood values with the
transformation ln(Mij/Pi) - where Mij is the probability of nucleotide i at
position j and Pi is background probability of of
nucleotide i - Assume each nucleotide can appear at any
position, then Pi0.25 in random model -
20Position-Specific Scoring Matrix
- Use the profile to scan a sequence
- Sum the coefficients from the matrix for each
nucleotide in each position - Formally, matrix M for a sequence s of length l
(s s1, ... , sl, and sk being one of A, C, G,
T) is computed as
21Position-Specific Scoring Matrix
- Assume the sequence GTAGTAGAAGGTAAGTGTCCGTAG
- Find the score forGTAGTAgaaggtaagTGTCCGTAGGTAGTA
GaaggtaagtGTCCGTAGGTAGTAGAAGGtaagtgtccGTAG
22Position-Specific Scoring Matrix
- Assume the sequence GTAGTAGAAGGTAAGTGTCCGTAG
- Find the score forGTAGTAgaaggtaagTGTCCGTAG
(-8)GTAGTAGaaggtaagtGTCCGTAGGTAGTAGAAGGtaagtgtcc
GTAG
23Position-Specific Scoring Matrix
- Assume the sequence GTAGTAGAAGGTAAGTGTCCGTAG
- Find the score forGTAGTAgaaggtaagTGTCCGTAG
(-8)GTAGTAGaaggtaagtGTCCGTAG (8.33)GTAGTAGAAGGta
agtgtccGTAG
24Position-Specific Scoring Matrix
- Assume the sequence GTAGTAGAAGGTAAGTGTCCGTAG
- Find the score forGTAGTAgaaggtaagTGTCCGTAG
(-8)GTAGTAGaaggtaagtGTCCGTAG (8.33)GTAGTAGAAGGta
agtgtccGTAG (0.24)
25PSI-BLAST
- Confirming relationships of purine
- nucleotide metabolism proteins
26PSI-BLAST
e value cutoff for PSSM
27RESULTS Initial BLASTP
Same results as protein-protein BLAST
28Results of First PSSM Search
Other purine nucleotide metabolizing enzymes not
found by ordinary BLAST
29Third PSSM Search Convergence