Title: BLAST: Database Search
1BLAST Database Search
- Heuristic Algorithm
- Some slides courtesy of Dr. Pevsner and Dr. Dirk
Husmeier
2What is BLAST?
- BLAST (Basic Local Alignment Search Tool)
- allows rapid sequence comparison of a query
- sequence against a database.
- Smith-Waterman is rigorous and it is guaranteed
- to find an optimal alignment.
- But also time and space consuming. It is
especially inefficient in database searches. - BLAST provides a rapid alternative.o S-W
3Why do we use BLAST?
- To understanding the relatedness of any protein
or DNA sequence (query sequence) to other known
sequences (database) - Identify sequences with a common ancestor
(orthologs) and paralogs - Discover new genes or proteins
- Explore protein structure and function
-
4The BLAST Algorithm
- S. F. Altschul, et al., 1997, Nucleic Acids
Research, 253389
The central idea of the BLAST algorithm is to
confine attention to segment pairs that contain
a word pair of length w with a score of at least
T. Altschul et al. (1990)
5How the original BLAST algorithm works Step 1.
size w words in the query sequence
Look at the query sequence by a moving window of
size w Example for a human RBP query FSGTWYA
(query word is in yellow) The moving window of
words FSG SGT GTW TWY WYA
page 101
6Step 1 compile a list of words scoring at least
T with query word
GTW 6,5,11 22 ASW 6,1,11 18 word
hits ATW 0,5,11 16 gt threshold NTW
0,5,11 16 GTY 6,5,2 13 GNW 10 GAW
9 word hits lt threshold
(T11)
72. Scan the database for entries that contains
any word from the compiled hit list.
3. Extend when you manage to find a hit extend
the hit in either direction. Keep track of the
score (use a scoring matrix) Stop when the score
drops below some cutoff.
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAG
TWYSLAMAASD. 44 lactoglobulin (hit)
extend
extend
Hit!
8Alignment Score
It is important to assess the statistical
significance of search results. For local
alignments (including BLAST search results), the
scores follow an extreme value distribution
(EVD). E-value is closely related to the
analysis of the distribution of alignment
score Karlin, S. Altschul, S.F. (1990)
"Methods for assessing the statistical
significance of molecular sequence features by
using general scoring schemes." Proc. Natl. Acad.
Sci. USA 872264-2268.(PubMed)
9Alignment score as a random walk
10Max Score in an Excursion
Pi frequency of residue i in 1st seq. Pk
frequency of residue k in 2nd seq.
11Protein Scoring
12Distn of any excursion Y
an exponential distribution
13The Max of n variables
- Y1, Y2, , Yn are identical independently
distributed - Ymax is the max of the above all. Then
- Prob(Ymaxlty)(Prob(Ylty))n
14The Max of n Exponential variables
- Y1, Y2, , Yn are independent exponential
variables - Ymax is the max of the above all. Then
- Prob(Ymaxlty)(Prob(Ylty))n
- (1-e-?y)n
- Prob(Ymaxgty)1-(Prob(Ylty))n
- 1-(1-e-?y)n
15In a database of n seq.s
- Number of sequences n
- Y1, Y2,, Yn are i.i.d. exponential
- What happens when n is large?
- Using a widely used rule (1x/n)n?exp(x)
- ?1-exp(-nCe?y)
- Probability of scores/excursion higher than y
- The distribution of Ymax follows an extreme value
distn
16the sum of a large number of independent
identically distributed (i.i.d) random variables
tends to a normal distribution,
0.40
0.35
0.30
0.25
normal distribution
probability
0.20
0.15
0.10
0.05
0
0
1
2
3
4
5
-1
-2
-3
-4
-5
x
17the maximum of a large number of i.i.d. random
variables tends to an extreme value distribution
0.40
0.35
0.30
0.25
normal distribution
extreme value distribution
probability
0.20
0.15
0.10
0.05
0
0
1
2
3
4
5
-1
-2
-3
-4
-5
x
18Expected number of better scores/higher excursions
E-value
p-value
19E Kmn e-lS
E is the number of hits you would expect from
your search with scores greater than S where K
is a constant m is the size of the query n is
the size of the database being searched l scales
for the specific scoring matrix used (decay
constant from the extreme value distribution)
20How to interpret BLAST E values and p values
Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to
interpret than corresponding p values. E p1-exp
(-E) 10 0.99995460 5 0.99326205 2 0.86466472 1
0.63212056 0.1 0.09516258 (about
0.1) 0.05 0.04877058 (about 0.05) 0.001 0.000999
50 (about 0.001) 0.0001 0.0001000
21 22Extreme value distribution
- The distribution
- The area to the right of S
- Scaling to a particular type of score
- where µ is the mode and ? is a scale factor.
23Extreme value distribution
Compute this value for x0.
- The distribution
- The area to the right of S
- Scaling to a particular type of score
- where µ is the mode and ? is a scale factor.
24Extreme value distribution
Compute this value for x 0. Solution exp-1
0.368
- The distribution
- The area to the right of S
- Scaling to a particular type of score
- where µ is the mode and ? is a scale factor.
25An example
- You run BLAST and get a score of 45. You then
run BLAST on a shuffled version of the database,
and fit an extreme value distribution to the
resulting empirical distribution. The parameters
of the EVD are µ 25 and ? 0.693. What is the
p-value associated with 45?
26An example
- You run BLAST and get a score of 45. You then
run BLAST on a shuffled version of the database,
and fit an extreme value distribution to the
resulting empirical distribution. The parameters
of the EVD are µ 25 and ? 0.693. What is the
p-value associated with 45?
27Another example
- You run BLAST and get a score of 23. You then
run BLAST on a shuffled version of the database,
and fit an extreme value distribution to the
resulting empirical distribution. The parameters
of the EVD are µ 20 and ? 0.744. What is the
p-value associated with 23?
28Another example
- You run BLAST and get a score of 23. You then
run BLAST on a shuffled version of the database,
and fit an extreme value distribution to the
resulting empirical distribution. The parameters
of the EVD are µ 20 and ? 0.744. What is the
p-value associated with 23?
29BLAST optional parameters
You can... choose the organism to search
turn filtering on/off change the substitution
matrix change the expect (e) value change the
word size change the output format
30Choosing Gap Penalty
- Choice must be made corresponding to each type of
scoring system to place gaps where they will
increase the overall alignment score. - There are no hard and fast rules for choosing gap
penalties. - Both g (opening penalty) and r (extension
penalty) should be non-zero. - The value of g r should be greater than the
maximum score used for a match if insertions and
deletions are considered to be rarer than
nucleotide substitutions. - The value of g strongly influences the number of
gaps introduced into a region separating two
closely matching regions.
31Comparing Scoring Matrix
- PAM
- Homologous seq.s during evolution
- Based on extrapolation of a small evol. Period
- Track evolutionary origins
- BLOSUM
- Conserved blocks
- Based on a range of evol. Periods
- Find conserved domains
32Another way to compare
- perform a search of a sequence database with a
known member of a protein family and to find how
many members of the family are found. When gap
penalty was not considered, the BLOSUM62 matrix
outperformed the PAM250 matrix in finding more
members of 504 different families on the Prosite
database.
33BLAST Phase 3
Extension In the original (1990)
implementation of BLAST, hits were extended in
either direction. In a 1997 refinement of BLAST,
two independent hits are required. The hits must
occur in close proximity to each other. With this
modification, only one seventh as many extensions
occur, greatly speeding the time required for a
search.
34How a BLAST search works threshold
You can modify the threshold parameter. The
default value for blastp is 11. To change it,
enter -f 16 or -f 5 in the advanced options.
35How to interpret a BLAST search expect value
The expect value E is the number of
alignments with scores greater than or equal to
score S that are expected to occur by chance in a
database search. An E value is related to a
probability value p. The key equation describing
an E value is E Kmn e-lS
36E Kmn e-lS
This equation is derived from a description of
the extreme value distribution S the score E
the expect value the expected number of HSPs
with a score gt S m, n the length of two
sequences l, K Karlin Altschul statistics
37Some properties of the equation E Kmn e-lS
- The value of E decreases exponentially with
increasing S - (higher S values correspond to better
alignments). Very - high scores correspond to very low E values.
- The E value for aligning a pair of random
sequences must - be negative! Otherwise, long random alignments
would - acquire great scores
- Parameter K describes the search space
(database). - For E1, one match with a similar score is
expected to - occur by chance. For a very much larger or
smaller - database, you would expect E to vary accordingly
38From raw scores to bit scores
- There are two kinds of scores
- raw scores (calculated from a substitution
matrix) and - bit scores (normalized scores)
- Bit scores are comparable between different
searches - because they are normalized to account for the
use - of different scoring matrices and different
database sizes - S bit score (lS - lnK) / ln2
- The E value corresponding to a given bit score
is - E mn 2 -S
- Bit scores allow you to compare results between
different - database searches, even using different scoring
matrices.
39How to interpret BLAST E values and p values
The expect value E is the number of
alignments with scores greater than or equal to
score S that are expected to occur by chance in a
database search. A p value is a different way
of representing the significance of an
alignment. p 1 - e-E