BLAST: Database Search - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

BLAST: Database Search

Description:

BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of ... Smith-Waterman is rigorous and it is guaranteed. to find an optimal alignment. ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 40

Provided by: sch17

Category:

more less

Transcript and Presenter's Notes

Title: BLAST: Database Search

1
BLAST Database Search

Heuristic Algorithm
Some slides courtesy of Dr. Pevsner and Dr. Dirk
Husmeier

2
What is BLAST?

BLAST (Basic Local Alignment Search Tool)
allows rapid sequence comparison of a query
sequence against a database.
Smith-Waterman is rigorous and it is guaranteed
to find an optimal alignment.
But also time and space consuming. It is
especially inefficient in database searches.
BLAST provides a rapid alternative.o S-W

3
Why do we use BLAST?

To understanding the relatedness of any protein
or DNA sequence (query sequence) to other known
sequences (database)
Identify sequences with a common ancestor
(orthologs) and paralogs
Discover new genes or proteins
Explore protein structure and function

4
The BLAST Algorithm

S. F. Altschul, et al., 1997, Nucleic Acids
Research, 253389

The central idea of the BLAST algorithm is to
confine attention to segment pairs that contain
a word pair of length w with a score of at least
T. Altschul et al. (1990)
5
How the original BLAST algorithm works Step 1.
size w words in the query sequence
Look at the query sequence by a moving window of
size w Example for a human RBP query FSGTWYA
(query word is in yellow) The moving window of
words FSG SGT GTW TWY WYA
page 101
6
Step 1 compile a list of words scoring at least
T with query word
GTW 6,5,11 22 ASW 6,1,11 18 word
hits ATW 0,5,11 16 gt threshold NTW
0,5,11 16 GTY 6,5,2 13 GNW 10 GAW
9 word hits lt threshold
(T11)
7
2. Scan the database for entries that contains
any word from the compiled hit list.
3. Extend when you manage to find a hit extend
the hit in either direction. Keep track of the
score (use a scoring matrix) Stop when the score
drops below some cutoff.
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAG
TWYSLAMAASD. 44 lactoglobulin (hit)
extend
extend
Hit!
8
Alignment Score
It is important to assess the statistical
significance of search results. For local
alignments (including BLAST search results), the
scores follow an extreme value distribution
(EVD). E-value is closely related to the
analysis of the distribution of alignment
score Karlin, S. Altschul, S.F. (1990)
"Methods for assessing the statistical
significance of molecular sequence features by
using general scoring schemes." Proc. Natl. Acad.
Sci. USA 872264-2268.(PubMed)
9
Alignment score as a random walk
10
Max Score in an Excursion
Pi frequency of residue i in 1st seq. Pk
frequency of residue k in 2nd seq.
11
Protein Scoring
12
Distn of any excursion Y
an exponential distribution
13
The Max of n variables

Y1, Y2, , Yn are identical independently
distributed
Ymax is the max of the above all. Then
Prob(Ymaxlty)(Prob(Ylty))n

14
The Max of n Exponential variables

Y1, Y2, , Yn are independent exponential
variables
Ymax is the max of the above all. Then
Prob(Ymaxlty)(Prob(Ylty))n
(1-e-?y)n
Prob(Ymaxgty)1-(Prob(Ylty))n
1-(1-e-?y)n

15
In a database of n seq.s

Number of sequences n
Y1, Y2,, Yn are i.i.d. exponential
What happens when n is large?
Using a widely used rule (1x/n)n?exp(x)
?1-exp(-nCe?y)
Probability of scores/excursion higher than y
The distribution of Ymax follows an extreme value
distn

16
the sum of a large number of independent
identically distributed (i.i.d) random variables
tends to a normal distribution,
0.40
0.35
0.30
0.25
normal distribution
probability
0.20
0.15
0.10
0.05
0
0
1
2
3
4
5
-1
-2
-3
-4
-5
x
17
the maximum of a large number of i.i.d. random
variables tends to an extreme value distribution
0.40
0.35
0.30
0.25
normal distribution
extreme value distribution
probability
0.20
0.15
0.10
0.05
0
0
1
2
3
4
5
-1
-2
-3
-4
-5
x
18
Expected number of better scores/higher excursions
E-value
p-value
19
E Kmn e-lS
E is the number of hits you would expect from
your search with scores greater than S where K
is a constant m is the size of the query n is
the size of the database being searched l scales
for the specific scoring matrix used (decay
constant from the extreme value distribution)
20
How to interpret BLAST E values and p values
Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to
interpret than corresponding p values. E p1-exp
(-E) 10 0.99995460 5 0.99326205 2 0.86466472 1
0.63212056 0.1 0.09516258 (about
0.1) 0.05 0.04877058 (about 0.05) 0.001 0.000999
50 (about 0.001) 0.0001 0.0001000
21

22
Extreme value distribution

The distribution
The area to the right of S
Scaling to a particular type of score
where µ is the mode and ? is a scale factor.

23
Extreme value distribution
Compute this value for x0.

The distribution
The area to the right of S
Scaling to a particular type of score
where µ is the mode and ? is a scale factor.

24
Extreme value distribution
Compute this value for x 0. Solution exp-1
0.368

The distribution
The area to the right of S
Scaling to a particular type of score
where µ is the mode and ? is a scale factor.

25
An example

You run BLAST and get a score of 45. You then
run BLAST on a shuffled version of the database,
and fit an extreme value distribution to the
resulting empirical distribution. The parameters
of the EVD are µ 25 and ? 0.693. What is the
p-value associated with 45?

26
An example

You run BLAST and get a score of 45. You then
run BLAST on a shuffled version of the database,
and fit an extreme value distribution to the
resulting empirical distribution. The parameters
of the EVD are µ 25 and ? 0.693. What is the
p-value associated with 45?

27
Another example

You run BLAST and get a score of 23. You then
run BLAST on a shuffled version of the database,
and fit an extreme value distribution to the
resulting empirical distribution. The parameters
of the EVD are µ 20 and ? 0.744. What is the
p-value associated with 23?

28
Another example

You run BLAST and get a score of 23. You then
run BLAST on a shuffled version of the database,
and fit an extreme value distribution to the
resulting empirical distribution. The parameters
of the EVD are µ 20 and ? 0.744. What is the
p-value associated with 23?

29
BLAST optional parameters
You can... choose the organism to search
turn filtering on/off change the substitution
matrix change the expect (e) value change the
word size change the output format
30
Choosing Gap Penalty

Choice must be made corresponding to each type of
scoring system to place gaps where they will
increase the overall alignment score.
There are no hard and fast rules for choosing gap
penalties.
Both g (opening penalty) and r (extension
penalty) should be non-zero.
The value of g r should be greater than the
maximum score used for a match if insertions and
deletions are considered to be rarer than
nucleotide substitutions.
The value of g strongly influences the number of
gaps introduced into a region separating two
closely matching regions.

31
Comparing Scoring Matrix

PAM
Homologous seq.s during evolution
Based on extrapolation of a small evol. Period
Track evolutionary origins

BLOSUM
Conserved blocks
Based on a range of evol. Periods
Find conserved domains

32
Another way to compare

perform a search of a sequence database with a
known member of a protein family and to find how
many members of the family are found. When gap
penalty was not considered, the BLOSUM62 matrix
outperformed the PAM250 matrix in finding more
members of 504 different families on the Prosite
database.

33
BLAST Phase 3
Extension In the original (1990)
implementation of BLAST, hits were extended in
either direction. In a 1997 refinement of BLAST,
two independent hits are required. The hits must
occur in close proximity to each other. With this
modification, only one seventh as many extensions
occur, greatly speeding the time required for a
search.
34
How a BLAST search works threshold
You can modify the threshold parameter. The
default value for blastp is 11. To change it,
enter -f 16 or -f 5 in the advanced options.
35
How to interpret a BLAST search expect value
The expect value E is the number of
alignments with scores greater than or equal to
score S that are expected to occur by chance in a
database search. An E value is related to a
probability value p. The key equation describing
an E value is E Kmn e-lS
36
E Kmn e-lS
This equation is derived from a description of
the extreme value distribution S the score E
the expect value the expected number of HSPs
with a score gt S m, n the length of two
sequences l, K Karlin Altschul statistics
37
Some properties of the equation E Kmn e-lS

The value of E decreases exponentially with
increasing S
(higher S values correspond to better
alignments). Very
high scores correspond to very low E values.
The E value for aligning a pair of random
sequences must
be negative! Otherwise, long random alignments
would
acquire great scores
Parameter K describes the search space
(database).
For E1, one match with a similar score is
expected to
occur by chance. For a very much larger or
smaller
database, you would expect E to vary accordingly

38
From raw scores to bit scores

There are two kinds of scores
raw scores (calculated from a substitution
matrix) and
bit scores (normalized scores)
Bit scores are comparable between different
searches
because they are normalized to account for the
use
of different scoring matrices and different
database sizes
S bit score (lS - lnK) / ln2
The E value corresponding to a given bit score
is
E mn 2 -S
Bit scores allow you to compare results between
different
database searches, even using different scoring
matrices.

39
How to interpret BLAST E values and p values
The expect value E is the number of
alignments with scores greater than or equal to
score S that are expected to occur by chance in a
database search. A p value is a different way
of representing the significance of an
alignment. p 1 - e-E

Write a Comment

User Comments (0)