Statistics Inference of BLAST - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Statistics Inference of BLAST

Description:

... www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html. http://www.bioinfbook.org/ http://www.sdsc.edu/~babu/UCSD/week02/dbSearch_tut.html. BLAST procedure. Query seq. ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 33

Provided by: alh6

Category:

more less

Transcript and Presenter's Notes

Title: Statistics Inference of BLAST

1
Statistics Inference of BLAST

??? Ai-Ling Hour
???? ?????
02-29052464
022446_at_mail.fju.edu.tw

2
Contents

Algorithms
Parameters
Score
p-value
E-value

3
References

http//www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschu
l-1.html

4
http//www.bioinfbook.org/
http//www.sdsc.edu/babu/UCSD/week02/dbSearch_tut
.html
5
BLAST procedure
Query seq.
Database
E-value threshold
Output Homolog Subject HSPs
6
Preliminary

BLAST programs
Input / Output
Gap penalty
Score matrix
Filter
HSP
Score
E-Value

7
Algorithm

The original algorithm does not allow gaps but
allows multiple hits to the same database
sequence.
BLAST programs were designed for fast database
searching, with minimal sacrifice of sensitivity
to distantly related sequences.
BLAST programs search databases in a special,
compressed format.
BLAST looks first for short subsequences which it
then tries to extend.

8
How BLAST works

Make a list of high scoring words (gtT)
Compare wordlist against database
If two hits within a given window, gapped
extension of second hit in both directions

9
BLAST Search Algorithm
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/BL
AST_algorithm.html
10
better
large w
lower T
slower
Sensitivity
Search speed
faster
worse
small w
higher T
http//www.bioinfbook.org/
11
Program Advanced Options
-G Cost to open gap Integer default 5 for
nucleotides 11 proteins -E Cost to extend gap
Integer default 2 nucleotides 1 proteins -q
Penalty for nucleotide mismatch Integer
default -3 -r reward for nucleotide match
Integer default 1 -e expect value Real
default 10 -W wordsize Integer default
11 nucleotides 3 proteins
12
Filtering

Low-complexity
SEG amino acid sequences
DUST nucleic acid sequences.
Human repeats, ...
lookup table
Lower Case

13
SEG output
14
Score of Alignment

How strong an alignment can be expected from
chance alone
To analyze how high a score is likely to arise by
chance, a model of random sequences is needed.
The expected score for aligning a random pair of
amino acid is required to be negative
An extreme value distribution

15
The probability density function of the extreme
value distribution (characteristic value u0 and
decay constant l1)
0.40
0.35
0.30
0.25
normal distribution
extreme value distribution
probability
0.20
0.15
0.10
0.05
0
0
1
2
3
4
5
-1
-2
-3
-4
-5
x
http//www.bioinfbook.org/
16
Extreme Value Distribution

The most one can say reliably is that if 100
random alignments have score inferior to the
alignment of interest, the P-value in question is
likely less than 0.01.
Multiple tests An alignment with P-value 0.0001
in the context of a single trial may be assigned
a P-value of only 0.1 if it was selected as the
best among 1000 independent trials.

17
Entropy
18
Entropy

A random DNA, p0.25 for each
H-4(0.25)(-2)2 bits
p(A/T)0.9, p(C/G)0.1
H-2(0.45)(-1.15)(0.05) (-4.32)1.47 bits

19
lod Score

Pairs of amino acids or nucleotides
log2 odd ratio
Random model qij pi pj
Sij log (qij /pipj)
Pairing randomly Sij 0

20
Score

The scores of any substitution matrix with
negative expected score can be written uniquely
in the form
Sijln(qij/pipj)/?
where the qij, called target frequencies, are
positive numbers that sum to 1, the pi are
background frequencies for the various residues,
and lambda is a positive constant.

21
Bit Score

Raw scores have little meaning without detailed
knowledge of the scoring system used, or more
simply its statistical parameters K and lambda.
Unless the scoring system is understood, citing a
raw score alone is like citing a distance without
specifying feet, meters, or light years.
S(?S-ln K) / ln2

22
Parameters

The parameters K and ? can be thought of simply
as natural scales for the search space size and
the scoring system respectively.

23
E-value

Bit score S', which has a standard set of units.
The E-value corresponding to a given bit score is
simply
Emn2-S

24
E-value

The number of hits one can "expect" to see just
by chance when searching a database of a
particular size
The E value describes the random background noise
that exists for matches between sequences.
The expected number of HSPs with score at least S
is given by the formula
EKmne-?S

25
p-value

The number of random HSPs with score gt S is
described by a Poisson distribution
This means that the probability of finding
exactly x HSPs with score gtS is given by
e-E(Ex/x!), E is the E-value of S
Specifically the chance of finding zero HSPs with
score gtS is e-E, so the probability of finding
at least one such HSP is 1- e-E

26
E values and p values
Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to
interpret than corresponding p values. E p 10
0.99995460 5 0.99326205 2 0.86466472 1 0.63212
056 0.1 0.09516258 (about 0.1) 0.05 0.04877058
(about 0.05) 0.001 0.00099950 (about
0.001) 0.0001 0.0001000
Table 4.4 page 107
27
query length 142196 database 6,672,153
sequences 23,415,242,475 total letters E lt
1E-100
28
query length 142196 database 6,672,153
sequences 23,415,242,475 total letters E lt
1E-100
29
query length 352 database 6,672,153
sequences 23,415,242,475 total letters E lt
1E-50
30
query length 352 database 6,672,153
sequences 23,415,242,475 total letters E lt
1E-50
31
(No Transcript)
32
Thank you

Write a Comment

User Comments (0)