Title: Statistics Inference of BLAST
1Statistics Inference of BLAST
- ??? Ai-Ling Hour
- ???? ?????
- 02-29052464
- 022446_at_mail.fju.edu.tw
2Contents
- Algorithms
- Parameters
- Score
- p-value
- E-value
3References
- http//www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschu
l-1.html
4http//www.bioinfbook.org/
http//www.sdsc.edu/babu/UCSD/week02/dbSearch_tut
.html
5BLAST procedure
Query seq.
Database
E-value threshold
Output Homolog Subject HSPs
6Preliminary
- BLAST programs
- Input / Output
- Gap penalty
- Score matrix
- Filter
- HSP
- Score
- E-Value
7Algorithm
- The original algorithm does not allow gaps but
allows multiple hits to the same database
sequence. - BLAST programs were designed for fast database
searching, with minimal sacrifice of sensitivity
to distantly related sequences. - BLAST programs search databases in a special,
compressed format. - BLAST looks first for short subsequences which it
then tries to extend.
8How BLAST works
- Make a list of high scoring words (gtT)
- Compare wordlist against database
- If two hits within a given window, gapped
extension of second hit in both directions
9BLAST Search Algorithm
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/BL
AST_algorithm.html
10better
large w
lower T
slower
Sensitivity
Search speed
faster
worse
small w
higher T
http//www.bioinfbook.org/
11Program Advanced Options
-G Cost to open gap Integer default 5 for
nucleotides 11 proteins -E Cost to extend gap
Integer default 2 nucleotides 1 proteins -q
Penalty for nucleotide mismatch Integer
default -3 -r reward for nucleotide match
Integer default 1 -e expect value Real
default 10 -W wordsize Integer default
11 nucleotides 3 proteins
12Filtering
- Low-complexity
- SEG amino acid sequences
- DUST nucleic acid sequences.
- Human repeats, ...
- lookup table
- Lower Case
13SEG output
14Score of Alignment
- How strong an alignment can be expected from
chance alone - To analyze how high a score is likely to arise by
chance, a model of random sequences is needed. - The expected score for aligning a random pair of
amino acid is required to be negative - An extreme value distribution
15The probability density function of the extreme
value distribution (characteristic value u0 and
decay constant l1)
0.40
0.35
0.30
0.25
normal distribution
extreme value distribution
probability
0.20
0.15
0.10
0.05
0
0
1
2
3
4
5
-1
-2
-3
-4
-5
x
http//www.bioinfbook.org/
16Extreme Value Distribution
- The most one can say reliably is that if 100
random alignments have score inferior to the
alignment of interest, the P-value in question is
likely less than 0.01. - Multiple tests An alignment with P-value 0.0001
in the context of a single trial may be assigned
a P-value of only 0.1 if it was selected as the
best among 1000 independent trials.
17Entropy
18Entropy
- A random DNA, p0.25 for each
- H-4(0.25)(-2)2 bits
- p(A/T)0.9, p(C/G)0.1
- H-2(0.45)(-1.15)(0.05) (-4.32)1.47 bits
19lod Score
- Pairs of amino acids or nucleotides
- log2 odd ratio
- Random model qij pi pj
- Sij log (qij /pipj)
- Pairing randomly Sij 0
20Score
- The scores of any substitution matrix with
negative expected score can be written uniquely
in the form - Sijln(qij/pipj)/?
- where the qij, called target frequencies, are
positive numbers that sum to 1, the pi are
background frequencies for the various residues,
and lambda is a positive constant.
21Bit Score
- Raw scores have little meaning without detailed
knowledge of the scoring system used, or more
simply its statistical parameters K and lambda. - Unless the scoring system is understood, citing a
raw score alone is like citing a distance without
specifying feet, meters, or light years. - S(?S-ln K) / ln2
22Parameters
- The parameters K and ? can be thought of simply
as natural scales for the search space size and
the scoring system respectively.
23E-value
- Bit score S', which has a standard set of units.
- The E-value corresponding to a given bit score is
simply - Emn2-S
24E-value
- The number of hits one can "expect" to see just
by chance when searching a database of a
particular size - The E value describes the random background noise
that exists for matches between sequences. - The expected number of HSPs with score at least S
is given by the formula - EKmne-?S
25p-value
- The number of random HSPs with score gt S is
described by a Poisson distribution - This means that the probability of finding
exactly x HSPs with score gtS is given by
e-E(Ex/x!), E is the E-value of S - Specifically the chance of finding zero HSPs with
score gtS is e-E, so the probability of finding
at least one such HSP is 1- e-E
26E values and p values
Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to
interpret than corresponding p values. E p 10
0.99995460 5 0.99326205 2 0.86466472 1 0.63212
056 0.1 0.09516258 (about 0.1) 0.05 0.04877058
(about 0.05) 0.001 0.00099950 (about
0.001) 0.0001 0.0001000
Table 4.4 page 107
27query length 142196 database 6,672,153
sequences 23,415,242,475 total letters E lt
1E-100
28query length 142196 database 6,672,153
sequences 23,415,242,475 total letters E lt
1E-100
29query length 352 database 6,672,153
sequences 23,415,242,475 total letters E lt
1E-50
30query length 352 database 6,672,153
sequences 23,415,242,475 total letters E lt
1E-50
31(No Transcript)
32Thank you