Title: Similarity Score Significance
1Similarity Score Significance
- Lecture 18 November 1, 2005
- Algorithms in Biosequence Analysis
- Nathan Edwards - Fall, 2005
2Similarity Score Significance
- Which of the answers represent homologous
sequences? - What is a good similarity score?
- How can we tell which answers are good?
- Why do good scores happen for bad answers?
- What similarity scores could we expect for
alignments of random sequences?
3Similarity Score Significance
- We saw last time how the alignment score is a log
likelihood of H vs R - Score log P S,T H / P S,T R
- H homology simulator
- R random sequence simulator
- Score gt 0 gt evidence for H
- Score lt 0 gt evidence for R
- Is a score of 1 convincing evidence of homology?
- What about 5, 10, 15, or 20?
- We need some notion of scale for the score
axis, some measure of confidence.
4Similarity Score Significance
- Want P H S,T !
- Are the two sequences, S and T, homologous?
- Bayes Theorem to the rescue!
- Plus some other probability identities...
- PXY PYX PX / PY
- PY PYA PA PYB PB
for partition A,B.
5Similarity Score Significance
- After some manipulation P H S,T ? 2S
/ ( ? 2S 1 )where ? is the a priori odds ratio
and S is the similarity score. - Logistic function s(x) ex / ( ex 1 )
- Translates (-8,8) to (0,1), s(0) 0.5
- P H S,T s( S log 2 log ? )
- A posteriori probability of H, given S, T, is
related to the score, adjusted by the a priori
odds ratio.
6Similarity Score Significance
- Bayesian Our new understanding is based on our
observation, plus whatever else we know. - Suppose we know (or believe) that a database of
(N1) proteins contains 1 protein homologous to
our query P. - PH 1/(N1), PR N/(N1), ? 1/N.
- P H S,T s( S - log N )
- Now need a higher score than before!
7Logistic function
8Similarity score significance
- For local alignments, things are much less clear
- nm local alignments between T and P
- Naïvely, this implies a log(nm) correction
- What if local alignments are not independent?
- Need small nudge factor to compensate
- Need model of random alignments...
- P H S,T s( S - log k n m )
9Similarity Score Significance
- Determining an appropriate prior log likelihood
for the Bayesian analysis requires two pieces - knowledge of homologies in database
- model of non-homologies/random alignments
- Classical/frequentist approach
- Show that it is very unlikely to be random
- Reject the null hypothesis......that random
alignment is plausible
10Similarity score significance
- Lets start out simple
- ungaped global alignments,
- scoring model match 1, mismatch -1
- Score S of length n alignment, under R?Each
position 1 with prob. ¼ -1 with prob. ¾Each
position independent. - Alignment score S -n 2Binom(¼,n)
11Similarity score significance
- ER(S) -n/2, VarR(S) ¾ n
- For large enough n, behaves like normal
distribution - So S Normal(-n/2, v(¾ n) ).
- PRS gt score can be computed from normal
distribution tables... - Example
- alignment of length 300 with score 120
- P N(-150,15) gt 120 1x10-73
12Similarity score significance
- However, we are rarely considering just one
alignment. - Suppose we have a database of N proteins to
compare against query P - What is probability that the best of N random
alignments scores at least S? - Given cdf F(x) PR score x , and independent
alignments, P all N alignments score S
F(S)N
13Similarity score significance
- We want prob. at least one alignment is gt S
- PR max of N scores gt S 1 F(S)N
- Alternative approachPR 1 score gt S
PR 1st score gt S or 2nd score gt S or ...
SN PR score gt S N(1 - F(S)) - Doesnt assume independence...
14Similarity score significance
- We can get the cdf F(x) in a variety of ways.
- Given an analytical model for R, we can determine
F(x) - Given R, we can determine an approximate
analytical model for R, and determine F(x) of
approximate model - Simulate R, fit analytical model to simulation
observations, determine F(x) of fitted model - Simulate R, count number of times S x for all
x, to estimate F(x) (Histogram)
15Similarity score significance
- Extreme-Value (Gumbel) Distribution models the
maximum of a (large) number of i.i.d. random
variables. - Normal approximation not appropriate for local
alignments - We need max of n x m local alignments
- Karlin-Altschul theory determines EVD parameters
in terms of n, m, and score matrix.
16Karlin-Altschul theory
- Assumes
- At least one alignment score is positive
- Expected scores are negative
- Characters of sequences are i.i.d.
- No gaps
- We assume, for simplicity, log likelihood s(x,y)
- Then the expected number of alignments
- (e-value) with score at least S
- E K m n e -S
17Karlin-Altschul theory
- K compensates for lack of independence of nearby
local alignments - Number of local alignments with score S is
Poisson distributed - P k local alignments S e-E Ek / k!
- P at least one local alignment S p-value
1 - e-E - When E lt 0.01, e-value and p-value are
essentially the same.
18Similarity score significance
- Karlin-Altschul doesnt extend to gapped scoring
models... - ...but simulation suggests the same approach
works. - As with Bayesian approach, correct for number of
independent trials - some fraction of nm.
19Summary
- Significance of local alignment similarity score
depends on - Score matrix, length of query, database
- Bayesian approach
- determine P H S,T
- need prior log likelihood for H vs R
- Frequentist approach
- determine PR max score gt S
- need cdf F(x) for score function, or
- EVD for P max score gt s