Title: Dayhoff
1Dayhoffs Markov Modelof Evolution
2Brands of Soup Revisited
P(BA) 2/7
Brand A
Brand B
P(AB) 2/7
3Brands of Soup Revisited
Transition Diagram
P(BA) p 2/7
Brand A
Brand B
P(AB) p 2/7
Conditional Probability Formulas
P(Ak) P(Ak-1) (1-p)P(Bk-1 ) p 5/7 P(Ak-1)
2/7 P(Bk-1)
P(Bk) P(Ak-1 ) p P(Bk-1) (1-p) 2/7 P(Ak-1)
5/7 P(Bk-1)
4Brands of Soup Revisited
Transition Diagram
P(BA) p 2/7
Brand A
Brand B
P(AB) p 2/7
Conditional Probability Formulas
P(Ak) P(Ak-1) (1-p)P(Bk-1 ) p 5/7 P(Ak-1)
2/7 P(Bk-1)
P(Bk) P(Ak-1 ) p P(Bk-1) (1-p) 2/7 P(Ak-1)
5/7 P(Bk-1)
Matrix Representation
5Brands of Soup Revisited
Transition Diagram
P(BA) p 2/7
Brand A
Brand B
P(AB) p 2/7
Conditional Probability Formulas
P(Ak) P(Ak-1) (1-p)P(Bk-1 ) p 5/7 P(Ak-1)
2/7 P(Bk-1)
P(Bk) P(Ak-1 ) p P(Bk-1) (1-p) 2/7 P(Ak-1)
5/7 P(Bk-1)
Matrix Representation
6Brands of Soup Revisited
Transition Diagram
P(BA) p 2/7
Brand A
Brand B
P(AB) p 2/7
Conditional Probability Formulas
P(Ak) P(Ak-1) (1-p)P(Bk-1 ) p 5/7 P(Ak-1)
2/7 P(Bk-1)
P(Bk) P(Ak-1 ) p P(Bk-1) (1-p) 2/7 P(Ak-1)
5/7 P(Bk-1)
Matrix Representation
7Brands of Soup Revisited
Transition Diagram
P(BA) p 2/7
Brand A
Brand B
P(AB) p 2/7
Conditional Probability Formulas
P(Ak) P(Ak-1) (1-p)P(Bk-1 ) p 5/7 P(Ak-1)
2/7 P(Bk-1)
P(Bk) P(Ak-1 ) p P(Bk-1) (1-p) 2/7 P(Ak-1)
5/7 P(Bk-1)
Matrix Representation
8Markov Processes Can Be Represented by Matrices
1/2
e.g., a 3-state process
1/3
1/4
Can be represented with this matrix
9Each Step Involves an Inner Product
10Each Step Involves an Inner Product
11Markov Matrix Properties
- Sum of probabilities in a row must be 1
- No change diagonal matrix
- If well-behaved, multiplying the matrix by
itself many times converges to a limit - This limit matrix has identical column elements
- The rows of the limit matrix are the equilibrium
probabilities for the process
(1) Every state can transition to every other
state at least indirectly, and (2) the least
common denominator of any cycle in the transition
diagram is 1
12Ask Mathematica!
Recall m
13Margaret Dayhoff
- Had a large (for 1978) database of related
proteins
- Asked what is the probability that two aligned
sequences are related by evolution?
DAYHOFF, M. O., R. M. SCHWARTZ, and B. C. ORCUTT.
1978. A model of evolutionary change in
proteins. (pp 345-352 in M. 0. DAYHOFF, ed. Atlas
of protein sequence and structure. Vol. 5, Suppl.
3. National Biomedical Research Foundation,
Washington, D.C.)
14Dayhoff Model
- Amino acids change over time independently of
their position in a protein. (simplifying
assumption) - The probability of a substitution depends only on
the amino acids involved and not on the prior
history (Markov model).
15A Sequence Alignment
(Example alignment from a BLAST search)
gtgi1173266spP44374RS5_HAEIN 30S ribosomal
protein S5 Length 166 Score 263
bits (672), Expect 1e-70 Identities 154/166
(92), Positives 159/166 (95) Query 1
MAHIEKQAGELQEKLIAVNRVSKTVKGGRIFSFTALTVVGDGNGRVGFGY
GKAREVPAAI 60 MIEKQ
GELQEKLIAVNRVSKTVKGGRI SFTALTVVGDGNGRVGFGYGKAREVPA
AI Sbjct 1 MSNIEKQVGELQEKLIAVNRVSKTVKGGRIMSFTAL
TVVGDGNGRVGFGYGKAREVPAAI 60 Query 61
QKAMEKARRNMINVALNNGTLQHPVKGVHTGSRVFMQPASEGTGIIAGGA
MRAVLEVAGV 120 QKAMEKARRNMINVALN
GTLQHPVKGVHTGSRVFMQPASEGTGIIAGGAMRAVLEVAGV Sbjct
61 QKAMEKARRNMINVALNEGTLQHPVKGVHTGSRVFMQPASEGTGII
AGGAMRAVLEVAGV 120 Query 121 HNVLAKAYGSTNPINVVRA
TIDGLENMNSPEMVAAKRGKSVEEILG 166
NVLKAYGSTNPINVVRATID L NM SPEMVAAKRGKVEILG Sbjc
t 121 RNVLSKAYGSTNPINVVRATIDALANMKSPEMVAAKRGKTVDE
ILG 166
16Observed Substitution Frequencies
A Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
R 30 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
N 109 17 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
D 154 0 532 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
C 33 10 0 0 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Q 93 120 50 76 0 Â Â Â Â Â Â Â Â Â Â Â Â Â Â
E 266 0 94 831 0 422 Â Â Â Â Â Â Â Â Â Â Â Â Â
G 579 10 156 162 10 30 112 Â Â Â Â Â Â Â Â Â Â Â Â
H 21 103 226 43 10 243 23 10 Â Â Â Â Â Â Â Â Â Â Â
I 66 30 36 13 17 8 35 0 3 Â Â Â Â Â Â Â Â Â Â
L 95 17 37 0 0 75 15 17 40 253 Â Â Â Â Â Â Â Â Â
K 57 477 322 85 0 147 104 60 23 43 39 Â Â Â Â Â Â Â Â
M 29 17 0 0 0 20 7 7 0 57 207 90 Â Â Â Â Â Â Â
F 20 7 7 0 0 0 0 17 20 90 167 0 17 Â Â Â Â Â Â
P 345 67 27 10 10 93 40 49 50 7 43 43 4 7 Â Â Â Â Â
S 772 137 432 98 117 47 86 450 26 20 32 168 20 40 269 Â Â Â Â
T 590 20 169 57 10 37 31 50 14 129 52 200 28 10 73 696 Â Â Â
W 0 27 3 0 0 0 0 0 3 0 13 0 0 10 0 17 0 Â Â
Y 20 3 36 0 30 0 10 0 40 13 23 10 0 260 0 22 23 6 Â
V 365 20 13 17 33 27 37 97 30 661 303 17 77 10 50 43 186 0 17
 A R N D C Q E G H I L K M F P S T W Y
17Building a Markov Model
- From the observed substitution data, Dayhoff et
al. were able to estimate the joint probabilities
of two amino acids substituting for eachother.
This yields a big, diagonally symmetric matrix of
probabilities. The diagonal elements Mab are
close to 1. - But the matrix of joint probabilities, P(bna)
does not represent a Markov process. Recall the
elements of a Markov process matrix are
conditional probabilities, P(ba) P(bna) /
P(a). P(a) is just the probability (frequency) of
an amino acid, so each column in Mab is divided
by the frequency of the corresponding amino acid.
The diagonal elements are still all close to 1. - Dayhoff then adjusts the small non-diagonal
elements by a common factor that makes the
expected number of amino acid substitutions equal
to 1 in 100. The diagonal elements are then
adjusted to make each row add up to 1 as required
by the law of total probability. - This is the PAM1 Markov matrix (PAM Point
Accepted Mutation 1 1 substitution
frequency).
18Using the PAM Model
- The PAM1 Markov matrix can be multiplied by
itself to yield the PAM2 Markov matrix, and again
to yield the PAM3 matrix, etc. PAM1 is a unit of
evolutionary distance. - PAM250 is commonly used. Note that 250 of the
amino acids have not been substituted its more
like 80. - The PAM Markov Matrices arrived at by matrix
multiplication need to be converted into the
scoring matrices that one would use for BLAST or
CLUSTALW.
19Probability of an Alignment
In a random model, the probability of the
independent alignment of two proteins x and y is
the product of the probabilities qa for all
the amino acids.
(Note that the qi are not all the same value
of 1/20.)
In a match model, the proteins have descended
from a common ancestor protein and the amino acid
sequences are no longer independent. In this
model, the probability can be expressed as a
matrix of joint probabilities pab
(Note that the pij pji because neither protein
is first.)
Dayhoff and coworkers could estimate these
probabilities from the frequencies of amino acid
substitutions she observed in her database of
evolutionarily related proteins.
20A Log-Odds Score
We are interested in the ratio of the match
model probability of alignment to the random
model probability
In practice, we usually take the log of these
quantities for a substitution scoring matrix.
This changes the multiplications into additions
and reduces round-off error.
S(a,b) defines the number you usually see in a
substitution matrix. These numbers are usually
rounded to integers to ease computation.
21Questions?
- I will post a Mathematica notebook.