Title: Markov models
1Markov models
Marco Salemi, Ph.D.
Dept. Pathology U.F. Gainesville, FL (U.S.A.)
2Modelling genetic change
- Given two or more nt or amino acid sequences,
usually the first goal is to calculate some
measure of sequence dissimilarity - easiest way to estimate genetic distances
- p-distance (number of nt differences between two
sequences divided by the sequence length)
3Ancestral Sequence
AACCTGTGCA
Seq1 AATCTGTGTA seq2 ATCCTGGGTT
4p-distance proportion of nt between two
sequences
Seq1 AATCTGTGTA seq2 ATCCTGGGTT
p-distance0.4
Usually underestimate the true distance genetic
(or evolutionary) distance d
5Multiple, parallel, and back-substitutions
AACCTGTGCA
AACCTGTGCA T A A C AACCAGTGAA
AACCTGTGCA T G A C
ACCCGGTGAA
6p
0.7 - 0.6 - 0.5 - 0.4 - 0.3 - 0.2 - 0.1
-
Relationship between p (observed) distance and d
genetic distance
d
1 2 3 4
7Modeling nt substitutions as a time-homogeneous
time-continuous stationary Markov process (1)
Markov property
- At any given site in a sequence the rate of
change from base i to base j is independent from
the base that occupied that site prior i
8Modeling nt substitutions as a time-homogeneous
time-continuous stationary Markov process (2)
- Homogeneity
- Substitution rates do not change over time
- Stationarity
- The relative frequencies of A, C, G, and T (pA,
pC, pG, pT) are at equilibrium, i.e. remain
constant.
9The Q-matrix
pi frequency of nt i a, b, c, etc. relative
rate parameters non-diagonal entries rate flow
from nucleotide i to nucleotide j diagonal
entries total rate flow that leaves nucleotide i
(rate at which nt i disappear per site per
sequence) .
10General Time Reversible (GTR) Models
Substitutions from nucleotide i to nucleotide j
have the same rate of Substitutions from
nucleotide j to nucleotide i. In general f 1
and a, b, c, d, e are estimated from the data
via maximum likelihood
11(No Transcript)
12¾ m relative rate of nucleotide i (i A, C, G,
T) replacement during evolution nt substitutions
per sequence per site per year 2 (¾ m ) t nt
substitutions per site between two sequences
d
13Estimating transition probabilities
As soon as the Q matrix, and thus the
evolutionary model, is specified, it is possible
to calculate the probabilities of change from any
base to any other during the evolutionary time t,
P(t), by computing the matrix exponential P(t)exp
(Qt)
14Jukes and Cantor (JC) model solution
By computing P(t)exp(Qt) with Q according to the
JC model
Pii(t) 1/4 3/4 exp(-mt) Pij(t) 1/4 - 1/4
exp(-mt)
Pii(t) probability of nt i to remain the same
during time t Pij(t) probability of nt i to be
replaced by nt j during time t
15Estimating the genetic distance (1)
Anc
Prob. of sequence A and B of being
identical IAB(iG) PGG(t) PGG(t)
Sequence A B
- The probability that the ancestral base i
remained the same during the interval (t0,t1) is
given by Pii(t1), for any of the two sequences,
and by Pii(t1)2, for both.
16Estimating the genetic distance (2)
Anc
Prob. of sequence A and B of being
identical IAB(iG) PAG(t) PAG(t) PCG(t)
PCG(t) PTG(t) PTG(t)
Sequence A B
- the probability that an ancestral base i mutated
into j in both sequences is given by Pij(t1)2
however, since j could have been anyone of the
remaining three bases, the total probability of
the second scenario will be 3Pij(t1)2
17Estimating the genetic distance (3)
- The total probability of two sequences sharing
the same nucleotide at any given site, I(t) will
be the sum - I(t) Pii(t1)2 3Pij(t1)2
- and the probability of the two sequences being
different, p1-I(t), - p 1 -Pii(t1)2 - 3Pij(t1)2
- An estimator of p is the observed proportion of
different sites between two sequences (
p-distance).
18Estimating the genetic distance (4)
Substituting Pii(t) 1/4 3/4 exp(-mt) and
Pij(t) 1/4 - 1/4 exp(-mt) into p 1
-Pii(t1)2 - 3Pij(t1)2 we obtain for two
sequences that split t time units ago p 3/4 1-
exp(-2mt) Solving for mt we get mt - 1/2 log
(1- 4/3 p)
192 (¾ m ) t nt substitutions per site between
two sequences d Therefore mt 2/3 d
20Genetic distances correction formula JC model
Substituting mt with 2/3d in the equation mt -
1/2 log (1- 4/3 p) we finally obtain the
Jukes-Cantor correction formula for the genetic
distance d between two sequences d - 3/4 ln
(1- 4/3 p) It can also be demonstrated that the
variance V(d) will be given by V(d)
9p(1-p)/(3-4p)2
21Seq1 AATCTGTGTA seq2 ATCCTGGGTT
p-distance 0.4 d (JC model) - 3/4 ln 1-
4/3 (0.4) 0.5716
22Ancestral Sequence
AACCTGTGCA
AATCTGTGTA
ATCCTGGGTT
p-distance 0.4 d (JC model) - 3/4 ln 1-
4/3 (0.4) 0.5716
23Q-matrix for the Tajima-Nei model
pA? pT ? pC ? pG
A C G T
A
m pC m pG
m pT m pA
m pG m pT m pA m pC
m pT m pA m pC
m pG
C
G
T
24Tajima-Nei model correction formula
- d - B ln (1- p/B)
- Where B 1 (p2 pT2 pC2 pG2)
- p observed distance
- When pA pT pC pG0.25, B ¾, and the
formula becomes equivalent to the one obtained
for the JC model
25Q-matrix for the Kimura-2p (K80) model
pA pT pC pG0.25 bek (Transitions) acd
f1 (Transverstions) k --gt Ti/Tv parameter
A C G T
A
-3/4m(k2) 1/4m 1/4km 1/4m 1/4m
-3/4m(k2) 1/4m 1/4km 1/4km 1/4m
-3/4m(k2) 1/4m 1/4m 1/4km 1/4m
-3/4m(k2)
C
G
T
26K80 model solution
By computing P(t)exp(Qt) with Q according to the
K80 model
Pii(t) ¼ ¼ e-mt ½ e-m(k1)t/2 Pij(t)_Ti
¼ ¼ e-mt - ½ e-m (k1)t/2 Pij(t)_Tv ¼ - ¼
e-mt
Pii(t) probability of nt i to remain the same
during time t Pij(t)_Ti probability of nt i ?
j transitions during time t Pij(t)_Tv
probability of nt i ? j transversions during time
t
27Genetic distances correction formula K80 model
Following the same derivation as for the JC
model, the K80 correction formula for the genetic
distance d between two sequences is found to
be d 1/2 ln 1/(1-2P-Q) 1/4 ln 1/(1-2Q)
P observed proportions of transition-type
differences Q observed proportions of
transversion-type differences
28Nucleotide frequencies in HIV/SIV are at
equilibrium pol gene
Average SEQUENCE COMPOSITION (HIV-O/HIV-M full
pol) 5 chi-square test
p-value SE8538a passed
97.80 97TZ02a passed
94.59 BOLO122b passed
99.94 CAM1b passed
96.73 NY5CGb passed
97.64 98IN022c passed
99.44 94IN112c passed
98.68 93IN101c passed
99.61 VI850f passed
97.09 X138g passed
86.61 SE6165g passed
95.73 VI991h passed
98.23 SE9173j passed
96.17 SE92809j passed
96.50 MP535k passed
69.92 92UG001d passed
86.20 HIVO passed
77.48
pA 39.0 pC 16.6 pG 22.8 pT
21.6 Average Ti/Tv2.6
29Nucleotide frequencies in HIV/SIV are at
equilibrium env gene
Average SEQUENCE COMPOSITION (SIV/HIV full
envelope) 5 chi-square test
p-value MVP5180 passed
14.60 SIVcpzUS passed
48.09 SIVcpzGAB passed
51.77 92UG037a passed
84.58 92UG975g passed
99.73 92RU131g passed
97.45 93IN905c passed
77.15 92BRO25c passed
59.51 92UG021d passed
94.89 92UG024d passed
92.60 BSSG3b passed
97.86 SFMHS20b passed
92.40 91TH652b passed
92.86 MBC18R01b passed
99.59
pA 34.5 pC 17.4 pG 23.4 pT
24.7 Average Ti/Tv1.5
30Q-matrix for the F84 model (equivalent to the
HKY85 model)
31F84 (HKY85) model solution
By computing P(t)exp(Qt) with Q according to the
HKY85 (or F84) model analytical formulas can be
found for
Pii(t) probability of nt i to remain the same
during time t Pij(t)_Ti probability of nt i ?
j transitions during time t Pij(t)_Tv
probability of nt i ? j transversions during time
t
F84 correction formula
d -2A ln1-P/2A-(A-B)Q/2AC) 2(A-B-C) ln
(1-Q/2C) A pCpT/(pCpT) pApG/(pApG), B
pCpTpApG, C(pApG) (pCpT), Pobserved
proportions of transition-type differences,
Qobserved proportions of transversion-type
differences
32Nucleotide substitution patterns in HIV/SIV
Average frequency of changes between states
To
SIV/HIV-1 envelope
346.2 697.4 290.3
241.9 123
320.8
From
515.4 126.6
117.1
Transitions
215.6 371 144.6
Transversions
Average Ti/Tv1.5
33More complex models
More complex models, like Tamura-Nei (TN93), or
the general time reversible (GTR) model usually
requires numerical algorithms in order to
calculate d. Several phylogenetic software
packages exist that can estimate genetic
distances between nucleotide sequences according
to different evolutionary models (MEGA3, PAUP,
PHYLIP, TREE-PUZZLE, DAMBE).
34Estimating HIV genetic distancesenv gene
HIV-1B vs HIV-O/SIVcpz/HIV-1C full envelope
35Estimating HIV genetic distancespol gene
HIV-1B vs HIV-O/HIV-1C full pol
p-distance JC69 K80
Tajima-Nei
HIV-O HIV-1C
0.257 (.007) 0.315 (.010) 0.318 (.011)
0.324 (.011) 0.103 (.005) 0.111 (.005)
0.113 (.006) 0.114 (.006)
36(No Transcript)
37Modeling Rate heterogeneity over sites
- All the models discussed so far assume rate
homogeneity over sites (each site along the
sequence is assumed to evolve at the same rate) - Different codon position evolve at different
rates (usually rate3rd gt rate1st gt rate2nd) - Within a protein there can be several conserved
regions (very low mutation rate) and
hypervariable regions (mutational hotspots) - For example, HIV V3 loop
- Genetic distances can be grossly underestimated
if rate heterogeneity over sites is not
considered
38Nucleotide substitution rate heterogeneity over
sites in HIV/SIV
SIV/HIV-1 envelope
Number of Characters
Number of steps
39G-distributed rates across sites
- The numbers of changes for each site in a given
alignment are expected to be - Poisson-distributed if the nucleotide
substitution rate is constant across sites - Negative-binomial distributed if the rate is
G-distributed across sites - In most real data sets the nucleotide
substitution rate seems to fit a G-distribution
40The G-distribution
- Known density function
- The shape of the distribution depends by the a
parameter - a lt 1 ? L-shape
- a gt 1 ? bell shape
- a ? ? ? one single value on the x-axis
41Modeling Rate heterogeneity over sites with a
G-distribution (1)
- A wide variety of rate distributions can be
obtained by setting a1/b and varying a. - The G-distribution can be incorporated in the
nucleotide substitution models - G-models are more realistic since they take into
account rate heterogeneity over sites - Strong rate heterogeneity is modeled using alt1
(for example, a0.5) - Weak rate heterogeneity is modeled using agt1 (for
example, a2)
42Modeling Rate heterogeneity over sites with a
G-distribution (2)
Most of the sites have relative rate around 1
Invariable or nearly-invariable sites
Mutational hotspots
43Some estimated a values
44JC model with G-distributed rates
- Golding (1983), and Nei and Gojobori (1986), have
shown that when the nucleotide substitution at
each site follows the JC model, but the
substitution rate varies with a G-distribution,
the G-distance is given by the formula on the
right
45The Discrete G-distribution (1)
- a is usually estimated from the data with maximum
likelihood (ML) - A set of aligned sequences and a tree are
required for the ML estimate - Integrating the likelihood function using a
continuous G-distribution is too expensive
computationally - Yang (1994) proposed to approximate the
continuous G-function using 4-8 discrete
categories of rates each one with equal
probability
46The Discrete G-distribution (2)
0.04
Four discrete categories
0.02
Frequency
0.0
0
1
2
3
4
5
Relative rate
- In general, the more the categories, the better
the approximation. - 4-8 categories are usually enough
47Estimating HIV genetic distanceswith G-models
HIV-1B vs HIV-O/SIVcpz/HIV-1C full envelope, K80
model