Markov models - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Markov models

Description:

Q-matrix for the Jukes and Cantor (JC) model -3/4m 1/4m 1/4m 1/4m. 1/4m -3/4m 1/4m 1/4m ... Jukes and Cantor (JC) model solution. Pii(t) = 1/4 3/4 exp(-mt) ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 48

Provided by: ktl

Category:

more less

Transcript and Presenter's Notes

Title: Markov models

1
Markov models

in Molecular phylogeny

Marco Salemi, Ph.D.
Dept. Pathology U.F. Gainesville, FL (U.S.A.)
2
Modelling genetic change

Given two or more nt or amino acid sequences,
usually the first goal is to calculate some
measure of sequence dissimilarity
easiest way to estimate genetic distances
p-distance (number of nt differences between two
sequences divided by the sequence length)

3
Ancestral Sequence
AACCTGTGCA
Seq1 AATCTGTGTA seq2 ATCCTGGGTT
4
p-distance proportion of nt between two
sequences
Seq1 AATCTGTGTA seq2 ATCCTGGGTT
p-distance0.4
Usually underestimate the true distance genetic
(or evolutionary) distance d
5
Multiple, parallel, and back-substitutions
AACCTGTGCA
AACCTGTGCA T A A C AACCAGTGAA

AACCTGTGCA T G A C
ACCCGGTGAA
6
p
0.7 - 0.6 - 0.5 - 0.4 - 0.3 - 0.2 - 0.1
-
Relationship between p (observed) distance and d
genetic distance
d
1 2 3 4
7
Modeling nt substitutions as a time-homogeneous
time-continuous stationary Markov process (1)
Markov property

At any given site in a sequence the rate of
change from base i to base j is independent from
the base that occupied that site prior i

8
Modeling nt substitutions as a time-homogeneous
time-continuous stationary Markov process (2)

Homogeneity
Substitution rates do not change over time
Stationarity
The relative frequencies of A, C, G, and T (pA,
pC, pG, pT) are at equilibrium, i.e. remain
constant.

9
The Q-matrix
pi frequency of nt i a, b, c, etc. relative
rate parameters non-diagonal entries rate flow
from nucleotide i to nucleotide j diagonal
entries total rate flow that leaves nucleotide i
(rate at which nt i disappear per site per
sequence) .
10
General Time Reversible (GTR) Models
Substitutions from nucleotide i to nucleotide j
have the same rate of Substitutions from
nucleotide j to nucleotide i. In general f 1
and a, b, c, d, e are estimated from the data
via maximum likelihood
11
(No Transcript)
12
¾ m relative rate of nucleotide i (i A, C, G,
T) replacement during evolution nt substitutions
per sequence per site per year 2 (¾ m ) t nt
substitutions per site between two sequences
d
13
Estimating transition probabilities
As soon as the Q matrix, and thus the
evolutionary model, is specified, it is possible
to calculate the probabilities of change from any
base to any other during the evolutionary time t,
P(t), by computing the matrix exponential P(t)exp
(Qt)
14
Jukes and Cantor (JC) model solution
By computing P(t)exp(Qt) with Q according to the
JC model
Pii(t) 1/4 3/4 exp(-mt) Pij(t) 1/4 - 1/4
exp(-mt)
Pii(t) probability of nt i to remain the same
during time t Pij(t) probability of nt i to be
replaced by nt j during time t
15
Estimating the genetic distance (1)
Anc
Prob. of sequence A and B of being
identical IAB(iG) PGG(t) PGG(t)
Sequence A B

The probability that the ancestral base i
remained the same during the interval (t0,t1) is
given by Pii(t1), for any of the two sequences,
and by Pii(t1)2, for both.

16
Estimating the genetic distance (2)
Anc
Prob. of sequence A and B of being
identical IAB(iG) PAG(t) PAG(t) PCG(t)
PCG(t) PTG(t) PTG(t)
Sequence A B

the probability that an ancestral base i mutated
into j in both sequences is given by Pij(t1)2
however, since j could have been anyone of the
remaining three bases, the total probability of
the second scenario will be 3Pij(t1)2

17
Estimating the genetic distance (3)

The total probability of two sequences sharing
the same nucleotide at any given site, I(t) will
be the sum
I(t) Pii(t1)2 3Pij(t1)2
and the probability of the two sequences being
different, p1-I(t),
p 1 -Pii(t1)2 - 3Pij(t1)2
An estimator of p is the observed proportion of
different sites between two sequences (
p-distance).

18
Estimating the genetic distance (4)
Substituting Pii(t) 1/4 3/4 exp(-mt) and
Pij(t) 1/4 - 1/4 exp(-mt) into p 1
-Pii(t1)2 - 3Pij(t1)2 we obtain for two
sequences that split t time units ago p 3/4 1-
exp(-2mt) Solving for mt we get mt - 1/2 log
(1- 4/3 p)
19
2 (¾ m ) t nt substitutions per site between
two sequences d Therefore mt 2/3 d
20
Genetic distances correction formula JC model
Substituting mt with 2/3d in the equation mt -
1/2 log (1- 4/3 p) we finally obtain the
Jukes-Cantor correction formula for the genetic
distance d between two sequences d - 3/4 ln
(1- 4/3 p) It can also be demonstrated that the
variance V(d) will be given by V(d)
9p(1-p)/(3-4p)2
21
Seq1 AATCTGTGTA seq2 ATCCTGGGTT
p-distance 0.4 d (JC model) - 3/4 ln 1-
4/3 (0.4) 0.5716
22
Ancestral Sequence
AACCTGTGCA
AATCTGTGTA
ATCCTGGGTT
p-distance 0.4 d (JC model) - 3/4 ln 1-
4/3 (0.4) 0.5716
23
Q-matrix for the Tajima-Nei model
pA? pT ? pC ? pG
A C G T
A
m pC m pG
m pT m pA
m pG m pT m pA m pC
m pT m pA m pC
m pG
C
G
T
24
Tajima-Nei model correction formula

d - B ln (1- p/B)
Where B 1 (p2 pT2 pC2 pG2)
p observed distance
When pA pT pC pG0.25, B ¾, and the
formula becomes equivalent to the one obtained
for the JC model

25
Q-matrix for the Kimura-2p (K80) model
pA pT pC pG0.25 bek (Transitions) acd
f1 (Transverstions) k --gt Ti/Tv parameter
A C G T
A
-3/4m(k2) 1/4m 1/4km 1/4m 1/4m
-3/4m(k2) 1/4m 1/4km 1/4km 1/4m
-3/4m(k2) 1/4m 1/4m 1/4km 1/4m
-3/4m(k2)
C
G
T
26
K80 model solution
By computing P(t)exp(Qt) with Q according to the
K80 model
Pii(t) ¼ ¼ e-mt ½ e-m(k1)t/2 Pij(t)_Ti
¼ ¼ e-mt - ½ e-m (k1)t/2 Pij(t)_Tv ¼ - ¼
e-mt
Pii(t) probability of nt i to remain the same
during time t Pij(t)_Ti probability of nt i ?
j transitions during time t Pij(t)_Tv
probability of nt i ? j transversions during time
t
27
Genetic distances correction formula K80 model
Following the same derivation as for the JC
model, the K80 correction formula for the genetic
distance d between two sequences is found to
be d 1/2 ln 1/(1-2P-Q) 1/4 ln 1/(1-2Q)
P observed proportions of transition-type
differences Q observed proportions of
transversion-type differences
28
Nucleotide frequencies in HIV/SIV are at
equilibrium pol gene
Average SEQUENCE COMPOSITION (HIV-O/HIV-M full
pol) 5 chi-square test
p-value SE8538a passed
97.80 97TZ02a passed
94.59 BOLO122b passed
99.94 CAM1b passed
96.73 NY5CGb passed
97.64 98IN022c passed
99.44 94IN112c passed
98.68 93IN101c passed
99.61 VI850f passed
97.09 X138g passed
86.61 SE6165g passed
95.73 VI991h passed
98.23 SE9173j passed
96.17 SE92809j passed
96.50 MP535k passed
69.92 92UG001d passed
86.20 HIVO passed
77.48
pA 39.0 pC 16.6 pG 22.8 pT
21.6 Average Ti/Tv2.6
29
Nucleotide frequencies in HIV/SIV are at
equilibrium env gene
Average SEQUENCE COMPOSITION (SIV/HIV full
envelope) 5 chi-square test
p-value MVP5180 passed
14.60 SIVcpzUS passed
48.09 SIVcpzGAB passed
51.77 92UG037a passed
84.58 92UG975g passed
99.73 92RU131g passed
97.45 93IN905c passed
77.15 92BRO25c passed
59.51 92UG021d passed
94.89 92UG024d passed
92.60 BSSG3b passed
97.86 SFMHS20b passed
92.40 91TH652b passed
92.86 MBC18R01b passed
99.59
pA 34.5 pC 17.4 pG 23.4 pT
24.7 Average Ti/Tv1.5
30
Q-matrix for the F84 model (equivalent to the
HKY85 model)
31
F84 (HKY85) model solution
By computing P(t)exp(Qt) with Q according to the
HKY85 (or F84) model analytical formulas can be
found for
Pii(t) probability of nt i to remain the same
during time t Pij(t)_Ti probability of nt i ?
j transitions during time t Pij(t)_Tv
probability of nt i ? j transversions during time
t
F84 correction formula
d -2A ln1-P/2A-(A-B)Q/2AC) 2(A-B-C) ln
(1-Q/2C) A pCpT/(pCpT) pApG/(pApG), B
pCpTpApG, C(pApG) (pCpT), Pobserved
proportions of transition-type differences,
Qobserved proportions of transversion-type
differences
32
Nucleotide substitution patterns in HIV/SIV
Average frequency of changes between states
To
SIV/HIV-1 envelope
346.2 697.4 290.3
241.9 123
320.8
From
515.4 126.6
117.1
Transitions
215.6 371 144.6
Transversions
Average Ti/Tv1.5
33
More complex models
More complex models, like Tamura-Nei (TN93), or
the general time reversible (GTR) model usually
requires numerical algorithms in order to
calculate d. Several phylogenetic software
packages exist that can estimate genetic
distances between nucleotide sequences according
to different evolutionary models (MEGA3, PAUP,
PHYLIP, TREE-PUZZLE, DAMBE).
34
Estimating HIV genetic distancesenv gene
HIV-1B vs HIV-O/SIVcpz/HIV-1C full envelope
35
Estimating HIV genetic distancespol gene
HIV-1B vs HIV-O/HIV-1C full pol
p-distance JC69 K80
Tajima-Nei
HIV-O HIV-1C
0.257 (.007) 0.315 (.010) 0.318 (.011)
0.324 (.011) 0.103 (.005) 0.111 (.005)
0.113 (.006) 0.114 (.006)
36
(No Transcript)
37
Modeling Rate heterogeneity over sites

All the models discussed so far assume rate
homogeneity over sites (each site along the
sequence is assumed to evolve at the same rate)
Different codon position evolve at different
rates (usually rate3rd gt rate1st gt rate2nd)
Within a protein there can be several conserved
regions (very low mutation rate) and
hypervariable regions (mutational hotspots)
For example, HIV V3 loop
Genetic distances can be grossly underestimated
if rate heterogeneity over sites is not
considered

38
Nucleotide substitution rate heterogeneity over
sites in HIV/SIV
SIV/HIV-1 envelope
Number of Characters
Number of steps
39
G-distributed rates across sites

The numbers of changes for each site in a given
alignment are expected to be
Poisson-distributed if the nucleotide
substitution rate is constant across sites
Negative-binomial distributed if the rate is
G-distributed across sites
In most real data sets the nucleotide
substitution rate seems to fit a G-distribution

40
The G-distribution

Known density function
The shape of the distribution depends by the a
parameter
a lt 1 ? L-shape
a gt 1 ? bell shape
a ? ? ? one single value on the x-axis

41
Modeling Rate heterogeneity over sites with a
G-distribution (1)

A wide variety of rate distributions can be
obtained by setting a1/b and varying a.
The G-distribution can be incorporated in the
nucleotide substitution models
G-models are more realistic since they take into
account rate heterogeneity over sites
Strong rate heterogeneity is modeled using alt1
(for example, a0.5)
Weak rate heterogeneity is modeled using agt1 (for
example, a2)

42
Modeling Rate heterogeneity over sites with a
G-distribution (2)
Most of the sites have relative rate around 1
Invariable or nearly-invariable sites
Mutational hotspots
43
Some estimated a values
44
JC model with G-distributed rates

Golding (1983), and Nei and Gojobori (1986), have
shown that when the nucleotide substitution at
each site follows the JC model, but the
substitution rate varies with a G-distribution,
the G-distance is given by the formula on the
right

45
The Discrete G-distribution (1)

a is usually estimated from the data with maximum
likelihood (ML)
A set of aligned sequences and a tree are
required for the ML estimate
Integrating the likelihood function using a
continuous G-distribution is too expensive
computationally
Yang (1994) proposed to approximate the
continuous G-function using 4-8 discrete
categories of rates each one with equal
probability