Markov models - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Markov models

Description:

Q-matrix for the Jukes and Cantor (JC) model -3/4m 1/4m 1/4m 1/4m. 1/4m -3/4m 1/4m 1/4m ... Jukes and Cantor (JC) model solution. Pii(t) = 1/4 3/4 exp(-mt) ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 48
Provided by: ktl
Category:
Tags: cantor | markov | models

less

Transcript and Presenter's Notes

Title: Markov models


1
Markov models
  • in Molecular phylogeny

Marco Salemi, Ph.D.
Dept. Pathology U.F. Gainesville, FL (U.S.A.)
2
Modelling genetic change
  • Given two or more nt or amino acid sequences,
    usually the first goal is to calculate some
    measure of sequence dissimilarity
  • easiest way to estimate genetic distances
  • p-distance (number of nt differences between two
    sequences divided by the sequence length)

3
Ancestral Sequence
AACCTGTGCA
Seq1 AATCTGTGTA seq2 ATCCTGGGTT
4
p-distance proportion of nt between two
sequences
Seq1 AATCTGTGTA seq2 ATCCTGGGTT
p-distance0.4
Usually underestimate the true distance genetic
(or evolutionary) distance d
5
Multiple, parallel, and back-substitutions
AACCTGTGCA
AACCTGTGCA T A A C AACCAGTGAA

AACCTGTGCA T G A C
ACCCGGTGAA
6
p
0.7 - 0.6 - 0.5 - 0.4 - 0.3 - 0.2 - 0.1
-
Relationship between p (observed) distance and d
genetic distance
d
1 2 3 4
7
Modeling nt substitutions as a time-homogeneous
time-continuous stationary Markov process (1)
Markov property
  • At any given site in a sequence the rate of
    change from base i to base j is independent from
    the base that occupied that site prior i

8
Modeling nt substitutions as a time-homogeneous
time-continuous stationary Markov process (2)
  • Homogeneity
  • Substitution rates do not change over time
  • Stationarity
  • The relative frequencies of A, C, G, and T (pA,
    pC, pG, pT) are at equilibrium, i.e. remain
    constant.

9
The Q-matrix
pi frequency of nt i a, b, c, etc. relative
rate parameters non-diagonal entries rate flow
from nucleotide i to nucleotide j diagonal
entries total rate flow that leaves nucleotide i
(rate at which nt i disappear per site per
sequence) .
10
General Time Reversible (GTR) Models
Substitutions from nucleotide i to nucleotide j
have the same rate of Substitutions from
nucleotide j to nucleotide i. In general f 1
and a, b, c, d, e are estimated from the data
via maximum likelihood
11
(No Transcript)
12
¾ m relative rate of nucleotide i (i A, C, G,
T) replacement during evolution nt substitutions
per sequence per site per year 2 (¾ m ) t nt
substitutions per site between two sequences
d
13
Estimating transition probabilities
As soon as the Q matrix, and thus the
evolutionary model, is specified, it is possible
to calculate the probabilities of change from any
base to any other during the evolutionary time t,
P(t), by computing the matrix exponential P(t)exp
(Qt)
14
Jukes and Cantor (JC) model solution
By computing P(t)exp(Qt) with Q according to the
JC model
Pii(t) 1/4 3/4 exp(-mt) Pij(t) 1/4 - 1/4
exp(-mt)
Pii(t) probability of nt i to remain the same
during time t Pij(t) probability of nt i to be
replaced by nt j during time t
15
Estimating the genetic distance (1)
Anc
Prob. of sequence A and B of being
identical IAB(iG) PGG(t) PGG(t)
Sequence A B
  • The probability that the ancestral base i
    remained the same during the interval (t0,t1) is
    given by Pii(t1), for any of the two sequences,
    and by Pii(t1)2, for both.

16
Estimating the genetic distance (2)
Anc
Prob. of sequence A and B of being
identical IAB(iG) PAG(t) PAG(t) PCG(t)
PCG(t) PTG(t) PTG(t)
Sequence A B
  • the probability that an ancestral base i mutated
    into j in both sequences is given by Pij(t1)2
    however, since j could have been anyone of the
    remaining three bases, the total probability of
    the second scenario will be 3Pij(t1)2

17
Estimating the genetic distance (3)
  • The total probability of two sequences sharing
    the same nucleotide at any given site, I(t) will
    be the sum
  • I(t) Pii(t1)2 3Pij(t1)2
  • and the probability of the two sequences being
    different, p1-I(t),
  • p 1 -Pii(t1)2 - 3Pij(t1)2
  • An estimator of p is the observed proportion of
    different sites between two sequences (
    p-distance).

18
Estimating the genetic distance (4)
Substituting Pii(t) 1/4 3/4 exp(-mt) and
Pij(t) 1/4 - 1/4 exp(-mt) into p 1
-Pii(t1)2 - 3Pij(t1)2 we obtain for two
sequences that split t time units ago p 3/4 1-
exp(-2mt) Solving for mt we get mt - 1/2 log
(1- 4/3 p)
19
2 (¾ m ) t nt substitutions per site between
two sequences d Therefore mt 2/3 d
20
Genetic distances correction formula JC model
Substituting mt with 2/3d in the equation mt -
1/2 log (1- 4/3 p) we finally obtain the
Jukes-Cantor correction formula for the genetic
distance d between two sequences d - 3/4 ln
(1- 4/3 p) It can also be demonstrated that the
variance V(d) will be given by V(d)
9p(1-p)/(3-4p)2
21
Seq1 AATCTGTGTA seq2 ATCCTGGGTT
p-distance 0.4 d (JC model) - 3/4 ln 1-
4/3 (0.4) 0.5716
22
Ancestral Sequence
AACCTGTGCA
AATCTGTGTA
ATCCTGGGTT
p-distance 0.4 d (JC model) - 3/4 ln 1-
4/3 (0.4) 0.5716
23
Q-matrix for the Tajima-Nei model
pA? pT ? pC ? pG
A C G T
A
m pC m pG
m pT m pA
m pG m pT m pA m pC
m pT m pA m pC
m pG
C
G
T
24
Tajima-Nei model correction formula
  • d - B ln (1- p/B)
  • Where B 1 (p2 pT2 pC2 pG2)
  • p observed distance
  • When pA pT pC pG0.25, B ¾, and the
    formula becomes equivalent to the one obtained
    for the JC model

25
Q-matrix for the Kimura-2p (K80) model
pA pT pC pG0.25 bek (Transitions) acd
f1 (Transverstions) k --gt Ti/Tv parameter
A C G T
A
-3/4m(k2) 1/4m 1/4km 1/4m 1/4m
-3/4m(k2) 1/4m 1/4km 1/4km 1/4m
-3/4m(k2) 1/4m 1/4m 1/4km 1/4m
-3/4m(k2)
C
G
T
26
K80 model solution
By computing P(t)exp(Qt) with Q according to the
K80 model
Pii(t) ¼ ¼ e-mt ½ e-m(k1)t/2 Pij(t)_Ti
¼ ¼ e-mt - ½ e-m (k1)t/2 Pij(t)_Tv ¼ - ¼
e-mt
Pii(t) probability of nt i to remain the same
during time t Pij(t)_Ti probability of nt i ?
j transitions during time t Pij(t)_Tv
probability of nt i ? j transversions during time
t
27
Genetic distances correction formula K80 model
Following the same derivation as for the JC
model, the K80 correction formula for the genetic
distance d between two sequences is found to
be d 1/2 ln 1/(1-2P-Q) 1/4 ln 1/(1-2Q)
P observed proportions of transition-type
differences Q observed proportions of
transversion-type differences
28
Nucleotide frequencies in HIV/SIV are at
equilibrium pol gene
Average SEQUENCE COMPOSITION (HIV-O/HIV-M full
pol) 5 chi-square test
p-value SE8538a passed
97.80 97TZ02a passed
94.59 BOLO122b passed
99.94 CAM1b passed
96.73 NY5CGb passed
97.64 98IN022c passed
99.44 94IN112c passed
98.68 93IN101c passed
99.61 VI850f passed
97.09 X138g passed
86.61 SE6165g passed
95.73 VI991h passed
98.23 SE9173j passed
96.17 SE92809j passed
96.50 MP535k passed
69.92 92UG001d passed
86.20 HIVO passed
77.48
pA 39.0 pC 16.6 pG 22.8 pT
21.6 Average Ti/Tv2.6
29
Nucleotide frequencies in HIV/SIV are at
equilibrium env gene
Average SEQUENCE COMPOSITION (SIV/HIV full
envelope) 5 chi-square test
p-value MVP5180 passed
14.60 SIVcpzUS passed
48.09 SIVcpzGAB passed
51.77 92UG037a passed
84.58 92UG975g passed
99.73 92RU131g passed
97.45 93IN905c passed
77.15 92BRO25c passed
59.51 92UG021d passed
94.89 92UG024d passed
92.60 BSSG3b passed
97.86 SFMHS20b passed
92.40 91TH652b passed
92.86 MBC18R01b passed
99.59
pA 34.5 pC 17.4 pG 23.4 pT
24.7 Average Ti/Tv1.5
30
Q-matrix for the F84 model (equivalent to the
HKY85 model)
31
F84 (HKY85) model solution
By computing P(t)exp(Qt) with Q according to the
HKY85 (or F84) model analytical formulas can be
found for
Pii(t) probability of nt i to remain the same
during time t Pij(t)_Ti probability of nt i ?
j transitions during time t Pij(t)_Tv
probability of nt i ? j transversions during time
t
F84 correction formula
d -2A ln1-P/2A-(A-B)Q/2AC) 2(A-B-C) ln
(1-Q/2C) A pCpT/(pCpT) pApG/(pApG), B
pCpTpApG, C(pApG) (pCpT), Pobserved
proportions of transition-type differences,
Qobserved proportions of transversion-type
differences
32
Nucleotide substitution patterns in HIV/SIV
Average frequency of changes between states
To
SIV/HIV-1 envelope
346.2 697.4 290.3
241.9 123
320.8
From
515.4 126.6
117.1
Transitions
215.6 371 144.6
Transversions
Average Ti/Tv1.5
33
More complex models
More complex models, like Tamura-Nei (TN93), or
the general time reversible (GTR) model usually
requires numerical algorithms in order to
calculate d. Several phylogenetic software
packages exist that can estimate genetic
distances between nucleotide sequences according
to different evolutionary models (MEGA3, PAUP,
PHYLIP, TREE-PUZZLE, DAMBE).
34
Estimating HIV genetic distancesenv gene
HIV-1B vs HIV-O/SIVcpz/HIV-1C full envelope
35
Estimating HIV genetic distancespol gene
HIV-1B vs HIV-O/HIV-1C full pol
p-distance JC69 K80
Tajima-Nei
HIV-O HIV-1C
0.257 (.007) 0.315 (.010) 0.318 (.011)
0.324 (.011) 0.103 (.005) 0.111 (.005)
0.113 (.006) 0.114 (.006)
36
(No Transcript)
37
Modeling Rate heterogeneity over sites
  • All the models discussed so far assume rate
    homogeneity over sites (each site along the
    sequence is assumed to evolve at the same rate)
  • Different codon position evolve at different
    rates (usually rate3rd gt rate1st gt rate2nd)
  • Within a protein there can be several conserved
    regions (very low mutation rate) and
    hypervariable regions (mutational hotspots)
  • For example, HIV V3 loop
  • Genetic distances can be grossly underestimated
    if rate heterogeneity over sites is not
    considered

38
Nucleotide substitution rate heterogeneity over
sites in HIV/SIV
SIV/HIV-1 envelope
Number of Characters
Number of steps
39
G-distributed rates across sites
  • The numbers of changes for each site in a given
    alignment are expected to be
  • Poisson-distributed if the nucleotide
    substitution rate is constant across sites
  • Negative-binomial distributed if the rate is
    G-distributed across sites
  • In most real data sets the nucleotide
    substitution rate seems to fit a G-distribution

40
The G-distribution
  • Known density function
  • The shape of the distribution depends by the a
    parameter
  • a lt 1 ? L-shape
  • a gt 1 ? bell shape
  • a ? ? ? one single value on the x-axis

41
Modeling Rate heterogeneity over sites with a
G-distribution (1)
  • A wide variety of rate distributions can be
    obtained by setting a1/b and varying a.
  • The G-distribution can be incorporated in the
    nucleotide substitution models
  • G-models are more realistic since they take into
    account rate heterogeneity over sites
  • Strong rate heterogeneity is modeled using alt1
    (for example, a0.5)
  • Weak rate heterogeneity is modeled using agt1 (for
    example, a2)

42
Modeling Rate heterogeneity over sites with a
G-distribution (2)
Most of the sites have relative rate around 1
Invariable or nearly-invariable sites
Mutational hotspots
43
Some estimated a values
44
JC model with G-distributed rates
  • Golding (1983), and Nei and Gojobori (1986), have
    shown that when the nucleotide substitution at
    each site follows the JC model, but the
    substitution rate varies with a G-distribution,
    the G-distance is given by the formula on the
    right

45
The Discrete G-distribution (1)
  • a is usually estimated from the data with maximum
    likelihood (ML)
  • A set of aligned sequences and a tree are
    required for the ML estimate
  • Integrating the likelihood function using a
    continuous G-distribution is too expensive
    computationally
  • Yang (1994) proposed to approximate the
    continuous G-function using 4-8 discrete
    categories of rates each one with equal
    probability

46
The Discrete G-distribution (2)
0.04
Four discrete categories
0.02
Frequency
0.0
0
1
2
3
4
5
Relative rate
  • In general, the more the categories, the better
    the approximation.
  • 4-8 categories are usually enough

47
Estimating HIV genetic distanceswith G-models
HIV-1B vs HIV-O/SIVcpz/HIV-1C full envelope, K80
model
Write a Comment
User Comments (0)
About PowerShow.com