Maximum likelihood and Bayesian methods in Molecular phylogeny - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Maximum likelihood and Bayesian methods in Molecular phylogeny

Description:

Maximum likelihood and Bayesian methods in Molecular ... conditional probability of the data (aligned sequences) given a hypothesis (a ... 2 (loge L1 loge L0) ... – PowerPoint PPT presentation

Number of Views:961
Avg rating:3.0/5.0
Slides: 46
Provided by: annemiev1
Category:

less

Transcript and Presenter's Notes

Title: Maximum likelihood and Bayesian methods in Molecular phylogeny


1
Maximum likelihood and Bayesian methods in
Molecular phylogeny
Marco Salemi, Ph.D.
Dept. Pathology U.F. Gainesville, FL (U.S.A.)
2
The Likelihood function
  • Likelihood function
  • conditional probability of the data (aligned
    sequences) given a hypothesis (a model of
    substitution with a set of parameters ?, and the
    tree ?, including topology and branch lengths)
  • L(?,?) Prob(Data??,?)
  • or
  • Prob(Aligned sequences?tree, model of evolution)
  • The maximum likelihood estimates (MLEs) of ? and
    ? are those making the likelihood function as
    large as possible
  • t , ? max L(?,?)
  • Therefore, what we usually call the likelihood
    of the tree IS NOT the likelihood of the tree but
    the probability of the data given that the tree
    is the true tree!

3
Calculating the likelihood of a known tree with a
known evolutionary model
1
j
n
1)
(1)
. . C G G A C A C G T T T A C . .
(2)
. . C A G A C A C C T C T A C . .
A
(3)
. . C G G A T A A G T T A A C . .
(4)
. . C G G A T A G C C T A G C . .
A
?
A
A
2)
1
3
C
P(A) gt P(C) gt P(G) P(T)
2
4
4
Calculating the likelihood known tree (contd)
3)
C
C
A
G
C
C
A
G
A
C
L(j) Prob
Prob
3)
L L(1) L(2) L(n) ? L(j)
A
A
C
C
A
G
4)
ln L ln L(1) ln L(2) ln L(n) ? L(j)
G
Prob
.
C
22n-2 internal nodes for a tree of n taxa 22n-2
terms in the summation! (More efficient
way Felsenstein Pruning Algorithm)
C
C
A
G
Prob
.
T
T
5
Finding the ML tree
  • Exhaustive search guaranteed to find the minimum
    tree because all tree topologies are evaluated.
    Not possible for more than 10 sequences
  • Branch and bound guaranteed to find the minimum
    tree without evaluating all tree topologies a
    larger number of taxa can be evaluated but still
    limited (depends on the dataset)
  • Heuristic searches not guaranteed to find the
    minimal tree. Uses stepwise addition of taxa plus
    a rearrangement process (branch swapping)

6
Trees combinatorial explosion
  • Number of unrooted trees for n taxa
  • NU(2n-5)!/2n-3(n-3)!
  • Number of rooted trees for n taxa
  • NR(2n-3)!/2n-2(n-2)!

7
Heuristic Search strategy (1)choosing a
starting tree
  • Simple Stepwise addition

8
Stepwise addition
A
Start tree construction with first 3 sequences
C
B
L3 gt L2 , L1
B
A
A
B
A
C
Add 4th
L3
L1
L2
D
C
C
D
B
D
E
E
A
B
A
B
A
B
E
Add 5th
D
C
D
C
D
C
E
B
B
A
A
E
Add next sequence
Max L
D
C
D
C
9
Tree landscape and input order
  • During stepwise addition, in spite of branch and
    bound or other rearrangement algorithms, one can
    get stuck into a sub optimal tree peak.
  • Tree input order juggling can explore more than
    one tree peak, with better chances at finding the
    optimal tree.

10
Heuristic Search strategy (1)choosing a
starting tree
  • Simple Stepwise addition
  • Stepwise addition with random repeats (at least
    5 to 10)
  • ADVANTAGE exploring different portion of the
    tree-space
  • DISADVANTAGE many random replicates necessary
    (computationally slow for more than 35-40 taxa)

11
Global Maximum Likelihood tree
local Maximum Likelihood tree
Likelihood
Trees
Starting tree of the heuristic search
Starting tree of the heuristic search
12
Heuristic Search strategy (1)choosing a
starting tree
  • Simple Stepwise addition
  • Stepwise addition with random repeats (at least
    5 to 10)
  • ADVANTAGE exploring different portion of the
    tree-space
  • DISADVANTAGE many random replicates necessary
    (computationally slow for more than 35-40 taxa)
  • NJ tree
  • ADVANTAGE starting tree probably not too wrong
    (too far from the global maximum)
  • DISADVANTAGE possible to get stuck in local
    maxima

13
Heuristic Search strategy (2)Re-arranging the
tree
  • The starting tree is re-arranged in order to find
    a better (higher likelihood) tree with a Pruning
    algorithm
  • Neighbor Nearest Interchange (NNI) subset of SPR
  • Subtree Pruning Regrafting (SPR) subset of
    TBR
  • Tree Bisection Reconnection (TBR)

14
Branch swapping
C
D
A
E
Tree bisection and reconnection (TBR)
E
F
B
A
D
G
F
C
B
G
C
D
E
A
B
G
F
A
F
B
G
D
C
Evaluate Likelihood again ...
E
15
Maximum Likelihood Estimates (MLEs) of
substitution model parameters
  • Each free parameter of a given nt substitution
    model can be estimated by ML
  • The MLEs of the parameters are those making the
    likelihood of the tree the largest
  • In theory parameters and tree topology could be
    estimated simultaneously
  • In practice the parameters are estimated using a
    reasonable starting tree (for example NJ or ML
    obtained with arbitrarily fixed parameters).

16
Nucleotide substitution models (again!)
  • is the mean substitution rate a, b, c, d, are
    relative rate parameters ?A, ?C, ?G, and ?T are
    frequency parameters
  • How do we decide which model best-fit our data?

17
The Likelihood Ratio Test (LRT)
  • The Likelihood function offers a natural way of
    comparing nested evolutionary hypothesis using
    the Likelihood Ratio (LR) statistic
  • ? 2 (loge L1 loge L0)
  • L1 maximum likelihood under the more
    parameter-rich, complex model (alternative
    hypothesis, H1)
  • L0 maximum likelihood under the less
    parameter-rich simple model (null hypothesis, H0)
  • If the models are nested and the null hypothesis
    (H0) is correct, D is asymptotically distributed
    as ?2 with a number of degrees of freedom equal
    to the difference in number of free parameters
    between the two models

18
The Likelihood Ratio Test (LRT) (2)
  • If LRT is significant (i.e. p lt 0.05, or lt0.01)
    the inclusion of additional parameters in the
    alternative model increases significantly the
    likelihood of the data
  • When D is close to zero (p gt 0.05) the
    alternative hypothesis does not fit the data
    significantly better than the null hypothesis,
    i.e. adding parameters to the null model does not
    give us a better explanation of the data!

19
The Likelihood Ratio Test (LRT) (3)
Nested evolutionary hypotheses That two models
are nested when one model (H0, null model or
constrained model) is equivalent to restrict the
possible values that one or more parameters can
take in the other model (H1, alternative,
unconstrained or full model
JC69 no Ti/Tv ratio, pA pT pC pG 0
K80 Ti/Tv ratio, pA pT pC pG 1
F84/HKY85 Ti/Tv ratio, pA ? pT ? pC ? pG 4
20
The Likelihood Ratio Test (LRT) (4)
  • Estimate a tree from the data (the "base tree").
    This tree has been shown to not have influence in
    the final model selected as far as it is not a
    random tree . A neighbor-joining tree with the
    JC69 or K80 model will be fast and will do fine.
  • Estimate the likelihood of the candidate models,
    for the given data set and the "base tree".
  • Compare the likelihood of the candidate models
    through a hierarchy of LRTs to select the
    best-fit model among the candidate ones.

21
Model free parameters (in
the Q matrix) GTR 8 a?b?c?d?e?f,
pi TN93 5 pi ? pj, Ti/Tv ratio,
R/Y ratio (b, e, acdf) 4 pi ?
pj Ti/Tv ratio (be, acdf)
3 pi ? pj (a bcdef) 1 pA
pT pC pG0.25 Ti/Tv ratio (be,
acdf) JC69 0 pA
pT pC pG0.25 (a bcdef)
HKY85 F84 F81 (?Tajima-Nei) K80
22
Testing rate heterogeneity over sites with
G-models (1)
  • Each one of the models discussed above assumes
    rate homogeneity over sites
  • A discrete G-distribution (Yang, 1994) is
    commonly used to model rate heterogeneity over
    sites
  • The a parameter of the G-distribution can be
    estimated, as any other parameter of the
    nucleotide substitution model, via maximum
    likelihood

23
Testing rate heterogeneity over sites with
G-models (2)
  • Any given G-model with some value of a is nested
    with the corresponding non G-model (which is the
    special case of the G-model for a?) D.o.f.1
  • For example, we can compare in a LRT the
    likelihood of a tree under the JC model and the
    JCG model. Since only one additional parameter
    is estimated by the more complex (JCG) model,
    the hypothesis of rate homogeneity over sites is
    rejected if ? gt ?20.05 for 1 d.o.f.

24

The classic molecular clock
  • The molecular clock hypothesis postulates that
    for any given macromolecule (a protein or a DNA
    sequence) the rate of evolution is approximately
    constant over time in all evolutionary lineages
    (Zuckerkandl and and Pauling 1962 and 1965)

25

Mutation and evolutionary rate
  • Mutation Rate (m) error rate (biochemical
    concept), rate of polymerase nucleotide
    misincorporation
  • Evolutionary rate (r) number of mutants arising
    per generation x probability of fixation
  • r 2Nm 1/2N m (Kimura, 1968)

26
A global molecular clock?
The hypothesis known as global clock was based on
the observation that a linear relation seems to
exist between the number of amino acid
substitutions between homologous proteins of
different species, and the species divergence
times estimated from archaeological data.
27

Why is the molecular clock attractive ?
  • If macromolecules evolve at constant rates, they
    can be used to date species-divergence times and
    other types of evolutionary events, similar to
    the dating of geological time using radioactive
    elements
  • Phylogenetic reconstruction is much simpler under
    constant rates that under nonconstant rates
  • The degree of rate variation among lineages may
    provide much insight into the mechanisms of
    molecular evolution (e.g. Kimura 1983 Gillespie
    1991 Salemi et al., 1999).

28
Estimating ancestral divergence times under the
clock hypothesis
knowing a divergence time T r dac/2T all the
other divergence dates in the internal nodes of
the tree can be estimated using the rate r
29
Global vs Local clocks
  • The molecular clock hypothesis is in perfect
    agreement with the neutral theory of evolution
    (Kimura 1968 Kimura 1983).
  • The existence of a clock seems to be a major
    support of the neutral theory against natural
    selection
  • The global clock hypothesis is today known to be
    untrue.
  • Factors like different metabolic rates, different
    lifespan, and different efficiency in the DNA
    repair mechanisms between distantly related
    species are responsible for different
    evolutionary rates of homologous genes.

30
Evolutionary rates of organisms
nucleotide substitutions per site per year
10 - 9
10 - 8
10 - 7
10 - 6
10 - 5
10 - 4
10 - 3
10 - 2
10 - 1
cellular genes
RNA viruses
DNA viruses
Human mtDNA
31
Clock-like phylogenies
When the molecular clock holds the different
lineages in a phylogenetic tree accumulate
mutations more or less at the same rate during
evolution they have a constant evolutionary rate
r (nucleotide substitutions/site/year).
?
time
clock-like tree
non clock-like tree
n-1 2n-3 independent branch
lengths to be estimated (n number of TAXA)

32
Likelihood ratio test for clock hypothesis
  • Given a phylogenetic tree topology, branch
    lengths proportional to the sequence divergence
    can be estimated via maximum likelihood assuming
    or not a constant evolutionary rate for each
    lineage of the tree.
  • The likelihood ratio statistics
  • ? 2 (loge LnoClock loge Lclock)
  • is c2 distributed with (2n-3)-(n-1)n-2 degrees
    of freedom
  • n number of TAXA (usually the clock-hypothesis
    is rejected when D lt c20.05).

33
LRT and non-parametric bootstrapping
  • The ?2 approximation to assess the significance
    of the LRT is not appropriate when the two
    competing hypothesis are not nested.
  • The LRT may perform poorly when the data include
    very short sequences relative to the number of
    parameters to be estimated.
  • The null distribution of the LRT statistic can be
    approximated by Monte Carlo simulation
  • Advantage statistically sound
  • Disadvantage computational expensiveness

34
Non-parametric bootstrapping (1)
  • Select the competing models one for the null
    hypothesis H0 and one for the alternative
    hypothesis H1.
  • Estimate the tree and the parameters of the model
    under the null hypothesis.
  • Use the tree and the estimated parameters to
    simulate 200-1000 replicate data sets of the same
    size as the original.

35
Non-parametric bootstrapping (2)
  • Infer the distribution of LRT statistic ?2 (loge
    L1 loge L0) from the simulated data sets
  • H0 is rejected if the original ? is higher than
    95 (99) percentile of the ? values in the
    distribution,.

36
Bayes formula
  • Let A and B be events
  • The intersection A?B AB is the event of A and B
    both occurring
  • If B occurs with probability P(B) and, if B has
    occurred, A can occur with probability P(AB)
    prob. of A given that B has occurred, then
  • P(AB) P(AB)P(B) (1)
  • Similarly, P(AB) P(BA)P(A). By substituting
    P(AB) in equation (1) and solving for P(AB)

37
A silly example
  • The probability of getting 5 by rolling a fair
    die is P(5) 1/6
  • The conditional probability of getting 5 knowing
    that rolling the die has given as outcome an odd
    number is P(5odd)1/3

38
Bayes formula in practice (1)
  • In practical applications P(BA) and P(A) are
    usually easy to calculate, whereas calculating
    P(B) requires a sum over all possible ways in
    which event B can occur
  • An example
  • A lab blood test has 95 prob. of detecting the
    disease and gives 1 false positives. 0.5 of the
    population is known to have the disease.
  • What is the (conditional) probability of a person
    actually having the disease if her test resulted
    positive?

39
Bayes formula in practice (2)
  • P(Dis positive test) P(positive test
    Dis)P(Dis) / P(positive test)
  • P(positive test Dis) 0.95
  • P(Dis) 0.005
  • P(positive test) P(positive test Dis)P(Dis)
    P(positive test no Dis)P(no Dis)
  • P(positive test) (0.95)
    (0.005) (0.01)
    (0.995)
  • P(Dis positive test) 0.323

40
Likelihood vs Bayesian in Molecular phylogeny
  • The Likelihood function gives the probability of
    the aligned sequences if the tree (and the model)
    is true
  • Prob (Aligned sequences?tree)
  • The Posterior probability of the tree is the
    probability of a particular tree given the
    observed data set
  • Prob (tree?Aligned sequences)

41

Bayesian Inference in phylogeny (1)
  • Instead of searching for the ML tree, trees are
    sampled according to their posterior probability
  • Bayesian inference takes full advantage of the
    information contained in the data
  • Computationally faster (does not require an
    exhaustive search to find the ML tree)
  • Can be used to select an evolutionary model
    without knowing the topology of the optimal (or
    nearly optimal) tree

42
Posterior probability of a phylogenetic tree
  • P(Aligned sequences?Tree) likelihood
  • P(Tree) a priori probability of any particular
    phylogeny easy to calculate by assuming that,
    before any observation has been made, any
    phylogenetic tree for the data has equal chance
    of being the true tree
  • P(Aligned sequences) requires summation over all
    trees and integration over all possible branch
    lengths and model parameter values!!!

43
The Markov Chain Monte Carlo (MCMC) sampler
  • Metropolis et al. (1953) and Hastings (1970)
    proposed an algorithm (a variant of MCMC) that
    allows to visit a number of trees in the tree
    space and (if run properly and long enough) to
    visit more more often the trees with the higher
    posterior probability
  • The algorithm starts from a random tree and
    obtains a second tree by randomly perturbing the
    first tree
  • The new tree is then accepted or rejected with a
    probability that depends upon the ratio of the
    posterior probabilities of the two trees being
    compared
  • P(Tree2?Aligned sequences) / P(Tree1?Aligned
    sequences). Therefore the term P(Aligned
    sequences) does not need to be calculated!

44
Inferring a high probability tree with
Metropolis-Hastings algorithm
  • Obtain a random starting tree, Tree
  • Randomly perturb Tree to obtain Tree. Let
    q(Tree) the probability of proposing the new
    tree from the previous one and q(Tree) the
    probability of proposing the old tree conditional
    on starting at the new tree (Tree) q(X) can be
    picked for some proposal distribution
  • Calculate the probability R of accepting Tree
    over Tree
  • R min 1 , P(Aligned sequences?Tree)P(Tree
    )q(Tree)
  • P(Aligned
    sequences?Tree)P(Tree)q(Tree)
  • Generate a random variable U uniformly
    distributed in the interval (0,1)
  • If UltR accept Tree, let TreeTree, and go back
    to step 2 otherwise continue with the current
    tree
  • If the process is repeated for a sufficiently
    large number of iterations the long-run
    frequencies of states visited by the Markov chain
    WILL approximate the posterior distribution

45
Bayesian Inference of phylogeny (2)
  • The Bayesian framework can be used also
  • Evaluating uncertainty in phylogenies instead of
    classinc bootstrap methods (Larget Simon, Mol.
    Biol. Evol. 1999)
  • Detecting positive selection (Nielsen Yang,
    Genetics 1998)
  • Testing evolutionary models (Huelsenbeck et al.,
    Mol. Biol. Evol. 2004)
  • Testing the molecular clock (Surchard et al.,
    Mol. Biol. Evol. 2001)
  • Software for Bayesian inference programs include
  • Mr. Bayes
Write a Comment
User Comments (0)
About PowerShow.com