Title: Likelihood Methods
1Likelihood Methods
2Likelihood maximization is a very common approach
to inference
3A coin is known to be biased The coin is tossed
three times two heads and one tail Use
principal of ML to estimate the probability of
throwing a head
Try P 0.2
L(D) 0.2 0.2 0.8 0.032
Try P 0.4
L(D) 0.4 0.4 0.6 0.096
Likelihood of the Data
Try P 0.6
L(D) 0.6 0.6 0.4 0.144
Try P 0.8
L(D) 0.8 0.8 0.2 0.128
Probability of a head
4Genetic Distance
Given a model of the way sequences evolve we can
derive a function that describes the likelihood
of observing a particular pair of sequences as a
function of the inferred genetic distance between
them
5A C G G A G Consider simplified Jukes and Cantor
model
D 0.3 0.3 0.3 0.7 0.063
D 0.6 0.6 0.6 0.4 0.144
Likelihood
D 0.9 0.9 0.9 0.1 0.081
Note this is a major simplification
Distance
6Genetic Distance using Maximum Likelihood
- Require a model of evolution
- Optimise all parameters of the model
- Each evolutionary event has an associated
likelihood given an inferred genetic distance - The likelihood of the sequence-pair is a function
of the genetic distance (just the product of the
likelihoods of each of the inferred events at
each sequence position) - Function is minimized
7Phylogenetic trees using Maximum Likelihood
- Require a model of evolution
- Each substitution has an associated likelihood
given a branch of a certain length - A function is derived to represent the likelihood
of the data given the tree, branch-lengths and
additional parameters - Optimise over parameters of the model
- Optimise over branch lengths
- Sum the likelihood over all possible sequences at
ancestral nodes - Search for the best tree (using heuristics such
as TBR)
8Models can be made more parameter rich to
increase their realism
- The most common additional parameters are
- A correction to allow different rates for each
type of nucleotide change - A correction for the proportion of sites which
are unable to change - A correction for variable rates at those sites
which can change - The values of the additional parameters will be
estimated in the process
9A gamma distribution can be used to model site
rate heterogeneity
Can be iterative.
- Rates of evolution can be worked out accurately
if the tree is known - Accurate rates of evolution for each sequence
position improve the accuracy of the tree -
Rates programme
Rates
Tree
Tree programme
10Likelihood and the number of parameters
- More parameters always leads to a better fit of
the data
11More parameters always leads to a higher value of
the likelihood whether or not the additional
parameters are providing a significantly better
fit to the data
12- Are the extra parameters justified?
- - Likelihood ratio test (applies to nested
models) - - Akaike Information Criterion
-
dof number of additional parameters
13One model is nested in another if it is a special
case of the more general model e.g. the Jukes
and Cantor model and Kimura 2P model
14- Modeltest
-
- - Uses PAUP
- - Tries out many nested models of nucleotide
substitution - - Decides how many parameters are justified by
the data
GTR does not overfit the data for at least some
HIV sequences
15Likelihood-based tests of topologies
- Kishino-Hasegawa test
- Trees specified apriori
- KH can be used to test whether two competing
hypotheses have significantly different
likelihood - Involves non-parametric bootstrap to get an idea
of by how much the likelihoods vary - NB should not be used to test trees that have
been chosen on the basis of the data!
16Likelihood-based tests of topologies
- Shimodaira-Hasegawa test
- Can be used to test confidence of ML tree
compared to related trees (e.g. second most
likely tree from the data) - PAUP, Andrew Rambaut http//evolve.zoo.ox.ac.uk/so
ftware/shtests
17Inferring Sequences at Ancestral Nodes
- Maximum likelihood estimates of tree topologies
also provide inferred sequences at ancestral
nodes - Analysis of sequences at ancestral nodes and
sequence changes at ancestral branches can
provide information about the timing of the
acquiring of a novel trait or mutation - PAML (Phylogenetic Analysis using Maximum
Likelihood) - Confidence intervals provided
- Selection can be inferred