Title: Molecular phylogenetics 3
1Molecular phylogenetics 3
- Level 3 Molecular Evolution and Bioinformatics
- Jim Provan
Page and Holmes Sections 6.5-6
2Maximum likelihood
- Principle of likelihood suggests that the
explanation that makes the observed outcome most
probable is preferred - More formally
- LD Pr (D H)
- In a phylogenetic context
- D is the set of sequences being compared
- H is a phylogenetic tree
- The tree that makes the data the most probable
evolutionary outcome is the maximum likelihood
estimate of the phylogeny
3Models, data and hypotheses
- Maximum likelihood requires three elements
- A model of sequence evolution
- A tree
- A data set
- ML methods of tree building must solve two
problems - For a given tree topology, what set of branch
lengths makes the observed data most likely - Which tree has the greatest likelihood
4Models, data and hypotheses
- Suppose we have two sequences, 1 and 2, separated
by an average of d substitutions per site - d mt
- Given a model of substitution for each site we
can compare the probability Pij(d) that two
sequences separated by d would have nucleotides i
and j - For example, if sequence 1 had nucleotide A then
PAG(d) is the probability that sequence 2 has a G
in the corresponding position - The log likelihood of obtaining the observed
sequences is the sum of the log likelihoods of
each individual site
5Models, data and hypotheses
- What model?
- Transition/transversion ratio
- Base composition
- Variation in rate across sites
- In all but simplest models (e.g. Jukes-Cantor),
differences in transition / transversion rates
can be taken into account - Keeping other parameters constant, it is possible
to calculate ML estimates of individual parameters
6Likelihood ratio tests
- We can test alternative hypotheses concerning the
same data using a likelihood ratio test - Likelihood ratio statistic (D) is the ratio of
the alternative hypothesis (H1) to the null
hypothesis (H0) - Because likelihoods are often very small, it is
more convenient to use log likelihoods - D log L1 log L0
- where
- L1 is the maximum likelihood of the alternative
hypothesis H1 - L0 is the maximum likelihood of the null
hypothesis H0 - Can be used to test various hypotheses such as
whether a particular model of evolution is valid,
whether a molecular clock adequately describes
the data or whether one phylogenetic hypothesis
is better than another
7Testing models
- A model can be tested to measure how well it fits
the observed data by comparing likelihood a tree
and a model confers on the data (Ltree) with
theoretical best (Lmax) - Likelihood ratio test can be performed to test
the adequacy of the HKY85 model to describe the
hominid mtDNA data set
8Testing rate variation
- If sequences are evolving at different rates,
then an ultrametric tree will give a poor
representation of relationships between taxa - 2D log Lno clock log Lclock
9Comparing phylogenetic hypotheses
- If two trees are not significantly different then
the sum of these likelihood differences
will not be significantly different from zero
10Objections to likelihood
- Requires an explicit model of evolution
- This is a strength, since it makes us aware of
the assumptions being made - However, dependence on a model raises question of
which model to use - Computationally expensive
- Finding the best combination of model and tree is
technically difficult - Computing likelihood is also time consuming and
it may be that there is more than one maximal
likelihood value for a given tree - Suggested that likelihood is better for testing
models rather than as an all-purpose phylogenetic
tool
11Splits
- In the above example, the split gorilla,
orang-utan, gibbon,human, chimp can be
written as 00011 in binary notation, or 3 in
decimal notation - One advantage is that we can refer to any split
by a single number
12Spectral analysis
- Provides a means of visualising support for each
split - In simple terms, consists of plotting the
frequencies of each split in the data set - Straightforward if there is two states for each
character
Human G T C A T C A T C C 1 1 0 1 1 0 1 1
0 1 Chimp A T T A C C A T T C 0 1 1 1 0 0
1 1 1 1 Gorilla G T T G T T A T T A 1 1 1
0 1 1 1 1 1 0 Orang-utan A C C A C T C C C
A 0 0 0 1 0 1 0 0 0 0 Gibbon A C C G C
C C C C A 0 0 0 0 0 0 0 0 0 0 5 7
6 11 5 12 7 7 6 3
13Spectral analysis
14Spectral analysis
- Since all splits cannot coexist in the same tree,
some method is needed to decide which splits to
use to construct the tree - Five trivial splits will be in every tree
- One possible solution is to choose the two
mutually compatible, non-trivial splits which
have the best support - In this case, the best non-trivial split is
Orang-utan, Gibbon - The next best supported split is Human, Chimp,
which is compatible with this split - This gives the basic topology Human, Chimp,
Gorilla, Orang-utan, Gibbon - Problems with spectral analysis
- Computationally expensive (half a million splits
for 20 sequences) - Potential for more than two character states
15Split decomposition
1 2 3 4 5 6 7 8 9 Human T C C T T A A A
A Chimp T T C T A T A A A Gorilla T T A C A A T
A A Orang-utan C C A C A A A T A Gibbon C C A C
A A A A T
16Split decomposition