Title: Phylogenetic Inference
1Phylogenetic Inference
2need optimality criteria algorithm to
search for the best tree given the optimality
criteria
3Best tree OR True tree
4Types of optimality criteria used to infer
phylogeny from sequence
- Distance methods
- Parsimony
- Likelihood
- Probabilistic methods
- Phylogentic invariants
5Distance based methods
- Minimum Evolution Principal
- The tree with the smallest sum of branch lengths
is the best tree
6A
B
r
s
t
u
v
D
C
dAB dCD dAD dBC r s u v r t v
s t u 2r 2s 2u 2v 2t
T r s u v t T (dAB
dCD dAD dBC ) / 2
7Number of possible unrooted trees from n
sequences
e.g. for 20 sequences there are approximately 1020
8For realistic numbers of sequences it is
impossible to consider all possible trees. Need
algorithms that can arrive at the best tree
without considering all possible trees.
9Neighbour joining is an approximation to minimum
evolution
10Neighbour Joining
8
7
1
6
2
3
5
4
Choose the pair that minimizes the length of the
resulting tree
11Distance methods
- Calculate the distance CORRECTING FOR MULTIPLE
HITS - The Distance Matrix
- 7
- Rat 0.0000 0.0646 0.1434 0.1456
0.3213 0.3213 0.7018 - Mouse 0.0646 0.0000 0.1716 0.1743
0.3253 0.3743 0.7673 - Rabbit 0.1434 0.1716 0.0000 0.0649
0.3582 0.3385 0.7522 - Human 0.1456 0.1743 0.0649 0.0000
0.3299 0.2915 0.7116 - Oppossum 0.3213 0.3253 0.3582 0.3299
0.0000 0.3279 0.6653 - Chicken 0.3213 0.3743 0.3385 0.2915
0.3279 0.0000 0.5721 - Frog 0.7018 0.7673 0.7522 0.7116
0.6653 0.5721 0.0000
12Correction for multiple hits
- A great many models used for nucleotide sequences
(e.g. JC, K2P, HKY, Rev, Maximum Likelihood) - aa sequences are even more complicated!
- Can take account of different rates of evolution
at sites (e.g. gamma distribution) - Accuracy falls off drastically for highly
divergent sequences - Is it necessary to use the most realistic model??
13The most accurate nucleotide substitution model
doesnt necessarily give the best estimate of the
true tree - models with higher numbers of
parameters provide distance estimates with higher
variance
14 How to infer the true tree? How to keep
reviewers/editors happy?
15In short distance methods
- Can be fast and simple
- e.g. UPGMA, Neighbour Joining, Minimum Evolution,
Fitch-Margoliash
16Maximum Parsimony
- Occams Razor
- Entia non sunt multiplicanda praeter
necessitatem. - William of Occam (1300-1349)
The best tree is the one which requires the least
number of substitutions
17- Check each topology
- Count the minimum number of changes required to
explain the data - Choose the tree with the smallest number of
changes - Usually performs well with closely related
sequences but often performs badly with very
distantly related sequences - With distantly related sequences homoplasy
becomes a major problem
18Not all trees need to be considered (branch and
bound method still guarantees to find MP
tree) In practice a heuristic search is often
performed (involving branch swapping e.g. NNI,
SPR, TBR). No guarantee of finding the MP tree.
19Long Branches Attract
- In a set of sequences evolving at different
rates the sequences evolving rapidly are drawn
together
20Comparison of methods
- Inconsistency
- Neighbour Joining (NJ) is very fast but depends
on accurate estimates of distance. This is more
difficult with very divergent data - NJ can suffer from Long Branch Attraction
- Parsimony suffers from Long Branch Attraction.
This may be a particular problem for very
divergent data - Parsimony can be computationally intensive
- Codon usage bias can be a problem for MP and NJ
- NJ and MP both perform well if sequences are not
too divergent - Maximum Likelihood can the most reliable but
depends on the choice of model and can be very
slow - Methods may be combined
21(No Transcript)
22The Molecular Clock
- For a given protein the rate of sequence
evolution is approximately constant across
lineages - Zuckerkandl and Pauling (1965)
This would allow speciation and duplication
events to be dated accurately based on molecular
data
Local and approximate molecular clocks more
reasonable
23Relative Rate Test
- Test whether sets of sequences are evolving at
equal rates (local molecular clock hypothesis)
e.g. RRTree, Robinson-Rechavi http//pbil.univ-lyo
n1.fr/software/rrtree.html