Title: Molecular Phylogenetics
1Molecular Phylogenetics
2Trees
- Diagram consisting of branches and nodes
- Species tree (how are my species related?)
- contains only one representative from each
species - when did speciation take place?
- all nodes indicate speciation events
- Gene tree (how are my genes related?)
- normally contains a number of genes from a single
species - nodes relate either to speciation or gene
duplication events
3(No Transcript)
4 The purpose of a phylogenetic tree is to
illustrate how a group of objects (usually genes
or organisms) are related to one another
5Phylogenetic trees are about visualising
evolutionary relationships
Nothing in Biology Makes Sense Except in the
Light of Evolution Theodosius
Dobzhansky (1900-1975)
6Terms
- Phylogeny (phylo tribe genesis)
- Homologue
- Orthologue
- Paralogue
- Tree topology
- Cladogram
- Phenogram
7Cladogram
8Phenogram
9 Clade A set of species which includes all of
the species derived from a single common ancestor
10Molecular Evolution - Li
11Cladistics and Phenetics
- Cladistic approach Trees are drawn based on the
conserved characters - Phenetic approach Trees are based on some
measure of distance between the leaves - Molecular phylogenies are inferred from molecular
(usually sequence) data - either cladistic (e.g. gene order) or phenetic
12Classes of algorithm used to infer phylogeny from
sequence
- Distance methods
- Parsimony
- Likelihood
- Probabilistic methods
- Phylogentic invariants
13Distance methods
- Calculate the distance CORRECTING FOR MULTIPLE
HITS - The Distance Matrix
- 7
- Rat 0.0000 0.0646 0.1434 0.1456
0.3213 0.3213 0.7018 - Mouse 0.0646 0.0000 0.1716 0.1743
0.3253 0.3743 0.7673 - Rabbit 0.1434 0.1716 0.0000 0.0649
0.3582 0.3385 0.7522 - Human 0.1456 0.1743 0.0649 0.0000
0.3299 0.2915 0.7116 - Oppossum 0.3213 0.3253 0.3582 0.3299
0.0000 0.3279 0.6653 - Chicken 0.3213 0.3743 0.3385 0.2915
0.3279 0.0000 0.5721 - Frog 0.7018 0.7673 0.7522 0.7116
0.6653 0.5721 0.0000
14Distance methods
- Normally fast and simple
- e.g. UPGMA, Neighbour Joining, Minimum Evolution,
Fitch-Margoliash
15Correction for multiple hits
- Only differences can be observed directly not
distances - All distance methods rely (crucially) on this
- A great many models used for nucleotide sequences
(e.g. JC, K2P, HKY, Rev, Maximum Likelihood) - aa sequences are infinitely more complicated!
- Accuracy falls off drastically for highly
divergent sequences
16Minimum Evolution
- The total length of all branches in the tree
should be a minimum - It has been shown that the minimum evolution tree
is expected to be the true tree provided branch
lengths corrected for multiple hits
17Neighbour Joining
8
7
1
6
2
3
5
4
18Neighbour joining is an approximation to minimum
evolution
19Maximum Parsimony
- Occams Razor
- Entia non sunt multiplicanda praeter
necessitatem. - William of Occam (1300-1349)
The best tree is the one which requires the least
number of substitutions
20- Check each topology
- Count the minimum number of changes required to
explain the data - Choose the tree with the smallest number of
changes - Usually performs well with closely related
sequences but often performs badly with very
distantly related sequences - With distantly related sequences homoplasy
becomes a major problem
21Maximum Likelihood
- Require a model of evolution
- Each substitution has an associated likelihood
given a branch of a certain length - A function is derived to represent the likelihood
of the data given the tree, branch-lengths and
additional parameters - Function is minimized
22Models can be made more parameter rich to
increase their realism
- The most common additional parameters are
- A correction to allow different substitution
rates for each type of nucleotide change - A correction for the proportion of sites which
are unable to change - A correction for variable site rates at those
sites which can change - The values of the additional parameters will be
estimated in the process (e.g. PAUP)
23A gamma distribution can be used to model site
rate heterogeneity
24Long Branches Attract
- In a set of sequences evolving at different
rates the sequences evolving rapidly are drawn
together
25Comparison of methods
- Inconsistency
- Neighbour Joining (NJ) is very fast but depends
on accurate estimates of distance. This is more
difficult with very divergent data - Parsimony suffers from Long Branch Attraction.
This may be a particular problem for very
divergent data - NJ can suffer from Long Branch Attraction
- Parsimony is also computationally intensive
- Codon usage bias can be a problem for MP and NJ
- Maximum Likelihood is the most reliable but
depends on the choice of model and is very slow - Methods may be combined
26The Molecular Clock
- For a given protein the rate of sequence
evolution is approximately constant across
lineages - Zuckerkandl and Pauling (1965)
This would allow speciation and duplication
events to be dated accurately based on molecular
data
Local and approximate molecular clocks more
reasonable
27Relative Rate Test
- Test whether sets of sequences are evolving at
equal rates (local molecular clock hypothesis)
e.g. RRTree, Robinson-Rechavi http//pbil.univ-lyo
n1.fr/software/rrtree.html
28Rooting the Tree
- In an unrooted tree the direction of evolution is
unknown - The root is the hypothesized ancestor of the
sequences in the tree - The root can either be placed on a branch or at a
node - You should start by viewing an unrooted tree
29(No Transcript)
30(No Transcript)
31(No Transcript)
32Automatic rooting
- Many software packages will root trees
automatically (e.g. mid-point rooting in NJPlot) - This must involve assumptions BEWARE!
33Rooting Using an Outgroup
- 1. The outgroup should be a sequence (or set of
sequences) known to be less closely related to
the rest of the sequences than they are to each
other - 2. It should ideally be as closely related as
possible to the rest of the sequences while still
satisfying condition 1 - The root must be somewhere between the outgroup
and the rest (either on the node or in a branch)
34 Sometimes two trees may look very different
but, in fact, differ only in the position of the
root
35What sequences should I use for organism
phylogenies?
- Slowly evolving / Fast evolving
- rRNA
- mitochondrion
- other
36How confident am I that my tree is correct?
- Bootstrap values
- Bootstrapping is a statistical technique that
can use random resampling of data to determine
sampling error for tree topologies
37Bootstrapping phylogenies
- Characters are resampled with replacement to
create many bootstrap replicate data sets - Each bootstrap replicate data set is analysed
(e.g. with parsimony, distance, ML etc.) - Agreement among the resulting trees is summarized
with a majority-rule consensus tree - Frequencies of occurrence of groups, bootstrap
proportions (BPs), are a measure of support for
those groups
38Bootstrapping - an example
Ciliate SSUrDNA - parsimony bootstrap
Ochromonas (1)
Symbiodinium (2)
100
Prorocentrum (3)
Euplotes (8)
84
Tetrahymena (9)
96
Loxodes (4)
100
Tracheloraphis (5)
100
Spirostomum (6)
100
Gruberia (7)
Majority-rule consensus
Wim de Grave et al. Fiocruz bioinformatics
training course
39Bootstrapping
Majority-rule consensus (with minority components)
Wim de Grave et al. Fiocruz bioinformatics
training course
40Bootstrap - interpretation
- Bootstrapping is a very valuable and widely used
technique (it is demanded by some journals) - BPs give an idea of how likely a given branch
would be to be unaffected if additional data,
with the same distribution, became available - BPs are not the same as confidence intervals.
There is no simple mapping between bootstrap
values and confidence intervals. There is no
agreement about what constitutes a good
bootstrap value (gt 70, gt 80, gt 85 ????) - Some theoretical work indicates that BPs can be a
conservative estimate of confidence intervals - If the estimated tree is inconsistent all the
bootstraps in the world wont help you..
41Jack-knifing
- Jack-knifing is very similar to bootstrapping and
differs only in the character resampling strategy - Jack-knifing is not as widely available or widely
used as bootstrapping - Tends to produce broadly similar results
42Likelihood-based tests of topologies
- Kishino-Hasegawa test
- Trees specified apriori
- KH can be used to test whether two competing
hypotheses have significantly different
likelihood - NB should not be used to test trees that have
been chosen on the basis of the data! - Shimodaira-Hasegawa test
- Can be used to test confidence of ML tree
compared to related trees (e.g. second most
likely tree from the data) - Andrew Rambaut http//evolve.zoo.ox.ac.uk/software
/shtests
43Inferring Sequences at Ancestral Nodes
- Maximum likelihood estimates of tree topologies
also provide inferred sequences at ancestral
nodes - Analysis of sequences at ancestral nodes and
sequence changes at ancestral branches can
provide information about the timing of the
acquiring of a novel trait or mutation - PAML (Phylogenetic Analysis using Maximum
Likelihood) - Confidence intervals provided
- Selection can be inferred
44Coalescent models
- Consider genetic lineages going back in time
- Make inferences from patterns of coalescent
events (e.g. effective population size,
migrations etc.) - Improved efficiency for sequence simulations
- Great number of software packages including
LAMARK (Kuhner and Felsenstein) - Has generated enormous interest and body of
literature (for review see Rosenberg and
Nordborg, Nature 2002
45- For an excellent review of recent progress in
phylogenetics see - Molecular phylogenetics state-of-the-art methods
for looking into the past - Whelan et al, Trends in Genetics, 2001
46Inferring Function from Sequence Homology
High throughput Genome Annotation
47Basic Methods
Taken from JA Eisen, Genome Research, 1998
- Highest BLAST hit
- uncharacterised gene is assigned the function of
the best BLAST hit - An additional cut-off value may be used
- Still very frequently used sometimes referred
to as first-pass annotation - Top Hits
- Examine a set of top BLAST hits and consider the
consensus function - COGs (Clusters of Orthologous Genes)
- Clustering of genes into orthologous groups based
on similarity scores
48Phylogenomics
Taken from JA Eisen, Genome Research, 1998
- Choose genes of interest
- Identify homologues
- Align sequences
- Calculate gene tree
- Overlay known functions onto tree
- Infer likely function of genes of interest
- Note Only proteins with confirmed functions
should be used (to avoid error propagation)
49Increased power over similarity methods However
increased power comes with increased costs in
terms of labour
50Duplication and Speciation
- Gene duplication may accelerate the rate of
evolution and the rate of functional divergence - Orthologues may provide better information about
function (for a given level of divergence) than
paralogues - Software is available for automatic inference of
gene duplication events on a phylogenetic tree
(e.g. SDI Seán Eddy). This can be used for
improved automation of phylogenomics (e.g. RIO
Eddy et al.) - ref Zmasek and Eddy BMC Bioinformatics 2002
51Functional Genomics
- e.g. Xun Gu Molecular Biology and Evolution 1999
- Examine functional divergence after duplication
- Type I (rates change)
- Type II (replacement at conserved site)
- Diverge program