Title: Phylogenies and the Tree of Life
1Phylogenies and the Tree of Life
Basic Principles of Phylogenetics Parsimony -
Distance - Likelihood Topologies - Super Trees -
Testing Networks Challenges Empirical
Investigations Molecular Clock
Biochemical rates Selection Strength
Tree shapes Branching Patterns
Rootings Open Questions
2Central Principles of Phylogeny Reconstruction
TTCAGT TCCAGT GCCAAT GCCAAT
3From Distance to Phylogenies
What is the relationship of a, b, c, d e?
4Enumerating Trees Unrooted valency 3
Recursion Tn (2n-5) Tn-1
Initialisation T1 T2 T31
4 5 6 7 8 9 10 15 20
3 15 105 945 10345 1.4 105 2.0 106 7.9 1012 2.2 1020
5Heuristic Searches in Tree Space
Nearest Neighbour Interchange
Subtree regrafting
Subtree rerooting and regrafting
6Assignment to internal nodes The simple way.
What is the cheapest assignment of nucleotides to
internal nodes, given some (symmetric) distance
function d(N1,N2)??
If there are k leaves, there are k-2 internal
nodes and 4k-2 possible assignments of
nucleotides. For k22, this is more than 1012.
75S RNA Alignment Phylogeny Hein, 1990
Transitions 2, transversions 5 Total weight
843.
10 tatt-ctggtgtcccaggcgtagaggaaccacaccgatccatctcga
acttggtggtgaaactctgccgcggt--aaccaatact-cg-gg-ggggg
ccct-gcggaaaaatagctcgatgccagga--ta 17
t--t-ctggtgtcccaggcgtagaggaaccacaccaatccatcccgaact
tggtggtgaaactctgctgcggt--ga-cgatact-tg-gg-gggagccc
g-atggaaaaatagctcgatgccagga--t- 9
t--t-ctggtgtctcaggcgtggaggaaccacaccaatccatcccgaact
tggtggtgaaactctattgcggt--ga-cgatactgta-gg-ggaagccc
g-atggaaaaatagctcgacgccagga--t- 14
t----ctggtggccatggcgtagaggaaacaccccatcccataccgaact
cggcagttaagctctgctgcgcc--ga-tggtact-tg-gg-gggagccc
g-ctgggaaaataggacgctgccag-a--t- 3
t----ctggtgatgatggcggaggggacacacccgttcccataccgaaca
cggccgttaagccctccagcgcc--aa-tggtact-tgctc-cgcaggga
g-ccgggagagtaggacgtcgccag-g--c- 11
t----ctggtggcgatggcgaagaggacacacccgttcccataccgaaca
cggcagttaagctctccagcgcc--ga-tggtact-tg-gg-ggcagtcc
g-ctgggagagtaggacgctgccag-g--c- 4
t----ctggtggcgatagcgagaaggtcacacccgttcccataccgaaca
cggaagttaagcttctcagcgcc--ga-tggtagt-ta-gg-ggctgtcc
c-ctgtgagagtaggacgctgccag-g--c- 15
g----cctgcggccatagcaccgtgaaagcaccccatcccat-ccgaact
cggcagttaagcacggttgcgcccaga-tagtact-tg-ggtgggagacc
gcctgggaaacctggatgctgcaag-c--t- 8
g----cctacggccatcccaccctggtaacgcccgatctcgt-ctgatct
cggaagctaagcagggtcgggcctggt-tagtact-tg-gatgggagacc
tcctgggaataccgggtgctgtagg-ct-t- 12
g----cctacggccataccaccctgaaagcaccccatcccgt-ccgatct
gggaagttaagcagggttgagcccagt-tagtact-tg-gatgggagacc
gcctgggaatcctgggtgctgtagg-c--t- 7
g----cttacgaccatatcacgttgaatgcacgccatcccgt-ccgatct
ggcaagttaagcaacgttgagtccagt-tagtact-tg-gatcggagacg
gcctgggaatcctggatgttgtaag-c--t- 16
g----cctacggccatagcaccctgaaagcaccccatcccgt-ccgatct
gggaagttaagcagggttgcgcccagt-tagtact-tg-ggtgggagacc
gcctgggaatcctgggtgctgtagg-c--t- 1
a----tccacggccataggactctgaaagcactgcatcccgt-ccgatct
gcaaagttaaccagagtaccgcccagt-tagtacc-ac-ggtgggggacc
acgcgggaatcctgggtgctgt-gg-t--t- 18
a----tccacggccataggactctgaaagcaccgcatcccgt-ccgatct
gcgaagttaaacagagtaccgcccagt-tagtacc-ac-ggtgggggacc
acatgggaatcctgggtgctgt-gg-t--t- 2
a----tccacggccataggactgtgaaagcaccgcatcccgt-ctgatct
gcgcagttaaacacagtgccgcctagt-tagtacc-at-ggtgggggacc
acatgggaatcctgggtgctgt-gg-t--t- 5
g---tggtgcggtcataccagcgctaatgcaccggatcccat-cagaact
ccgcagttaagcgcgcttgggccagaa-cagtact-gg-gatgggtgacc
tcccgggaagtcctggtgccgcacc-c--c- 13
g----ggtgcggtcataccagcgttaatgcaccggatcccat-cagaact
ccgcagttaagcgcgcttgggccagcc-tagtact-ag-gatgggtgacc
tcctgggaagtcctgatgctgcacc-c--t- 6
g----ggtgcgatcataccagcgttaatgcaccggatcccat-cagaact
ccgcagttaagcgcgcttgggttggag-tagtact-ag-gatgggtgacc
tcctgggaagtcctaatattgcacc-c-tt-
8Cost of a history - minimizing over internal
states
A C G T
9Cost of a history leaves (initialisation).
A C G T
Initialisation leaves Cost(N) 0 if N is
at leaf, otherwise infinity
G
A
Empty Cost 0
Empty Cost 0
10Fitch-Hartigan-Sankoff Algorithm
(A,C,G,T) (9,7,7,7)
(A, C, G,T) (10,2,10,2)
The cost of cheapest tree hanging from this node
given there is a C at this node
(A,C,G,T) 0
(A,C,G,T) 0
(A,C,G,T) 0
5
C
A
2
T
G
11The Felsenstein Zone Felsenstein-Cavendar (1979)
Patterns(16 only 8 shown) 0 1 0 0 0
0 0 0 0 0 1 0 0 1 0 1 0 0 0 1
0 1 1 0 0 0 0 0 1 0 1 1
12Bootstrapping Felsenstein (1985)
ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT 10
230101201
13Assignment to internal nodes The simple way.
If branch lengths and evolutionary process is
known, what is the probability of nucleotides at
the leaves?
Cctacggccatacca a ccctgaaagcaccccatcccgt
Cttacgaccatatca c cgttgaatgcacgccatcccgt
Cctacggccatagca c ccctgaaagcaccccatcccgt
Cccacggccatagga c ctctgaaagcactgcatcccgt
Tccacggccatagga a ctctgaaagcaccgcatcccgt
Ttccacggccatagg c actgtgaaagcaccgcatcccg Tggt
gcggtcatacc g agcgctaatgcaccggatccca
Ggtgcggtcatacca t gcgttaatgcaccggatcccat
14Probability of leaf observations - summing over
internal states
A C G T
15Output from Likelihood Method.
Likelihood 7.910-14 ?? ? 0.31 0.18
Likelihood 6.210-12 ?? ? 0.34 0.16
ln(7.910-14) ln(6.210-12) is ?2 distributed
with (n-2) degrees of freedom
16The Molecular Clock
First noted by Zuckerkandl Pauling (1964) as an
empirical fact. How can one detect it?
17Rootings
Purpose 1) To give time direction in the
phylogeny most ancient point 2) To be able to
define concepts such a monophyletic group.
1) Outgrup Enhance data set with sequence from
a species definitely distant to all of them. It
will be be joined at the root of the original data
2) Midpoint Find midpoint of longest path in
tree.
3) Assume Molecular Clock.
18Rooting the 3 kingdoms
3 billion years ago no reliable clock - no
outgroup Given 2 set of homologous proteins, i.e.
MDH LDH can the archea, prokaria and eukaria be
rooted?
Given 2 set of homologous proteins, i.e. MDH
LDH can the archea, prokaria and eukaria be
rooted?
19The generation/year-time clock Langley-Fitch,1973
20The generation/year-time clock Langley-Fitch,1973
Can the generation time clock be tested?
21The generation/year-time clock Langley-Fitch,1973
k3, t2 dg4 k, t dg (2k-3)-(t-1)
22- b globin, cytochrome c, fibrinopeptide A
generation time clock - Langley-Fitch,1973
- Relative rates
- a-globin 0.342
- globin 0.452
- cytochrome c 0.069
- fibrinopeptide A 0.137
23Almost Clocks (MJ Sanderson (1997) A
Nonparametric Approach to Estimating Divergence
Times in the Absence of Rate Constancy
Mol.Biol.Evol.14.12.1218-31), J.L.Thorne et al.
(1998) Estimating the Rate of Evolution of the
Rate of Evolution. Mol.Biol.Evol.
15(12).1647-57, JP Huelsenbeck et al. (2000) A
compound Poisson Process for Relaxing the
Molecular Clock Genetics 154.1879-92. )
I Smoothing a non-clock tree onto a clock tree
(Sanderson)
II Rate of Evolution of the rate of Evolution
(Thorne et al.). The rate of evolution can change
at each bifurcation
III Relaxed Molecular Clock (Huelsenbeck et al.).
At random points in time, the rate changes by
multiplying with random variable (gamma
distributed)
Comment Makes perfect sense. Testing no clock
versus perfect is choosing between two
unrealistic extremes.
24Spannoids
Advantage Decomposes large trees into small
trees Questions How to find optimal spannoid?
How well do they approximate?
25Profiloids and Staroids
Questions Parameter changes on edges
relating HMMs Choosing Optimal Staroids