Title: Phylogenetics
1Phylogenetics
What is a tree how many are there? Principles
of phylogenetic receconstruction. Special
Issues Rooting a tree The Molecular Clock
Almost Clocks.
2Trees graphical biological.
A graph is a set vertices (nodes) v1,..,vk and
a set of edges e1(vi1,vj1),..,en(vin,vjn).
Edges can be directed, then (vi,vj) is viewed as
different (opposite direction) from (vj,vi) - or
undirected.
v2
v1
(v1?v2)
(v2, v4) or (v4, v2)
v4
v3
Nodes can be labelled or unlabelled. In
phylogenies the leaves are labelled and the rest
unlabelled. The degree of a node is the number
of edges it is a part of. A leaf has degree 1.
A graph is connected, if any two nodes has a
path connecting them. A tree is a connected graph
without any cycles, i.e. only one path between
any two nodes.
3Trees phylogenies.
A tree with k nodes has k-1 edges. (easy to show
by induction). A root is a special node with
degree 2 that is interpreted as the point furthes
back in time. The leaves are interpreted as
being contemporary. A root introduces a time
direction in a tree. A rooted tree is said to be
bifurcating, if all non-leafs/roots has degree 3,
corresponding to 1 ancestor and 2 children. For
unrooted tree it is said to have valency 3.
Edges can be labelled with a positive real
number interpreted as time duration or amount or
evolution. If the length of the path from the
root to any leaf is the same, it obeys a
molecular clock. Tree Topology Discrete
structure phylogeny without branch lengths.
Root
Leaf
Internal Node
Internal Node
Leaf
4Enumerating Trees Unrooted valency 3
Recursion Tn (2n-5) Tn-1
Initialisation T1 T2 T31
4 5 6 7 8 9 10 15 20
3 15 105 945 10345 1.4 105 2.0 106 7.9 1012 2.2 1020
5Local operations on trees.
Nearest Neighbor Interchange
A
A
C
C
D
D
B
B
Subtree cut and regrafting (subtree root kept)
Subtree cut and regrafting (subtree root
possibly new)
6Central Principles of Phylogeny Reconstruction
TTCAGT TCCAGT GCCAAT GCCAAT
1
0
2
Parsimony Distance Likelihood
Total Weight 4
0
1
0.6
1 1 2 3 2 1
0.7
1.5
0.4
0.3
L3.110-7 Parameter estimates
7Distance Concepts on Trees I A Metric, d( , )
i d(a,b)0 ltgt ab
ii d(a,b)d(b,a)
iii d(a,b) lt d(a,c) d(c,b)
a
b
c
8Distance Concepts on Trees II
Tree Metric (distance function originates from
tree) d(x,y) d(z,w) d(x,z) d(y,w) gt d(x,w)
d(y,z), where z,y,z,w is a permutation of
a,b,c,d. (gt implies that no branch has length 0)
Reconstruction Principle d(s1,i) (d(s1,s2)
d(s1,s3) - d(s2,s3))/2
9Distance Concepts on Trees III
Ultra Metric (distance function originates from
tree) d(x,y) d(x,z) gt d(x,y), where z,y,z
is a permutation of a,b,c. (gt implies that no
branch has length 0)
Reconstruction Principle d(s1,i)
d(s1,s2)/2
10UPGMASokal and Michener, 1958
Unweighted Pair-Group method with Arithmetic
Mean Input Matrix with pariwise distances
between sequences, D 1 Find smallest distance,
di,j 2 i,j are now siblings with a distance,
di,j/2, to their MRCA (i,j). 3 A new
distancematrix of dimension (n-1)(n-1) where i
and j have been substituted by (i,j). All
distances to (i,j) are dk,(i,j) (dk,i
dj,k)/2. 4 This is done n-1 times and the tree
has been reconstructed. Output An ultrametric.
Comment i. If UPGMA is given an
ultrametric, it will reconstruct the same
ultrametric.
11Assignment to internal nodes The simple way.
A
G
C
T
?
?
?
?
?
?
C
C
C
A
What is the cheapest assignment of nucleotides to
internal nodes, given some (symmetric) distance
function d(N1,N2)??
If there are k leaves, there are k-2 internal
nodes and 4k-2 possible assignments of
nucleotides. For k22, this is more than 1012.
12Cost of a history - minimizing over internal
states
A C G T
d(C,G) wC(left subtree)
A C G T
A C G T
13Cost of a history leaves (initialisation).
A C G T
Initialisation leaves Cost(N) 0 if N is
at leaf, otherwise infinity
G
A
Empty Cost 0
Empty Cost 0
14Fitch-Hartigan-Sankoff Algorithm
(A,C,G,T) (9,7,7,7) Costs
Transition 2, / \
Transversion 5. / \
/ \ (A, C, G, T) \ (10,2,10,2)
\ / \ \ / \
\ / \ \ /
\ \ / \ \
(A,C,G,T) (A,C,G,T) (A,C,G,T) 0
0 0
The cost of cheapest tree hanging from this node
given there is a C at this node
C
A
T
G
155S RNA Alignment Phylogeny Hein, 1990
3
5
4
Mitochondria
Plants
6
13
11
9
7
15
Prokaryotes
17
14
10
12
Fungi
16
Transitions 2, transversions 5 Total weight
843.
Animals
8
2
1
10 tatt-ctggtgtcccaggcgtagaggaaccacaccgatccatctcga
acttggtggtgaaactctgccgcggt--aaccaatact-cg-gg-ggggg
ccct-gcggaaaaatagctcgatgccagga--ta 17
t--t-ctggtgtcccaggcgtagaggaaccacaccaatccatcccgaact
tggtggtgaaactctgctgcggt--ga-cgatact-tg-gg-gggagccc
g-atggaaaaatagctcgatgccagga--t- 9
t--t-ctggtgtctcaggcgtggaggaaccacaccaatccatcccgaact
tggtggtgaaactctattgcggt--ga-cgatactgta-gg-ggaagccc
g-atggaaaaatagctcgacgccagga--t- 14
t----ctggtggccatggcgtagaggaaacaccccatcccataccgaact
cggcagttaagctctgctgcgcc--ga-tggtact-tg-gg-gggagccc
g-ctgggaaaataggacgctgccag-a--t- 3
t----ctggtgatgatggcggaggggacacacccgttcccataccgaaca
cggccgttaagccctccagcgcc--aa-tggtact-tgctc-cgcaggga
g-ccgggagagtaggacgtcgccag-g--c- 11
t----ctggtggcgatggcgaagaggacacacccgttcccataccgaaca
cggcagttaagctctccagcgcc--ga-tggtact-tg-gg-ggcagtcc
g-ctgggagagtaggacgctgccag-g--c- 4
t----ctggtggcgatagcgagaaggtcacacccgttcccataccgaaca
cggaagttaagcttctcagcgcc--ga-tggtagt-ta-gg-ggctgtcc
c-ctgtgagagtaggacgctgccag-g--c- 15
g----cctgcggccatagcaccgtgaaagcaccccatcccat-ccgaact
cggcagttaagcacggttgcgcccaga-tagtact-tg-ggtgggagacc
gcctgggaaacctggatgctgcaag-c--t- 8
g----cctacggccatcccaccctggtaacgcccgatctcgt-ctgatct
cggaagctaagcagggtcgggcctggt-tagtact-tg-gatgggagacc
tcctgggaataccgggtgctgtagg-ct-t- 12
g----cctacggccataccaccctgaaagcaccccatcccgt-ccgatct
gggaagttaagcagggttgagcccagt-tagtact-tg-gatgggagacc
gcctgggaatcctgggtgctgtagg-c--t- 7
g----cttacgaccatatcacgttgaatgcacgccatcccgt-ccgatct
ggcaagttaagcaacgttgagtccagt-tagtact-tg-gatcggagacg
gcctgggaatcctggatgttgtaag-c--t- 16
g----cctacggccatagcaccctgaaagcaccccatcccgt-ccgatct
gggaagttaagcagggttgcgcccagt-tagtact-tg-ggtgggagacc
gcctgggaatcctgggtgctgtagg-c--t- 1
a----tccacggccataggactctgaaagcactgcatcccgt-ccgatct
gcaaagttaaccagagtaccgcccagt-tagtacc-ac-ggtgggggacc
acgcgggaatcctgggtgctgt-gg-t--t- 18
a----tccacggccataggactctgaaagcaccgcatcccgt-ccgatct
gcgaagttaaacagagtaccgcccagt-tagtacc-ac-ggtgggggacc
acatgggaatcctgggtgctgt-gg-t--t- 2
a----tccacggccataggactgtgaaagcaccgcatcccgt-ctgatct
gcgcagttaaacacagtgccgcctagt-tagtacc-at-ggtgggggacc
acatgggaatcctgggtgctgt-gg-t--t- 5
g---tggtgcggtcataccagcgctaatgcaccggatcccat-cagaact
ccgcagttaagcgcgcttgggccagaa-cagtact-gg-gatgggtgacc
tcccgggaagtcctggtgccgcacc-c--c- 13
g----ggtgcggtcataccagcgttaatgcaccggatcccat-cagaact
ccgcagttaagcgcgcttgggccagcc-tagtact-ag-gatgggtgacc
tcctgggaagtcctgatgctgcacc-c--t- 6
g----ggtgcgatcataccagcgttaatgcaccggatcccat-cagaact
ccgcagttaagcgcgcttgggttggag-tagtact-ag-gatgggtgacc
tcctgggaagtcctaatattgcacc-c-tt-
16The Felsenstein Zone Felsenstein-Cavendar (1979)
True Tree
Reconstructed Tree
s1
s2
s3
s4
Patterns(16 only 8 shown) 0 1 0 0 0
0 0 0 0 0 1 0 0 1 0 1 0 0 0 1
0 1 1 0 0 0 0 0 1 0 1 1
17Bootstrapping Felsenstein (1985)
ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT 10
230101201
ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT
2
3
4
1
18Probability of a pattern - summing over internal
states
A
C
G
?
?
?
?
A
A
T
A C G T
A C G T
A C G T
19Probability of leaf observations - summing over
internal states
A C G T
P(C?G) PC(left subtree)
A C G T
A C G T
20Output from Likelihood Method
With Clock Without Clock s5
s4 23 5.2
\ / /\
40.9 20.4 / \
\ / /
\ !
/ \ 1.6 5.6 23
sd4.6 124.4 / \
s1---6-------22---------------11---3 /\
\ ! ! 44.9 /\
\ /\ 7 3.4 4 sd.1.4
/ \ \ / \ !
s1 s2 s3 s4 s5
s2 Likelihood 7.910-14 ?? ? 0.31.1,0.18.1
6.210-12 ?? ? 0.34.1 0.16.1
ln(7.910-14) ln(6.210-12) is ?2 distributed
with (n-2) degrees of freedom.
21The Molecular Clock
First noted by Zuckerkandl Pauling (1964) as an
empirical fact. How can one detect it?
Known Ancestor Time
Unknown AncestorTime
/\ a at time T. / \
/ \ ? \ / \
/\ \ / \
/ \ \ / \
/ \ \ s1 s2 s1
s2 s3
22Rooting the 3 kingdoms
3 billion years ago no reliable clock no
outgroup Given 2 set of homologous proteins, i.e.
MDH LDH can the archea, prokaria and eukaria be
rooted? LDH MDH A
A \ \
\
\ --------E --------E /
/ /
/ P P LDH
MDH / \ /
\ / \ /\
/\ / \ / \ / /\
/ /\ P A E P A E
23Rootings
Purpose 1) To give time direction in the
phylogeny most ancient point 2) To be able to
define concepts such a monophyletic
group. Metoder 1) Outgrup Enhance data set
with sequence from a species definitely distant
to all of them. It will be be joined at the root
of the original data set. 2) Midpoint Find
midpoint of longest path in tree. 3) Assume
Molecular Clock.
24The generation/year-time clock
(Illustration of Langley-Fitch) s1
/\ \
/ \ clock l1 \
/ \ ------- s3
/\ \ l1 l2 lt l3
l2 / l3 / \ \ /
/ \ \ s2
s1 s2 s3 Given root
(2k-3)-(k-1) (k-2) degrees of freedoms lost in
imposing a clock. Assumptions 1. Ancestral
Sequences are observable. 2. The number of events
on branch is Poisson distributed with a mean
proportional to the branch length. The same
proportionality constant for all branches. 3.
The observed differences between sequences at two
neighboring nodes is the actual number of
events. s1'
s1 \ \
\
l1 \ cl1 \
------- s3
------------ s3' l2 / l3
cl2 / cl3 /
/ s2
/
s2'
sequences 1 sequences 2
k sequences s species s(2k-3)s
s(k-1) (2k-3)s s(k-1)
25Almost Clocks (MJ Sanderson (1997) A
Nonparametric Approach to Estimating Divergence
Times in the Absence of Rate Constancy
Mol.Biol.Evol.14.12.1218-31), J.L.Thorne et al.
(1998) Estimating the Rate of Evolution of the
Rate of Evolution. Mol.Biol.Evol.
15(12).1647-57, JP Huelsenbeck et al. (2000) A
compound Poisson Process for Relaxing the
Molecular Clock Genetics 154.1879-92. )
I Smoothing a non-clock tree onto a clock tree
(Sanderson). II Rate of Evolution of the rate
of Evolution (Thorne et al.). The rate of
evolution can change at each bifurcation. III
Relaxed Molecular Clock (Huelsenbeck et al.).
At random points in time, the rate changes by
multiplying with random variable (gamma
distributed)
26Non-contemporaneous leaves. (A.Rambaut (2000)
Estimating the rate of molecular evolution
incorporating non-contemporaneous sequences into
maximum likelihood phylogenies. Bioinformatics
16.4.395-399)
27Recombination and the Molecular Clock I
In presence of recombination and Gene Conversion,
the relationship among sequence might not be
describable by a phylogeny!!
Common Practice I Finding the phylogeny
anyway. II testing for the molecular clock.
28Recombination and the Molecular Clock II Schierup
Hein (2000) Recombination and the Molecular
Clock. Mol.Biol.Evol.17.10.1578-79 Schierup
Hein (2000) Consequences of Recombination on
Traditional Phylogenetic Analysis. Genetics
156.879-91.
What is the consequences of this practice? I
Simulate data with model including
recombination. II Reconstruct phylogeny. III Test
for Clock.
29History of Phylogenetic Methods 1958 Sokal and
Michener publishes UGPMA method for making
distrance trees with a clock. 1964 Parsimony
principle defined, but not advocated by Edwards
and Cavalli-Sforza. 1962-65 Zuckerkandl and
Pauling introduces the notion of a Molecular
Clock. 1967 First large molecular phylogenies
by Fitch and Margoliash. 1969 Heuristic method
used by Dayhoff to make trees and reconstruct
ancetral sequences. 1970 Neyman analyzes three
sequence stochastic model with Jukes-Cantor
substitution. 1971-73 Fitch, Hartigan Sankoff
independently comes up with same algorithm
reconstructing parsimony ancetral sequences.
1973 Sankoff treats alignment and phylogenies
as on general problem phylogenetic alignment.
301979 Cavender and Felsenstein independently comes
up with same evolutionary model where parsimony
is inconsistent. Later called the Felsenstein
Zone. 1981 Felsenstein Maximum Likelihood
Model Program DNAML (i programpakken
PHYLIP). 1981 Parsimony tree problem is shown to
be NP-Complete. 1985 Felsenstein introduces
bootstrapping as confidence interval on
phylogenies. 1986 Bandelt and Dress introduces
split decompostion as a generalization of trees.
1985- Many authors (Sawyer, Hein, Stephens,
M.Smith) tries to address the problem of
recombinations in phylogenies. 1997-9 Thorne et
al., Sanderson Huelsenbeck introduces the
Almost Clock. 2000 Rambaut (and others) makes
methods that can find trees with
non-contemporaneous leaves. 2001- Major rise in
the interest in phylogenetic statistical
alignment
31Phylogeny literature, www and packages.
Books Molecular Systematics (1996) (eds. Hillis
and Craig) New Uses for Phylogenies (1996) (eds.
P.Harvey) W.Maddison and D.Maddison
MacClade Semple Steel (2003) Phylogenetics
OUP Journals Molecular Biology and Evoltion J.
Molecular Evolution Molecular Phylogenetics System
atic Biology. J. of Classification www-pages PAU
P probably the best package for phylogenetic
analysis available. David Swofford http//www.lms
.si.edu/PAUP/about.html MacClade W. D.
Maddison http//phylogeny.arizona.edu/macclade/m
acclade.html PHYLIP J. Felsenstein.
http//depts.washington.edu/genetics/faculty/felse
nstein.html PAML Z. Yang
http//abacus.gene.ucl.ac.uk/
32Global Fit Metods
1 Error function wi,j (di,j -
pi,j)a 2 Minimisation has two parts topology
branchlengths. Try all topologies and solv
branch problem for each. 3 A(i,j),k is
(n(n-1)/2)(2n-3) matrix with 1 if k is an edge
on the path from i to j, 0 ellers. 4 The path
length i j, pi,j, In the given topology is
given by pi,j A(i,j),ksk. 5 If
wi,j 1 og a2 this can be solved by linear
algebra (di,j - A(i,j),ksk)2
33Nearest Neighbor JoiningSaitou and Nei, 1987
Input Distancematrix D. 1 For each leaf the
average distance to the others is calculated
ri(di,1 di,2 dn,i)/(n-1). 2 Rate
corrected distance matrix, M, is constructed mi,j
di,j - (ri rj)/(n-2). Only minimal mi,j is
necessary. 3 Make ancestral node, u, to i j
giving minimal mi,j. New branch lengths are
defined by si,u di,j/2 (ri - rj)/2(N-2)
sj,u di,j - si,u 4 The distance from u
to the others are set to dk,u (di,k dj,k
-di,j)/2 Do this n-2 times Alternativ
karakterisation af metoden Start med bedste
kvadratiske fit af et træ med en k indre (kltn)
indre knuder, tilføj den indre gren, som giver
den største forbedring i det kvadratiske fit (nu
k1 knuder). Dette fortsættes indtil hel træet
er bygget (k-1 indre knuder er tilføjet.
34Branch and Bound Algorithm
Ø Lavt overslag på vægten af træ - eventuelt
vægten på godt gættet træ. W(n) vægten for
træet i knude n. R(n) højt underslag for
vægttilvæksten ved at tilføje resten af
sekvenserne. Betingelse for bounding W(n) R(n)
gt Ø 97 7 102 Hvordan regnes R(n) ud?
A T C G
A C G G
T C G G
35Tree topology comparison.
I. Bootstrapping columns in the
alignment. Example Human, Chimp, Gorilla
Orangutan with root. position 1 2 3 4 5 6 7 8 9
12.586 H T C T G A C G T T T G
A ... C C T C T G A C G G T T G A ...
C G T C T G A C G G T T G A ... C O
T C A G A C G G T C G A ... C root
T C A G A C G T A A G A ... C 15 possible
trees, only 3 of relevance /\
/\ /\ / \
/ \ / \ /\ \
/\ \ /\ \ / \
\ / \ \ / \ \ /\
\ \ /\ \ \ /\ \ \ /
\ \ \ / \ \ \ / \ \ \
H C G O H G C O C G H
O I. Bootstrap probabilities 0.80
0.09 0.11 II. Differences
in likelihood 0.0 -16.63 s.d14.22
-15.12 sd13.95