Title: Combinatorics of Phylogenies
1Combinatorics of Phylogenies
- Evaluating the Size of Problem
- Understanding the Structure of Problem
- Designing Combinatorial Search Algorithms
- Enumerating main classes of trees
- Enumerating other Genealogical Structures
http//www.math.canterbury.ac.nz/m.steel/ http//
www.eecs.berkeley.edu/yss/ http//www.stats.ox.ac
.uk/research/genome/projects
2Trees graphical biological.
A graph is a set vertices (nodes) v1,..,vk and
a set of edges e1(vi1,vj1),..,en(vin,vjn).
Edges can be directed, then (vi,vj) is viewed as
different (opposite direction) from (vj,vi) - or
undirected.
Nodes can be labelled or unlabelled. In
phylogenies the leaves are labelled and the rest
unlabelled
The degree of a node is the number of edges it is
a part of. A leaf has degree 1.
A graph is connected, if any two nodes has a path
connecting them.
A tree is a connected graph without any cycles,
i.e. only one path between any two nodes.
3Trees phylogenies.
A tree with k nodes has k-1 edges. (easy to show
by induction)..
A root is a special node with degree 2 that is
interpreted as the point furthest back in time.
The leaves are interpreted as being contemporary.
A root introduces a time direction in a tree.
A rooted tree is said to be bifurcating, if all
non-leafs/roots has degree 3, corresponding to 1
ancestor and 2 children. For unrooted tree it
is said to have valency 3.
Edges can be labelled with a positive real number
interpreted as time duration or amount or
evolution.
If the length of the path from the root to any
leaf is the same, it obeys a molecular clock.
Tree Topology Discrete structure phylogeny
without branch lengths.
4 Pruefer Code Number of Spanning trees on
labeled nodes
From tree to tuple
From tuple to tree
Aigner Ziegler Proofs from the Book chapt.
Cayleys formula for the number of trees
Springer van Lint Wilson (1992) A Course in
Combinatorics chapt. 2 Trees
5Enumerating Trees Unrooted valency 3
Recursion Tn (2n-5) Tn-1
Initialisation T1 T2 T31
6Number of phylogenies with arbitrary valencies
Felsenstein, 1979, Artemisa Labi (2007 summer
project
7Number of Coalescent Topologies
- Time ranking of internal nodes are recorded
Waiting
Coalescing
1,2,3,4,5
(1,2)--(3,(4,5))
1,23,4,5
1--2
123,4,5
3--(4,5)
1234,5
4--5
12345
S1S21
8Non-isomorphic trees
Dobson, A. (1974) Unrooted Trees for Numerical
Taxonomy. J. Appl. Prob. 11.1.32-42 Felsenstein
(2004) p30
9Heuristic Searches in Tree Space
Nearest Neighbour Interchange
Subtree regrafting
Subtree rerooting and regrafting
10Tree Reconstruction
Basic Principles of Phylogenetics Distance
Parsimony Compatibility Inconsistency
Likelihood
11Central Principles of Phylogeny Reconstruction
TTCAGT TCCAGT GCCAAT GCCAAT
12From Distance to Phylogenies
What is the relationship of a, b, c, d e?
13UGPMA Unweighted Group Pairs Method using
Arithmetic Averages
A B C D E A 1715 2147 3091 2326 B
2991 3399 2058 C 2795 3943 D
4289 E
UGPMA can fail
AB C D E AB 2529 3245 2192 C
2795 3943 D 4289 E
A and B are siblings, but A and C are closest
ABE C D ABE 3027 3593 C 2795 D
Siblings will have d(A,?)d(B,?)-d(A,B)/2
maximal.
ABE CD ABE 3310 CD
From Molecular Systematics p486
14Assignment to internal nodes The simple way.
What is the cheapest assignment of nucleotides to
internal nodes, given some (symmetric) distance
function d(N1,N2)??
If there are k leaves, there are k-2 internal
nodes and 4k-2 possible assignments of
nucleotides. For k22, this is more than 1012.
155S RNA Alignment Phylogeny Hein, 1990
Transitions 2, transversions 5 Total weight
843.
10 tatt-ctggtgtcccaggcgtagaggaaccacaccgatccatctcga
acttggtggtgaaactctgccgcggt--aaccaatact-cg-gg-ggggg
ccct-gcggaaaaatagctcgatgccagga--ta 17
t--t-ctggtgtcccaggcgtagaggaaccacaccaatccatcccgaact
tggtggtgaaactctgctgcggt--ga-cgatact-tg-gg-gggagccc
g-atggaaaaatagctcgatgccagga--t- 9
t--t-ctggtgtctcaggcgtggaggaaccacaccaatccatcccgaact
tggtggtgaaactctattgcggt--ga-cgatactgta-gg-ggaagccc
g-atggaaaaatagctcgacgccagga--t- 14
t----ctggtggccatggcgtagaggaaacaccccatcccataccgaact
cggcagttaagctctgctgcgcc--ga-tggtact-tg-gg-gggagccc
g-ctgggaaaataggacgctgccag-a--t- 3
t----ctggtgatgatggcggaggggacacacccgttcccataccgaaca
cggccgttaagccctccagcgcc--aa-tggtact-tgctc-cgcaggga
g-ccgggagagtaggacgtcgccag-g--c- 11
t----ctggtggcgatggcgaagaggacacacccgttcccataccgaaca
cggcagttaagctctccagcgcc--ga-tggtact-tg-gg-ggcagtcc
g-ctgggagagtaggacgctgccag-g--c- 4
t----ctggtggcgatagcgagaaggtcacacccgttcccataccgaaca
cggaagttaagcttctcagcgcc--ga-tggtagt-ta-gg-ggctgtcc
c-ctgtgagagtaggacgctgccag-g--c- 15
g----cctgcggccatagcaccgtgaaagcaccccatcccat-ccgaact
cggcagttaagcacggttgcgcccaga-tagtact-tg-ggtgggagacc
gcctgggaaacctggatgctgcaag-c--t- 8
g----cctacggccatcccaccctggtaacgcccgatctcgt-ctgatct
cggaagctaagcagggtcgggcctggt-tagtact-tg-gatgggagacc
tcctgggaataccgggtgctgtagg-ct-t- 12
g----cctacggccataccaccctgaaagcaccccatcccgt-ccgatct
gggaagttaagcagggttgagcccagt-tagtact-tg-gatgggagacc
gcctgggaatcctgggtgctgtagg-c--t- 7
g----cttacgaccatatcacgttgaatgcacgccatcccgt-ccgatct
ggcaagttaagcaacgttgagtccagt-tagtact-tg-gatcggagacg
gcctgggaatcctggatgttgtaag-c--t- 16
g----cctacggccatagcaccctgaaagcaccccatcccgt-ccgatct
gggaagttaagcagggttgcgcccagt-tagtact-tg-ggtgggagacc
gcctgggaatcctgggtgctgtagg-c--t- 1
a----tccacggccataggactctgaaagcactgcatcccgt-ccgatct
gcaaagttaaccagagtaccgcccagt-tagtacc-ac-ggtgggggacc
acgcgggaatcctgggtgctgt-gg-t--t- 18
a----tccacggccataggactctgaaagcaccgcatcccgt-ccgatct
gcgaagttaaacagagtaccgcccagt-tagtacc-ac-ggtgggggacc
acatgggaatcctgggtgctgt-gg-t--t- 2
a----tccacggccataggactgtgaaagcaccgcatcccgt-ctgatct
gcgcagttaaacacagtgccgcctagt-tagtacc-at-ggtgggggacc
acatgggaatcctgggtgctgt-gg-t--t- 5
g---tggtgcggtcataccagcgctaatgcaccggatcccat-cagaact
ccgcagttaagcgcgcttgggccagaa-cagtact-gg-gatgggtgacc
tcccgggaagtcctggtgccgcacc-c--c- 13
g----ggtgcggtcataccagcgttaatgcaccggatcccat-cagaact
ccgcagttaagcgcgcttgggccagcc-tagtact-ag-gatgggtgacc
tcctgggaagtcctgatgctgcacc-c--t- 6
g----ggtgcgatcataccagcgttaatgcaccggatcccat-cagaact
ccgcagttaagcgcgcttgggttggag-tagtact-ag-gatgggtgacc
tcctgggaagtcctaatattgcacc-c-tt-
16Cost of a history - minimizing over internal
states
A C G T
17Cost of a history leaves (initialisation).
A C G T
Initialisation leaves Cost(N) 0 if N is
at leaf, otherwise infinity
G
A
Empty Cost 0
Empty Cost 0
18Compatibility and Branch Popping
Definition Two columns can be placed on the same
tree each explained by 1 mutation.
A GCACGTGCAGTTAGGA B GCACGTGCAGTTAGGA C
TCTCGTGCAGTTAGGA D TCTCATGCAATTAGGA E
TCTCATGCAATTATGA F TCTCATGCAATTATGA
This is equivalent to In the two columns only 3
or the 4 possible character pairs are observed
Multistate Definition The number of mutations
needed to explain a pair of columns is the sum of
the mutations needed to explain the individual
columns
A GCACGTGCAGTTAGGA B GCACGTGCAGTTAGGA C
TCTCGTGCAGTTAGGA D TCTCATGCAATTAGGA E
TCTCATGCAATTATGA F TCTCATGCAATTATGA
A GCACGTGCAGTTAGGA B GCACGTGCAGTTAGGA C
TCTCGTGCAGTTAGGA D TCTCATGCAATTAGGA E
TCTCATGCAATTATGA F TCTCATGCAATTATGA
19The Felsenstein Zone Felsenstein-Cavendar (1979)
Patterns(16 only 8 shown) 0 1 0 0 0
0 0 0 0 0 1 0 0 1 0 1 0 0 0 1
0 1 1 0 0 0 0 0 1 0 1 1
20Bootstrapping Felsenstein (1985)
ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT 10
230101201
21Output from Likelihood Method.
Likelihood 7.910-14 ?? ? 0.31 0.18
Likelihood 6.210-12 ?? ? 0.34 0.16
ln(7.910-14) ln(6.210-12) is ?2 distributed
with (n-2) degrees of freedom
22Bayesian Approach
Likelihood function L() the probability of data
as function of parameters L(Q,D)
In Likelihood analysis, Q is not stochastic
variable, Qmax(D) is
In Bayesian Analysis, Q is a stochastic variable
with a prior distribution before data is included
in the analysis.
After the observation of Data, there will be a
posterior on Q
Bayesian Analysis have seem a major rise in use
as a consequence of numerical/stochastic
integration techniques such as Markov Chain Monte
Carlo.
Likelihood function L(Q,D) is central to both
approaches
23Assignment to internal nodes The simple way.
If branch lengths and evolutionary process is
known, what is the probability of nucleotides at
the leaves?
Cctacggccatacca a ccctgaaagcaccccatcccgt
Cttacgaccatatca c cgttgaatgcacgccatcccgt
Cctacggccatagca c ccctgaaagcaccccatcccgt
Cccacggccatagga c ctctgaaagcactgcatcccgt
Tccacggccatagga a ctctgaaagcaccgcatcccgt
Ttccacggccatagg c actgtgaaagcaccgcatcccg Tggt
gcggtcatacc g agcgctaatgcaccggatccca
Ggtgcggtcatacca t gcgttaatgcaccggatcccat
24Probability of leaf observations - summing over
internal states
A C G T
25The Molecular Clock
First noted by Zuckerkandl Pauling (1964) as an
empirical fact. How can one detect it?
26Rootings
Purpose 1) To give time direction in the
phylogeny most ancient point 2) To be able to
define concepts such a monophyletic group.
1) Outgrup Enhance data set with sequence from
a species definitely distant to all of them. It
will be be joined at the root of the original data
2) Midpoint Find midpoint of longest path in
tree.
3) Assume Molecular Clock.
27Rooting the 3 kingdoms
3 billion years ago no reliable clock - no
outgroup Given 2 set of homologous proteins, i.e.
MDH LDH can the archea, prokaria and eukaria be
rooted?
Given 2 set of homologous proteins, i.e. MDH
LDH can the archea, prokaria and eukaria be
rooted?
28The generation/year-time clock Langley-Fitch,1973
29The generation/year-time clock Langley-Fitch,1973
Can the generation time clock be tested?
30The generation/year-time clock Langley-Fitch,1973
k3, t2 dg4 k, t dg (2k-3)-(t-1)
31- b globin, cytochrome c, fibrinopeptide A
generation time clock - Langley-Fitch,1973
- Relative rates
- a-globin 0.342
- globin 0.452
- cytochrome c 0.069
- fibrinopeptide A 0.137
32Almost Clocks (MJ Sanderson (1997) A
Nonparametric Approach to Estimating Divergence
Times in the Absence of Rate Constancy
Mol.Biol.Evol.14.12.1218-31), J.L.Thorne et al.
(1998) Estimating the Rate of Evolution of the
Rate of Evolution. Mol.Biol.Evol.
15(12).1647-57, JP Huelsenbeck et al. (2000) A
compound Poisson Process for Relaxing the
Molecular Clock Genetics 154.1879-92. )
I Smoothing a non-clock tree onto a clock tree
(Sanderson)
II Rate of Evolution of the rate of Evolution
(Thorne et al.). The rate of evolution can change
at each bifurcation
III Relaxed Molecular Clock (Huelsenbeck et al.).
At random points in time, the rate changes by
multiplying with random variable (gamma
distributed)
Comment Makes perfect sense. Testing no clock
versus perfect is choosing between two
unrealistic extremes.
33Adaptive Evolution Yang, Swanson, Nielsen,..
- Models with positive selection.
- Positive Selection is interesting as it is as
functional change and could at times be
correlated with change between species.
34Summary
Combinatorics of Trees Principles of Phylogeny
Inference Distance Parsimony
Probablistic Methods Applications Clocks
Selection