Combinatorics of Phylogenies - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Combinatorics of Phylogenies

Description:

Edges can be directed, then (vi,vj) is viewed as different (opposite direction) ... Dobson, A. (1974) Unrooted Trees for Numerical Taxonomy. J. Appl. Prob. ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 35
Provided by: stat285
Category:

less

Transcript and Presenter's Notes

Title: Combinatorics of Phylogenies


1
Combinatorics of Phylogenies
  • Motivation
  • Evaluating the Size of Problem
  • Understanding the Structure of Problem
  • Designing Combinatorial Search Algorithms
  • Topics
  • Enumerating main classes of trees
  • Enumerating other Genealogical Structures
  • Size of Neighborhoods

http//www.math.canterbury.ac.nz/m.steel/ http//
www.eecs.berkeley.edu/yss/ http//www.stats.ox.ac
.uk/research/genome/projects
2
Trees graphical biological.
A graph is a set vertices (nodes) v1,..,vk and
a set of edges e1(vi1,vj1),..,en(vin,vjn).
Edges can be directed, then (vi,vj) is viewed as
different (opposite direction) from (vj,vi) - or
undirected.
Nodes can be labelled or unlabelled. In
phylogenies the leaves are labelled and the rest
unlabelled
The degree of a node is the number of edges it is
a part of. A leaf has degree 1.
A graph is connected, if any two nodes has a path
connecting them.
A tree is a connected graph without any cycles,
i.e. only one path between any two nodes.
3
Trees phylogenies.
A tree with k nodes has k-1 edges. (easy to show
by induction)..
A root is a special node with degree 2 that is
interpreted as the point furthest back in time.
The leaves are interpreted as being contemporary.
A root introduces a time direction in a tree.
A rooted tree is said to be bifurcating, if all
non-leafs/roots has degree 3, corresponding to 1
ancestor and 2 children. For unrooted tree it
is said to have valency 3.
Edges can be labelled with a positive real number
interpreted as time duration or amount or
evolution.
If the length of the path from the root to any
leaf is the same, it obeys a molecular clock.
Tree Topology Discrete structure phylogeny
without branch lengths.
4
Pruefer Code Number of Spanning trees on
labeled nodes
From tree to tuple
From tuple to tree
Aigner Ziegler Proofs from the Book chapt.
Cayleys formula for the number of trees
Springer van Lint Wilson (1992) A Course in
Combinatorics chapt. 2 Trees
5
Enumerating Trees Unrooted valency 3
Recursion Tn (2n-5) Tn-1
Initialisation T1 T2 T31
6
Number of phylogenies with arbitrary valencies
Felsenstein, 1979, Artemisa Labi (2007 summer
project
7
Number of Coalescent Topologies
  • Time ranking of internal nodes are recorded

Waiting
Coalescing
1,2,3,4,5
(1,2)--(3,(4,5))
1,23,4,5
1--2
123,4,5
3--(4,5)
1234,5
4--5
12345
  • Bifurcating
  • Multifurcating

S1S21
8
Non-isomorphic trees
Dobson, A. (1974) Unrooted Trees for Numerical
Taxonomy. J. Appl. Prob. 11.1.32-42 Felsenstein
(2004) p30
9
Heuristic Searches in Tree Space
Nearest Neighbour Interchange
Subtree regrafting
Subtree rerooting and regrafting
10
Tree Reconstruction
Basic Principles of Phylogenetics Distance
Parsimony Compatibility Inconsistency
Likelihood
11
Central Principles of Phylogeny Reconstruction
TTCAGT TCCAGT GCCAAT GCCAAT
12
From Distance to Phylogenies
What is the relationship of a, b, c, d e?
13
UGPMA Unweighted Group Pairs Method using
Arithmetic Averages
A B C D E A 1715 2147 3091 2326 B
2991 3399 2058 C 2795 3943 D
4289 E
UGPMA can fail
AB C D E AB 2529 3245 2192 C
2795 3943 D 4289 E
A and B are siblings, but A and C are closest
ABE C D ABE 3027 3593 C 2795 D
Siblings will have d(A,?)d(B,?)-d(A,B)/2
maximal.
ABE CD ABE 3310 CD
From Molecular Systematics p486
14
Assignment to internal nodes The simple way.
What is the cheapest assignment of nucleotides to
internal nodes, given some (symmetric) distance
function d(N1,N2)??
If there are k leaves, there are k-2 internal
nodes and 4k-2 possible assignments of
nucleotides. For k22, this is more than 1012.

15
5S RNA Alignment Phylogeny Hein, 1990
Transitions 2, transversions 5 Total weight
843.
10 tatt-ctggtgtcccaggcgtagaggaaccacaccgatccatctcga
acttggtggtgaaactctgccgcggt--aaccaatact-cg-gg-ggggg
ccct-gcggaaaaatagctcgatgccagga--ta 17
t--t-ctggtgtcccaggcgtagaggaaccacaccaatccatcccgaact
tggtggtgaaactctgctgcggt--ga-cgatact-tg-gg-gggagccc
g-atggaaaaatagctcgatgccagga--t- 9
t--t-ctggtgtctcaggcgtggaggaaccacaccaatccatcccgaact
tggtggtgaaactctattgcggt--ga-cgatactgta-gg-ggaagccc
g-atggaaaaatagctcgacgccagga--t- 14
t----ctggtggccatggcgtagaggaaacaccccatcccataccgaact
cggcagttaagctctgctgcgcc--ga-tggtact-tg-gg-gggagccc
g-ctgggaaaataggacgctgccag-a--t- 3
t----ctggtgatgatggcggaggggacacacccgttcccataccgaaca
cggccgttaagccctccagcgcc--aa-tggtact-tgctc-cgcaggga
g-ccgggagagtaggacgtcgccag-g--c- 11
t----ctggtggcgatggcgaagaggacacacccgttcccataccgaaca
cggcagttaagctctccagcgcc--ga-tggtact-tg-gg-ggcagtcc
g-ctgggagagtaggacgctgccag-g--c- 4
t----ctggtggcgatagcgagaaggtcacacccgttcccataccgaaca
cggaagttaagcttctcagcgcc--ga-tggtagt-ta-gg-ggctgtcc
c-ctgtgagagtaggacgctgccag-g--c- 15
g----cctgcggccatagcaccgtgaaagcaccccatcccat-ccgaact
cggcagttaagcacggttgcgcccaga-tagtact-tg-ggtgggagacc
gcctgggaaacctggatgctgcaag-c--t- 8
g----cctacggccatcccaccctggtaacgcccgatctcgt-ctgatct
cggaagctaagcagggtcgggcctggt-tagtact-tg-gatgggagacc
tcctgggaataccgggtgctgtagg-ct-t- 12
g----cctacggccataccaccctgaaagcaccccatcccgt-ccgatct
gggaagttaagcagggttgagcccagt-tagtact-tg-gatgggagacc
gcctgggaatcctgggtgctgtagg-c--t- 7
g----cttacgaccatatcacgttgaatgcacgccatcccgt-ccgatct
ggcaagttaagcaacgttgagtccagt-tagtact-tg-gatcggagacg
gcctgggaatcctggatgttgtaag-c--t- 16
g----cctacggccatagcaccctgaaagcaccccatcccgt-ccgatct
gggaagttaagcagggttgcgcccagt-tagtact-tg-ggtgggagacc
gcctgggaatcctgggtgctgtagg-c--t- 1
a----tccacggccataggactctgaaagcactgcatcccgt-ccgatct
gcaaagttaaccagagtaccgcccagt-tagtacc-ac-ggtgggggacc
acgcgggaatcctgggtgctgt-gg-t--t- 18
a----tccacggccataggactctgaaagcaccgcatcccgt-ccgatct
gcgaagttaaacagagtaccgcccagt-tagtacc-ac-ggtgggggacc
acatgggaatcctgggtgctgt-gg-t--t- 2
a----tccacggccataggactgtgaaagcaccgcatcccgt-ctgatct
gcgcagttaaacacagtgccgcctagt-tagtacc-at-ggtgggggacc
acatgggaatcctgggtgctgt-gg-t--t- 5
g---tggtgcggtcataccagcgctaatgcaccggatcccat-cagaact
ccgcagttaagcgcgcttgggccagaa-cagtact-gg-gatgggtgacc
tcccgggaagtcctggtgccgcacc-c--c- 13
g----ggtgcggtcataccagcgttaatgcaccggatcccat-cagaact
ccgcagttaagcgcgcttgggccagcc-tagtact-ag-gatgggtgacc
tcctgggaagtcctgatgctgcacc-c--t- 6
g----ggtgcgatcataccagcgttaatgcaccggatcccat-cagaact
ccgcagttaagcgcgcttgggttggag-tagtact-ag-gatgggtgacc
tcctgggaagtcctaatattgcacc-c-tt-
16
Cost of a history - minimizing over internal
states
A C G T
17
Cost of a history leaves (initialisation).
A C G T
Initialisation leaves Cost(N) 0 if N is
at leaf, otherwise infinity
G
A
Empty Cost 0
Empty Cost 0
18
Compatibility and Branch Popping
Definition Two columns can be placed on the same
tree each explained by 1 mutation.
A GCACGTGCAGTTAGGA B GCACGTGCAGTTAGGA C
TCTCGTGCAGTTAGGA D TCTCATGCAATTAGGA E
TCTCATGCAATTATGA F TCTCATGCAATTATGA
This is equivalent to In the two columns only 3
or the 4 possible character pairs are observed
Multistate Definition The number of mutations
needed to explain a pair of columns is the sum of
the mutations needed to explain the individual
columns
A GCACGTGCAGTTAGGA B GCACGTGCAGTTAGGA C
TCTCGTGCAGTTAGGA D TCTCATGCAATTAGGA E
TCTCATGCAATTATGA F TCTCATGCAATTATGA
A GCACGTGCAGTTAGGA B GCACGTGCAGTTAGGA C
TCTCGTGCAGTTAGGA D TCTCATGCAATTAGGA E
TCTCATGCAATTATGA F TCTCATGCAATTATGA
19
The Felsenstein Zone Felsenstein-Cavendar (1979)
Patterns(16 only 8 shown) 0 1 0 0 0
0 0 0 0 0 1 0 0 1 0 1 0 0 0 1
0 1 1 0 0 0 0 0 1 0 1 1
20
Bootstrapping Felsenstein (1985)
ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT 10
230101201
21
Output from Likelihood Method.
Likelihood 7.910-14 ?? ? 0.31 0.18
Likelihood 6.210-12 ?? ? 0.34 0.16
ln(7.910-14) ln(6.210-12) is ?2 distributed
with (n-2) degrees of freedom
22
Bayesian Approach
Likelihood function L() the probability of data
as function of parameters L(Q,D)
In Likelihood analysis, Q is not stochastic
variable, Qmax(D) is
In Bayesian Analysis, Q is a stochastic variable
with a prior distribution before data is included
in the analysis.
After the observation of Data, there will be a
posterior on Q
Bayesian Analysis have seem a major rise in use
as a consequence of numerical/stochastic
integration techniques such as Markov Chain Monte
Carlo.
Likelihood function L(Q,D) is central to both
approaches
23
Assignment to internal nodes The simple way.
If branch lengths and evolutionary process is
known, what is the probability of nucleotides at
the leaves?
Cctacggccatacca a ccctgaaagcaccccatcccgt
Cttacgaccatatca c cgttgaatgcacgccatcccgt
Cctacggccatagca c ccctgaaagcaccccatcccgt
Cccacggccatagga c ctctgaaagcactgcatcccgt
Tccacggccatagga a ctctgaaagcaccgcatcccgt
Ttccacggccatagg c actgtgaaagcaccgcatcccg Tggt
gcggtcatacc g agcgctaatgcaccggatccca
Ggtgcggtcatacca t gcgttaatgcaccggatcccat
24
Probability of leaf observations - summing over
internal states
A C G T
25
The Molecular Clock
First noted by Zuckerkandl Pauling (1964) as an
empirical fact. How can one detect it?
26
Rootings
Purpose 1) To give time direction in the
phylogeny most ancient point 2) To be able to
define concepts such a monophyletic group.
1) Outgrup Enhance data set with sequence from
a species definitely distant to all of them. It
will be be joined at the root of the original data
2) Midpoint Find midpoint of longest path in
tree.
3) Assume Molecular Clock.
27
Rooting the 3 kingdoms
3 billion years ago no reliable clock - no
outgroup Given 2 set of homologous proteins, i.e.
MDH LDH can the archea, prokaria and eukaria be
rooted?
Given 2 set of homologous proteins, i.e. MDH
LDH can the archea, prokaria and eukaria be
rooted?
28
The generation/year-time clock Langley-Fitch,1973
29
The generation/year-time clock Langley-Fitch,1973
Can the generation time clock be tested?
30
The generation/year-time clock Langley-Fitch,1973
k3, t2 dg4 k, t dg (2k-3)-(t-1)
31
  • b globin, cytochrome c, fibrinopeptide A
    generation time clock
  • Langley-Fitch,1973
  • Relative rates
  • a-globin 0.342
  • globin 0.452
  • cytochrome c 0.069
  • fibrinopeptide A 0.137

32
Almost Clocks (MJ Sanderson (1997) A
Nonparametric Approach to Estimating Divergence
Times in the Absence of Rate Constancy
Mol.Biol.Evol.14.12.1218-31), J.L.Thorne et al.
(1998) Estimating the Rate of Evolution of the
Rate of Evolution. Mol.Biol.Evol.
15(12).1647-57, JP Huelsenbeck et al. (2000) A
compound Poisson Process for Relaxing the
Molecular Clock Genetics 154.1879-92. )
I Smoothing a non-clock tree onto a clock tree
(Sanderson)
II Rate of Evolution of the rate of Evolution
(Thorne et al.). The rate of evolution can change
at each bifurcation
III Relaxed Molecular Clock (Huelsenbeck et al.).
At random points in time, the rate changes by
multiplying with random variable (gamma
distributed)
Comment Makes perfect sense. Testing no clock
versus perfect is choosing between two
unrealistic extremes.
33
Adaptive Evolution Yang, Swanson, Nielsen,..
  • Models with positive selection.
  • Positive Selection is interesting as it is as
    functional change and could at times be
    correlated with change between species.

34
Summary
Combinatorics of Trees Principles of Phylogeny
Inference Distance Parsimony
Probablistic Methods Applications Clocks
Selection
Write a Comment
User Comments (0)
About PowerShow.com