Title: The Tree of Life Initiative
1The Tree of Life Initiative
- Tandy Warnow
- Radcliffe Institute
- The University of Texas at Austin
2Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3Reconstructing the Tree of Life
Handling large datasets millions of species
4Evolution informs about everything in biology
- Big genome sequencing projects just produce data
so what? - Evolutionary history relates all organisms and
genes, and helps us understand and predict - interactions between genes (genetic networks)
- drug design
- predicting functions of genes
- influenza vaccine development
- origins and spread of disease
- origins and migrations of humans
5- Cyber Infrastructure for Phylogenetic
Research - Purpose to create a national infrastructure of
hardware, - open source software, database technology, etc.,
necessary - to infer the Tree of Life.
6- Funded by 11.6M ITR (Information Technology
Research) Grant from NSF - 40 biologists, computer scientists, and
mathematicians collaborating on the project
7CIPRes Members
8What are the Components of the CIPRes Project?
Software Development
Outreach
Algorithms
Production
Database
Simulations
9What are the Components of the CIPRes Project?
Software Development
Simulations
Production
Algorithms
Database
10Main Research Projects
- Databases (Miranker and Donoghue)
- Simulations (Kim)
- Algorithms (Warnow)
- Software (Moret, W. Maddison, and Swofford)
11Algorithms group
- John Huelsenbeck (UCSD)
- Warren Hunt (Texas)
- Dick Karp (Berkeley)
- Bernard Moret (UNM),
- Elchanan Mossel (Berkeley)
- Gene Myers (Berkeley)
- Christos Papadimitriou (Berkeley)
- Satish Rao (Berkeley)
- Stuart Russell (Berkeley)
- Tandy Warnow (Texas)
- Tiffani Williams (UNM)
12This talk
- New techniques to solve Maximum Parsimony and
Maximum Likelihood on massive datasets - The GRAPPA software for whole genome phylogeny
reconstruction - Both projects joint with Bernard Moret at UNM
13Campanulaceae chloroplast genome phylogeny
Merciera
Wahlenbergia
Tiodanus
Legousia
Asyneuma
Trachelium
Symphyandra
Campanula
Codonopsis
Tobacco
Adenophora
Cyananthus
The strict consensus of 24 trees, each with
inversion length of 64. Finished within 40
minutes on a laptop using GRAPPA version 1.8
Platycodon
14DNA Sequence Evolution
15Phylogeny Problem
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
16Methods for phylogenetic inference
- Polynomial time methods, mostly based upon
estimating evolutionary distances - Heuristics for hard optimization problems (such
as maximum parsimony and maximum likelihood) - Bayesian methods
17Standard problem Maximum Parsimony (Hamming
distance Steiner Tree)
- Input Set S of n aligned sequences of length k
- Output A phylogenetic tree T
- leaf-labeled by sequences in S
- additional sequences of length k labeling the
internal nodes of T - such that is minimized.
18Maximum parsimony (example)
- Input Four sequences
- ACT
- ACA
- GTT
- GTA
- Question which of the three trees has the best
MP scores?
19Maximum Parsimony (MP)
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
20Maximum Parsimony (MP)
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
21Maximum Parsimony computational complexity
22Maximum likelihood
- For a given model (e.g., K2P, HKY,GTR, etc.), try
to find the tree T and its associated parameters
(substitution matrices, edge lengths, etc.) that
maximizes - PrST, parameters
- Even harder in practice than MP.
23Solving NP-hard problems exactly is unlikely
- Number of (unrooted) binary trees on n leaves is
(2n-5)!! - If each tree on 1000 taxa could be analyzed in
0.001 seconds, we would find the best tree in - 2890 millennia
24Approaches for solving MP
- Hill-climbing heuristics (which can get stuck in
local optima) - Randomized algorithms for getting out of local
optima - Approximation algorithms for MP (based upon
Steiner Tree approximation algorithms).
25Approaches for solving MP
- Hill-climbing heuristics (which can get stuck in
local optima) - Randomized algorithms for getting out of local
optima - Approximation algorithms for MP (based upon
Steiner Tree approximation algorithms).
26Main research objectives
- Determine the best current methods available for
MP (PAUP, TNT, etc.) and ML (???), and then
improve upon them - Focus on performance within one day, one week, or
one month, on large real datasets (1K to 20K
sequences for MP) - Final objective is hundreds of thousands (or
millions) of sequences.
27Initial results
- TNT (by Pablo Goloboff) is much better at solving
Maximum Parsimony than the other software we
studied (especially on large datasets). - Our research (Moret, Roshan, Warnow, and
Williams) shows that we need to get within 0.01
of optimal MP scores (or better even, on large
datasets) to return reasonable estimates of an
optimal trees topology. - ML heuristics dont seem to be able to analyze
large datasets with any accuracy.
28Datasets
Obtained from various researchers and online
databases
- 1322 lsu rRNA of all organisms
- 2000 Eukaryotic rRNA
- 2594 rbcL DNA
- 4583 Actinobacteria 16s rRNA
- 6590 ssu rRNA of all Eukaryotes
- 7180 three-domain rRNA
- 7322 Firmicutes bacteria 16s rRNA
- 8506 three-domain2org rRNA
- 11361 ssu rRNA of all Bacteria
- 13921 Proteobacteria 16s rRNA
29Problems with current techniques for MP
Average MP scores above optimal of best methods
at 24 hours across 10 datasets
Best current techniques fail to reach 0.01 of
optimal at the end of 24 hours, on large datasets
30Problems with current techniques for MP
The best current method (default TNT) fails to
reach acceptable levels of accuracy (0.01 of
optimal) within 24 hours on many large datasets
-- evidence suggests that this level will not be
reached for weeks or months (or more) of further
analysis.
Performance of TNT with time
31Observations
- The best methods cannot get acceptably good
solutions within 24 hours on most of these large
datasets. - Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions. - Apparent convergence can be misleading.
32Observations
- The best methods cannot get acceptably good
solutions within 24 hours on most of these large
datasets. - Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions. - Apparent convergence can be misleading.
33Observations
- The best methods cannot get acceptably good
solutions within 24 hours on most of these large
datasets. - Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions. - Apparent convergence can be misleading.
34Our objective speed up the best MP heuristics
Fake study
Performance of hill-climbing heuristic
MP score of best trees
Desired Performance
Time
35- How can we improve upon existing techniques?
36- Tree Bisection and Reconnection (TBR)
37- Tree Bisection and Reconnection (TBR)
Delete an edge
38- Tree Bisection and Reconnection (TBR)
39- Tree Bisection and Reconnection (TBR)
Reconnect the trees with a new edge that
bifurcates an edge in each tree
40A conjecture as to why current techniques are
poor
- Our studies suggest that trees with near optimal
scores tend to be topologically close (RF
distance less than 15) from the other near
optimal trees. - The standard technique (TBR) for moving around
tree space explores O(n3) trees, which are mostly
topologically distant. - So TBR may be useful initially (to reach near
optimality) but then more localized searches
are more productive.
41Disk-Covering Methods (DCMs)
- DCMs are divide-and-conquer methods that our
group has developed for use in phylogeny
reconstruction - DCM2 was designed for speeding up maximum
parsimony and maximum likelihood heuristics. DCM2
was good enough for PAUP. - DCM3 is a recent improvement over DCM2 which
enables iteration (and gives smaller subproblems)
- and is good enough for TNT.
42Boosting MP heuristics
- These DCMs boost the performance of phylogeny
reconstruction methods.
DCM
Base method M
DCM-M
43DCM3 technique for speeding up MP searches
44DCM3 a new DCM decomposition
- This DCM3 decompositions
- are faster to compute than
- earlier DCMs
- (2) yield smaller subproblems
- than DCM2, and
- (3) can be used iteratively
45Short Subtree Graph
- Given an edge-weighted tree T, for every edge e,
let X(e) be the set of leaves in each of the four
subtrees around e that are closest to e. - Let G be the union of cliques on all X(e)s.
- Theorem G is triangulated. Hence a clique
separator in G can be found in polynomial time.
46Strict Consensus Merger (SCM)
47Computing DCM3 decompositions
- An optimal DCM3 decomposition takes O(n3) to
compute - same as for DCM2 - The centroid edge DCM3 decomposition can be
computed in O(n2) time - An approximate centroid edge decomposition can be
computed in O(n) time
48Iterative-DCM3
T
DCM3
Base method
T
49New DCMs
- DCM3
- Compute subproblems using DCM3 decomposition
- Apply base method to each subproblem to yield
subtrees - Merge subtrees using the Strict Consensus Merger
technique - Randomly refine to make it binary
- Recursive-DCM3
- Iterative DCM3
- Compute a DCM3 tree
- Perform local search and go to step 1
- Recursive-Iterative DCM3
50Boosting MP heuristics
- We examine DCMs using DCM2 and DCM3, and using
recursion and/or iteration.
DCM
Base method M
DCM-M
51Performance Study
- How well do these boosted versions of the best
MP heuristics perform, compared to the best MP
heuristics? - We examine performance with respect to optimal
MP scores (best found so far, using any method)
for a number of very large datasets, over 24
hours. - The benchmark MP heuristic is the default TNT.
52Datasets
Obtained from various researchers and online
databases
- 1322 lsu rRNA of all organisms
- 2000 Eukaryotic rRNA
- 2594 rbcL DNA
- 4583 Actinobacteria 16s rRNA
- 6590 ssu rRNA of all Eukaryotes
- 7180 three-domain rRNA
- 7322 Firmicutes bacteria 16s rRNA
- 8506 three-domain2org rRNA
- 11361 ssu rRNA of all Bacteria
- 13921 Proteobacteria 16s rRNA
53Comparison of DCM decompositions(Maximum subset
size)
DCM2 subproblems are almost as large as the full
dataset size on datasets 1 through 4. On
datasets 5-10 DCM2 was too slow to compute a
decomposition within 24 hours.
54Comparison of DCMs (4583 sequences)
Base method is the TNT-ratchet. DCM2 tree takes
almost 10 hours to produce a tree and is too
slow to run on larger datasets. Rec-I-DCM3 is the
best method at all times.
55Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet. Note the
improvement in DCMs as we move from the
default to recursion to iteration to
recursioniteration. On very large datasets
Rec-I-DCM3 gives significant improvements over
unboosted TNT.
56Rec-I-DCM3 significantly improves performance
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset
57Rec-I-DCM3(TNT) vs. TNT(Comparison of scores at
24 hours)
Base method is the default TNT technique, the
current best method for MP. Rec-I-DCM3
significantly improves upon the unboosted TNT by
returning trees which are at most 0.01 above
optimal on most datasets.
58Summary of Part I
- Rec-I-DCM3 is a powerful technique for escaping
local optima, and boosts the performance of the
best heuristics for solving MP - The improvement increases with the difficulty of
the dataset - RecIDCM3(TNT) is 50 times faster
than TNT on our hardest datasets, but we expect
even bigger speedups in our next version - DCMs also boost the performance of Maximum
Likelihood heuristics (not shown)
59Limitations of DNA phylogenetics
- Deep evolutionary histories may not be
recoverable from DNA sequence phylogeny due to
lack of specificity -- too much noise (homoplasy)
and insufficient sequence length - The systematics community has looked to rare
genomic changes for better sources of
phylogenetic signal
60Part II Whole-Genome Phylogenetics
61Genomes As Signed Permutations
1 5 3 4 -2 -6or6 2 -4 3 5 1 etc.
62Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10
63Other types of events
- Duplications, Insertions, and Deletions (changes
gene content) - Fissions and Fusions (for genomes with more than
one chromosome) - These events change the number of copies of each
gene in each genome (unequal gene content)
64Genome Rearrangement Has A Huge State Space
- DNA sequences 4 states per site
- Signed circular genomes with n genes
states, 1
site - Circular genomes (1 site)
- with 37 genes (mitochondria)
states - with 120 genes (chloroplasts)
states
65Why use gene orders?
- Rare genomic changes huge state space and
relative infrequency of events (compared to site
substitutions) could make the inference of deep
evolution easier, or more accurate. - Our research shows this is true, but accurate
analysis of gene order data is computationally
very intensive!
66Phylogeny reconstruction from gene orders
- Distance-based reconstruction estimate pairwise
distances, and apply methods like
Neighbor-Joining or Weighbor - Maximum Parsimony find tree with the minimum
length (inversions, transpositions, or other edit
distances) - Maximum Likelihood find tree and parameters of
evolution most likely to generate the observed
data
67Maximum Parsimony on Rearranged Genomes (MPRG)
- The leaves are rearranged genomes.
- Find the tree that minimizes the total number of
rearrangement events (e.g., inversion phylogeny
minimizes the number of inversions)
68Optimization problems for gene order phylogeny
- Breakpoint phylogeny find the phylogeny which
minimizes the total number of breakpoints
(NP-hard, even to find the median of three
genomes) - Inversion phylogeny find the phylogeny which
minimizes the sum of inversion distances on the
edges (NP-hard, even to find the median of three
genomes)
69 Inversion and Breakpoint phylogenies
- When the data are close to saturated, even
Weighbor(EDE) analyses are insufficiently
accurate. In these cases, our initial
investigations suggest that the inversion and
breakpoint phylogeny approaches may be superior. - Problem finding the best trees is enormously
hard, since even the point estimation problem
is hard (worse than estimating branch lengths in
ML).
Local optimum
MP score
Global optimum
Phylogenetic trees
70Observations
- For equal gene content, heuristics for the
inversion phylogeny problem are extremely
accurate, even under model conditions in which
transpositions are dominant. - For unequal gene content, the parsimony style
problems are too computationally intense -- but
NJ (neighbor joining) with a new distance
estimator (Moret et al. 2004) works extremely
well.
71Software
- BPAnalysis (Sankoff) open source, restricted to
the breakpoint phylogeny reconstruction - GRAPPA (Moret et al.) open source, restricted to
single chromosome genomes, but can handle both
equal and unequal gene content - MGR (Pevzner et al.) multiple chromosome,
limited to equal gene content, performs well if
the dataset is small (less than 10 genomes) - Bayesian analysis by Bret Larget (not released)
works on some small datasets but not all.
72Merciera
Wahlenbergia
Tiodanus
Legousia
Asyneuma
Trachelium
Symphyandra
Campanula
Codonopsis
Tobacco
Adenophora
Cyananthus
The strict consensus of 24 trees, each with
inversion length of 64. Finished within 40
minutes on a laptop using GRAPPA version 1.8
Platycodon
73GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)
- http//www.cs.unm.edu/moret/GRAPPA/
- Heuristics for maximum parsimony style problems
for equal gene content - Fast polynomial time distance-based methods
- Contributors U. New Mexico,U. Texas at Austin,
Universitá di Bologna, Italy - Freely available in source code at this site.
- Project leader Bernard Moret (UNM)
(moret_at_cs.unm.edu)
74Summary
- Computational phylogenetics offers interesting
problems for computer scientists - Collaboration with biologists is essential
- Testing on real and/or simulated data is the only
way to know that the methods are worth pursuing
75Research problems in phylogenetic reconstruction
- Hard optimization problems
- Bayesian inference
- Whole Genome Phylogeny (e.g., gene order/content)
- Reticulate evolution
- Gene Tree/Species Tree
- Processing sets of trees compact representations
and consensus methods - Supertree methods
- Statistical issues with respect to stochastic
models of evolution (e.g., fast converging
methods)
76Contest planned!
- The CIPRES project will host a contest for
software to solve maximum parsimony and maximum
likelihood for molecular sequences (not gene
orders) - Dates not yet set, but should be 2005
- Benchmarks will be real and simulated datasets
- For more information, check http//www.phylo.org
77Acknowledgements
- The NSF
- The David and Lucile Packard Foundation
- The Radcliffe Institute for Advanced Study, the
Program in Evolutionary Dynamics at Harvard, and
the Institute for Cellular and Molecular Biology
at UT-Austin. - DCM Usman Roshan, Bernard Moret, and Tiffani
Williams - GRAPPA Bernard Moret, Li-San Wang, Jijun Tang,
Bob Jansen,
78Phylolab, U. Texas
Please visit us at http//www.cs.utexas.edu/users/
phylo/