Title: Gene Order Phylogeny
1Gene Order Phylogeny
- Tandy Warnow
- The Program in Evolutionary Dynamics, Harvard
University - The University of Texas at Austin
2- Cyber-Infrastructure for Phylogenetic RESearch
(http//www.phylo.org) - Main research Large-scale phylogenetics,
reticulate evolution, gene order phylogeny,
complex simulations, and databases - Funded by 11.6M ITR Grant from NSF
- 40 biologists, computer scientists, and
mathematicians collaborating on the project
3CIPRes Members
4Limitations of DNA phylogenetics
- Deep evolutionary histories may not be
recoverable from DNA sequence phylogeny due to
lack of specificity -- too much noise (homoplasy)
and insufficient sequence length - The systematics community has looked to rare
genomic changes for better sources of
phylogenetic signal
5Whole-Genome Phylogenetics
6Genomes As Signed Permutations
1 5 3 4 -2 -6or6 2 -4 3 5 1 etc.
7Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10
8Other types of events
- Duplications, Insertions, and Deletions (changes
gene content) - Fissions and Fusions (for genomes with more than
one chromosome) - These events change the number of copies of each
gene in each genome (unequal gene content)
9Genome Rearrangement Has A Huge State Space
- DNA sequences 4 states per site
- Signed circular genomes with n genes
states, 1
site - Circular genomes (1 site)
- with 37 genes (mitochondria)
states - with 120 genes (chloroplasts)
states
10Why use gene orders?
- Rare genomic changes huge state space and
relative infrequency of events (compared to site
substitutions) could make the inference of deep
evolution easier, or more accurate. - Our research shows this is true, but accurate
analysis of gene order data is computationally
very intensive!
11Phylogeny reconstruction from gene orders
- Distance-based reconstruction estimate pairwise
distances, and apply methods like
Neighbor-Joining or Weighbor - Maximum Parsimony find tree with the minimum
length (inversions, transpositions, or other edit
distances) - Maximum Likelihood find tree and parameters of
evolution most likely to generate the observed
data
12Maximum Parsimony on Rearranged Genomes (MPRG)
- The leaves are rearranged genomes.
- Find the tree that minimizes the total number of
rearrangement events (e.g., inversion phylogeny
minimizes the number of inversions)
13Optimization problems for gene order phylogeny
- Breakpoint phylogeny find the phylogeny which
minimizes the total number of breakpoints
(NP-hard, even to find the median of three
genomes) - Inversion phylogeny find the phylogeny which
minimizes the sum of inversion distances on the
edges (NP-hard, even to find the median of three
genomes)
14 Inversion phylogenies
- When the data are close to saturated, even the
best distance-based analyses are insufficiently
accurate. In these cases, our initial
investigations suggest that the inversion
phylogeny approach may be superior. - Problem finding the best trees is enormously
hard, since even the point estimation problem
is hard (worse than estimating branch lengths in
ML).
Local optimum
Tree length
Global optimum
Phylogenetic trees
15Observations
- For equal gene content, heuristics for the
inversion phylogeny problem are extremely
accurate, even under model conditions in which
transpositions are dominant. - For unequal gene content, the parsimony style
problems are too computationally intense -- but
NJ (neighbor joining) with a new distance
estimator (Moret et al. 2004) works extremely
well.
16Software
- BPAnalysis (Sankoff) open source, restricted to
the breakpoint phylogeny reconstruction - GRAPPA (Moret et al.) open source, restricted to
single chromosome genomes, but can handle both
equal and unequal gene content - MGR (Pevzner et al.) multiple chromosome,
limited to equal gene content, performs well if
the dataset is small (less than 10 genomes) - Bayesian analysis by Bret Larget (not yet
released).
17Merciera
Wahlenbergia
Tiodanus
Legousia
Asyneuma
Trachelium
Symphyandra
Campanula
Codonopsis
Tobacco
Adenophora
Cyananthus
The strict consensus of 24 trees, each with
inversion length of 64. Finished within 40
minutes on a laptop using GRAPPA version 1.8
Platycodon
18GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)
- http//www.cs.unm.edu/moret/GRAPPA/
- Heuristics for maximum parsimony style problems
for equal gene content - Fast polynomial time distance-based methods
- Contributors U. New Mexico,U. Texas at Austin,
Universitá di Bologna, Italy - Freely available in source code at this site.
- Project leader Bernard Moret (UNM)
(moret_at_cs.unm.edu)
19Speeding up MP and ML DCM3
- Tandy Warnow
- Radcliffe Institute
- The University of Texas at Austin
20Reconstructing the Tree of Life
Handling large datasets millions of species
21Methods for phylogenetic inference
- Polynomial time methods, mostly based upon
estimating evolutionary distances - Heuristics for hard optimization problems (such
as maximum parsimony and maximum likelihood) - Bayesian methods
22Main research objectives
- Determine the best current methods available for
MP and ML, and then improve upon them - Focus on performance within one day, one week, or
one month, on large real datasets (1K to 20K
sequences for MP) - Final objective is hundreds of thousands (or
millions) of sequences.
23Initial results
- Very large datasets are hard for both MP and ML,
no matter what software is used - Suboptimal solutions to MP yield reasonable
estimates of the optimal MP trees - but only if
they are within .01 of optimal MP score - Improving upon techniques for searching treespace
will yield improvements for both MP and ML
24Datasets
Obtained from various researchers and online
databases
- 1322 lsu rRNA of all organisms
- 2000 Eukaryotic rRNA
- 2594 rbcL DNA
- 4583 Actinobacteria 16s rRNA
- 6590 ssu rRNA of all Eukaryotes
- 7180 three-domain rRNA
- 7322 Firmicutes bacteria 16s rRNA
- 8506 three-domain2org rRNA
- 11361 ssu rRNA of all Bacteria
- 13921 Proteobacteria 16s rRNA
25Problems with current techniques for MP
Average MP scores above optimal of best methods
at 24 hours across 10 datasets
Best current techniques fail to reach 0.01 of
optimal at the end of 24 hours, on large datasets
26Problems with current techniques for MP
The best current method (default TNT) fails to
reach acceptable levels of accuracy (0.01 of
optimal) within 24 hours on many large datasets
-- evidence suggests that this level will not be
reached for weeks or months (or more) of further
analysis.
Performance of TNT with time
27Observations
- The best methods cannot get acceptably good
solutions within 24 hours on most of these large
datasets. - Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions. - Apparent convergence can be misleading.
28Observations
- The best methods cannot get acceptably good
solutions within 24 hours on most of these large
datasets. - Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions. - Apparent convergence can be misleading.
29Observations
- The best methods cannot get acceptably good
solutions within 24 hours on most of these large
datasets. - Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions. - Apparent convergence can be misleading.
30Disk-Covering Methods (DCMs)
- DCMs are divide-and-conquer methods that our
group has developed for use in phylogeny
reconstruction - DCM2 was designed for speeding up maximum
parsimony and maximum likelihood heuristics. DCM2
was good enough for PAUP. - DCM3 is a recent improvement over DCM2 which
enables iteration (and gives smaller subproblems)
- and is good enough for TNT.
31Boosting MP heuristics
- DCMs boost the performance of phylogeny
reconstruction methods.
DCM
Base method M
DCM-M
32DCM3 technique for speeding up MP searches
33Iterative-DCM3
T
DCM3
Base method
T
34New DCMs
- DCM3
- Compute subproblems using DCM3 decomposition
- Apply base method to each subproblem to yield
subtrees - Merge subtrees using the Strict Consensus Merger
technique - Randomly refine to make it binary
- Recursive-DCM3
- Iterative DCM3
- Compute a DCM3 tree
- Perform local search and go to step 1
- Recursive-Iterative DCM3
35Boosting MP heuristics
- We examine DCMs using DCM2 and DCM3, and using
recursion and/or iteration.
DCM
Base method M
DCM-M
36Performance Study
- How well do these boosted versions of the best
MP heuristics perform, compared to the best MP
heuristics? - We examine performance with respect to optimal
MP scores (best found so far, using any method)
for a number of very large datasets, over 24
hours. - The benchmark MP heuristic is the default TNT.
37Datasets
Obtained from various researchers and online
databases
- 1322 lsu rRNA of all organisms
- 2000 Eukaryotic rRNA
- 2594 rbcL DNA
- 4583 Actinobacteria 16s rRNA
- 6590 ssu rRNA of all Eukaryotes
- 7180 three-domain rRNA
- 7322 Firmicutes bacteria 16s rRNA
- 8506 three-domain2org rRNA
- 11361 ssu rRNA of all Bacteria
- 13921 Proteobacteria 16s rRNA
38Rec-I-DCM3 significantly improves performance
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset
39Rec-I-DCM3(TNT) vs. TNT(Comparison of scores at
24 hours)
Base method is the default TNT technique, the
current best method for MP. Rec-I-DCM3
significantly improves upon the unboosted TNT by
returning trees which are at most 0.01 above
optimal on most datasets.
40Summary
- Rec-I-DCM3 is a powerful technique for escaping
local optima, and boosts the performance of the
best heuristics for solving MP - The improvement increases with the difficulty of
the dataset - Rec-I-DCM3(TNT) is 50 times faster
than TNT on our hardest datasets, but we expect
even bigger speedups in our next version - DCMs also boost the performance of Maximum
Likelihood heuristics (not shown)
41Acknowledgements
- Collaborators Bernard Moret (UNM), Usman Roshan
(UT-Austin), and Tiffani Williams (UNM) - Funding NSF, The David and Lucile Packard
Foundation, The Radcliffe Institute for Advanced
Study, The Institute for Cellular and Molecular
Biology at UT-Austin, and The Program in
Evolutionary Dynamics at Harvard University - Software will be part of the CIPRES Projects
first distribution - see http//www.phylo.org
42- Cyber-Infrastructure for Phylogenetic RESearch
(http//www.phylo.org) - Main research Large-scale phylogenetics,
reticulate evolution, gene order phylogeny, and
databases - Funded by 11.6M ITR Grant from NSF
- 40 biologists, computer scientists, and
mathematicians collaborating on the project