CIPRES: Enabling Tree of Life Projects - PowerPoint PPT Presentation

About This Presentation

Title:

CIPRES: Enabling Tree of Life Projects

Description:

... of New Mexico Bernard Moret David Bader UCSD/SDSC Fran Berman Alex Borchers Phil Bourne John Huelsenbeck Terri Liebowitz Mark Miller University of ... – PowerPoint PPT presentation

Number of Views:135

Avg rating:3.0/5.0

Slides: 37

Provided by: tandyw

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: CIPRES: Enabling Tree of Life Projects

1
CIPRES Enabling Tree of Life Projects

Tandy Warnow
The University of Texas at Austin

2
Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3
Reconstructing the Tree of Life
Handling large datasets millions of species
The Tree of Life is not really a tree
reticulate evolution
4
Cyber Infrastructure for Phylogenetic
Research Purpose to create a national
infrastructure of hardware, open source
software, database technology, etc., necessary to
infer the Tree of Life. Group 40 biologists,
computer scientists, and mathematicians from 13
institutions. Funding 11.6 M (large ITR grant
from NSF). URL http//www.phylo.org
5
CIPRes Members
6
CIPRES activity

Databases - e.g. TreeBase II (Bill Piel and
others)
Simulations of large-scale complex genome-scale
evolution (Junhyong Kim)
Outreach (Michael Donoghue and Brent Mishler)
Algorithms (Tandy Warnow)
Open source software (Wayne Maddison, Dave
Swofford, Mark Holder, and Bernard Moret)
Computer cluster at SDSC (Fran Berman and Mark
Miller) - available to ATOL projects and other
groups with datasets above 1000 taxa

7
Phylogeny Problem
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
8
Steps in a phylogenetic analysis

Gather data
Align sequences
Estimate phylogeny (or combine with previous
step)
Estimate the reliable aspects of the evolutionary
history (using bootstrapping, consensus trees, or
other methods)
Perform post-tree analyses.

9
CIPRES research in algorithms

Multiple sequence alignment
Genomic alignment
Heuristics for Maximum Parsimony and Maximum
Likelihood
Bayesian MCMC methods
Supertree methods
Whole genome phylogeny reconstruction
Reticulate evolution detection and reconstruction
Data mining on sets of trees, and compact
representations of these sets

10
Software distributions

The first distribution (in the next months) will
focus on Rec-I-DCM3(PAUP) fast heuristic
searches for maximum parsimony on large datasets
for PAUP users
All software will be open source
Community contributions to software will be
enabled

11
Phylogenetic reconstruction methods

Heuristics for hard optimization criteria
(Maximum Parsimony and Maximum Likelihood) - hard
to solve on large datasets

Polynomial time distance-based methods Neighbor
Joining, FastME, Weighbor, etc. - poor accuracy
on datasets with large evolutionary distances

12
DCMs Divide-and-conquer for improving phylogeny
reconstruction
13
Boosting phylogeny reconstruction methods

DCMs boost the performance of phylogeny
reconstruction methods.

DCM
Base method M
DCM-M
14
DCMs (Disk-Covering Methods)

DCMs for polynomial time methods improve
topological accuracy (empirical observation), and
have provable theoretical guarantees under Markov
models of evolution
DCMs for hard optimization problems reduce
running time needed to achieve good levels of
accuracy (empirically observation)

15
DCM1-boosting distance-based methodsNakhleh et
al. ISMB 2001

DCM1-boosting makes distance-based methods more
accurate
Theoretical guarantees that DCM1-NJ converges to
the true tree from polynomial length sequences

0.8
NJ
DCM1-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
16
Major challenge MP and ML

Maximum Parsimony (MP) and Maximum Likelihood
(ML) remain the methods of choice for most
systematists
The main challenge here is to make it possible to
obtain good solutions to MP or ML in reasonable
time periods on large datasets

17
Solving NP-hard problems exactly is unlikely
leaves trees
4 3
5 15
6 105
7 945
8 10395
9 135135
10 2027025
20 2.2 x 1020
100 4.5 x 10190
1000 2.7 x 102900

Number of (unrooted) binary trees on n leaves is
(2n-5)!!
If each tree on 1000 taxa could be analyzed in
0.001 seconds, we would find the best tree in
2890 millennia

18
How good an MP analysis do we need?

Our research shows that we need to get within
0.01 of optimal (or better even, on large
datasets) to return reasonable estimates of the
true trees topology

19
Problems with current techniques for MP
Shown here is the performance of a heuristic
maximum parsimony analysis on a real dataset of
almost 14,000 sequences. (Optimal here means
best score to date, using any method for any
amount of time.) Acceptable error is below 0.01.
Performance of TNT with time
20
Strict Consensus Merger (SCM)
21
Observations

The best MP heuristics cannot get acceptably good
solutions within 24 hours on most of these large
datasets.
Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions.
Apparent convergence can be misleading.

22
Our objective speed up the best MP heuristics
Fake study
Performance of hill-climbing heuristic
MP score of best trees
Desired Performance
Time
23
DCM3 decomposition

DCM3 decompositions
can be obtained in O(n) time
(2) yield small subproblems
(3) can be used iteratively
(4) can be applied recursively

24
Iterative-DCM3
T
DCM3
Base method
T
25
New DCMs

DCM3
Compute subproblems using DCM3 decomposition
Apply base method to each subproblem to yield
subtrees
Merge subtrees using the Strict Consensus Merger
technique
Randomly refine to make it binary
Recursive-DCM3
Iterative DCM3
Compute a DCM3 tree
Perform local search and go to step 1
Recursive-Iterative DCM3

26
Rec-I-DCM3 significantly improves performance
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset
27
Datasets
Obtained from various researchers and online
databases

1322 lsu rRNA of all organisms
2000 Eukaryotic rRNA
2594 rbcL DNA
4583 Actinobacteria 16s rRNA
6590 ssu rRNA of all Eukaryotes
7180 three-domain rRNA
7322 Firmicutes bacteria 16s rRNA
8506 three-domain2org rRNA
11361 ssu rRNA of all Bacteria
13921 Proteobacteria 16s rRNA

28
Rec-I-DCM3(TNT) vs. TNT(Comparison of scores at
24 hours)
Base method is the default TNT technique, the
current best method for MP. Rec-I-DCM3
significantly improves upon the unboosted TNT by
returning trees which are at most 0.01 above
optimal on most datasets.
29
Observations

Rec-I-DCM3 improves upon the best performing
heuristics for MP.
The improvement increases with the difficulty of
the dataset.

30
DCMs

DCM for NJ and other distance methods produces
absolute fast converging (afc) methods
DCMs for MP heuristics
DCMs for use with the GRAPPA software for whole
genome phylogenetic analysis these have been
shown to let GRAPPA scale from its maximum of
about 15-20 genomes to 1000 genomes.
Current projects DCM development for maximum
likelihood and multiple sequence alignment.

31
Part II Whole-Genome Phylogenetics
32
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10

Inversion (Reversal)

Transposition

Inverted Transposition

33
Genome Rearrangement Has A Huge State Space

DNA sequences 4 states per site
Signed circular genomes with n genes
states, 1
site
Circular genomes (1 site)
with 37 genes
states
with 120 genes
states

34
Why use gene orders?

Rare genomic changes huge state space and
relative infrequency of events (compared to site
substitutions) could make the inference of deep
evolution easier, or more accurate.
Our research shows this is true, but accurate
analysis of gene order data is computationally
very intensive!

35
Maximum Parsimony on Rearranged Genomes (MPRG)

The leaves are rearranged genomes.
Find the tree that minimizes the total number of
rearrangement events (NP-hard)

36
Solving the inversion phylogeny

Usual issue of getting stuck in local optima,
since the optimization problems are NP-hard
Additional problem finding the best trees is
enormously hard, since even the point
estimation problem is hard (worse than
estimating branch lengths in ML).

Local optimum
MP score
Global optimum
Phylogenetic trees
37
Benchmark gene order dataset Campanulaceae

12 genomes 1 outgroup (Tobacco), 105 gene
segments
NP-hard optimization problems breakpoint and
inversion phylogenies (techniques score every
tree)
Joint work with Bob Jansen, Linda Raubeson, Jijun
Tang, and Li-San Wang
1997 BPAnalysis (Blanchette and Sankoff) 200
years (est.)
2000 Using GRAPPA v1.1 on the 512-processor Los
Lobos Supercluster machine 2 minutes
(200,000-fold speedup per processor)
2003 Using latest version of GRAPPA 2 minutes
on a single processor (1-billion-fold speedup per
processor)

38
GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)

http//www.cs.unm.edu/moret/GRAPPA/
Heuristics for NP-hard optimization problems
Fast polynomial time distance-based methods
Contributors U. New Mexico, U. Texas at Austin,
Universitá di Bologna, Italy
Freely available in source code at this site.
Project leader Bernard Moret (UNM)
(moret_at_cs.unm.edu)

39
Limitations and ongoing research

Current methods are mostly limited to single
chromosomes with equal gene content (or very
small amounts of deletions and duplications).
We have made some progress on developing a
reliable distance-based method for chromosomes
with unequal gene content (tests on real and
simulated data show high accuracy)
Handling the multiple chromosome case is harder

40
Acknowledgements