CIPRES: Enabling Tree of Life Projects - PowerPoint PPT Presentation

About This Presentation
Title:

CIPRES: Enabling Tree of Life Projects

Description:

... of New Mexico Bernard Moret David Bader UCSD/SDSC Fran Berman Alex Borchers Phil Bourne John Huelsenbeck Terri Liebowitz Mark Miller University of ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 37
Provided by: tandyw
Category:

less

Transcript and Presenter's Notes

Title: CIPRES: Enabling Tree of Life Projects


1
CIPRES Enabling Tree of Life Projects
  • Tandy Warnow
  • The University of Texas at Austin

2
Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3
Reconstructing the Tree of Life
Handling large datasets millions of species
The Tree of Life is not really a tree
reticulate evolution
4
Cyber Infrastructure for Phylogenetic
Research Purpose to create a national
infrastructure of hardware, open source
software, database technology, etc., necessary to
infer the Tree of Life. Group 40 biologists,
computer scientists, and mathematicians from 13
institutions. Funding 11.6 M (large ITR grant
from NSF). URL http//www.phylo.org
5
CIPRes Members
6
CIPRES activity
  • Databases - e.g. TreeBase II (Bill Piel and
    others)
  • Simulations of large-scale complex genome-scale
    evolution (Junhyong Kim)
  • Outreach (Michael Donoghue and Brent Mishler)
  • Algorithms (Tandy Warnow)
  • Open source software (Wayne Maddison, Dave
    Swofford, Mark Holder, and Bernard Moret)
  • Computer cluster at SDSC (Fran Berman and Mark
    Miller) - available to ATOL projects and other
    groups with datasets above 1000 taxa

7
Phylogeny Problem
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
8
Steps in a phylogenetic analysis
  • Gather data
  • Align sequences
  • Estimate phylogeny (or combine with previous
    step)
  • Estimate the reliable aspects of the evolutionary
    history (using bootstrapping, consensus trees, or
    other methods)
  • Perform post-tree analyses.

9
CIPRES research in algorithms
  • Multiple sequence alignment
  • Genomic alignment
  • Heuristics for Maximum Parsimony and Maximum
    Likelihood
  • Bayesian MCMC methods
  • Supertree methods
  • Whole genome phylogeny reconstruction
  • Reticulate evolution detection and reconstruction
  • Data mining on sets of trees, and compact
    representations of these sets

10
Software distributions
  • The first distribution (in the next months) will
    focus on Rec-I-DCM3(PAUP) fast heuristic
    searches for maximum parsimony on large datasets
    for PAUP users
  • All software will be open source
  • Community contributions to software will be
    enabled

11
Phylogenetic reconstruction methods
  1. Heuristics for hard optimization criteria
    (Maximum Parsimony and Maximum Likelihood) - hard
    to solve on large datasets
  1. Polynomial time distance-based methods Neighbor
    Joining, FastME, Weighbor, etc. - poor accuracy
    on datasets with large evolutionary distances

12
DCMs Divide-and-conquer for improving phylogeny
reconstruction
13
Boosting phylogeny reconstruction methods
  • DCMs boost the performance of phylogeny
    reconstruction methods.

DCM
Base method M
DCM-M
14
DCMs (Disk-Covering Methods)
  • DCMs for polynomial time methods improve
    topological accuracy (empirical observation), and
    have provable theoretical guarantees under Markov
    models of evolution
  • DCMs for hard optimization problems reduce
    running time needed to achieve good levels of
    accuracy (empirically observation)

15
DCM1-boosting distance-based methodsNakhleh et
al. ISMB 2001
  • DCM1-boosting makes distance-based methods more
    accurate
  • Theoretical guarantees that DCM1-NJ converges to
    the true tree from polynomial length sequences

0.8
NJ
DCM1-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
16
Major challenge MP and ML
  • Maximum Parsimony (MP) and Maximum Likelihood
    (ML) remain the methods of choice for most
    systematists
  • The main challenge here is to make it possible to
    obtain good solutions to MP or ML in reasonable
    time periods on large datasets

17
Solving NP-hard problems exactly is unlikely
leaves trees
4 3
5 15
6 105
7 945
8 10395
9 135135
10 2027025
20 2.2 x 1020
100 4.5 x 10190
1000 2.7 x 102900
  • Number of (unrooted) binary trees on n leaves is
    (2n-5)!!
  • If each tree on 1000 taxa could be analyzed in
    0.001 seconds, we would find the best tree in
  • 2890 millennia

18
How good an MP analysis do we need?
  • Our research shows that we need to get within
    0.01 of optimal (or better even, on large
    datasets) to return reasonable estimates of the
    true trees topology

19
Problems with current techniques for MP
Shown here is the performance of a heuristic
maximum parsimony analysis on a real dataset of
almost 14,000 sequences. (Optimal here means
best score to date, using any method for any
amount of time.) Acceptable error is below 0.01.
Performance of TNT with time
20
Strict Consensus Merger (SCM)
21
Observations
  • The best MP heuristics cannot get acceptably good
    solutions within 24 hours on most of these large
    datasets.
  • Datasets of these sizes may need months (or
    years) of further analysis to reach reasonable
    solutions.
  • Apparent convergence can be misleading.

22
Our objective speed up the best MP heuristics
Fake study
Performance of hill-climbing heuristic
MP score of best trees
Desired Performance
Time
23
DCM3 decomposition
  • DCM3 decompositions
  • can be obtained in O(n) time
  • (2) yield small subproblems
  • (3) can be used iteratively
  • (4) can be applied recursively

24
Iterative-DCM3
T
DCM3
Base method
T
25
New DCMs
  • DCM3
  • Compute subproblems using DCM3 decomposition
  • Apply base method to each subproblem to yield
    subtrees
  • Merge subtrees using the Strict Consensus Merger
    technique
  • Randomly refine to make it binary
  • Recursive-DCM3
  • Iterative DCM3
  • Compute a DCM3 tree
  • Perform local search and go to step 1
  • Recursive-Iterative DCM3

26
Rec-I-DCM3 significantly improves performance
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset
27
Datasets
Obtained from various researchers and online
databases
  • 1322 lsu rRNA of all organisms
  • 2000 Eukaryotic rRNA
  • 2594 rbcL DNA
  • 4583 Actinobacteria 16s rRNA
  • 6590 ssu rRNA of all Eukaryotes
  • 7180 three-domain rRNA
  • 7322 Firmicutes bacteria 16s rRNA
  • 8506 three-domain2org rRNA
  • 11361 ssu rRNA of all Bacteria
  • 13921 Proteobacteria 16s rRNA

28
Rec-I-DCM3(TNT) vs. TNT(Comparison of scores at
24 hours)
Base method is the default TNT technique, the
current best method for MP. Rec-I-DCM3
significantly improves upon the unboosted TNT by
returning trees which are at most 0.01 above
optimal on most datasets.
29
Observations
  • Rec-I-DCM3 improves upon the best performing
    heuristics for MP.
  • The improvement increases with the difficulty of
    the dataset.

30
DCMs
  • DCM for NJ and other distance methods produces
    absolute fast converging (afc) methods
  • DCMs for MP heuristics
  • DCMs for use with the GRAPPA software for whole
    genome phylogenetic analysis these have been
    shown to let GRAPPA scale from its maximum of
    about 15-20 genomes to 1000 genomes.
  • Current projects DCM development for maximum
    likelihood and multiple sequence alignment.

31
Part II Whole-Genome Phylogenetics
32
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10
  • Inversion (Reversal)
  • Transposition
  • Inverted Transposition

33
Genome Rearrangement Has A Huge State Space
  • DNA sequences 4 states per site
  • Signed circular genomes with n genes
    states, 1
    site
  • Circular genomes (1 site)
  • with 37 genes
    states
  • with 120 genes
    states

34
Why use gene orders?
  • Rare genomic changes huge state space and
    relative infrequency of events (compared to site
    substitutions) could make the inference of deep
    evolution easier, or more accurate.
  • Our research shows this is true, but accurate
    analysis of gene order data is computationally
    very intensive!

35
Maximum Parsimony on Rearranged Genomes (MPRG)
  • The leaves are rearranged genomes.
  • Find the tree that minimizes the total number of
    rearrangement events (NP-hard)

36
Solving the inversion phylogeny
  • Usual issue of getting stuck in local optima,
    since the optimization problems are NP-hard
  • Additional problem finding the best trees is
    enormously hard, since even the point
    estimation problem is hard (worse than
    estimating branch lengths in ML).

Local optimum
MP score
Global optimum
Phylogenetic trees
37
Benchmark gene order dataset Campanulaceae
  • 12 genomes 1 outgroup (Tobacco), 105 gene
    segments
  • NP-hard optimization problems breakpoint and
    inversion phylogenies (techniques score every
    tree)
  • Joint work with Bob Jansen, Linda Raubeson, Jijun
    Tang, and Li-San Wang
  • 1997 BPAnalysis (Blanchette and Sankoff) 200
    years (est.)
  • 2000 Using GRAPPA v1.1 on the 512-processor Los
    Lobos Supercluster machine 2 minutes
    (200,000-fold speedup per processor)
  • 2003 Using latest version of GRAPPA 2 minutes
    on a single processor (1-billion-fold speedup per
    processor)

38
GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)
  • http//www.cs.unm.edu/moret/GRAPPA/
  • Heuristics for NP-hard optimization problems
  • Fast polynomial time distance-based methods
  • Contributors U. New Mexico, U. Texas at Austin,
    Universitá di Bologna, Italy
  • Freely available in source code at this site.
  • Project leader Bernard Moret (UNM)
    (moret_at_cs.unm.edu)

39
Limitations and ongoing research
  • Current methods are mostly limited to single
    chromosomes with equal gene content (or very
    small amounts of deletions and duplications).
  • We have made some progress on developing a
    reliable distance-based method for chromosomes
    with unequal gene content (tests on real and
    simulated data show high accuracy)
  • Handling the multiple chromosome case is harder

40
Acknowledgements
  • NSF
  • The David and Lucile Packard Foundation
  • The Program in Evolutionary Dynamics at Harvard
  • The Institute for Cellular and Molecular Biology
    at UT-Austin
  • See http//www.phylo.org and http//www.cs.utexas.
    edu/tandy for more info
Write a Comment
User Comments (0)
About PowerShow.com