The Tree of Life Initiative - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

The Tree of Life Initiative

Description:

The Tree of Life Initiative. Tandy Warnow. Radcliffe Institute. The University of Texas at Austin ... Daniel Miranker. Usman Roshan. Luay Nakhleh. University ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 62
Provided by: tandyw
Category:

less

Transcript and Presenter's Notes

Title: The Tree of Life Initiative


1
The Tree of Life Initiative
  • Tandy Warnow
  • Radcliffe Institute
  • The University of Texas at Austin

2
Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3
Reconstructing the Tree of Life
Handling large datasets millions of species
4
Evolution informs about everything in biology
  • Big genome sequencing projects just produce data
    so what?
  • Evolutionary history relates all organisms and
    genes, and helps us understand and predict
  • interactions between genes (genetic networks)
  • drug design
  • predicting functions of genes
  • influenza vaccine development
  • origins and spread of disease
  • origins and migrations of humans

5
  • Cyber Infrastructure for Phylogenetic
    Research
  • Purpose to create a national infrastructure of
    hardware,
  • open source software, database technology, etc.,
    necessary
  • to infer the Tree of Life.

6
  • Funded by 11.6M ITR (Information Technology
    Research) Grant from NSF
  • 40 biologists, computer scientists, and
    mathematicians collaborating on the project

7
CIPRes Members
8
What are the Components of the CIPRes Project?
Software Development
Outreach
Algorithms
Production
Database
Simulations
9
What are the Components of the CIPRes Project?
Software Development
Simulations
Production
Algorithms
Database
10
Main Research Projects
  • Databases (Miranker and Donoghue)
  • Simulations (Kim)
  • Algorithms (Warnow)
  • Software (Moret, W. Maddison, and Swofford)

11
Algorithms group
  • John Huelsenbeck (UCSD)
  • Warren Hunt (Texas)
  • Dick Karp (Berkeley)
  • Bernard Moret (UNM),
  • Elchanan Mossel (Berkeley)
  • Gene Myers (Berkeley)
  • Christos Papadimitriou (Berkeley)
  • Satish Rao (Berkeley)
  • Stuart Russell (Berkeley)
  • Tandy Warnow (Texas)
  • Tiffani Williams (UNM)

12
This talk
  • New techniques to solve Maximum Parsimony and
    Maximum Likelihood on massive datasets
  • The GRAPPA software for whole genome phylogeny
    reconstruction
  • Both projects joint with Bernard Moret at UNM

13
Campanulaceae chloroplast genome phylogeny
Merciera
Wahlenbergia
Tiodanus
Legousia
Asyneuma
Trachelium
Symphyandra
Campanula
Codonopsis
Tobacco
Adenophora
Cyananthus
The strict consensus of 24 trees, each with
inversion length of 64. Finished within 40
minutes on a laptop using GRAPPA version 1.8
Platycodon
14
DNA Sequence Evolution
15
Phylogeny Problem
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
16
Methods for phylogenetic inference
  • Polynomial time methods, mostly based upon
    estimating evolutionary distances
  • Heuristics for hard optimization problems (such
    as maximum parsimony and maximum likelihood)
  • Bayesian methods

17
Standard problem Maximum Parsimony (Hamming
distance Steiner Tree)
  • Input Set S of n aligned sequences of length k
  • Output A phylogenetic tree T
  • leaf-labeled by sequences in S
  • additional sequences of length k labeling the
    internal nodes of T
  • such that is minimized.

18
Maximum parsimony (example)
  • Input Four sequences
  • ACT
  • ACA
  • GTT
  • GTA
  • Question which of the three trees has the best
    MP scores?

19
Maximum Parsimony (MP)
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
20
Maximum Parsimony (MP)
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
21
Maximum Parsimony computational complexity
22
Maximum likelihood
  • For a given model (e.g., K2P, HKY,GTR, etc.), try
    to find the tree T and its associated parameters
    (substitution matrices, edge lengths, etc.) that
    maximizes
  • PrST, parameters
  • Even harder in practice than MP.

23
Solving NP-hard problems exactly is unlikely
  • Number of (unrooted) binary trees on n leaves is
    (2n-5)!!
  • If each tree on 1000 taxa could be analyzed in
    0.001 seconds, we would find the best tree in
  • 2890 millennia

24
Approaches for solving MP
  • Hill-climbing heuristics (which can get stuck in
    local optima)
  • Randomized algorithms for getting out of local
    optima
  • Approximation algorithms for MP (based upon
    Steiner Tree approximation algorithms).

25
Approaches for solving MP
  • Hill-climbing heuristics (which can get stuck in
    local optima)
  • Randomized algorithms for getting out of local
    optima
  • Approximation algorithms for MP (based upon
    Steiner Tree approximation algorithms).

26
Main research objectives
  • Determine the best current methods available for
    MP (PAUP, TNT, etc.) and ML (???), and then
    improve upon them
  • Focus on performance within one day, one week, or
    one month, on large real datasets (1K to 20K
    sequences for MP)
  • Final objective is hundreds of thousands (or
    millions) of sequences.

27
Initial results
  • TNT (by Pablo Goloboff) is much better at solving
    Maximum Parsimony than the other software we
    studied (especially on large datasets).
  • Our research (Moret, Roshan, Warnow, and
    Williams) shows that we need to get within 0.01
    of optimal MP scores (or better even, on large
    datasets) to return reasonable estimates of an
    optimal trees topology.
  • ML heuristics dont seem to be able to analyze
    large datasets with any accuracy.

28
Datasets
Obtained from various researchers and online
databases
  • 1322 lsu rRNA of all organisms
  • 2000 Eukaryotic rRNA
  • 2594 rbcL DNA
  • 4583 Actinobacteria 16s rRNA
  • 6590 ssu rRNA of all Eukaryotes
  • 7180 three-domain rRNA
  • 7322 Firmicutes bacteria 16s rRNA
  • 8506 three-domain2org rRNA
  • 11361 ssu rRNA of all Bacteria
  • 13921 Proteobacteria 16s rRNA

29
Problems with current techniques for MP
Average MP scores above optimal of best methods
at 24 hours across 10 datasets
Best current techniques fail to reach 0.01 of
optimal at the end of 24 hours, on large datasets
30
Problems with current techniques for MP
The best current method (default TNT) fails to
reach acceptable levels of accuracy (0.01 of
optimal) within 24 hours on many large datasets
-- evidence suggests that this level will not be
reached for weeks or months (or more) of further
analysis.
Performance of TNT with time
31
Observations
  • The best methods cannot get acceptably good
    solutions within 24 hours on most of these large
    datasets.
  • Datasets of these sizes may need months (or
    years) of further analysis to reach reasonable
    solutions.
  • Apparent convergence can be misleading.

32
Observations
  • The best methods cannot get acceptably good
    solutions within 24 hours on most of these large
    datasets.
  • Datasets of these sizes may need months (or
    years) of further analysis to reach reasonable
    solutions.
  • Apparent convergence can be misleading.

33
Observations
  • The best methods cannot get acceptably good
    solutions within 24 hours on most of these large
    datasets.
  • Datasets of these sizes may need months (or
    years) of further analysis to reach reasonable
    solutions.
  • Apparent convergence can be misleading.

34
Our objective speed up the best MP heuristics
Fake study
Performance of hill-climbing heuristic
MP score of best trees
Desired Performance
Time
35
  • How can we improve upon existing techniques?

36
  • Tree Bisection and Reconnection (TBR)

37
  • Tree Bisection and Reconnection (TBR)

Delete an edge
38
  • Tree Bisection and Reconnection (TBR)

39
  • Tree Bisection and Reconnection (TBR)

Reconnect the trees with a new edge that
bifurcates an edge in each tree
40
A conjecture as to why current techniques are
poor
  • Our studies suggest that trees with near optimal
    scores tend to be topologically close (RF
    distance less than 15) from the other near
    optimal trees.
  • The standard technique (TBR) for moving around
    tree space explores O(n3) trees, which are mostly
    topologically distant.
  • So TBR may be useful initially (to reach near
    optimality) but then more localized searches
    are more productive.

41
Disk-Covering Methods (DCMs)
  • DCMs are divide-and-conquer methods that our
    group has developed for use in phylogeny
    reconstruction
  • DCM2 was designed for speeding up maximum
    parsimony and maximum likelihood heuristics. DCM2
    was good enough for PAUP.
  • DCM3 is a recent improvement over DCM2 which
    enables iteration (and gives smaller subproblems)
    - and is good enough for TNT.

42
Boosting MP heuristics
  • These DCMs boost the performance of phylogeny
    reconstruction methods.

DCM
Base method M
DCM-M
43
DCM3 technique for speeding up MP searches
44
DCM3 a new DCM decomposition
  • This DCM3 decompositions
  • are faster to compute than
  • earlier DCMs
  • (2) yield smaller subproblems
  • than DCM2, and
  • (3) can be used iteratively

45
Short Subtree Graph
  • Given an edge-weighted tree T, for every edge e,
    let X(e) be the set of leaves in each of the four
    subtrees around e that are closest to e.
  • Let G be the union of cliques on all X(e)s.
  • Theorem G is triangulated. Hence a clique
    separator in G can be found in polynomial time.

46
Strict Consensus Merger (SCM)
47
Computing DCM3 decompositions
  • An optimal DCM3 decomposition takes O(n3) to
    compute - same as for DCM2
  • The centroid edge DCM3 decomposition can be
    computed in O(n2) time
  • An approximate centroid edge decomposition can be
    computed in O(n) time

48
Iterative-DCM3
T
DCM3
Base method
T
49
New DCMs
  • DCM3
  • Compute subproblems using DCM3 decomposition
  • Apply base method to each subproblem to yield
    subtrees
  • Merge subtrees using the Strict Consensus Merger
    technique
  • Randomly refine to make it binary
  • Recursive-DCM3
  • Iterative DCM3
  • Compute a DCM3 tree
  • Perform local search and go to step 1
  • Recursive-Iterative DCM3

50
Boosting MP heuristics
  • We examine DCMs using DCM2 and DCM3, and using
    recursion and/or iteration.

DCM
Base method M
DCM-M
51
Performance Study
  • How well do these boosted versions of the best
    MP heuristics perform, compared to the best MP
    heuristics?
  • We examine performance with respect to optimal
    MP scores (best found so far, using any method)
    for a number of very large datasets, over 24
    hours.
  • The benchmark MP heuristic is the default TNT.

52
Datasets
Obtained from various researchers and online
databases
  • 1322 lsu rRNA of all organisms
  • 2000 Eukaryotic rRNA
  • 2594 rbcL DNA
  • 4583 Actinobacteria 16s rRNA
  • 6590 ssu rRNA of all Eukaryotes
  • 7180 three-domain rRNA
  • 7322 Firmicutes bacteria 16s rRNA
  • 8506 three-domain2org rRNA
  • 11361 ssu rRNA of all Bacteria
  • 13921 Proteobacteria 16s rRNA

53
Comparison of DCM decompositions(Maximum subset
size)
DCM2 subproblems are almost as large as the full
dataset size on datasets 1 through 4. On
datasets 5-10 DCM2 was too slow to compute a
decomposition within 24 hours.
54
Comparison of DCMs (4583 sequences)
Base method is the TNT-ratchet. DCM2 tree takes
almost 10 hours to produce a tree and is too
slow to run on larger datasets. Rec-I-DCM3 is the
best method at all times.
55
Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet. Note the
improvement in DCMs as we move from the
default to recursion to iteration to
recursioniteration. On very large datasets
Rec-I-DCM3 gives significant improvements over
unboosted TNT.
56
Rec-I-DCM3 significantly improves performance
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset
57
Rec-I-DCM3(TNT) vs. TNT(Comparison of scores at
24 hours)
Base method is the default TNT technique, the
current best method for MP. Rec-I-DCM3
significantly improves upon the unboosted TNT by
returning trees which are at most 0.01 above
optimal on most datasets.
58
Summary of Part I
  • Rec-I-DCM3 is a powerful technique for escaping
    local optima, and boosts the performance of the
    best heuristics for solving MP
  • The improvement increases with the difficulty of
    the dataset - RecIDCM3(TNT) is 50 times faster
    than TNT on our hardest datasets, but we expect
    even bigger speedups in our next version
  • DCMs also boost the performance of Maximum
    Likelihood heuristics (not shown)

59
Limitations of DNA phylogenetics
  • Deep evolutionary histories may not be
    recoverable from DNA sequence phylogeny due to
    lack of specificity -- too much noise (homoplasy)
    and insufficient sequence length
  • The systematics community has looked to rare
    genomic changes for better sources of
    phylogenetic signal

60
Part II Whole-Genome Phylogenetics
61
Genomes As Signed Permutations
1 5 3 4 -2 -6or6 2 -4 3 5 1 etc.
62
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10
  • Inversion (Reversal)
  • Transposition
  • Inverted Transposition

63
Other types of events
  • Duplications, Insertions, and Deletions (changes
    gene content)
  • Fissions and Fusions (for genomes with more than
    one chromosome)
  • These events change the number of copies of each
    gene in each genome (unequal gene content)

64
Genome Rearrangement Has A Huge State Space
  • DNA sequences 4 states per site
  • Signed circular genomes with n genes
    states, 1
    site
  • Circular genomes (1 site)
  • with 37 genes (mitochondria)
    states
  • with 120 genes (chloroplasts)
    states

65
Why use gene orders?
  • Rare genomic changes huge state space and
    relative infrequency of events (compared to site
    substitutions) could make the inference of deep
    evolution easier, or more accurate.
  • Our research shows this is true, but accurate
    analysis of gene order data is computationally
    very intensive!

66
Phylogeny reconstruction from gene orders
  • Distance-based reconstruction estimate pairwise
    distances, and apply methods like
    Neighbor-Joining or Weighbor
  • Maximum Parsimony find tree with the minimum
    length (inversions, transpositions, or other edit
    distances)
  • Maximum Likelihood find tree and parameters of
    evolution most likely to generate the observed
    data

67
Maximum Parsimony on Rearranged Genomes (MPRG)
  • The leaves are rearranged genomes.
  • Find the tree that minimizes the total number of
    rearrangement events (e.g., inversion phylogeny
    minimizes the number of inversions)

68
Optimization problems for gene order phylogeny
  • Breakpoint phylogeny find the phylogeny which
    minimizes the total number of breakpoints
    (NP-hard, even to find the median of three
    genomes)
  • Inversion phylogeny find the phylogeny which
    minimizes the sum of inversion distances on the
    edges (NP-hard, even to find the median of three
    genomes)

69
Inversion and Breakpoint phylogenies
  • When the data are close to saturated, even
    Weighbor(EDE) analyses are insufficiently
    accurate. In these cases, our initial
    investigations suggest that the inversion and
    breakpoint phylogeny approaches may be superior.
  • Problem finding the best trees is enormously
    hard, since even the point estimation problem
    is hard (worse than estimating branch lengths in
    ML).

Local optimum
MP score
Global optimum
Phylogenetic trees
70
Observations
  • For equal gene content, heuristics for the
    inversion phylogeny problem are extremely
    accurate, even under model conditions in which
    transpositions are dominant.
  • For unequal gene content, the parsimony style
    problems are too computationally intense -- but
    NJ (neighbor joining) with a new distance
    estimator (Moret et al. 2004) works extremely
    well.

71
Software
  • BPAnalysis (Sankoff) open source, restricted to
    the breakpoint phylogeny reconstruction
  • GRAPPA (Moret et al.) open source, restricted to
    single chromosome genomes, but can handle both
    equal and unequal gene content
  • MGR (Pevzner et al.) multiple chromosome,
    limited to equal gene content, performs well if
    the dataset is small (less than 10 genomes)
  • Bayesian analysis by Bret Larget (not released)
    works on some small datasets but not all.

72
Merciera
Wahlenbergia
Tiodanus
Legousia
Asyneuma
Trachelium
Symphyandra
Campanula
Codonopsis
Tobacco
Adenophora
Cyananthus
The strict consensus of 24 trees, each with
inversion length of 64. Finished within 40
minutes on a laptop using GRAPPA version 1.8
Platycodon
73
GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)
  • http//www.cs.unm.edu/moret/GRAPPA/
  • Heuristics for maximum parsimony style problems
    for equal gene content
  • Fast polynomial time distance-based methods
  • Contributors U. New Mexico,U. Texas at Austin,
    Universitá di Bologna, Italy
  • Freely available in source code at this site.
  • Project leader Bernard Moret (UNM)
    (moret_at_cs.unm.edu)

74
Summary
  • Computational phylogenetics offers interesting
    problems for computer scientists
  • Collaboration with biologists is essential
  • Testing on real and/or simulated data is the only
    way to know that the methods are worth pursuing

75
Research problems in phylogenetic reconstruction
  • Hard optimization problems
  • Bayesian inference
  • Whole Genome Phylogeny (e.g., gene order/content)
  • Reticulate evolution
  • Gene Tree/Species Tree
  • Processing sets of trees compact representations
    and consensus methods
  • Supertree methods
  • Statistical issues with respect to stochastic
    models of evolution (e.g., fast converging
    methods)

76
Contest planned!
  • The CIPRES project will host a contest for
    software to solve maximum parsimony and maximum
    likelihood for molecular sequences (not gene
    orders)
  • Dates not yet set, but should be 2005
  • Benchmarks will be real and simulated datasets
  • For more information, check http//www.phylo.org

77
Acknowledgements
  • The NSF
  • The David and Lucile Packard Foundation
  • The Radcliffe Institute for Advanced Study, the
    Program in Evolutionary Dynamics at Harvard, and
    the Institute for Cellular and Molecular Biology
    at UT-Austin.
  • DCM Usman Roshan, Bernard Moret, and Tiffani
    Williams
  • GRAPPA Bernard Moret, Li-San Wang, Jijun Tang,
    Bob Jansen,

78
Phylolab, U. Texas
Please visit us at http//www.cs.utexas.edu/users/
phylo/
Write a Comment
User Comments (0)
About PowerShow.com