The Tree of Life Initiative

About This Presentation

Title:

The Tree of Life Initiative

Description:

The Tree of Life Initiative. Tandy Warnow. Radcliffe Institute. The University of Texas at Austin ... Daniel Miranker. Usman Roshan. Luay Nakhleh. University ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 62

Provided by: tandyw

Category:

more less

Transcript and Presenter's Notes

Title: The Tree of Life Initiative

1
The Tree of Life Initiative

Tandy Warnow
Radcliffe Institute
The University of Texas at Austin

2
Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3
Reconstructing the Tree of Life
Handling large datasets millions of species
4
Evolution informs about everything in biology

Big genome sequencing projects just produce data
so what?
Evolutionary history relates all organisms and
genes, and helps us understand and predict
interactions between genes (genetic networks)
drug design
predicting functions of genes
influenza vaccine development
origins and spread of disease
origins and migrations of humans

Cyber Infrastructure for Phylogenetic
Research
Purpose to create a national infrastructure of
hardware,
open source software, database technology, etc.,
necessary
to infer the Tree of Life.

Funded by 11.6M ITR (Information Technology
Research) Grant from NSF
40 biologists, computer scientists, and
mathematicians collaborating on the project

7
CIPRes Members
8
What are the Components of the CIPRes Project?
Software Development
Outreach
Algorithms
Production
Database
Simulations
9
What are the Components of the CIPRes Project?
Software Development
Simulations
Production
Algorithms
Database
10
Main Research Projects

Databases (Miranker and Donoghue)
Simulations (Kim)
Algorithms (Warnow)
Software (Moret, W. Maddison, and Swofford)

11
Algorithms group

John Huelsenbeck (UCSD)
Warren Hunt (Texas)
Dick Karp (Berkeley)
Bernard Moret (UNM),
Elchanan Mossel (Berkeley)
Gene Myers (Berkeley)
Christos Papadimitriou (Berkeley)
Satish Rao (Berkeley)
Stuart Russell (Berkeley)
Tandy Warnow (Texas)
Tiffani Williams (UNM)

12
This talk

New techniques to solve Maximum Parsimony and
Maximum Likelihood on massive datasets
The GRAPPA software for whole genome phylogeny
reconstruction
Both projects joint with Bernard Moret at UNM

13
Campanulaceae chloroplast genome phylogeny
Merciera
Wahlenbergia
Tiodanus
Legousia
Asyneuma
Trachelium
Symphyandra
Campanula
Codonopsis
Tobacco
Adenophora
Cyananthus
The strict consensus of 24 trees, each with
inversion length of 64. Finished within 40
minutes on a laptop using GRAPPA version 1.8
Platycodon
14
DNA Sequence Evolution
15
Phylogeny Problem
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
16
Methods for phylogenetic inference

Polynomial time methods, mostly based upon
estimating evolutionary distances
Heuristics for hard optimization problems (such
as maximum parsimony and maximum likelihood)
Bayesian methods

17
Standard problem Maximum Parsimony (Hamming
distance Steiner Tree)

Input Set S of n aligned sequences of length k
Output A phylogenetic tree T
leaf-labeled by sequences in S
additional sequences of length k labeling the
internal nodes of T
such that is minimized.

18
Maximum parsimony (example)

Input Four sequences
ACT
ACA
GTT
GTA
Question which of the three trees has the best
MP scores?

19
Maximum Parsimony (MP)
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
20
Maximum Parsimony (MP)
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
21
Maximum Parsimony computational complexity
22
Maximum likelihood

For a given model (e.g., K2P, HKY,GTR, etc.), try
to find the tree T and its associated parameters
(substitution matrices, edge lengths, etc.) that
maximizes
PrST, parameters
Even harder in practice than MP.

23
Solving NP-hard problems exactly is unlikely

Number of (unrooted) binary trees on n leaves is
(2n-5)!!
If each tree on 1000 taxa could be analyzed in
0.001 seconds, we would find the best tree in
2890 millennia

24
Approaches for solving MP

Hill-climbing heuristics (which can get stuck in
local optima)
Randomized algorithms for getting out of local
optima
Approximation algorithms for MP (based upon
Steiner Tree approximation algorithms).

25
Approaches for solving MP

Hill-climbing heuristics (which can get stuck in
local optima)
Randomized algorithms for getting out of local
optima
Approximation algorithms for MP (based upon
Steiner Tree approximation algorithms).

26
Main research objectives

Determine the best current methods available for
MP (PAUP, TNT, etc.) and ML (???), and then
improve upon them
Focus on performance within one day, one week, or
one month, on large real datasets (1K to 20K
sequences for MP)
Final objective is hundreds of thousands (or
millions) of sequences.

27
Initial results

TNT (by Pablo Goloboff) is much better at solving
Maximum Parsimony than the other software we
studied (especially on large datasets).
Our research (Moret, Roshan, Warnow, and
Williams) shows that we need to get within 0.01
of optimal MP scores (or better even, on large
datasets) to return reasonable estimates of an
optimal trees topology.
ML heuristics dont seem to be able to analyze
large datasets with any accuracy.

28
Datasets
Obtained from various researchers and online
databases

1322 lsu rRNA of all organisms
2000 Eukaryotic rRNA
2594 rbcL DNA
4583 Actinobacteria 16s rRNA
6590 ssu rRNA of all Eukaryotes
7180 three-domain rRNA
7322 Firmicutes bacteria 16s rRNA
8506 three-domain2org rRNA
11361 ssu rRNA of all Bacteria
13921 Proteobacteria 16s rRNA

29
Problems with current techniques for MP
Average MP scores above optimal of best methods
at 24 hours across 10 datasets
Best current techniques fail to reach 0.01 of
optimal at the end of 24 hours, on large datasets
30
Problems with current techniques for MP
The best current method (default TNT) fails to
reach acceptable levels of accuracy (0.01 of
optimal) within 24 hours on many large datasets
-- evidence suggests that this level will not be
reached for weeks or months (or more) of further
analysis.
Performance of TNT with time
31
Observations

The best methods cannot get acceptably good
solutions within 24 hours on most of these large
datasets.
Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions.
Apparent convergence can be misleading.

32
Observations

The best methods cannot get acceptably good
solutions within 24 hours on most of these large
datasets.
Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions.
Apparent convergence can be misleading.

33
Observations

The best methods cannot get acceptably good
solutions within 24 hours on most of these large
datasets.
Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions.
Apparent convergence can be misleading.

34
Our objective speed up the best MP heuristics
Fake study
Performance of hill-climbing heuristic
MP score of best trees
Desired Performance
Time
35

How can we improve upon existing techniques?

Tree Bisection and Reconnection (TBR)

Tree Bisection and Reconnection (TBR)

Delete an edge
38

Tree Bisection and Reconnection (TBR)

Tree Bisection and Reconnection (TBR)

Reconnect the trees with a new edge that
bifurcates an edge in each tree
40
A conjecture as to why current techniques are
poor

Our studies suggest that trees with near optimal
scores tend to be topologically close (RF
distance less than 15) from the other near
optimal trees.
The standard technique (TBR) for moving around
tree space explores O(n3) trees, which are mostly
topologically distant.
So TBR may be useful initially (to reach near
optimality) but then more localized searches
are more productive.

41
Disk-Covering Methods (DCMs)

DCMs are divide-and-conquer methods that our
group has developed for use in phylogeny
reconstruction
DCM2 was designed for speeding up maximum
parsimony and maximum likelihood heuristics. DCM2
was good enough for PAUP.
DCM3 is a recent improvement over DCM2 which
enables iteration (and gives smaller subproblems)
- and is good enough for TNT.

42
Boosting MP heuristics

These DCMs boost the performance of phylogeny
reconstruction methods.

DCM
Base method M
DCM-M
43
DCM3 technique for speeding up MP searches
44
DCM3 a new DCM decomposition

This DCM3 decompositions
are faster to compute than
earlier DCMs
(2) yield smaller subproblems
than DCM2, and
(3) can be used iteratively

45
Short Subtree Graph

Given an edge-weighted tree T, for every edge e,
let X(e) be the set of leaves in each of the four
subtrees around e that are closest to e.
Let G be the union of cliques on all X(e)s.
Theorem G is triangulated. Hence a clique
separator in G can be found in polynomial time.

46
Strict Consensus Merger (SCM)
47
Computing DCM3 decompositions

An optimal DCM3 decomposition takes O(n3) to
compute - same as for DCM2
The centroid edge DCM3 decomposition can be
computed in O(n2) time
An approximate centroid edge decomposition can be
computed in O(n) time

48
Iterative-DCM3
T
DCM3
Base method
T
49
New DCMs

DCM3
Compute subproblems using DCM3 decomposition
Apply base method to each subproblem to yield
subtrees
Merge subtrees using the Strict Consensus Merger
technique
Randomly refine to make it binary
Recursive-DCM3
Iterative DCM3
Compute a DCM3 tree
Perform local search and go to step 1
Recursive-Iterative DCM3

50
Boosting MP heuristics

We examine DCMs using DCM2 and DCM3, and using
recursion and/or iteration.

DCM
Base method M
DCM-M
51
Performance Study

How well do these boosted versions of the best
MP heuristics perform, compared to the best MP
heuristics?
We examine performance with respect to optimal
MP scores (best found so far, using any method)
for a number of very large datasets, over 24
hours.
The benchmark MP heuristic is the default TNT.

52
Datasets
Obtained from various researchers and online
databases

1322 lsu rRNA of all organisms
2000 Eukaryotic rRNA
2594 rbcL DNA
4583 Actinobacteria 16s rRNA
6590 ssu rRNA of all Eukaryotes
7180 three-domain rRNA
7322 Firmicutes bacteria 16s rRNA
8506 three-domain2org rRNA
11361 ssu rRNA of all Bacteria
13921 Proteobacteria 16s rRNA

53
Comparison of DCM decompositions(Maximum subset
size)
DCM2 subproblems are almost as large as the full
dataset size on datasets 1 through 4. On
datasets 5-10 DCM2 was too slow to compute a
decomposition within 24 hours.
54
Comparison of DCMs (4583 sequences)
Base method is the TNT-ratchet. DCM2 tree takes
almost 10 hours to produce a tree and is too
slow to run on larger datasets. Rec-I-DCM3 is the
best method at all times.
55
Comparison of DCMs (13,921 sequences)
Base method is the TNT-ratchet. Note the
improvement in DCMs as we move from the
default to recursion to iteration to
recursioniteration. On very large datasets
Rec-I-DCM3 gives significant improvements over
unboosted TNT.
56
Rec-I-DCM3 significantly improves performance
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset
57
Rec-I-DCM3(TNT) vs. TNT(Comparison of scores at
24 hours)
Base method is the default TNT technique, the
current best method for MP. Rec-I-DCM3
significantly improves upon the unboosted TNT by
returning trees which are at most 0.01 above
optimal on most datasets.
58
Summary of Part I

Rec-I-DCM3 is a powerful technique for escaping
local optima, and boosts the performance of the
best heuristics for solving MP
The improvement increases with the difficulty of
the dataset - RecIDCM3(TNT) is 50 times faster
than TNT on our hardest datasets, but we expect
even bigger speedups in our next version
DCMs also boost the performance of Maximum
Likelihood heuristics (not shown)

59
Limitations of DNA phylogenetics

Deep evolutionary histories may not be
recoverable from DNA sequence phylogeny due to
lack of specificity -- too much noise (homoplasy)
and insufficient sequence length
The systematics community has looked to rare
genomic changes for better sources of
phylogenetic signal

60
Part II Whole-Genome Phylogenetics
61
Genomes As Signed Permutations
1 5 3 4 -2 -6or6 2 -4 3 5 1 etc.
62
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10

Inversion (Reversal)

Transposition

Inverted Transposition

63
Other types of events

Duplications, Insertions, and Deletions (changes
gene content)
Fissions and Fusions (for genomes with more than
one chromosome)
These events change the number of copies of each
gene in each genome (unequal gene content)

64
Genome Rearrangement Has A Huge State Space

DNA sequences 4 states per site
Signed circular genomes with n genes
states, 1
site
Circular genomes (1 site)
with 37 genes (mitochondria)
states
with 120 genes (chloroplasts)
states

65
Why use gene orders?

Rare genomic changes huge state space and
relative infrequency of events (compared to site
substitutions) could make the inference of deep
evolution easier, or more accurate.
Our research shows this is true, but accurate
analysis of gene order data is computationally
very intensive!

66
Phylogeny reconstruction from gene orders

Distance-based reconstruction estimate pairwise
distances, and apply methods like
Neighbor-Joining or Weighbor
Maximum Parsimony find tree with the minimum
length (inversions, transpositions, or other edit
distances)
Maximum Likelihood find tree and parameters of
evolution most likely to generate the observed
data

67
Maximum Parsimony on Rearranged Genomes (MPRG)

The leaves are rearranged genomes.
Find the tree that minimizes the total number of
rearrangement events (e.g., inversion phylogeny
minimizes the number of inversions)

68
Optimization problems for gene order phylogeny

Breakpoint phylogeny find the phylogeny which
minimizes the total number of breakpoints
(NP-hard, even to find the median of three
genomes)
Inversion phylogeny find the phylogeny which
minimizes the sum of inversion distances on the
edges (NP-hard, even to find the median of three
genomes)

69
Inversion and Breakpoint phylogenies

When the data are close to saturated, even
Weighbor(EDE) analyses are insufficiently
accurate. In these cases, our initial
investigations suggest that the inversion and
breakpoint phylogeny approaches may be superior.
Problem finding the best trees is enormously
hard, since even the point estimation problem
is hard (worse than estimating branch lengths in
ML).

Local optimum
MP score
Global optimum
Phylogenetic trees
70
Observations

For equal gene content, heuristics for the
inversion phylogeny problem are extremely
accurate, even under model conditions in which
transpositions are dominant.
For unequal gene content, the parsimony style
problems are too computationally intense -- but
NJ (neighbor joining) with a new distance
estimator (Moret et al. 2004) works extremely
well.

71
Software

BPAnalysis (Sankoff) open source, restricted to
the breakpoint phylogeny reconstruction
GRAPPA (Moret et al.) open source, restricted to
single chromosome genomes, but can handle both
equal and unequal gene content
MGR (Pevzner et al.) multiple chromosome,
limited to equal gene content, performs well if
the dataset is small (less than 10 genomes)
Bayesian analysis by Bret Larget (not released)
works on some small datasets but not all.

72
Merciera
Wahlenbergia
Tiodanus
Legousia
Asyneuma
Trachelium
Symphyandra
Campanula
Codonopsis
Tobacco
Adenophora
Cyananthus
The strict consensus of 24 trees, each with
inversion length of 64. Finished within 40
minutes on a laptop using GRAPPA version 1.8
Platycodon
73
GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)

http//www.cs.unm.edu/moret/GRAPPA/
Heuristics for maximum parsimony style problems
for equal gene content
Fast polynomial time distance-based methods
Contributors U. New Mexico,U. Texas at Austin,
Universitá di Bologna, Italy
Freely available in source code at this site.
Project leader Bernard Moret (UNM)
(moret_at_cs.unm.edu)

74
Summary

Computational phylogenetics offers interesting
problems for computer scientists
Collaboration with biologists is essential
Testing on real and/or simulated data is the only
way to know that the methods are worth pursuing

75
Research problems in phylogenetic reconstruction

Hard optimization problems
Bayesian inference
Whole Genome Phylogeny (e.g., gene order/content)
Reticulate evolution
Gene Tree/Species Tree
Processing sets of trees compact representations
and consensus methods
Supertree methods
Statistical issues with respect to stochastic
models of evolution (e.g., fast converging
methods)

76
Contest planned!

The CIPRES project will host a contest for
software to solve maximum parsimony and maximum
likelihood for molecular sequences (not gene
orders)
Dates not yet set, but should be 2005
Benchmarks will be real and simulated datasets
For more information, check http//www.phylo.org

77
Acknowledgements

The NSF
The David and Lucile Packard Foundation
The Radcliffe Institute for Advanced Study, the
Program in Evolutionary Dynamics at Harvard, and
the Institute for Cellular and Molecular Biology
at UT-Austin.
DCM Usman Roshan, Bernard Moret, and Tiffani
Williams
GRAPPA Bernard Moret, Li-San Wang, Jijun Tang,
Bob Jansen,

78
Phylolab, U. Texas
Please visit us at http//www.cs.utexas.edu/users/
phylo/

Write a Comment

User Comments (0)