Title: Computing the Tree of Life
1Computing the Tree of Life
- The University of Texas at Austin
- Department of Computer Sciences
- Tandy Warnow
2Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3DNA Sequence Evolution
4Molecular Phylogenetics
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
(Tree is unrooted)
V
W
5Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
6Evolution informs about everything in biology
- Big genome sequencing projects just produce data
-- so what? - Evolutionary history relates all organisms and
genes, and helps us understand and predict - interactions between genes (genetic networks)
- drug design
- predicting functions of genes
- influenza vaccine development
- origins and spread of disease
- origins and migrations of humans
7Evolutionary trees and the pharmaceutical industry
- Big genome sequencing projects just produce data
-- so what? Evolutionary history relates all
organisms and genes, and evolutionary trees are
used to make important biological discoveries. - The pharmaceutical industry uses phylogenies for
many applications, such as the development of
influenza vaccine! - Inaccuracies in the phylogenies lead to
inaccurate predictions (e.g., vaccines that dont
work, drugs that dont have the required
properties). Current software isnt accurate
enough, or fast enough! - This means !
8NSFs program for Assembling the Tree of Life
- The Tree of Life has proven useful in many
fields, such as - choosing experimental systems for biological
research, - determining which genes are common to many kinds
of organisms and which are unique, - tracking the origin and spread of emerging
diseases and their vectors, - bio-prospecting for pharmaceutical and
agrochemical products, - Developing databases for genetic information, and
evaluating risk factors for species conservation
and ecosystem restoration.
9Computational challenges for Assembling the Tree
of Life (NSF)
- 8 million species for the Tree of Life -- cannot
currently analyze more than a few hundred (and
even this takes years) - We need new methods for inferring large
phylogenies - hard optimization problems! - We need new software for visualizing large trees
- We need new database technology
10We are world leaders in research in Computational
Phylogenetics
- DCM-boosting for phylogeny reconstruction -
improves accuracy and speeds up heuristics for
NP-hard problems (Warnow, UT-Austin) - GRAPPA -- software for whole genome phylogeny
(Moret, UNM) - Visualization of large trees, and sets of trees
(Amenta, UC Davis) - Phylogenetic databases (Miranker)
11DCM-boosting improves methodsNakhleh et al.
ISMB 2001
- Random trees
- K2PGamma model
- Sequence length1000
- Average branch length0.05
12(Figure)Nakhleh et al. ISMB 2001
- Random trees
- K2PGamma model
- Sequence length1000
- Average branch length0.05
DCM-NJ
0.8
NJ
DCM-NJ MP
HGT-FP
0.6
Avg. RF
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
13DCM-boosting phylogenetic reconstruction
methodsNakhleh et al. ISMB 2001
- DCM-boosting makes fast methods more accurate
- DCM-boosting speeds-up heuristics for hard
optimization problems
0.8
NJ
DCM-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
14 Whole-Genome Phylogenetics
15Benchmark gene order dataset Campanulaceae
- 12 genomes 1 outgroup (Tobacco), 105 gene
segments - NP-hard optimization problems breakpoint and
inversion phylogenies - 1997 BPAnalysis (Blanchette and Sankoff) 200
years (est.)
16Benchmark gene order dataset Campanulaceae
- 12 genomes 1 outgroup (Tobacco), 105 gene
segments - NP-hard optimization problems breakpoint and
inversion phylogenies - 1997 BPAnalysis (Blanchette and Sankoff) 200
years (est.) - 2000 Using GRAPPA v1.1 on the 512-processor Los
Lobos Supercluster machine 2 minutes
(200,000-fold speedup per processor)
17Benchmark gene order dataset Campanulaceae
- 12 genomes 1 outgroup (Tobacco), 105 gene
segments - NP-hard optimization problems breakpoint and
inversion phylogenies - 1997 BPAnalysis (Blanchette and Sankoff) 200
years (est.) - 2000 Using GRAPPA v1.1 on the 512-processor Los
Lobos Supercluster machine 2 minutes
(200,000-fold speedup per processor) - 2003 Using latest version of GRAPPA 2 minutes
on a single processor (1-billion-fold speedup per
processor)
18GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)
- http//www.cs.unm.edu/moret/GRAPPA/
- Heuristics for NP-hard optimization problems
- Fast polynomial time distance-based methods
- Contributors U. New Mexico,U. Texas at Austin,
Universitá di Bologna, Italy - Fastest and most accurate software for whole
genome phylogeny worldwide
19Opportunities
- New phylogenetic reconstruction software can
improve pharmaceutical RD (making more accurate
solutions achievable in hours or days, rather
than months or years) - Software for researchers is available as free
(open source), but users need the latest tools
now, with proper interfaces -- business
opportunity.
20Participants and Funding
- University of Texas Computer Scientists Warnow,
Dhillon, Hunt, and Miranker - University of Texas biologists Jansen,
Linder, and Hillis - Other institutions UNM, UC Davis, Central
Washington, CUNY, JGI - Funding Three NSF ITR grants, NSF Biocomplexity,
David and Lucile Packard Foundation
21Phylolab, U. Texas
Please visit us at http//www.cs.utexas.edu/users/
phylo/