Title: Algorithms for Inferring the Tree of Life
1Algorithms for Inferring the Tree of Life
- Tandy Warnow
- Dept. of Computer Science
- The University of Texas at Austin
2Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3Evolution informs about everything in biology
- Big genome sequencing projects just produce data
so what? - Evolutionary history relates all organisms and
genes, and helps us understand and predict - interactions between genes (genetic networks)
- drug design
- predicting functions of genes
- influenza vaccine development
- origins and spread of disease
- origins and migrations of humans
4Reconstructing the Tree of Life
Handling large datasets millions of species,
NP-hard problems, Lots of computer science
research to do
5Steps in a phylogenetic analysis
- Gather data
- Align sequences
- Estimate phylogeny on the multiple alignment
- Estimate the reliable aspects of the evolutionary
history (using bootstrapping, consensus trees, or
other methods) - Perform post-tree analyses.
6DNA Sequence Evolution
7Phylogeny Problem
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
8Phylogenetic reconstruction methods
- Hill-climbing heuristics for hard optimization
criteria (Maximum Parsimony and Maximum
Likelihood)
- Polynomial time distance-based methods UPGMA,
Neighbor Joining, FastME, Weighbor, etc.
9Performance criteria
- Running time.
- Space.
- Statistical performance issues (e.g., statistical
consistency) with respect to a Markov model of
evolution. - Topological accuracy with respect to the
underlying true tree. Typically studied in
simulation. - Accuracy with respect to a particular criterion
(e.g. tree length or likelihood score), on real
data.
10How can we infer evolution?
- While there are more than two sequences, DO
- Find the closest pair of sequences and make
them siblings - Replace the pair by a single sequence
11That was called UPGMA
- Advantages UPGMA is polynomial time and works
well under the strong molecular clock
hypothesis. - Disadvantages UPGMA does not work well in
simulations, perhaps because the molecular clock
hypothesis does not generally apply. - Other polynomial time methods, also
distance-based, work better. One of the best of
these is Neighbor Joining.
12Quantifying Error
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
13Neighbor joining has poor performance on large
diameter trees Nakhleh et al. ISMB 2001
- Simulation study based upon fixed edge lengths,
K2P model of evolution, sequence lengths fixed to
1000 nucleotides. - Error rates reflect proportion of incorrect edges
in inferred trees.
0.8
NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
14- Other standard polynomial time methods dont
improve substantially on NJ (and have the same
problem with large diameter datasets). - What about trying to solve maximum parsimony or
maximum likelihood?
15Maximum Parsimony
- Input Set S of n aligned sequences of length k
- Output
- A phylogenetic tree T leaf-labeled by sequences
in S - additional sequences of length k labeling the
internal nodes of T - such that is minimized,
where H(i,j) denotes the Hamming distance between
sequences at nodes i and j
16Maximum parsimony (example)
- Input Four sequences
- ACT
- ACA
- GTT
- GTA
- Question which of the three trees has the best
MP scores?
17Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
18Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
19Maximum Parsimony computational complexity
20Solving NP-hard problems exactly is unlikely
- Number of (unrooted) binary trees on n leaves is
(2n-5)!! - If each tree on 1000 taxa could be analyzed in
0.001 seconds, we would find the best tree in - 2890 millennia
21Approaches for solving MP/ML
- Hill-climbing heuristics (which can get stuck in
local optima) - Randomized algorithms for getting out of local
optima - Approximation algorithms for MP (based upon
Steiner Tree approximation algorithms).
22Problems with current techniques for MP
Shown here is the performance of a heuristic
maximum parsimony analysis on a real dataset of
almost 14,000 sequences. (Optimal here means
best score to date, using any method for any
amount of time.) Acceptable error is below 0.01.
Performance of TNT with time
23Observations
- The best MP heuristics cannot get acceptably good
solutions within 24 hours on most of these large
datasets. - Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions. - Apparent convergence can be misleading.
24Empirical problems with existing methods
- Heuristics for Maximum Parsimony (MP) and Maximum
Likelihood (ML) cannot handle large datasets
(take too long!) we need new heuristics for
MP/ML that can analyze large datasets - Polynomial time methods have poor topological
accuracy on large diameter datasets we need
better polynomial time methods
25My research
- Focused on the design and analysis of algorithms
for large-scale phylogeny reconstruction and
multiple sequence alignment. - Objective the design of new algorithms with
better performance than existing algorithms, as
evidenced by mathematical theory, experiment, or
empirical studies. - Collaborations with biologists for modelling and
data analysis. - Current group four PhD students, one postdoc,
and two undergrads.
26What happens after the analysis?
- The result of a phylogenetic analysis is often
thousands (or tens of thousands) of equally good
trees. - How should we analyze the set of trees?
- How can we store the set of trees?
- Current approaches use consensus methods, as well
as other techniques, to try to infer what is
likely to be the characteristics of the true
tree. Current techniques use too much space,
take too much time, and are not sufficiently
informative.
27General comments
- There is interesting computer science research to
be done in computational phylogenetics, with a
tremendous potential for impact. - Algorithm development must be tested on both real
and simulated data. - The interplay between data, stochastic models of
evolution, optimization problems, and algorithms,
is important and instructive.