Title: Algorithmic research in phylogeny reconstruction
1Algorithmic research in phylogeny reconstruction
- Tandy Warnow
- The University of Texas at Austin
2Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3Reconstructing the Tree of Life
Handling large datasets millions of species NSF
funds many projects towards this goal, under the
Assembling the Tree of Life (ATOL) program
4Current projects
- Heuristics for NP-hard optimization problems for
phylogeny reconstruction - Phylogenetic multiple sequence alignment
- Detecting and reconstruction horizontal gene
transfer and hybridization - Constructing phylogenies on languages
- Graph-theory, combinatorial optimization,
probabilistic analysis, are fundamental to
algorithm development in this area. But all
methods are extensively tested in simulation and
on real data as well. Collaborations with
biologists or linguists are essential.
5DNA Sequence Evolution
6Phylogeny Problem
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
7Solving NP-hard problems exactly is unlikely
leaves trees
4 3
5 15
6 105
7 945
8 10395
9 135135
10 2027025
20 2.2 x 1020
100 4.5 x 10190
1000 2.7 x 102900
- Number of (unrooted) binary trees on n leaves is
(2n-5)!! - If each tree on 1000 taxa could be analyzed in
0.001 seconds, we would find the best tree in - 2890 millennia
8Approaches for solving hard optimization
problems (like maximum parsimony)
- Hill-climbing heuristics (which can get stuck in
local optima) - Randomized algorithms for getting out of local
optima - Approximation algorithms (give bounds on what is
possible)
9Problems with current techniques for MP
Shown here is the performance of a heuristic
maximum parsimony analysis on a real dataset of
almost 14,000 sequences. (Optimal here means
best score to date, using any method for any
amount of time.) Acceptable error is below 0.01.
Performance of TNT with time
10Performance of NJ, a popular polynomial time
method Nakhleh et al. ISMB 2001
- Simulation study based upon fixed edge lengths,
K2P model of evolution, sequence lengths fixed to
1000 nucleotides. - Error rates reflect proportion of incorrect edges
in inferred trees.
0.8
NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
11DCMs (Disk-Covering Methods)
- DCMs for polynomial time methods improve
topological accuracy (empirical observation), and
have provable theoretical guarantees under Markov
models of evolution - DCMs for hard optimization problems reduce
running time needed to achieve good levels of
accuracy (empirically observation)
12DCMs Divide-and-conquer for improving phylogeny
reconstruction
13Boosting phylogeny reconstruction methods
- DCMs boost the performance of phylogeny
reconstruction methods.
DCM
Base method M
DCM-M
14Iterative-DCM3
T
DCM3
Base method
T
15Rec-I-DCM3 significantly improves performance
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset
16DCM1-boosting distance-based methodsNakhleh et
al. ISMB 2001
- DCM1-boosting makes distance-based methods more
accurate - Theoretical guarantees that DCM1-NJ converges to
the true tree from polynomial length sequences
0.8
NJ
DCM1-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
17General comments
- Everything in phylogeny (just about) is NP-hard
- Graph-theory, probability, and optimization are
the basic tools for algorithmic advances - Algorithms are tested on both real and simulated
data. - Collaborations with domain experts (biologists or
linguists) essential to success. (At UT, we have
wonderful biologists to work with, and all my
students collaborate with them.)
18For more information
- Send me email to make an appointment
- Check my webpage for tutorials on the subject
- See http//www.phylo.org and http//www.cs.utexas.
edu/tandy for more info