Computational and mathematical challenges involved in very large-scale phylogenetics - PowerPoint PPT Presentation

About This Presentation

Title:

Computational and mathematical challenges involved in very large-scale phylogenetics

Description:

Modelling variation between characters: Rates-across-sites ... Modelling variation between characters: The 'no-common-mechanism' model ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 48

Provided by: csUt8

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Computational and mathematical challenges involved in very large-scale phylogenetics

1
Computational and mathematical challenges
involved in very large-scale phylogenetics

Tandy Warnow
The University of Texas at Austin

2
How did life evolve on earth?
An international effort to understand how life
evolved on earth Biomedical applications drug
design, protein structure and function
prediction, biodiversity.

Courtesy of the Tree of Life project

3
Steps in a phylogenetic analysis

Gather data
Align sequences
Reconstruct phylogeny on the multiple alignment -
often obtaining a large number of trees
Compute consensus (or otherwise estimate the
reliable components of the evolutionary history)
Perform post-tree analyses.

4
Reconstructing the Tree of Life
Handling large datasets millions of species,
and most problems are NP-hard
Evolution is complex Reticulate evolution
Indels, genome rearrangements, etc. Gene
tree/species tree differences
5
The CIPRES Project (Cyber-Infrastructure for
Phylogenetic Research)

The US National Science Foundation funds this
project, which has the following major
components
ALGORITHMS and SOFTWARE scaling to millions of
sequences (open source, freely distributed)
MATHEMATICS/PROBABILITY/STATISTICS Obtaining
better mathematical theory under complex models
of evolution
DATABASES Producing new database technology for
structured data, to enable scientific discoveries
SIMULATIONS The first million taxon simulation
under realistically complex models
OUTREACH Museum partners, K-12, general
scientific public
PORTAL available to all researchers
See www.phylo.org for more about CIPRES.

6
This talk

Part 1 Overview of some interesting mathematical
issues
Part 2 Better heuristics for NP-hard
optimization problems
Part 3 New methods for simultaneous estimation
of trees and alignments
Part 4 Related research and open problems.

7
Part 1 Mathematical issues under stochastic
models
8
DNA Sequence Evolution
9
Markov models of single site evolution

Simplest (Jukes-Cantor)
The model tree is a pair (T,e,p(e)), where T is
a rooted binary tree, and p(e) is the probability
of a substitution on the edge e.
The state at the root is random.
If a site changes on an edge, it changes with
equal probability to each of the remaining
states.
The evolutionary process is Markovian.
More complex models (such as the General Markov
model) are also considered, often with little
change to the theory.

10
Modelling variation between characters
Rates-across-sites

If a site (i.e., character) is twice as fast as
another on one edge, it is twice as fast
everywhere.
The distribution of the rates is typically
assumed to be gamma.

B
D
A
C
B
D
A
C
11
Modelling variation between characters The
no-common-mechanism model

A separate random variable for every combination
of site and edge - the underlying tree is fixed,
but otherwise there are no constraints on
variation between sites.

C
A
D
B
B
D
A
C
12
Identifiability and statistical consistency

A model is identifiable if it is uniquely
characterized by the probability distribution it
defines.
A phylogenetic reconstruction method is
statistically consistent under a model if the
probability that the method reconstructs the true
tree goes to 1 as the sequence length increases.

13
Identifiability results

The standard Markov models (from Jukes-Cantor
to the General Markov model) are identifiable.
These models are also identifiable when sites
draw rates from a gamma distribution (easy to
prove if the distribution is known, and harder to
prove if the distribution must be estimated - cf.
Allmans talk).
However, mixed models are often not identifiable
(cf. Matsens talk), nor are some models in which
sites draw rates from more complex distributions.
Phylogeny estimation typically is done under
identifiable models.

14
What about phylogeny reconstruction methods?
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
15
Phylogenetic reconstruction methods

Hill-climbing heuristics for NP-hard optimization
criteria (Maximum Parsimony and Maximum
Likelihood)

Polynomial time distance-based methods Neighbor
Joining, FastME, Weighbor, etc.
Bayesian methods

16
Performance criteria

Running time.
Space.
Statistical performance issues (e.g., statistical
consistency and sequence length requirements)
Topological accuracy with respect to the
underlying true tree. Typically studied in
simulation.
Accuracy with respect to a particular criterion
(e.g. tree length or likelihood score), on real
data.

17
Quantifying Error
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
18
Statistical consistency, exponential convergence,
and absolute fast convergence (afc)
19
Theoretical results statistical consistency
under typical models?

Neighbor Joining is polynomial time, and
statistically consistent.
Maximum Parsimony is NP-hard, and even exact
solutions are not statistically consistent.
Maximum Likelihood is NP-hard, but exact
solutions are statistically consistent

20
Theoretical results sequence length requirements
under typical models?

Neighbor joining (and some other distance-based
methods) will return the true tree with high
probability provided sequence lengths are
exponential in the diameter of the tree (Erdos et
al., Atteson).
Maximum likelihood will return the true tree with
high probability provided sequence lengths are
exponential in the number of taxa (Steel and
Szekely).

21
Neighbor joining has poor performance on large
diameter trees Nakhleh et al. ISMB 2001

Simulation study based upon fixed edge lengths,
K2P model of evolution, sequence lengths fixed to
1000 nucleotides.
Error rates reflect proportion of incorrect edges
in inferred trees.

0.8
NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
22
DCM1 Warnow, St. John, and Moret, SODA 2001
Exponentially converging method
Absolute fast converging method
DCM
SQS

A two-phase procedure which reduces the sequence
length requirement of methods. The DCM phase
produces a collection of trees, and the SQS phase
picks the best tree.
The base method is applied to subsets of the
original dataset. When the base method is NJ,
you get DCM1-NJ.

23
DCM1-boosting distance-based methodsNakhleh et
al. ISMB 2001

Theorem DCM1-NJ converges to the true tree from
polynomial length sequences

0.8
NJ
DCM1-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
24
However,

The best accuracy in simulation tends to be from
computationally intensive methods (and most
molecular phylogeneticists prefer these methods).
Unfortunately, these approaches can take weeks or
more, just to reach decent local optima.
Conclusion We need better heuristics!

25
Part 2 Improved heuristics for NP-hard
optimization problems
26
Approaches for solving MP/ML

Hill-climbing heuristics (which can get stuck in
local optima)
Randomized algorithms for getting out of local
optima
Approximation algorithms for MP (based upon
Steiner Tree approximation algorithms).

27
Problems with current techniques for MP
Shown here is the performance of a TNT heuristic
maximum parsimony analysis on a real dataset of
almost 14,000 sequences. (Optimal here means
best score to date, using any method for any
amount of time.) Acceptable error is below 0.01.
Performance of TNT with time
28
Observations

The best MP heuristics cannot get acceptably good
solutions within 24 hours on some large datasets.
Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions.
Apparent convergence can be misleading.

29
Rec-I-DCM3 a new technique (Roshan et al.)

Combines a new decomposition technique (DCM3)
with recursion and iteration, to produce a novel
approach for escaping local optima
Tested initially on MP (maximum parsimony), but
also implemented for ML and other optimization
problems

30
The DCM3 decomposition

DCM3 decompositions
can be obtained in O(n) time (the
short subtree graph is triangulated)
(2) yield small subproblems
(3) can be used iteratively

31
Iterative-DCM3
T
DCM3
Base method
T
32
Rec-I-DCM3 significantly improves performance
(Roshan et al.)
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset
33
Observations

Rec-I-DCM3 improves upon the best performing
heuristics for MP.
The improvement increases with the difficulty of
the dataset.
Rec-I-DCM3 also improves maximum likelihood
(using RAxML) analyses (data not shown), and is
included in the CIPRES portal.

34
Part 3 Multiple sequence alignment
35
Steps in a phylogenetic analysis

Gather data
Align sequences
Reconstruct phylogeny on the multiple alignment -
often obtaining a large number of trees
Compute consensus (or otherwise estimate the
reliable components of the evolutionary history)
Perform post-tree analyses.

36
Steps in a phylogenetic analysis

Gather data
Align sequences
Reconstruct phylogeny on the multiple alignment -
often obtaining a large number of trees
Compute consensus (or otherwise estimate the
reliable components of the evolutionary history)
Perform post-tree analyses.

37
Basic observations about standard two-phase
methods

Many new MSA methods improve on ClustalW, with
ProbCons and MAFFT the two best MSA methods.
Although alignment accuracy correlates with
phylogenetic accuracy, it has less effect than
might be expected (Wang et al., unpublished).

38
What about simultaneous estimation?

Several Bayesian methods for simultaneous
estimation of trees and alignments have been
developed, but none can be applied to datasets
with more than (approx.) 20 sequences.
POY attempts to solve the NP-hard minimum length
tree problem, where gaps contribute to the
length of the tree and can be applied to large
datasets. However, its performance on simulated
data isnt competitive with the best two-phase
methods (unpublished data).

39
New method SATe (Simultaneous Alignment and
Tree estimation)

Developers Warnow, Linder, Liu, Nelesen, and
Zhao.
Basic technique propose alignments (using
treelength under a selected affine gap penalty),
and compute maximum likelihood trees for these
alignments under GTR.
Unpublished.

40
Simulation study

100 taxon model trees (generated by r8s and then
modified, so as to deviate from the molecular
clock).
DNA sequences evolved under ROSE (indel events of
blocks of nucleotides, plus HKY site evolution).
The root sequence has 1000 sites.
We vary the gap length distribution, probability
of gaps, and probability of substitutions, to
produce 8 model conditions models 1-4 have long
gaps and 5-8 have short gaps.
We compared RAxML on various alignments
(including the true alignment) to SATe.

41
Topological accuracy

FN (false negative) proportion of correct edges
missing from the estimated tree
FP (false positive) proportion of incorrect
edges in the estimated tree

42
Alignment accuracy

Normalized number of columns in the estimated
alignment relative to the true alignment.

43
Alignment accuracy

FN proportion of correctly homologous pairs of
nucleotides missing from the estimated alignment
(i.e., 1-SP score).
FP proportion of incorrect pairings of
nucleotides in the estimated alignment.

44
Other observations

SATe is more accurate at estimating the number of
gaps and the correct alignment length than other
methods.
The standard alignment accuracy measure, SP, is
not particularly predictive of phylogenetic
accuracy.
We still need methods for MSA under models that
include rearrangements and duplications.

45
Part IV Open questions

Tree shape (including branch lengths) has an
impact on phylogeny reconstruction - but what
model of tree shape to use?
What is the sequence length requirement for
Maximum Likelihood? (Current bound is probably
too large.)
Why is MP not so bad?
How to detect and reconstruct reticulate
evolution?
Teasing apart trees under complex models?
What complex models are identifiable?

46
General comments

Current models of sequence evolution are clearly
too simple, and more realistic ones are not
identifiable.
Relative performance between methods can change
as the models become more complex or as the
number of taxa increases.
We do not know how methods perform under
realistic conditions (nor how long we need to let
computationally intensive methods run).

47
Acknowledgements

Funding NSF, The David and Lucile Packard
Foundation, The Program in Evolutionary Dynamics
at Harvard, and The Institute for Cellular and
Molecular Biology at UT-Austin.
Collaborators Peter Erdos, Daniel Huson, Randy
Linder, Kevin Liu, Bernard Moret, Serita Nelesen,
Usman Roshan, Mike Steel, Katherine St. John,
Laszlo Szekely, Tiffani Williams, and David Zhao.

Write a Comment

User Comments (0)