Computational and mathematical challenges involved in very large-scale phylogenetics - PowerPoint PPT Presentation

About This Presentation
Title:

Computational and mathematical challenges involved in very large-scale phylogenetics

Description:

Modelling variation between characters: Rates-across-sites ... Modelling variation between characters: The 'no-common-mechanism' model ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 48
Provided by: csUt8
Category:

less

Transcript and Presenter's Notes

Title: Computational and mathematical challenges involved in very large-scale phylogenetics


1
Computational and mathematical challenges
involved in very large-scale phylogenetics
  • Tandy Warnow
  • The University of Texas at Austin

2
How did life evolve on earth?
An international effort to understand how life
evolved on earth Biomedical applications drug
design, protein structure and function
prediction, biodiversity.
  • Courtesy of the Tree of Life project

3
Steps in a phylogenetic analysis
  • Gather data
  • Align sequences
  • Reconstruct phylogeny on the multiple alignment -
    often obtaining a large number of trees
  • Compute consensus (or otherwise estimate the
    reliable components of the evolutionary history)
  • Perform post-tree analyses.

4
Reconstructing the Tree of Life
Handling large datasets millions of species,
and most problems are NP-hard
Evolution is complex Reticulate evolution
Indels, genome rearrangements, etc. Gene
tree/species tree differences
5
The CIPRES Project (Cyber-Infrastructure for
Phylogenetic Research)
  • The US National Science Foundation funds this
    project, which has the following major
    components
  • ALGORITHMS and SOFTWARE scaling to millions of
    sequences (open source, freely distributed)
  • MATHEMATICS/PROBABILITY/STATISTICS Obtaining
    better mathematical theory under complex models
    of evolution
  • DATABASES Producing new database technology for
    structured data, to enable scientific discoveries
  • SIMULATIONS The first million taxon simulation
    under realistically complex models
  • OUTREACH Museum partners, K-12, general
    scientific public
  • PORTAL available to all researchers
  • See www.phylo.org for more about CIPRES.

6
This talk
  • Part 1 Overview of some interesting mathematical
    issues
  • Part 2 Better heuristics for NP-hard
    optimization problems
  • Part 3 New methods for simultaneous estimation
    of trees and alignments
  • Part 4 Related research and open problems.

7
Part 1 Mathematical issues under stochastic
models
8
DNA Sequence Evolution
9
Markov models of single site evolution
  • Simplest (Jukes-Cantor)
  • The model tree is a pair (T,e,p(e)), where T is
    a rooted binary tree, and p(e) is the probability
    of a substitution on the edge e.
  • The state at the root is random.
  • If a site changes on an edge, it changes with
    equal probability to each of the remaining
    states.
  • The evolutionary process is Markovian.
  • More complex models (such as the General Markov
    model) are also considered, often with little
    change to the theory.

10
Modelling variation between characters
Rates-across-sites
  • If a site (i.e., character) is twice as fast as
    another on one edge, it is twice as fast
    everywhere.
  • The distribution of the rates is typically
    assumed to be gamma.

B
D
A
C
B
D
A
C
11
Modelling variation between characters The
no-common-mechanism model
  • A separate random variable for every combination
    of site and edge - the underlying tree is fixed,
    but otherwise there are no constraints on
    variation between sites.

C
A
D
B
B
D
A
C
12
Identifiability and statistical consistency
  • A model is identifiable if it is uniquely
    characterized by the probability distribution it
    defines.
  • A phylogenetic reconstruction method is
    statistically consistent under a model if the
    probability that the method reconstructs the true
    tree goes to 1 as the sequence length increases.

13
Identifiability results
  • The standard Markov models (from Jukes-Cantor
    to the General Markov model) are identifiable.
  • These models are also identifiable when sites
    draw rates from a gamma distribution (easy to
    prove if the distribution is known, and harder to
    prove if the distribution must be estimated - cf.
    Allmans talk).
  • However, mixed models are often not identifiable
    (cf. Matsens talk), nor are some models in which
    sites draw rates from more complex distributions.
  • Phylogeny estimation typically is done under
    identifiable models.

14
What about phylogeny reconstruction methods?
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
15
Phylogenetic reconstruction methods
  1. Hill-climbing heuristics for NP-hard optimization
    criteria (Maximum Parsimony and Maximum
    Likelihood)
  1. Polynomial time distance-based methods Neighbor
    Joining, FastME, Weighbor, etc.
  2. Bayesian methods

16
Performance criteria
  • Running time.
  • Space.
  • Statistical performance issues (e.g., statistical
    consistency and sequence length requirements)
  • Topological accuracy with respect to the
    underlying true tree. Typically studied in
    simulation.
  • Accuracy with respect to a particular criterion
    (e.g. tree length or likelihood score), on real
    data.

17
Quantifying Error
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
18
Statistical consistency, exponential convergence,
and absolute fast convergence (afc)
19
Theoretical results statistical consistency
under typical models?
  • Neighbor Joining is polynomial time, and
    statistically consistent.
  • Maximum Parsimony is NP-hard, and even exact
    solutions are not statistically consistent.
  • Maximum Likelihood is NP-hard, but exact
    solutions are statistically consistent

20
Theoretical results sequence length requirements
under typical models?
  • Neighbor joining (and some other distance-based
    methods) will return the true tree with high
    probability provided sequence lengths are
    exponential in the diameter of the tree (Erdos et
    al., Atteson).
  • Maximum likelihood will return the true tree with
    high probability provided sequence lengths are
    exponential in the number of taxa (Steel and
    Szekely).

21
Neighbor joining has poor performance on large
diameter trees Nakhleh et al. ISMB 2001
  • Simulation study based upon fixed edge lengths,
    K2P model of evolution, sequence lengths fixed to
    1000 nucleotides.
  • Error rates reflect proportion of incorrect edges
    in inferred trees.

0.8
NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
22
DCM1 Warnow, St. John, and Moret, SODA 2001
Exponentially converging method
Absolute fast converging method
DCM
SQS
  • A two-phase procedure which reduces the sequence
    length requirement of methods. The DCM phase
    produces a collection of trees, and the SQS phase
    picks the best tree.
  • The base method is applied to subsets of the
    original dataset. When the base method is NJ,
    you get DCM1-NJ.

23
DCM1-boosting distance-based methodsNakhleh et
al. ISMB 2001
  • Theorem DCM1-NJ converges to the true tree from
    polynomial length sequences

0.8
NJ
DCM1-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
24
However,
  • The best accuracy in simulation tends to be from
    computationally intensive methods (and most
    molecular phylogeneticists prefer these methods).
  • Unfortunately, these approaches can take weeks or
    more, just to reach decent local optima.
  • Conclusion We need better heuristics!

25
Part 2 Improved heuristics for NP-hard
optimization problems
26
Approaches for solving MP/ML
  1. Hill-climbing heuristics (which can get stuck in
    local optima)
  2. Randomized algorithms for getting out of local
    optima
  3. Approximation algorithms for MP (based upon
    Steiner Tree approximation algorithms).

27
Problems with current techniques for MP
Shown here is the performance of a TNT heuristic
maximum parsimony analysis on a real dataset of
almost 14,000 sequences. (Optimal here means
best score to date, using any method for any
amount of time.) Acceptable error is below 0.01.
Performance of TNT with time
28
Observations
  • The best MP heuristics cannot get acceptably good
    solutions within 24 hours on some large datasets.
  • Datasets of these sizes may need months (or
    years) of further analysis to reach reasonable
    solutions.
  • Apparent convergence can be misleading.

29
Rec-I-DCM3 a new technique (Roshan et al.)
  • Combines a new decomposition technique (DCM3)
    with recursion and iteration, to produce a novel
    approach for escaping local optima
  • Tested initially on MP (maximum parsimony), but
    also implemented for ML and other optimization
    problems

30
The DCM3 decomposition
  • DCM3 decompositions
  • can be obtained in O(n) time (the
  • short subtree graph is triangulated)
  • (2) yield small subproblems
  • (3) can be used iteratively

31
Iterative-DCM3
T
DCM3
Base method
T
32
Rec-I-DCM3 significantly improves performance
(Roshan et al.)
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset
33
Observations
  • Rec-I-DCM3 improves upon the best performing
    heuristics for MP.
  • The improvement increases with the difficulty of
    the dataset.
  • Rec-I-DCM3 also improves maximum likelihood
    (using RAxML) analyses (data not shown), and is
    included in the CIPRES portal.

34
Part 3 Multiple sequence alignment
35
Steps in a phylogenetic analysis
  • Gather data
  • Align sequences
  • Reconstruct phylogeny on the multiple alignment -
    often obtaining a large number of trees
  • Compute consensus (or otherwise estimate the
    reliable components of the evolutionary history)
  • Perform post-tree analyses.

36
Steps in a phylogenetic analysis
  • Gather data
  • Align sequences
  • Reconstruct phylogeny on the multiple alignment -
    often obtaining a large number of trees
  • Compute consensus (or otherwise estimate the
    reliable components of the evolutionary history)
  • Perform post-tree analyses.

37
Basic observations about standard two-phase
methods
  • Many new MSA methods improve on ClustalW, with
    ProbCons and MAFFT the two best MSA methods.
  • Although alignment accuracy correlates with
    phylogenetic accuracy, it has less effect than
    might be expected (Wang et al., unpublished).

38
What about simultaneous estimation?
  • Several Bayesian methods for simultaneous
    estimation of trees and alignments have been
    developed, but none can be applied to datasets
    with more than (approx.) 20 sequences.
  • POY attempts to solve the NP-hard minimum length
    tree problem, where gaps contribute to the
    length of the tree and can be applied to large
    datasets. However, its performance on simulated
    data isnt competitive with the best two-phase
    methods (unpublished data).

39
New method SATe (Simultaneous Alignment and
Tree estimation)
  • Developers Warnow, Linder, Liu, Nelesen, and
    Zhao.
  • Basic technique propose alignments (using
    treelength under a selected affine gap penalty),
    and compute maximum likelihood trees for these
    alignments under GTR.
  • Unpublished.

40
Simulation study
  • 100 taxon model trees (generated by r8s and then
    modified, so as to deviate from the molecular
    clock).
  • DNA sequences evolved under ROSE (indel events of
    blocks of nucleotides, plus HKY site evolution).
    The root sequence has 1000 sites.
  • We vary the gap length distribution, probability
    of gaps, and probability of substitutions, to
    produce 8 model conditions models 1-4 have long
    gaps and 5-8 have short gaps.
  • We compared RAxML on various alignments
    (including the true alignment) to SATe.

41
Topological accuracy
  • FN (false negative) proportion of correct edges
    missing from the estimated tree
  • FP (false positive) proportion of incorrect
    edges in the estimated tree

42
Alignment accuracy
  • Normalized number of columns in the estimated
    alignment relative to the true alignment.

43
Alignment accuracy
  • FN proportion of correctly homologous pairs of
    nucleotides missing from the estimated alignment
    (i.e., 1-SP score).
  • FP proportion of incorrect pairings of
    nucleotides in the estimated alignment.

44
Other observations
  • SATe is more accurate at estimating the number of
    gaps and the correct alignment length than other
    methods.
  • The standard alignment accuracy measure, SP, is
    not particularly predictive of phylogenetic
    accuracy.
  • We still need methods for MSA under models that
    include rearrangements and duplications.

45
Part IV Open questions
  • Tree shape (including branch lengths) has an
    impact on phylogeny reconstruction - but what
    model of tree shape to use?
  • What is the sequence length requirement for
    Maximum Likelihood? (Current bound is probably
    too large.)
  • Why is MP not so bad?
  • How to detect and reconstruct reticulate
    evolution?
  • Teasing apart trees under complex models?
  • What complex models are identifiable?

46
General comments
  • Current models of sequence evolution are clearly
    too simple, and more realistic ones are not
    identifiable.
  • Relative performance between methods can change
    as the models become more complex or as the
    number of taxa increases.
  • We do not know how methods perform under
    realistic conditions (nor how long we need to let
    computationally intensive methods run).

47
Acknowledgements
  • Funding NSF, The David and Lucile Packard
    Foundation, The Program in Evolutionary Dynamics
    at Harvard, and The Institute for Cellular and
    Molecular Biology at UT-Austin.
  • Collaborators Peter Erdos, Daniel Huson, Randy
    Linder, Kevin Liu, Bernard Moret, Serita Nelesen,
    Usman Roshan, Mike Steel, Katherine St. John,
    Laszlo Szekely, Tiffani Williams, and David Zhao.
Write a Comment
User Comments (0)
About PowerShow.com