LargeScale Phylogenetic Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

LargeScale Phylogenetic Analysis

Description:

Graduate Program in Evolution and Ecology. Co-Director ... the rates of evolution across the sites can be drawn from a fixed distribution ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 47
Provided by: lisan4
Category:

less

Transcript and Presenter's Notes

Title: LargeScale Phylogenetic Analysis


1
Large-Scale Phylogenetic Analysis
  • Tandy Warnow
  • Associate ProfessorDepartment of Computer
    Sciences
  • Graduate Program in Evolution and Ecology
  • Co-DirectorThe Center for Computational Biology
    and Bioinformatics
  • The University of Texas at Austin

2
Outline of Talk
  • Phylogenetic reconstruction from DNA sequences
    the problems, and the progress
  • Phylogenetic reconstruction from gene order and
    content in whole genomes initial work
  • The future of large-scale phylogeny, and the
    possibilities of inferring the Tree of Life

3
I. Molecular Systematics
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
4
DNA Sequence Evolution
5
Major Phylogenetic Reconstruction Methods
  • Polynomial-time distance-based methods (neighbor
    joining the most popular)
  • NP-hard sequence-based methods
  • Maximum Parsimony
  • Maximum Likelihood
  • Heated debates over the relative performance of
    these methods

6
Quantifying Error
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
7
Main Result DCM-Boosting and DCMNJML
We have developed the first polynomial time
methods that improve upon NJ (with respect to
topological accuracy) and are never worse than
NJ. The method is obtained through DCM-boosting.
8
Basis of Distance-Based Methods Additivity
  • A distance matrix is additive if there exists
    a tree and such
    that .
  • Waterman et al. (1977) showed that

9
Distance-based Phylogenetic Methods
10
Statistical Consistency
  • Atteson (1990) showed that if
    is small enough.

Hence NJ is statistically consistent for many
models of evolution.But what about performance
on finite sequence lengths?
11
We focus on performance on finite sequence lengths
12
Absolute fast convergence vs. exponential
convergence
13
General Markov (GM) Model
  • A GM model tree is a pair where
  • is a rooted binary tree.
  • , and is
    a stochastic substitution matrix with
    .
  • The sequence at the root of is drawn from a
    uniform distribution.
  • the rates of evolution across the sites can be
    drawn from a fixed distribution
  • GM contains models like Jukes-Cantor (JC) and
    Kimura 2-Parameter (K2P) models.

14
Absolute Fast Convergence
  • Let . Define
    . We parameterize the GM model
  • A phylogenetic reconstruction method is
    absolute fast-converging (AFC) for the GM model
    if for all positive there is a
    polynomial such that for all
    on set of sequences of length at
    least generated on , we have

15
Theoretical Comparison of Early AFC Methods to NJ
  • Theorem 1 Warnow et al. 2001DCMNJSQS is
    absolute fast converging for the GM model.
  • Theorem 2 Csurös 2001HGTFP is absolute fast
    converging for the GM model.
  • Theorem 3 Atteson 1999NJ is exponentially
    converging for the GM model (but is not known to
    be AFC).

16
DCM-Boosting Warnow et al. 2001
  • DCMSQS is a two-phase procedure which reduces
    the sequence length requirement of methods.

Exponentially converging method
Absolute fast converging method
DCM
SQS
  • DCMNJSQS is the result of DCM-boosting NJ.

17
Experimental Comparison of Early AFC Methods to NJ
  • rbcL 500-taxon tree
  • Jukes-Cantor model
  • Avg. branch length 0.264

18
Improving upon early AFC methods
  • These early AFC methods outperform NJ only on
    long enough sequences and on large enough trees
    with high enough rates of evolution.
  • Hence we need new fast converging methods which
    improve upon NJ on more of the parameter space,
    and are never worse than NJ.
  • We modify the second phase to improve the
    empirical performance, replacing SQS with ML
    (maximum likelihood) or MP (maximum parsimony).

19
DCMNJML vs. other methods on a fixed tree
  • 500-taxon rbcL tree
  • K2P? model (?2, ?1)
  • Avg. branch length 0.278
  • Typical performance

20
Comparison of methods on random trees as a
function of the number of taxa
  • Random tree topologies
  • K2P? model (?2, ?1)
  • Avg. branch length 0.05
  • Seq. length 1000

21
Summary
  • These are the first polynomial time methods that
    improve upon NJ (with respect to topological
    accuracy) and are never worse than NJ.
  • The advantage obtained with DCMNJMP and DCMNJML
    increases with number of taxa.
  • In practice these new methods are slower than NJ
    (minutes vs. seconds), but still much faster than
    MP and ML (which can take days).
  • Conjecture DCMNJML is AFC.

22
II. Whole-Genome Phylogeny
23
Genomes As Signed Permutations
1 5 3 4 -2 -6or6 2 -4 3 5 1 etc.
24
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10
  • Inversion (Reversal)
  • Transposition
  • Inverted Transposition

25
Genome Rearrangement Has A Huge State Space
  • DNA sequences 4 states per site
  • Signed circular genomes with n genes
    states, 1 site
  • Circular genomes (1 site)
  • with 37 genes states
  • with 120 genes states

26
Distance-based Phylogenetic Methods for Genomes
27
Genomic Distance Estimators
  • Standard
  • Breakpoint distance
  • (Minimum) Inversion distance
  • Our estimators We attempt to estimate
  • the actual number of events (the true
    evolutionary distance)
  • EDE Moret et al, ISMB01
  • Approx-IEBP Wang and Warnow, STOC01
  • Exact-IEBP Wang, WABI01

28
Breakpoint Distance
  • Breakpoint distance5

1 2 3 4 5 6 7 8 9 10
1 3 2 4 5 9 6 7 8 10
29
Minimum Inversion Distance
  • Inversion distance3

1 2 3 4 5 6 7 8 9 10
1 2 3 8 7 6 5 4 9 10
1 8 3 2 7 6 5 4 9 10
1 8 3 7 2 6 5 4 9 10
30
Measured Distance vs. Actual Number of Events
Breakpoint Distance
Inversion Distance
120 genes, inversion-only evolution
31
Generalized Nadeau-Taylor Model
  • Three types of events
  • Inversions
  • Transpositions
  • Inverted Transpositions
  • Events of the same type are equiprobable
  • Probability of the three types have fixed ratio
    Inv Trp Inv.Trp (1-a-b)ab

32
Estimating True Evolutionary Distances for Genomes
  • Given fixed probabilities for each type of
    event, we estimate the expected breakpoint
    distance after k random events
  • Approx-IEBP Wang, Warnow 2001
  • Polynomial-time closed-form approximation to the
    expected breakpoint distance
  • Proven error bound
  • Exact-IEBP Wang 2001
  • Exact, recursive solution for the expected
    breakpoint distance
  • Polynomial-time but slower than Approx-IEBP

33
Estimating True Evolutionary Distances for
Genomes (cont.)
  • Estimating the expected Inversion distance
  • EDE Moret, Wang, Warnow, Wyman 2001
  • Closed-form formula based upon an empirical
    estimation of the expected inversion distance
    after k random events (based upon 120 genes and
    inversion only, but robust to errors in the
    model) .
  • Polynomial time, fastest of the three.

34
Goodness of fit for Approx-IEBP
  • 120 genes
  • Inversion-only evolution
  • (similar perfor-
  • mance under
  • other models)
  • EDE and
  • Exact-IEBP
  • have similar performance

Approx-
35
Absolute Difference
  • 120 genes
  • Inversion only evolution
  • (Similar relative
  • performance under
  • other models)

36
Accuracy of Neighbor Joining Using Distance
Estimators
  • 120 genes
  • Inversion-only evolution
  • 10, 20, 40, 80, and 160 genomes
  • Similar relative
  • performance
  • under other
  • models

37
Accuracy of Neighbor Joining Using Distance
Estimators
  • 120 genes
  • All three event types equiprobable
  • 10, 20, 40, 80, and 160 genomes
  • Similar relative
  • performance under
  • other models

38
Summary of Genomic Distance Estimators
  • Statistically based estimation of genomic
    distances improves NJ analyses
  • Our IEBP estimators assume knowledge of the
    probabilities of each type of event, but are
    robust to model violations
  • NJ(EDE) outperforms NJ on other estimators, under
    all models studied
  • Accuracy is very good, except when very close to
    saturation

39
Maximum Parsimony on Rearranged Genomes (MPRG)
  • The leaves are rearranged genomes.
  • Find the tree that minimizes the total number of
    rearrangement events

40
GRAPPA Bader et al., PSB01
  • (Genome Rearrangements Analysis under
    Parsimony and other Phylogenetic Algorithms)
  • Reimplementation of BPAnalysis Blanchette et
    al. 1997 for the Breakpoint Phylogeny problem.
  • Uses algorithm engineering to improve
    performance.
  • Improves the algorithm by reducing the number of
    tree length evaluations. (Evaluating the length
    of a fixed tree is NP-hard)

41
Campanulaceae
42
Analysis of Campanulaceae
  • 12 genomes 1 outgroup (Tobacco)
  • 105 gene segments
  • BPAnalysis Blanchette et al. 1997over 200
    years Cosner et al. 2000
  • Using GRAPPA v1.1 on the 512-processor Los Lobos
    Supercluster machine
  • 2 minutes 100 million-fold speedup(200,000-fol
    d speedup per processor)

43
Consensus of 216 MP Trees
Strict Consensus of 216 trees 6 out of 10
internal edges recovered.
44
Future Work
  • New focus on Rare Genomic Changes
  • New data
  • New models
  • New methods
  • New techniques for large scale analyses
  • Divide-and-conquer methods
  • Non-tree models
  • Visualization of large trees and large sets of
    trees

45
Acknowledgements
  • Funding
  • The David and Lucile Packard Foundation,
  • The National Science Foundation, and
  • Paul Angello
  • Collaborators
  • Robert Jansen (U. Texas)
  • Bernard Moret, David Bader, Mi-Yan (U.
    New Mexico)
  • Daniel Huson (Celera)
  • Katherine St. John (CUNY)
  • Linda Raubeson (Central Washington U.)
  • Luay Nakhleh, Usman Roshan, Jerry Sun,
    Li-San Wang, Stacia Wyman (Phylolab, U.
    Texas)

46
Phylolab, U. Texas
Please visit us at http//www.cs.utexas.edu/users/
phylo/
Write a Comment
User Comments (0)
About PowerShow.com