LargeScale Phylogenetic Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

LargeScale Phylogenetic Analysis

Description:

Graduate Program in Evolution and Ecology. Co-Director ... the rates of evolution across the sites can be drawn from a fixed distribution ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 47

Provided by: lisan4

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: LargeScale Phylogenetic Analysis

1
Large-Scale Phylogenetic Analysis

Tandy Warnow
Associate ProfessorDepartment of Computer
Sciences
Graduate Program in Evolution and Ecology
Co-DirectorThe Center for Computational Biology
and Bioinformatics
The University of Texas at Austin

2
Outline of Talk

Phylogenetic reconstruction from DNA sequences
the problems, and the progress
Phylogenetic reconstruction from gene order and
content in whole genomes initial work
The future of large-scale phylogeny, and the
possibilities of inferring the Tree of Life

3
I. Molecular Systematics
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
4
DNA Sequence Evolution
5
Major Phylogenetic Reconstruction Methods

Polynomial-time distance-based methods (neighbor
joining the most popular)
NP-hard sequence-based methods
Maximum Parsimony
Maximum Likelihood
Heated debates over the relative performance of
these methods

6
Quantifying Error
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
7
Main Result DCM-Boosting and DCMNJML
We have developed the first polynomial time
methods that improve upon NJ (with respect to
topological accuracy) and are never worse than
NJ. The method is obtained through DCM-boosting.
8
Basis of Distance-Based Methods Additivity

A distance matrix is additive if there exists
a tree and such
that .
Waterman et al. (1977) showed that

9
Distance-based Phylogenetic Methods
10
Statistical Consistency

Atteson (1990) showed that if
is small enough.

Hence NJ is statistically consistent for many
models of evolution.But what about performance
on finite sequence lengths?
11
We focus on performance on finite sequence lengths
12
Absolute fast convergence vs. exponential
convergence
13
General Markov (GM) Model

A GM model tree is a pair where
is a rooted binary tree.
, and is
a stochastic substitution matrix with
.
The sequence at the root of is drawn from a
uniform distribution.
the rates of evolution across the sites can be
drawn from a fixed distribution
GM contains models like Jukes-Cantor (JC) and
Kimura 2-Parameter (K2P) models.

14
Absolute Fast Convergence

Let . Define
. We parameterize the GM model
A phylogenetic reconstruction method is
absolute fast-converging (AFC) for the GM model
if for all positive there is a
polynomial such that for all
on set of sequences of length at
least generated on , we have

15
Theoretical Comparison of Early AFC Methods to NJ

Theorem 1 Warnow et al. 2001DCMNJSQS is
absolute fast converging for the GM model.
Theorem 2 Csurös 2001HGTFP is absolute fast
converging for the GM model.
Theorem 3 Atteson 1999NJ is exponentially
converging for the GM model (but is not known to
be AFC).

16
DCM-Boosting Warnow et al. 2001

DCMSQS is a two-phase procedure which reduces
the sequence length requirement of methods.

Exponentially converging method
Absolute fast converging method
DCM
SQS

DCMNJSQS is the result of DCM-boosting NJ.

17
Experimental Comparison of Early AFC Methods to NJ

rbcL 500-taxon tree
Jukes-Cantor model
Avg. branch length 0.264

18
Improving upon early AFC methods

These early AFC methods outperform NJ only on
long enough sequences and on large enough trees
with high enough rates of evolution.
Hence we need new fast converging methods which
improve upon NJ on more of the parameter space,
and are never worse than NJ.
We modify the second phase to improve the
empirical performance, replacing SQS with ML
(maximum likelihood) or MP (maximum parsimony).

19
DCMNJML vs. other methods on a fixed tree

500-taxon rbcL tree
K2P? model (?2, ?1)
Avg. branch length 0.278
Typical performance

20
Comparison of methods on random trees as a
function of the number of taxa

Random tree topologies
K2P? model (?2, ?1)
Avg. branch length 0.05
Seq. length 1000

21
Summary

These are the first polynomial time methods that
improve upon NJ (with respect to topological
accuracy) and are never worse than NJ.
The advantage obtained with DCMNJMP and DCMNJML
increases with number of taxa.
In practice these new methods are slower than NJ
(minutes vs. seconds), but still much faster than
MP and ML (which can take days).
Conjecture DCMNJML is AFC.

22
II. Whole-Genome Phylogeny
23
Genomes As Signed Permutations
1 5 3 4 -2 -6or6 2 -4 3 5 1 etc.
24
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10

Inversion (Reversal)

Transposition

Inverted Transposition

25
Genome Rearrangement Has A Huge State Space

DNA sequences 4 states per site
Signed circular genomes with n genes
states, 1 site
Circular genomes (1 site)
with 37 genes states
with 120 genes states

26
Distance-based Phylogenetic Methods for Genomes
27
Genomic Distance Estimators

Standard
Breakpoint distance
(Minimum) Inversion distance
Our estimators We attempt to estimate
the actual number of events (the true
evolutionary distance)
EDE Moret et al, ISMB01
Approx-IEBP Wang and Warnow, STOC01
Exact-IEBP Wang, WABI01

28
Breakpoint Distance

Breakpoint distance5

1 2 3 4 5 6 7 8 9 10
1 3 2 4 5 9 6 7 8 10
29
Minimum Inversion Distance

Inversion distance3

1 2 3 4 5 6 7 8 9 10
1 2 3 8 7 6 5 4 9 10
1 8 3 2 7 6 5 4 9 10
1 8 3 7 2 6 5 4 9 10
30
Measured Distance vs. Actual Number of Events
Breakpoint Distance
Inversion Distance
120 genes, inversion-only evolution
31
Generalized Nadeau-Taylor Model

Three types of events
Inversions
Transpositions
Inverted Transpositions
Events of the same type are equiprobable
Probability of the three types have fixed ratio
Inv Trp Inv.Trp (1-a-b)ab

32
Estimating True Evolutionary Distances for Genomes

Given fixed probabilities for each type of
event, we estimate the expected breakpoint
distance after k random events
Approx-IEBP Wang, Warnow 2001
Polynomial-time closed-form approximation to the
expected breakpoint distance
Proven error bound
Exact-IEBP Wang 2001
Exact, recursive solution for the expected
breakpoint distance
Polynomial-time but slower than Approx-IEBP

33
Estimating True Evolutionary Distances for
Genomes (cont.)

Estimating the expected Inversion distance
EDE Moret, Wang, Warnow, Wyman 2001
Closed-form formula based upon an empirical
estimation of the expected inversion distance
after k random events (based upon 120 genes and
inversion only, but robust to errors in the
model) .
Polynomial time, fastest of the three.

34
Goodness of fit for Approx-IEBP

120 genes
Inversion-only evolution
(similar perfor-
mance under
other models)
EDE and
Exact-IEBP
have similar performance

Approx-
35
Absolute Difference

120 genes
Inversion only evolution
(Similar relative
performance under
other models)

36
Accuracy of Neighbor Joining Using Distance
Estimators

120 genes
Inversion-only evolution
10, 20, 40, 80, and 160 genomes
Similar relative
performance
under other
models

37
Accuracy of Neighbor Joining Using Distance
Estimators

120 genes
All three event types equiprobable
10, 20, 40, 80, and 160 genomes
Similar relative
performance under
other models

38
Summary of Genomic Distance Estimators

Statistically based estimation of genomic
distances improves NJ analyses
Our IEBP estimators assume knowledge of the
probabilities of each type of event, but are
robust to model violations
NJ(EDE) outperforms NJ on other estimators, under
all models studied
Accuracy is very good, except when very close to
saturation

39
Maximum Parsimony on Rearranged Genomes (MPRG)

The leaves are rearranged genomes.
Find the tree that minimizes the total number of
rearrangement events

40
GRAPPA Bader et al., PSB01

(Genome Rearrangements Analysis under
Parsimony and other Phylogenetic Algorithms)
Reimplementation of BPAnalysis Blanchette et
al. 1997 for the Breakpoint Phylogeny problem.
Uses algorithm engineering to improve
performance.
Improves the algorithm by reducing the number of
tree length evaluations. (Evaluating the length
of a fixed tree is NP-hard)

41
Campanulaceae
42
Analysis of Campanulaceae

12 genomes 1 outgroup (Tobacco)
105 gene segments
BPAnalysis Blanchette et al. 1997over 200
years Cosner et al. 2000
Using GRAPPA v1.1 on the 512-processor Los Lobos
Supercluster machine
2 minutes 100 million-fold speedup(200,000-fol
d speedup per processor)

43
Consensus of 216 MP Trees
Strict Consensus of 216 trees 6 out of 10
internal edges recovered.
44
Future Work

New focus on Rare Genomic Changes
New data
New models
New methods
New techniques for large scale analyses
Divide-and-conquer methods
Non-tree models
Visualization of large trees and large sets of
trees

45
Acknowledgements

Funding
The David and Lucile Packard Foundation,
The National Science Foundation, and
Paul Angello
Collaborators
Robert Jansen (U. Texas)
Bernard Moret, David Bader, Mi-Yan (U.
New Mexico)
Daniel Huson (Celera)
Katherine St. John (CUNY)
Linda Raubeson (Central Washington U.)
Luay Nakhleh, Usman Roshan, Jerry Sun,
Li-San Wang, Stacia Wyman (Phylolab, U.
Texas)