Gene Order Phylogeny - PowerPoint PPT Presentation

About This Presentation

Title:

Gene Order Phylogeny

Description:

The best current method (default TNT) fails to reach acceptable levels of ... significantly improves upon the unboosted TNT by returning trees which are at most 0.01 ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 40

Provided by: tandyw

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Gene Order Phylogeny

1
Gene Order Phylogeny

Tandy Warnow
The Program in Evolutionary Dynamics, Harvard
University
The University of Texas at Austin

Cyber-Infrastructure for Phylogenetic RESearch
(http//www.phylo.org)
Main research Large-scale phylogenetics,
reticulate evolution, gene order phylogeny,
complex simulations, and databases
Funded by 11.6M ITR Grant from NSF
40 biologists, computer scientists, and
mathematicians collaborating on the project

3
CIPRes Members
4
Limitations of DNA phylogenetics

Deep evolutionary histories may not be
recoverable from DNA sequence phylogeny due to
lack of specificity -- too much noise (homoplasy)
and insufficient sequence length
The systematics community has looked to rare
genomic changes for better sources of
phylogenetic signal

5
Whole-Genome Phylogenetics
6
Genomes As Signed Permutations
1 5 3 4 -2 -6or6 2 -4 3 5 1 etc.
7
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10

Inversion (Reversal)

Transposition

Inverted Transposition

8
Other types of events

Duplications, Insertions, and Deletions (changes
gene content)
Fissions and Fusions (for genomes with more than
one chromosome)
These events change the number of copies of each
gene in each genome (unequal gene content)

9
Genome Rearrangement Has A Huge State Space

DNA sequences 4 states per site
Signed circular genomes with n genes
states, 1
site
Circular genomes (1 site)
with 37 genes (mitochondria)
states
with 120 genes (chloroplasts)
states

10
Why use gene orders?

Rare genomic changes huge state space and
relative infrequency of events (compared to site
substitutions) could make the inference of deep
evolution easier, or more accurate.
Our research shows this is true, but accurate
analysis of gene order data is computationally
very intensive!

11
Phylogeny reconstruction from gene orders

Distance-based reconstruction estimate pairwise
distances, and apply methods like
Neighbor-Joining or Weighbor
Maximum Parsimony find tree with the minimum
length (inversions, transpositions, or other edit
distances)
Maximum Likelihood find tree and parameters of
evolution most likely to generate the observed
data

12
Maximum Parsimony on Rearranged Genomes (MPRG)

The leaves are rearranged genomes.
Find the tree that minimizes the total number of
rearrangement events (e.g., inversion phylogeny
minimizes the number of inversions)

13
Optimization problems for gene order phylogeny

Breakpoint phylogeny find the phylogeny which
minimizes the total number of breakpoints
(NP-hard, even to find the median of three
genomes)
Inversion phylogeny find the phylogeny which
minimizes the sum of inversion distances on the
edges (NP-hard, even to find the median of three
genomes)

14
Inversion phylogenies

When the data are close to saturated, even the
best distance-based analyses are insufficiently
accurate. In these cases, our initial
investigations suggest that the inversion
phylogeny approach may be superior.
Problem finding the best trees is enormously
hard, since even the point estimation problem
is hard (worse than estimating branch lengths in
ML).

Local optimum
Tree length
Global optimum
Phylogenetic trees
15
Observations

For equal gene content, heuristics for the
inversion phylogeny problem are extremely
accurate, even under model conditions in which
transpositions are dominant.
For unequal gene content, the parsimony style
problems are too computationally intense -- but
NJ (neighbor joining) with a new distance
estimator (Moret et al. 2004) works extremely
well.

16
Software

BPAnalysis (Sankoff) open source, restricted to
the breakpoint phylogeny reconstruction
GRAPPA (Moret et al.) open source, restricted to
single chromosome genomes, but can handle both
equal and unequal gene content
MGR (Pevzner et al.) multiple chromosome,
limited to equal gene content, performs well if
the dataset is small (less than 10 genomes)
Bayesian analysis by Bret Larget (not yet
released).

17
Merciera
Wahlenbergia
Tiodanus
Legousia
Asyneuma
Trachelium
Symphyandra
Campanula
Codonopsis
Tobacco
Adenophora
Cyananthus
The strict consensus of 24 trees, each with
inversion length of 64. Finished within 40
minutes on a laptop using GRAPPA version 1.8
Platycodon
18
GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)

http//www.cs.unm.edu/moret/GRAPPA/
Heuristics for maximum parsimony style problems
for equal gene content
Fast polynomial time distance-based methods
Contributors U. New Mexico,U. Texas at Austin,
Universitá di Bologna, Italy
Freely available in source code at this site.
Project leader Bernard Moret (UNM)
(moret_at_cs.unm.edu)

19
Speeding up MP and ML DCM3

Tandy Warnow
Radcliffe Institute
The University of Texas at Austin

20
Reconstructing the Tree of Life
Handling large datasets millions of species
21
Methods for phylogenetic inference

Polynomial time methods, mostly based upon
estimating evolutionary distances
Heuristics for hard optimization problems (such
as maximum parsimony and maximum likelihood)
Bayesian methods

22
Main research objectives

Determine the best current methods available for
MP and ML, and then improve upon them
Focus on performance within one day, one week, or
one month, on large real datasets (1K to 20K
sequences for MP)
Final objective is hundreds of thousands (or
millions) of sequences.

23
Initial results

Very large datasets are hard for both MP and ML,
no matter what software is used
Suboptimal solutions to MP yield reasonable
estimates of the optimal MP trees - but only if
they are within .01 of optimal MP score
Improving upon techniques for searching treespace
will yield improvements for both MP and ML

24
Datasets
Obtained from various researchers and online
databases

1322 lsu rRNA of all organisms
2000 Eukaryotic rRNA
2594 rbcL DNA
4583 Actinobacteria 16s rRNA
6590 ssu rRNA of all Eukaryotes
7180 three-domain rRNA
7322 Firmicutes bacteria 16s rRNA
8506 three-domain2org rRNA
11361 ssu rRNA of all Bacteria
13921 Proteobacteria 16s rRNA

25
Problems with current techniques for MP
Average MP scores above optimal of best methods
at 24 hours across 10 datasets
Best current techniques fail to reach 0.01 of
optimal at the end of 24 hours, on large datasets
26
Problems with current techniques for MP
The best current method (default TNT) fails to
reach acceptable levels of accuracy (0.01 of
optimal) within 24 hours on many large datasets
-- evidence suggests that this level will not be
reached for weeks or months (or more) of further
analysis.
Performance of TNT with time
27
Observations

The best methods cannot get acceptably good
solutions within 24 hours on most of these large
datasets.
Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions.
Apparent convergence can be misleading.

28
Observations

The best methods cannot get acceptably good
solutions within 24 hours on most of these large
datasets.
Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions.
Apparent convergence can be misleading.

29
Observations

The best methods cannot get acceptably good
solutions within 24 hours on most of these large
datasets.
Datasets of these sizes may need months (or
years) of further analysis to reach reasonable
solutions.
Apparent convergence can be misleading.

30
Disk-Covering Methods (DCMs)

DCMs are divide-and-conquer methods that our
group has developed for use in phylogeny
reconstruction
DCM2 was designed for speeding up maximum
parsimony and maximum likelihood heuristics. DCM2
was good enough for PAUP.
DCM3 is a recent improvement over DCM2 which
enables iteration (and gives smaller subproblems)
- and is good enough for TNT.

31
Boosting MP heuristics

DCMs boost the performance of phylogeny
reconstruction methods.

DCM
Base method M
DCM-M
32
DCM3 technique for speeding up MP searches
33
Iterative-DCM3
T
DCM3
Base method
T
34
New DCMs

DCM3
Compute subproblems using DCM3 decomposition
Apply base method to each subproblem to yield
subtrees
Merge subtrees using the Strict Consensus Merger
technique
Randomly refine to make it binary
Recursive-DCM3
Iterative DCM3
Compute a DCM3 tree
Perform local search and go to step 1
Recursive-Iterative DCM3

35
Boosting MP heuristics

We examine DCMs using DCM2 and DCM3, and using
recursion and/or iteration.

DCM
Base method M
DCM-M
36
Performance Study

How well do these boosted versions of the best
MP heuristics perform, compared to the best MP
heuristics?
We examine performance with respect to optimal
MP scores (best found so far, using any method)
for a number of very large datasets, over 24
hours.
The benchmark MP heuristic is the default TNT.

37
Datasets
Obtained from various researchers and online
databases

1322 lsu rRNA of all organisms
2000 Eukaryotic rRNA
2594 rbcL DNA
4583 Actinobacteria 16s rRNA
6590 ssu rRNA of all Eukaryotes
7180 three-domain rRNA
7322 Firmicutes bacteria 16s rRNA
8506 three-domain2org rRNA
11361 ssu rRNA of all Bacteria
13921 Proteobacteria 16s rRNA

38
Rec-I-DCM3 significantly improves performance
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset
39
Rec-I-DCM3(TNT) vs. TNT(Comparison of scores at
24 hours)
Base method is the default TNT technique, the
current best method for MP. Rec-I-DCM3
significantly improves upon the unboosted TNT by
returning trees which are at most 0.01 above
optimal on most datasets.
40
Summary

Rec-I-DCM3 is a powerful technique for escaping
local optima, and boosts the performance of the
best heuristics for solving MP
The improvement increases with the difficulty of
the dataset - Rec-I-DCM3(TNT) is 50 times faster
than TNT on our hardest datasets, but we expect
even bigger speedups in our next version
DCMs also boost the performance of Maximum
Likelihood heuristics (not shown)

41
Acknowledgements

Collaborators Bernard Moret (UNM), Usman Roshan
(UT-Austin), and Tiffani Williams (UNM)
Funding NSF, The David and Lucile Packard
Foundation, The Radcliffe Institute for Advanced
Study, The Institute for Cellular and Molecular
Biology at UT-Austin, and The Program in
Evolutionary Dynamics at Harvard University
Software will be part of the CIPRES Projects
first distribution - see http//www.phylo.org

Cyber-Infrastructure for Phylogenetic RESearch
(http//www.phylo.org)
Main research Large-scale phylogenetics,
reticulate evolution, gene order phylogeny, and
databases
Funded by 11.6M ITR Grant from NSF
40 biologists, computer scientists, and
mathematicians collaborating on the project

Write a Comment

User Comments (0)