Title: Genome Rearrangement Phylogeny
1Genome Rearrangement Phylogeny
Robert K. Jansen School of Biology University of
Texas at Austin Bernard M.E. MoretDepartment of
Computer ScienceUniversity of New MexicoLi-San
Wang Tandy Warnow Department of Computer
Sciences University of Texas at Austin
2Outline
- Introduction
- Genome rearrangement phylogeny reconstruction
- Application
- Other methods
- Future research
3New Phylogenetic Signals
- Large-throughput sequencing efforts lead to
larger datasets - Challenge inferring deep evolutionary events
- Biologists turning to rare genomic changes
- Rare
- Large state space
- High signal-to-noise ratio
- Potential for clarifying early evolution
- Best studied gene order evolution
(genome rearrangement)
4Genomes As Signed Permutations
1 5 3 4 -2 -6 or 5 1 6 2 -4
-3 etc.
5Gene Order Data
- Rare changes on the genomic scale
- Large state space
- DNA 4 states/character
- Protein (amino acid sequence) 20
states/character - Circular gene order with 120 genes
- High signal-to-noise ratio
states/character
6Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10
Inversion
1 2 6 5 -4 -3 7 8 9 10
Transposition
1 2 7 8 3 4 5 6 9 10
Inverted Transposition
1 2 7 8 6 -5 -4 -3 9 10
7Edit Distances Between Genomes
- (INV) Inversion distance Hannenhalli Pevzner
1995 - Computable in linear time Moret et al 2001
- (BP) Breakpoint distance Watterson et al. 1982
- Computable in linear time
- NJ(BP) Blanchette, Kunisawa, Sankoff, 1999
A
1 2 3 4 5 6 7 8 9 10
B
1 2 3 -8 -7 -6 4 5 9 10
BP(A,B)3
8Our Model the Generalized Nadeau-Taylor Model
STOC01
- Three types of events
- Inversions (INV)
- Transpositions (TRP)
- Inverted Transpositions (ITP)
- Events of the same type are equiprobable
- Probabilities of the three types have fixed ratio
- We focus on signed circular genomes in this talk.
9Simulation Study Protocol
Synthetic Input
Evolutionary Process
Known in simulation
PhylogeneticMethod
Inferred Tree
10Quantifying Error
11Outline
- Genome rearrangement evolution
- Genome rearrangement phylogeny reconstruction
- Application
- Other methods
- Future research
12 Gene Order Parsimony
13Breakpoint PhylogenySankoff Blanchette 1998
- Maximum Parsimony-style problem
- Find tree(s), leaf-labeled by genomes, with
shortest breakpoint length - NP-hard problem on two levels
- Find the shortest tree (the space of trees has
exponential size) - Given a tree, find its breakpoint length (Even
for a tree with 3 leaves, but can be reduced to
TSP) - BPAnalysis Sankoff Blanchette 1998
- Takes 200 years to compute our 13-taxon dataset
on a Sun workstation
14BPAnalysis
- Tree length evaluation for EVERY tree
- Given a fixed tree topology, evaluate the tree
length - Iteratively evaluate the median problem (tree
length for a 3-leaf tree)
15GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)
- http//www.cs.unm.edu/moret/GRAPPA/
- Uses lowerbound techniques to speed up
- Used on real datasets, producing thousand-fold
speedups over BPAnalysis ISMB01 - Contributors (led by Bernard Moret at UNM) U.
New MexicoU. Texas at AustinUniversitá di
Bologna, Italy
16The Circular Lowerbound of the Length of a Tree
- Given a tree, we can lowerbound its length very
quickly
17The Lowerbound Technique
- Avoid any tree X without potential
- tree X whose lowerbound lb(X) is higher than
twice the length c(T) of the best tree T - Finding a good starting tree quickly is of utmost
importance - We turn to distance-based methods
- Neighbor joining (NJ) Saitou and Nei 1987
- Weighbor Bruno et al. 2000
18Additive Distance Matrix and True Evolutionary
Distance (T.E.D.)
S2
S3
S4
S5
S1
S3
S1 0 9 15 14 17
S1
S2 0 14 13 16
S4
7
5
5
S3 0 13 16
1
3
S4 0 13
4
S5 0
8
S2
S5
Theorem Waterman et al. 1977 Given an mm
additive distance matrix, we can reconstruct a
tree realizing the distance in O(m2) time.
19Error Tolerance of Neighbor Joining
- Theorem Atteson 1999Let Dij be the true
evolutionary distances, and dij be the
estimated distances for T. Let be the
length of the shortest edge in T. If for all taxa
i,j, we havethen neighbor joining returns T.
20BP and INV
INV vs K
(120 genes)
BP/2 vs K
(K Actual number of inversions)
(Inversion-only evolution)
21NJ(BP) Blanchette, Kunisawa, Sankoff 1999 and
NJ(INV)
Transpositions/inverted transpositions only
Inversion only
120 genes, 160 leaves Uniformly Random Tree
22Estimate True Evolutionary DistancesUsing BP
- To use the scatter plot to
- estimate the actual number
- of events (K)
- Compute BP/2
- From the curve, look up the corresponding
valueof K
(2)
(1)
BP/2 vs K
(120 genes)
(K Actual number of inversions)
(Inversion-only evolution)
23True Evolutionary Distance (t.e.d.) Estimators
for Gene Order Data
IEBP Inverting the Expected BreakPoint
distance EDE Empirically Derived Estimator
24True Evolutionary Distance Estimators
Exact-IEBP vs K
(120 genes)
BP vs K
(K Actual number of inversions)
(Inversion-only evolution)
25Variance of True Evolutionary Distance Estimators
- There are new distance-based phylogeny
reconstruction methods (though designed for DNA
sequences) - Weighbor Bruno et al. 2000These methods use
the variance of good t.e.d.s, and yield more
accurate trees than NJ. - Variance estimates for the t.e.d.s Wang WABI02
- Weighbor(IEBP), Weighbor(EDE)
K vs Exact-IEBP (120 genes)
26Using T.E.D. Helps
120 genes160 leaves Uniformly random
treeTranspositions/invertedtranspositions
only(180 runs per figure)
5
27Observations
- EDE is the best distance estimator when used with
NJ and Weighbor. - True evolutionary distance estimators are
reliable even when we do not know the GNT model
parameters (the probability ratios of the three
types of events).
28Outline
- Genome rearrangement evolution
- Genome rearrangement phylogeny reconstruction
- Application
- Other methods
- Future research
29Percentage of Trees Eliminated Through Bounding
ISMB01
30Campanulaceae cpDNA
- 13 taxa (tobacco as outlier)
- 105 gene segments
- GRAPPA finds 216 trees with shortest breakpoint
length (out of 654,729,075 trees) - Running Time
- BPAnalysis takes 2 centuries on a Sun workstation
- GRAPPA takes 1.5 hours on a 512-node supercluster
- About 2300-fold speedup on a single node
31Campanulaceae Moret et al. ISMB 2001
Strict consensus of 216 optimal trees found by
GRAPPA
6 out of 10 max. edges found
32Outline
- Genome rearrangement evolution
- Genome rearrangement phylogeny reconstruction
- Application
- Other methods
- Future research
33Fast Approaches for Genome Rearrangement
Phylogeny
- Basic technique encode data as strings and apply
maximum parsimony - Running time exponential in the number of
genomes, but polynomial in the number of genes
(faster than GRAPPA) - MPBE ISMB00Maximum Parsimony using Binary
Encodings - MPME Boore et al. Nature 95, PSB02Maximum
Parsimony using Multi-state Encodings - The length of a tree using these two methods is a
lowerbound of the true breakpoint length Bryant
01
34Maximum Parsimony using Binary Encoding (MPBE)
Input genome (circular)
A 1 2 3 4 -4 3 2 1 B 1 -4 -3 2
2 3 4 -1 C 1 2 -3 4 4 3 2 -1
MPBE Strings
A 1 1 1 1 0 0 0 0 0 B 0 1 1 0 1 1
0 0 0 C 1 0 0 0 0 0 1 1 1
35Maximum Parsimony using Multistate Encoding (MPME)
Input genome (circular)
A 1 2 3 4 -4 3 2 1 B 1 -4 -3 2
2 3 4 -1 C 1 2 -3 4 4 3 2 -1
MPME Strings
We use PAUP to solve Maximum Parsimony gt
Constraint number of states per site cannot
exceed 32
1 2 3 4 -1 2 3 -4
A 2 3 4 1 4 1 2 -3 B -4 3 4 1 2 1
2 -3 C 2 3 -2 3 4 1 -4 1
36NJ vs MP (120 genes, 160 genomes)
All three event types equiprobable (datasets that
exceed 32-state limit for MPME are dropped)
37Inversion Phylogeny
- Inversion median has higher running time than
breakpoint median - Inversion phylogeny overall has shorter running
time than breakpoint phylogeny, and returns more
accurate trees Moret et al. WABI 02
38DCM-GRAPPA Moret Tang 2003
- Disk-Covering Method divide the original problem
into subproblems Huson, Nettles, Parida, Warnow
and Yooseph, 1998 - Uses inversion distance
- DCM-GRAPPA can now process thousands of genomes,
each having hundreds of genes
39Ongoing and Future Research
- Genome rearrangement phylogeny with unequal gene
content (duplications, deletions, etc.) - Non-uniform genome rearrangement
models(Segment-length dependent model, hotspots)
40Acknowledgements
- University of Texas Tandy Warnow (Advisor)
Robert K. Jansen Stacia Wyman Luay
Nakhleh Usman Roshan Cara
Stockham Jerry Sun - University of New Mexico Bernard M.E. Moret
David Bader Jijun Tang Mi
Yan - Central Washington University Linda Raubeson
41PhylolabDepartment of Computer
SciencesUniversity of Texas at Austin
Please visit us at http//www.cs.utexas.edu/users/
phylo/