Title: Fast Algorithms for Minimum Evolution
1Fast Algorithms for Minimum Evolution
- Richard Desper, NCBI
- Olivier Gascuel, LIRMM
2Overview
- Statement of phylogeny reconstruction problem and
various approaches to solving it. - Tree length formula as a function of average
distances. - Greedy algorithms for tree building and tree
swapping. - Simulation results.
- A few extras regarding consistency and branch
lengths.
3Phylogeny Reconstruction
- General problem reconstruct the evolutionary
history for a set L of extant species. - Input multiple sequence alignment for L or
matrix of estimates of pairwise evolutionary
distances. - Output weighted phylogeny representing history
of L and common ancestors.
4Methods
- Likelihood methods model-based likelihood
maximization. - Parsimony methods minimize total number of
mutations in tree. - Distance methods fit tree structure to inferred
evolutionary distances. Leading methods include
Felsenstein-Fitch-Margoliash weighted
least-squares and Neighbor-Joining and its
variants.
5Felsenstein-Fitch-Margoliash Least-squares Method
- FITCH searches the space of topologies by
iteratively adding leaves and by tree swapping. - Edge weights and topology are chosen to minimize
the sum of squares (D is the input metric, DT is
the induced tree metric)
If sij 1 for all i and j, this is called the
ordinary least-squares method.
6Minimum Evolution
- Developed by Rzhetsky and Nei (1992) as a
modification of the OLS method - For each topology T,
- Define function l assigning OLS lengths to edges
of T - Define size of tree
- Choose T minimizing l(T )
7Recursive Definition of DAB
All average distances for all pairs of
non-intersecting subtrees of a given topology
can be calculated in O(n2) time.
8External OLS Edge Length Function
- If e is the edge connecting the leaf i to the
- subtrees A and B,
9Internal OLS Edge Length Function
- The length of the edge e is (Vach, 1988)
where
10Tree length formula
- Lemma with T as to the right,
- let denote the root of subtree X,
- and the edge to X for
- Then,
11Tree Length Formula
- With T as in prior slide,
Using lemma and branch length formula for l(e),
12General approach
- To search the space of topologies, well keep in
memory two data structures - Sizes of each subtree of given topology
- Matrix of average distances DXY for X,Y disjoint
subtrees in given topology - As we move from one topology to another, well
update the matrix, but only as much as needed,
in an efficient manner.
13Tree Swapping by NNI
NNI swapping is a basic step in topology building
and searching
14Tree Length Formula
- With T as in prior slide,
Using lemma and branch length formula for l(e),
15Tree Length after NNI
- Given T g T the tree swap in prior slide, l
the edge length function
(1)
where l and l are constants depending on the
topologies.
16OLS FASTNNI
- Pre-compute average distances between
non-intersecting sub-trees. (O(n2) computations) - Loop over all internal edges, select the best
swap using Equation (1). (O(n)) - If no swap improves length of the tree, stop and
return the tree, else perform the best swap and
update the matrix of average distances and repeat
Step 2. (O(n) per swap there is only one new
split.) - Thus, if we require p swaps, the total complexity
of - FASTNNI is O(n2 pn).
17Balanced Minimum Evolution
- Gascuel (2000) observed that the OLS/ME method
was weaker than NJ in approximating the correct
topology. - Pauplin (2000) to simplify tree length
computation proposed to use a balanced version
of Minimum Evolution, weighting each sub-tree
equally when calculating averages if A and B are
sub-trees of T, with
18BNNI
- Calculate balanced averages of all pairs of
sub-trees. (O(n2)) - Calculate improvement for each swap using
- (2)
- If no tree swap improves length of the tree, stop
and return tree, else update matrix of average
distances and repeat Step 2. (O(n diam(T)) per
swap) - The average complexity, when performing p swaps,
is - O(n2 pn diam(T)).
19Updating Subtree Averages
T
x
X
A
C
e
Y
D
B
Q How many recalculations?
(Hint you can count (x,y) pairs).
A O(n diam(T))
20Building trees from scratch
- We have NNI algorithms for OLS and balanced
branch lengths. But what if we have no initial
topology for NNIs?
21OLS Greedy Minimum Evolution
- Start with three-taxon tree T3
- For k4 to n,
- Calculate DkA for each subtree A in Tk-1
- Express cost of inserting k along edge e as f(e).
- (Use Equation (3) on the next slide.)
- Choose e minimizing f. Insert k along e to form
Tk. - Update matrix of average distances between every
pair of 2-distant subtrees. - GME runs in O(n2) running time
22Greedy Minimum Evolution
We use a variant of Equation (1), where D k.
Let L l(T).
Then
23Balanced Minimum Evolution
- Same as GME,except
- (modifications)
- Calculate balanced average distances instead of
ordinary average distances - Use l ½ to find weights for insertion points
- Must keep average distances for all pairs of
sub-trees. - BME runs in O(n2 diam(T)) running time.
24Simulations
- Created 24- and 96-taxon trees, 2000 per each
size, Yule-Harding process (g molecular clock). - Edge lengths multiplied by (1.0 mX), where X is
exponentially distributed. - Generated trees with three rates of evolution
- SeqGen used to generate sequences for each tree
and rate (12,000 in all) - DNADIST used to calculate distance matrices
25Results topological distances
BNNI improved all input trees
26Results topological distances
This improvement is large with fast rates and
high numbers of taxa
27Results topological distances
NNI trees are close to the best possible for BME
28Results topological distances
The quality of the NNI tree is (mostly) independen
t of starting point
29Results topological distances
30Computational Times
in (MMSS)
24 Taxa 96 Taxa 1000 Taxa 4000 Taxa
GME BNNI 0.0263 0.0842 11.3390 0602.1
HGT/FP 0.0252 0.1349 13.8080 0333.1
NJ/BIONJ 0.0630 0.1628 21.2500 2055.9
WEIGHBOR 0.4244 26.8818 Â Â
FITCH 4.3745 Â Â Â
Computations done on Sun Enterprise E4500/E5500
running Solaris 8 on 10 400-Mhz processors with 7
Gb memory.
31Average number of NNIs
24 Taxa 96 Taxa 1000 Taxa 4000 Taxa
GME FASTNNI 1.244 8.446 44.9 336.50
GME BNNI 1.446 11.177 59.1 343.75
BME BNNI 1.070 6.933 29.1 116.25
We see that the average number of NNIs is
considerably lower than the number of taxa.
32BME WLS
- Why does the balanced approach work so well?
- Pauplins formula for the length of a tree is
- BME is a weighted least squares approach with
Where pT(i,j) is the length of the (i,j) path in
T.
Distantly related taxa see their importance
decrease exponentially.
33Bonus features
- BME is a consistent method. As observed
distances converge to true distances, the true
topology becomes the minimum evolution tree. - The BNNI tree has no negative branch lengths. A
negative value to the branch length function
implies a NNI leading to a smaller tree.
34Consistency of Balanced ME
- Theorem Suppose S is a weighted tree, and T
is a tree topology incompatible with S. Let T
be the tree of topology T with weights
determined by the balanced scheme. Then - l(T) gt l(S).
- Lemma it suffices to prove the case when S is a
split metric.
35Balanced ME consistency
- Basic idea let l be the tree length function on
the space of topologies. We find a sequence of
topologies, TT0, T1, ... TkS such that - Each Ti1 can be reached from Ti via one of two
simple topological transformations - l(Ti) gt l(Ti1) for all i.
- Proof structure modeled after OLS/ME proof
(Rzhetsky and Nei, 1993).
36Type I transformation
Color the leaves black or white according to the
split metric S. A Type I transformation uses a
NNI to form a larger monochromatic cluster
This transformation reduces the size of the tree
under l
37Type II transformation
A Type II transformation uses two NNIs to form
two monochromatic subtrees
This transformation also reduces the value of the
size of the tree under l
38Positive Branch Lengths after BNNI
- Recall that the length of an
edge is described by
We do not perform the switch because
i.e.
Thus
Similarly,
39Conclusions
- BME BNNI runs in O((n2 pn) diam(T)), outputs
trees comparable to (better than) FITCH,
Weighbor, BioNJ, or NJ. - FastME is faster than NJ or its variants.
- BNNI consistently improved output trees in all
settings, even when WLS/Fitch trees were input. - BNNI outputs tree without negative branch
lengths. - FASTME software available at http//www.ncbi.nlm.n
ih.gov/CBBResearch/Desper/FastME.html or
http//www.lirmm.fr/w3ifa/MAAS/.