Title: On the Hardness of Inferring Phylogenies from TripletDissimilarities
1On the Hardness of Inferring Phylogenies from
Triplet-Dissimilarities
- Ilan Gronau Shlomo Moran
- Technion Israel Institute of Technology
- Haifa, Israel
2Pairwise-Distance Based Reconstruction
DT
E
M
L
G
H
B
3Optimization Criteria
We wish the tree-metric DT to approximate
simultaneously the pairwise distances in D.
should be close to
D
DT
Two closeness measures studied here
Maximal Difference (l8 )
4Maximal Difference (l8 ) vs. Maximal Distortion
B E G H L M
D
DT
B E G H L M
Goal Find optimal T, which minimizes the
maximal difference/distortion between D and DT
5Previous works on Approximating Dissimilarities
by Tree Distances
- Negative results (NP-hardness)
- Closest tree-metric (even ultrametric ) to
dissimilarity matrix under l1 l2 Day 87 - Closest tree-metric to dissimilarity matrix
under l8 ABFPT99 - Hard to approximate better than 1.125
- Implicit Hard to approximate closest MaxDist
tree within any constant factor - Positive results
- Closest ultrametric to dissimilarity matrix
under l8 Krivanek 88 - 3-approximation of closest additive metric to a
given metric ABFPT99 - (implicit 6-approximation for general
dissimilarity matrices)
6This Work Triplet-Distances Distances to
Triplets Midpoints
C(i,j,k)
tT (i jk)
- tT (i jk) tT (i kj)
- tT (i ij) 0
- tT (i jj) DT (i, j)
i
k
j
7Triplet-Distances Defined by 2-Distances
- Each distance Matrix D defines 3-trees
t(i jk) ½D(i,j)D(i,k)-D(j,k).
i
Any metric on 3 taxa
8
9
j
7
k
8Triplet-Distance Based Reconstruction
t(i jk) ½D(i,j)D(i,k)-D(j,k).
BB BE BG.. LL LM MM
B E G H L M
reconstruct
?
9Why use Triplet-Distances?
1. They enable more accurate estimations of
2-distances. 2. They are used (de facto) by known
reconstruction algorithms
10Improved Estimations of Pairwise Distances
Information Loss
D
Calculate D(H,E)
11Improved Estimations (cont)
- Estimate D(H,E) by calculating all the 3-trees on
H,E,XX?H,E - (Or calculate just one 3-tree, for a trusted
3rd taxon X - V. Ranwez, O. Gascuel, Improvement of
distance-based phylogenetic methods by a local
maximum likelihood approach using triplets,
Mol.Biol. Evol. 19(11) 19521963. (2002)
12(Implicit) use of Triplet-Distances in
2-Distance Reconstruction Algorithms
t(i jk) ½D(i,j)D(i,k)-D(j,k).
131st use Triplet Distances from a Single
Source
- Fix a taxon r, and construct a tree T which
minimizes - Optimal solution is doable in O(n2) time, and is
used eg in - (FKW95) Optimal approximation of distances by
ultrametric trees. - (ABFPT99) The best known approximation of
distances by general trees - (BB99) Fast construction of Buneman trees.
-
142nd useSaitouNei Neighbour Joining
The neighbors-selection criterion of NJ selects a
taxon-pair i,j which maximizes the sum
r
r
i
r
r
r
r
j
r
r
15Previous Works on Triplet-Dissimilarities/Distanc
es
- I. Gronau, S. Moran Neighbor Joining Algorithms
for Inferring Phylogenies via LCA-Distances,
Journal of Computational Biology 14(1) pp. 1-15
(2007). - Works which use the total weights of 3 trees
- S. Joly, GL Calve, Three Way Distances, Journal
of Classification 12 pp. 191-205 (1995) - L. Pachter, D. Speyer Reconstructing Trees from
Subtrees Weights , Applied Mathematics Letters 17
pp. 615-621 (2004) - D. Levy, R. Yoshida, L. Pachter, Beyond pairwise
distances Neighbor-joining with phylogenetic
diversity estimates, Mol. Biol. Evol. 23(3)
491498 (2006) .
16Summary of Results
- Results for Maximal Difference (l8)
- Decision problem is NP-Hard
- ? IS there a tree T s.t. t,tT 8 ? ?
- Hardness-of-approximation of optimization problem
- ? Finding a tree T s.t. t,tT 8
1.4t,tOPT8 - A 15-approximation algorithm
- ? Using the 6-approximation algorithm for
2-dissimilarities from ABFPT99 - Result for Maximal Distortion
- Hardness-of-approximation within any constant
factor
17NP Hardness of the Decision Problem
We use a reduction from 3SAT (the problem of
determining whether a 3CNF formula is
satisfiable)
We show
If one can determine for (t,?) whether there
exists a tree T s.t. t,tT 8 ?, then one can
determine for every 3CNF formula f whether it is
satisfiable.
18The Reduction
Given a 3CNF formula f we define triplet
distances ? and an error bound ? which enforce
the output tree to imply a satisfying assignment
to f.
- The set of taxa
- Taxa T , F.
- A taxon for every literal ( ).
- 3 taxa for every clause Cj ( y j1 , y j2 , y j3
).
19Properties Enforced by the Input (?,?)
- One the following can be enforced on each taxa
triplet (u,v,w) - taxon u is close to Path(v,w), or
- taxon u is far to Path(v,w)
u
20Enforcing Truth Assignmaent
- A truth assignment to f is implied by the
following - T is far from F
- For each i, is far from , and both of
and are close to Path(T ,F)
Thus we set xi T iff xi is close to T.
21Enforcing Clauses-Satisfaction
A clause C( l 1 ? l 2 ? l 3 ) is satisfied iff
At least one literal l i is true, i.e. is close
to T.
(l 1 ? l 2 ? l 3 ) is satisfied iff it is not
like this
We need to guarantee that all clauses avoid the
above by the close/far relations.
22Clauses-Satisfaction (cont)
-?(l 1 ? l 2 ? l 3 ) is satisfied iff out of the
three paths Path(l 1 , l 2), Path(l 1 , l 3),
Path(l 2 , l 3), at least two paths are close
to T .
l 3
T
F
l 1
l 2
23Clauses-Satisfaction (cont)
We attach a taxon to each such path y1 is
close to Path ( l 2,l 3) y2 is close to Path (
l 1,l 3) y3 is close to Path ( l 1,l 2)
?(l 1 ? l 2 ? l 3 ) is satisfied iff at least
two yis can be located close to T.
24Clauses-Satisfaction (end)
and, at least two of the yis can be located
close to T Path( y 2,y 3), Path( y 1,y 3),
Path( y 1,y 2), are close to T
So, (l 1 ? l 2 ? l 3 ) is satisfied iff all the
above paths are close to T
25Construction Example
f is satisfiable ? there is a tree T which
satisfies all bounds
A1 tT (T , F ) 2a2ß A2 i1..n
tT (T ) a tT (F
) a B1 j1..m tT (y j1 l j2 l j3 )
a tT (y j2 l j1 l j3 ) a tT (y
j3 l j1 l j2 ) a B2 j1..m tT (y j1
T F ) a tT (y j2 T F ) a tT
(y j3 T F ) a B3 j1..m tT (T y j2
y j3 ) a tT (T y j1 y j3 ) a
tT (T y j1 y j2 ) a
26Hardness of Approximation Results
By stretching the close/far restrictions, the
following problems are also shown NP hard
- Approximating Maximal Difference
- Finding a tree T s.t. t,tT 8 1.4t,tOPT8
- Approximating Maximal Distortion
- Finding a tree T s.t.
- MaxDist(t,tT ) C MaxDist(t,tOPT) for any
constant C
Details in I. Gronau and S. moran, On The
Hardness of Inferring Phylogenies from
Triplet-Dissimilarities, Theoretical Computer
Science 389(1-2), December 2007, pp. 44-55.
27Open Problems/Further Research
- Extending hardness results for 3-diss tables
induced by 2-diss matrices - (t(i jk) ½D(i,j)D(i,k)-D(j,k) )
- Extending hardness results for naturally
looking trees - (binary trees with constant-bounded edge
weights) - Check Performance of NJ when neighbor selection
formula computed from real 3-distances. - Devise algorithms which use 3-distances as input.
- Does optimization of 3-diss lead to good
topological accuracy (under accepted models of
sequence evolution) - (it is known that optimization of 2-diss doesnt
lead to good topological accuracy)
28Thank You
29Distance-Based Phylogenetic Reconstruction
- Compute distances between all taxon-pairs
- Find a tree (edge-weighted) best-describing the
distances
30Optimization Criteria
-
- Known measures of closeness
- l8 -
- lp -
- MaxDist -
( where 0/01 )
31The Reduction
f
, ?
3CNF formula
There is a tree T s.t. t,tT 8 ?
f is satisfiable
If one can determine for (t,?) whether there
exists a tree T s.t. t,tT 8 ?, then one can
determine for every 3CNF formula f whether it is
satisfiable.
32The Reduction
Define a set of lower and upper bounds A1 tT (T
, F ) 2a2ß A2 i1..n tT (T
) a tT (F ) a B1
j1..m tT (y j1 l j2 l j3 ) a tT (y
j2 l j1 l j3 ) a tT (y j3 l j1 l j2 )
a B2 j1..m tT (y j1 T F ) a
tT (y j2 T F ) a tT (y j3 T F )
a B3 j1..m tT (T y j2 y j3 ) a
tT (T y j1 y j3 ) a tT (T y j1 y j2
) a
33The Reduction
f
tu
2?
,
3CNF formula
There is a tree T s.t. tl tT tu
f is satisfiable
If one can determine for (t,?) whether there
exists a tree T s.t. t,tT 8 ?, then one can
determine for every 3CNF formula f whether it is
satisfiable.
34The Reduction
- Define the set of taxa.
- Define a set of lower and upper bounds on some
entries of tT. - f is satisfiable ? there is a tree T which
satisfies all bounds - Define ? according to the slackness required for
the proof of ?.
35The Reduction
- Define the set of taxa
- Taxa T , F.
- A taxon for every literal ( ).
- 3 taxa for every clause ( y j1 , y j2 , y j3 ).
36The Analysis
A1 tT (T , F ) 2a2ß A2 i1..n tT
(T ) a tT (F )
a
- Trees satisfying A1 and A2 imply a
truth-assignment to x1 ,..., xn.
37The Analysis
B1 j1..m tT (y j1 l j2 l j3 ) a tT
(y j2 l j1 l j3 ) a tT (y j3 l j1 l
j2 ) a B2 j1..m tT (y j1 T F )
a tT (y j2 T F ) a tT (y j3 T F
) a B3 j1..m tT (T y j2 y j3 )
a tT (T y j1 y j3 ) a tT (T y j1
y j2 ) a
There is a tree T which satisfies all bounds ? f
is satisfiable
- B1 and B2 imply that y ja l jb l jc for
a,b,c1,2,3. - B3 implies that at least two of y j1, y j2, y j3
are satisfied.
38The Reduction t(f)
A1 tT (T , F ) 2a2ß A2 i1..n
tT (T ) a tT (F
) a B1 j1..m tT (y j1 l j2 l j3 )
a tT (y j2 l j1 l j3 ) a tT (y
j3 l j1 l j2 ) a B2 j1..m tT (y j1
T F ) a tT (y j2 T F ) a tT
(y j3 T F ) a B3 j1..m tT (T y j2
y j3 ) a tT (T y j1 y j3 ) a
tT (T y j1 y j2 ) a
- In our constructed tree
- All 2-distances are in 2a , 2a2ß.
- All 3-distances are in a , a2ß.
- ? ?ß.
A1 t(T , F ) 2a3ß A2 i1..n t(T
) a-ß t(F )
a-ß B1 j1..m t(y j1 l j2 l j3 ) a-ß
t(y j2 l j1 l j3 ) a-ß t(y j3 l j1 l j2
) a-ß B2 j1..m t(y j1 T F ) aß
t(y j2 T F ) aß t(y j3 T F )
aß B3 j1..m t(T y j2 y j3 ) a-ß
t(T y j1 y j3 ) a-ß t(T y j1 y j2 )
a-ß Other 2-distances t(s , t )
2a2ß Other 3-distances t(s t u ) a2ß