Title: Algorithms For Quartet Based Phylogeny Reconstruction
1Algorithms For Quartet BasedPhylogeny
Reconstruction
Gang Wu Department of Computing
Science University of Alberta Edmonton, Alberta,
Canada
2Outline
- Introduction
- Research methods
- Computational results
- Conclusions and future works
3Common Phylogeny Terminology
Phylogeny pattern of historical relationships
among species (taxa). Tree mathematical
structure used to depict the evolutionary history
of a group of taxa
Leaf Nodes
Branches or Edges
A
Represent the taxa (genes, populations,
etc.) used to infer the phylogeny
internal
B
C
D
ROOT of the Tree (common ancestor of all taxa)
E
Internal Nodes (represent hypothetical ancestors
of the taxa)
4(No Transcript)
5Quartet Based Phylogeny Reconstruction
- Quartet four taxa (A, B, C, D)
- Quartet topology an unrooted tree for a quartet
- Three possible quartet topologies for a quartet.
ABCD
ACBD
ADBC
6Process of Quartet Based Phylogeny Reconstruction
7Definitions
A quartet topology abcd is consistent with a
phylogeny T, or a phylogeny T satisfies a
quartet topology abcd , iff a,b,c,d are all
leaves of T and the path from a to b does not
share any nodes with the path from c to d.
8b
a
c
aecd
d
f
e
Phylogeny T
quartet topology aecd is consistent with T, or T
satisfies aecd
9Definitions
A quartet topology set Q is complete iff Q
contains a quartet topology for each four taxa
over S.
Given a quartet topology set Q on a taxon set S,
Q is compatible iff there is a phylogeny on S
which satisfies all the quartet topologies in Q.
10aecd abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Quartet topology set Q on taxon set
Sa,b,c,d,e,f
Q is complete
11Definitions
Given a taxon set S, we define the phylogeny that
reveals the correct relationships among the taxa
in S as the true phylogeny on S, denoted as
Ttrue. If a quartet topology q is inconsistent
with Ttrue, then q is a quartet error of Ttrue.
b
a
c
aced
d
f
A quartet error
e
true phylogeny Ttrue
12Research Goal
Given a quartet topology set Q on taxon set S,
reconstruct the phylogeny that can reveal the
true phylogeny on S as much as possible.
13Research Methods
- An answer set programming based method and a
look-ahead branch and bound method for solving
the general Maximum Quartet Consistency (MQC)
problem - A polynomial time algorithm for solving a special
MQC problem - Three efficient algorithms to reconstruct true
phylogeny with high success probabilities.
14MQC/MQI Problem
Maximum Quartet Consistency Problem (MQC) Input
A set Q of quartet topologies on S. Goal Find a
phylogeny T on S such that the number of
consistent quartet topologies in Q is maximized.
Minimum Quartet Inconsistency Problem
(MQI) Input A set Q of quartets on S. Goal
Find a phylogeny T on S such that the number of
inconsistent quartet topologies in Q is minimized.
15MQC/MQI Problem
aced abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
We cannot find any phylogeny that can satisfy all
the quartet topologies in Q
Then we turn to find a phylogeny that can satisfy
a maximum number of quartet topologies in Q
Quartet topology set Q on taxon set
Sa,b,c,d,e,f
16MQC/MQI Problem
aced abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Quartet topology set Q on taxon set
Sa,b,c,d,e,f
Phylogeny Topt for MQC problem
If a quartet topology q ? Q is not consistent
with the optimal phylogeny Topt, then it is a
conflicting quartet topology.
17Why MQC/MQI
- Simulation results show that in most cases the
resultant phylogenies of MQC/MQI problem are the
true phylogenies, especially when the number of
quartet errors is small. - MQC/MQI problem is NP-complete, and it is a
challenge to solve it efficiently in practical
phylogeny reconstruction.
18Outline
- Introduction
- Research methods
- Answer set programming method based on
ultrametric phylogeny - Computational results
- Conclusions and future works
19Ultrametric Phylogeny and Matrix
- Ultrametric Phylogeny
- Label each internal node with a positive integer
number - Along any root to a leaf path, the labels on the
path are strictly decreasing
20Ultrametric Phylogeny and Matrix
- Ultrametric Phylogeny
- Label each internal node with a positive integer
number - Along any root to a leaf path, the labels on the
path are strictly decreasing
Ultrametric Matrix Each entry value is the label
of least common ancestor of the two leaf nodes.
It is
- Symmetric, M(i, i) 0 and
- For every triplet (i, j, k) there are two equal
values among - M(i, j), M(j, k), and M(i, k) and they are
greater than the third value.
e.g. i1, j3, k4, M(1, 3)M(3, 4)gt M(1, 4)
21Theorem A quartet topology abcd is consistent
with a phylogeny T iff any ultrametric labeling
scheme M of T satisfies min M(a, c), M(b, d)
gt minM(a, b), M(c, d).
4
3
1
2
s1
s5
s4
s3
s2
s1 s5 s2 s3 is consistent with the tree and
its corresponding matrix min M(1, 2), M(5,
3)4 gt minM(1, 5), M(2, 3)1. Condition
satisfied!
An ultrametric matrix satisfies a quartet
topology abcd if the above inequality is
satisfied
22Theorem Given a quartet topology set Q on S and
an ultrametric phylogeny T on S, T satisfies k
quartet topologies in Q if and only if the
corresponding ultrametric matrix M on S satisfies
the same k quartet topologies in Q.
We transfer the original MQC problem into an
ultrametric matrix searching problem
23Problem Formulation
- Input
- nn matrix M(i,j), the domain of each matrix
entry is 0..n-1 - Quartet topology set Q.
- Goal
- Find a solution to M(i,j), so that
- The matrix is ultrametric
- The number quartet topologies satisfied by the
matrix is maximized. -
24Answer Set Programming
Given a set of logic rules
a b, c, not d
b not d
Use some solver program to find the solution
a, b, c
25Formulation in Answer Set Programming
Domain
1m(1, 2, 1),m(1, 2, 2),m(1, 2, 3),m(1, 2,
4),m(1, 2, 5)1 matrix entry (1,2) takes exactly
one value in the domain 1,5
Ultrametric Constraints
for three matrix values, m(i,j), m(j,k) and
m(i,k), two of them are equal and greater than
the third one
Quartet Constraints
if minm(i,k),m(j,l)gtminm(i,j),m(k,l) then
quartet i,jk,l is satisfied
Objective
maximize q(i,j,k,l)
26Outline
- Introduction
- Research methods
- Answer set programming method based on
ultrametric phylogeny - A lookahead Branch and Bound algorithm
- Computational results
- Conclusions and future works
27Background
Local conflict Incompatible set with 3 quartet
topologies and 5 taxa. For example, abcd, acbe
and acde. Theorem Given a complete set of
quartet topologies Q over a set of taxa S and
some taxon e in S, Q is compatible iff there
exists no local conflict whose taxon set includes
e. Idea Construct a local conflict list
involving a taxon e, and then try to resolve all
the local conflicts in the list by changing less
than k quartet topologies. Method Branch and
Bound
28Lookahead
Contribution of changing a quartet topology The
difference between the size of the local conflict
lists before and after a quartet topology
changing.
Suppose current local conflict list has the size
of 100 We choose a quartet topology abcd and
change it to acbd The new local conflict list
has the size of 60 Then the contribution of
abcd -gt acbd is 40
At each search node, we first have a lookahead
mechanism to test the contribution of each
possible branch and choose the one with maximum
contribution to continue searching.
29Outline of Algorithm
- At every node in the search tree
- Test to decide to cut the
node or not
(m is the number of local conflicts, k is the
maximum quartet errors, k1 is the number of
changed quartet topologies so far) - Determine need-to-be-changed quartet topologies
(If there are 3(k-k1) distinct local conflicts
involving q, then q must be changed) - Determine need-to-be-fixed quartet topologies
(find optimal edges and all the quartet
topologies consistent with the optimal edges are
fixed) - Use the quartet inference rules on the quartet
topologies generated in step 3 (e.g. abcd and
abce -gt abde) -
30Outline of Algorithm-Contd
5. Build a local conflict list and partition it
into two parts IF there are
need-to-be-changed quartet topologies
Pick the need-to-be-changed quartet topology
achieving the largest contribution to resolve
ELSE Pick the resolvement way achieving
the largest contribution
31Outline
- Introduction
- Research methods
- Answer set programming method based on
ultrametric phylogeny - A lookahead Branch and Bound algorithm.
- A polynomial time algorithm when the number of
conflicting quartet topologies is O(n) - Computational results
- Conclusions and future works
32Background
- MQC/MQI is NP-complete if Q is complete
- Known result If the number of conflicting
quartet topologies is less than (n is
the number of taxa), then MQC/MQI can be solved
in polynomial time - We extend the result to O(n).
33Lemma
Let E denote the set of conflicting quartet
topologies in an optimal solution to the MQC/MQI
problem. There is a taxon e such that the number
of quartet topologies in E involving e is less
than or equal to
- A quartet topology contains 4 taxa
- E contains total 4E taxa
- Input taxon set size is n.
34Lemma
In the MQC/MQI problem, if there is no
conflicting quartet topologies involving taxon e,
then the problem can be solved in O(n4) time.
- Build a local conflict list involving e
- For the 3 quartet topologies in a local conflict,
2 must contain e. Therefore they are not
conflicting quartet topologies - Then the third quartet topology containing e
must be changed to resolve the local conflict.
35Theorem
There is a polynomial algorithm that solves the
MQC/MQI problem when the number of conflicting
quartet topologies in the given complete quartet
topology set Q is at most cn, where c is a
positive constant and n is the number of taxa.
- At least one taxon, e, is involved in at most 4c
quartet topologies in E. - We try every possible change for every possible
combination of the 4c quartet topologies in E. It
is still polynomial since 4c is a constant - Rest quartet topologies in E are not involved e,
and can be determined in O(n4) time.
36Algorithm
- For (every taxon e ? S)
- Construct the local conflict list involving
taxon e - If (the size of the list is 30c(n - 4))
- For (every k ? 0, 4c)
- For (every combination of k topologies
involving e) - For (every possible change of
these k topologies) - Change (cn - k) topologies
not involving e to resolve conflicts - Update the best solution if
there is no conflict left - If (the best solution is empty)
- Set Q must contain more than cn conflicting
quartet topologies, exit - Else
- Construct the phylogeny associated with the
best solution - Return the best solution and its associated
phylogeny.
37Outline
- Introduction
- Research methods
- Answer set programming based on ultrametric
phylogeny - A lookahead Branch and Bound algorithm.
- A polynomial time algorithm when the number of
conflicting quartet topologies is O(n) - A probabilistic model and three efficient
algorithms to compute Ttrue with high
probabilities. - Computational results
- Conclusions and future works
38Background
Given a tree T on n leaves, there exists an
internal node (separator) whose removal
partitions the tree into connected components,
each with at most n/2 leaves, and such node can
be found in O(n) time.
Separator
39Probabilistic Model
- Given Ttrue, generate a complete quartet topology
set QTtrue for Ttrue. - For every quartet topology in QTtrue , with
probability p/2, change its topology into each of
the other two topologies (then every quartet
topology has a probability of p to be a quartet
error). - Simulation results show that current quartet
inference methods can achieve over 80 accuracy
for inferred quartet topologies. - We assume 0 lt p lt 1/3
40Theorem
Given a quartet topology set Q with no quartet
errors (p0), Ttrue can be constructed in O(n2)
time by querying at most (n-4) log(n-1) quartet
topologies in Q.
- Start with a random quartet topology
- Iteratively insert a new taxon to grow the
phylogeny.
41g
aecg
agcd
42Theorem
When 0 lt p lt 1/3 , we can reconstruct Ttrue in
O(n4 log n) time with a probability at least
- Use a voting scheme to decide into which branch
the new taxon should be inserted.
43aecg, aedg, becg, bedg, agcf, bfdg, ..
g
agcd, bgcd, ecgd, fgcd
44Compatible 5-subset
A compatible complete quartet topology set on a
taxon set of 5 taxa.
b
aecd abcd abce abde becd
a
c
e
d
45Theorem
When 0 lt p lt 1/3 , and we can start with a
compatible 5-subset, then Ttrue can be
reconstructed in O(n5) time with a probability at
least
where
46Outline
- Introduction
- Research methods
- Answer set programming based on ultrametric
phylogeny - A lookahead Branch and Bound algorithm.
- A polynomial time algorithm when the number of
conflicting quartet topologies is O(n) - A probabilistic model and three efficient
algorithms to compute Ttrue with high
probabilities. - Computational results
- Conclusions and future works
47Running times of the exact algorithms for MQC
problem
DP (dynamic programming method by A. Ben-Dor et
al. GN (fixed parameter method by Gramm et
al. ASP (our Answer set programming
method) LBnB-Opt (our lookahead branch and bound
method)
48Running times on datasets with small quartet
errors
49Probability comparison among the proposed
algorithm (M-VOTE), the hypercleaning algorithm
(HC), the answer set programming method for the
MQC problem (ASP), and the theoretical success
probability of M-VOTE.
50Conclusions
- The answer set programming formulation gives a
new perspective of the MQC problem. - The proposed exact algorithms outperform other
exact algorithms significantly. - In general problem instances, the answer set
programming method has the greatest efficiency. - If the quartet errors are small and the quartet
topology set is complete, the Lookahead branch
and bound algorithm has the greatest efficiency. -
-
51Conclusions
- The polynomial time algorithm for solving a
special MQC/MQI problem has improved current
result in this area. - The probabilistic model and the proposed
algorithms open up several research directions. -
-
52Future Work
- Design a quartet specific answer set programming
solver - Investigate other possible probability
distributions and design efficient algorithms. -
-
53Questions?