Title: Phylogenetic Networks with Constrained Recombination
1(No Transcript)
2Reconstructing Ancestral Recombination Graphs -
or Phylogenetic Networks with Recombination
Different parts of this work are joint with
Satish Eddhu, Charles Langley, Dean Hickerson,
Yun Song, Yufeng Wu.
3Reconstructing the Evolution of Binary
Bio-Sequences
- Perfect Phylogeny (tree) model
- Phylogenetic Networks (DAG) with recombination
- Phylogenetic Networks with disjoint cycles
Galled-Trees - Phylogenetic Networks with unconstrained cycles
Blobbed-Trees - Combinatorial Structure and Efficient Algorithms
- Efficiently Computed Lower and Upper bounds on
the number of recombinations needed
4Geneological or Phylogenetic Networks
- The major biological motivation comes from
genetics and attempts to reconstruct the history
of recombination in populations. - Also relates to phylogenetic-based haplotyping.
- Some of the algorithmic and mathematical results
have phylogenetic applications, for example in
hybrid speciation, lateral gene transfer.
5Why binary?
Single nucleotide polymorphisms are the key data
that we use. SNPs imply that the sequences are
binary, and also that the order of the sites
is fixed (on a chromosome). This is in contrast
to a set of taxonomic characters, which may be
binary, but where the given order is arbitrary.
6The Perfect Phylogeny Model for binary sequences
sites
12345
00000
Ancestral sequence
1
4
Site mutations on edges
3
00010
The tree derives the set M 10100 10000 01011 0101
0 00010
2
10100
5
10000
01010
01011
Extant sequences at the leaves
7The converse problem
Given a set of sequences M we want to find, if
possible, a perfect phylogeny that derives M.
Remember that each site can change state from 0
to 1 only once.
n will denote the number of sequences in M, and m
will denote the length of each sequence in M.
m
01101001 11100101 10101011
M
n
8When can a set of sequences be derived on a
perfect phylogeny?
- Classic NASC Arrange the sequences in a matrix.
Then (with no duplicate columns), the sequences
can be generated on a unique perfect phylogeny if
and only if no two columns (sites) contain all
four pairs - 0,0 and 0,1 and 1,0 and 1,1
This is the 4-Gamete Test
9A richer model
10100 10000 01011 01010 00010 10101 added
12345
00000
1
4
3
00010
2
10100
5
pair 4, 5 fails the three gamete-test. The sites
4, 5 conflict.
10000
01010
01011
Real sequence histories often involve
recombination.
10Sequence Recombination
01011
10100
S
P
5
Single crossover recombination
10101
A recombination of P and S at recombination point
5.
The first 4 sites come from P (Prefix) and the
sites from 5 onward come from S (Suffix).
11Network with Recombination
10100 10000 01011 01010 00010 10101 new
12345
00000
1
4
3
00010
2
10100
5
10000
P
01010
The previous tree with one recombination event
now derives all the sequences.
01011
5
S
10101
12Multiple Crossover Recombination
4-crossovers
2-crossovers gene conversion
13Elements of a Phylogenetic Network (single
crossover recombination)
- Directed acyclic graph.
- Integers from 1 to m written on the edges. Each
integer written only once. These represent
mutations. - A choice of ancestral sequence at the root.
- Every non-root node is labeled by a sequence
obtained from its parent(s) and any edge label on
the edge into it. - A node with two edges into it is a
recombination node, with a recombination point
r. One parent is P and one is S. - The network derives the sequences that label the
leaves.
14A Phylogenetic Network
00000
4
00010
a00010
3
1
10010
00100
5
00101
2
01100
S
b10010
4
S
P
01101
p
c00100
g00101
3
d10100
f01101
e01100
15Which Phylogenetic Networks are meaningful?
- Given M we want a phylogenetic network that
derives M, but which one?
A A perfect phylogeny (tree) if possible. As
little deviation from a tree, if a tree is not
possible. Use as little recombination or
gene-conversion as possible.
16Minimizing recombinations
- Any set M of sequences can be generated by a
phylogenetic network with enough recombinations,
and one mutation per site. This is not
interesting or useful. - However, the number of (observable)
recombinations is small in realistic sets of
sequences. Observable depends on n and m
relative to the number of recombinations. - Two algorithmic problems given a set of
sequences M, find a phylogenetic network
generating M, minimizing the number of
recombinations (Heins problem). Find a network
generating M that has some biologically-motivated
structural properties.
17Minimization is NP-hard
- The problem of finding a phylogenetic network
that creates a given set of sequences M, and
minimizes the number of recombinations, is
NP-hard. (Wang et al 2000) (Semple 2004) - Wang et al. explored the problem of finding a
phylogenetic network where the recombination
cycles are required to be node disjoint, if
possible. - They gave a sufficient but not a necessary
condition to recognize cases when this is
possible. O(nm n4) time.
18Recombination Cycles
- In a Phylogenetic Network, with a recombination
node x, if we trace two paths backwards from x,
then the paths will eventually meet. - The cycle specified by those two paths is called
a recombination cycle.
19Galled-Trees
- A recombination cycle in a phylogenetic network
is called a gall if it shares no node with any
other recombination cycle. - A phylogenetic network is called a galled-tree
if every recombination cycle is a gall.
20A galled-tree generating the sequences
generated by the prior network.
4
3
1
s
p
a 00010
3
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
21(No Transcript)
22Sales pitch for Galled-Trees
- Galled-trees represent a small deviation from
true trees. - There are sufficient applications where it is
plausible that a galled tree - exists that generates the sequences.
- Observable recombinations tend to be recent
block structure of human DNA recombination is
sparse, so the true history of observable - recombinations may be a galled-tree.
The number of recombinations is never more than
m/2. Moreover, when M can be derived on a
galled-tree, the number of recombinations used
is the minimum number over any phylogenetic
network, even if multiple cross-overs at a
recombination event are counted as a single
recombination. A galled-tree for M is almost
unique - implications for reconstructing
the correct history.
23Old (Aug. 2003) Results
- O(nm n3)-time algorithm to determine whether
or not M can be derived on a galled-tree with
all-0 ancestral sequence. - Proof that the galled-tree produced by the
algorithm is a nearly-unique solution. - Proof that the galled-tree (if one exists)
produced by the algorithm minimizes the number of
recombinations used, over all phylogenetic-network
s with all-0 ancestral sequence.
24New work
- We derive the galled-tree results in a more
- general setting that addresses unconstrained
recombination cycles and multiple - crossover recombination. This also solves the
- problem of finding the most tree-like
- network when a perfect phylogeny is not
- possible. In this algorithm, no ancestral
sequence is known in advance.
25 Blobbed-trees generalizing galled-trees
- In a phylogenetic network a maximal set of
intersecting cycles is called a blob. - Contracting each blob results in a directed,
rooted tree, otherwise one of the blobs was not
maximal. - So every phylogenetic network can be viewed as a
directed tree of blobs - a blobbed-tree. - The blobs are the non-tree-like parts of the
network.
26Every network is a tree of blobs. How do the
tree parts and the blobs relate?
How can we exploit this relationship?
Ugly tangled network inside the blob.
27Incompatible Sites
- A pair of sites (columns) of M that fail the
- 4-gametes test are said to be incompatible.
- A site that is not in such a pair is compatible.
281 2 3 4 5
Incompatibility Graph
a b c d e f g
0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 1 0 1
4
M
1
3
2
5
Two nodes are connected iff the pair of sites are
incompatible, i.e, fail the 4-gamete test.
THE MAIN TOOL We represent the pairwise
incompatibilities in a incompatibility graph.
29The connected components of G(M) are very
informative
- The number of non-trivial connected components is
a lower-bound on the number of recombinations
needed in any network (Bafna, Bansal Gusfield,
Hickerson). - When each blob is a single-cycle (galled-tree
case) all the incompatible sites in a blob must
come from a single connected component C, and
that blob must contain all the sites from C.
Compatible sites need not be inside any blob.
(Gusfield et al 2003-5)
30Simple Fact
- If sites two sites i and j are incompatible,
then the sites must be together on some
recombination cycle whose recombination point is
between the two sites i and j. - (This is a general fact for all phylogenetic
networks.)
Ex In the prior example, sites 1, 3 are
incompatible, as are 1, 4 as are 2, 5.
31A Phylogenetic Network
00000
4
00010
a00010
3
1
10010
00100
5
00101
2
01100
S
b10010
4
S
P
01101
p
c00100
g00101
3
d10100
f01101
e01100
32Simple Consequence of the simple fact
- All sites on the same (non-trivial) connected
component of the incompatibility graph - must be on the same blob in any blobbed-tree.
- Follows by transitivity.
- So we cant subdivide a blob into a tree-like
structure if it only contains sites from a single
connected component of the incompatibility graph.
33Key Result about Galls For galls, the converse
of the simple consequence is also true.
- Two sites that are in different (non-trivial)
connected - components cannot be placed on the same gall in
- any phylogenetic network for M.
- Hence, in a galled-tree T for M each gall
contains all and only the sites of one
(non-trivial) connected component of the
incompatibility graph. All compatible sites can
be put on edges outside of the galls. -
This is the key to the galled-tree solution.
34Incompatibility Graph
A galled-tree generating the sequences
generated by the prior network.
4
4
3
1
3
2
5
1
s
p
a 00010
2
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
35Motivated by the one-one correspondence between
galls and non-trivial connected components, we
ask To what extent does this one-one
correspondence hold in general blobbed-trees,
i.e. with no constraints on how recombination
cycles interweave?
36 The Decomposition Theorem (Recomb 2005)
- For any set of sequences M, there is a
blobbed-tree T(M) that derives M, where each
blob contains all and only the sites in one
non-trivial connected component of G(M). The
compatible sites can always be put on edges
outside of any blob.
A blobbed-tree with this structure is called
fully-decomposed.
37General Structure
- So, for any set of sequences M, there is a
phylogenetic network T(M) that is fully
decomposed. - Moreover, the tree part of T(M) is unique. And it
is easy to find the tree part.
38A fully-decomposed network for the sequences
generated by the prior network.
Incompatibility Graph
4
4
3
1
3
2
5
1
s
p
a 00010
2
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
39Moreover
- Since all sites from a single connected
component must be together on some blob in any
phylogenetic network, no network is more
decomposed than the fully decomposed network.
40Proof Ideas
- Let C be a connected component of G(M). Define
MC as the sequences in M restricted to the
sites in C.
411 2 3 4 5
a b c d e f g
0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 1 0 1
4
C2
C1
M
1
3
2
5
B1
B2
1 3 4
2 5
a b c d e f g
0 0 0 0 0 0 0 0 1 0 1 1 0 1
a 0 0 1 b 1 0 1 c 0 1 0 d 1 1 0 e 0 1 0 f 0 1
0 g 0 1 0
MC1
MC2
42Faux Proof
- Pick one site from each connected component C in
G(M) to represent C. No pair of those sites
are incompatible, so by the NASC for a perfect
phylogeny, there will be a perfect phylogeny T
for the sites. Expand each node to a network
generating the sequences in MC. - Incorrect, because the structure of T can be
wrong. We need to use information about all the
sites in each C.
43Now for each connected component C in G(M), call
each distinct sequence in MC a supercharacter,
and let W be the indicator matrix for the
supercharacters. So W indicates which rows of
M contain which particular supercharacters.
1 2 3 4 5 6 7 8
1 3 4
2 5
a b c d e f g
1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0
1 0 0 0 0 1
a b c d e f g
0 0 0 0 0 0 0 0 1 0 1 1 0 1
5 5 5 5 6 7 8
a 0 0 1 b 1 0 1 c 0 1 0 d 1 1 0 e 0 1 0 f 0 1 0 g
01 0
1 2 3 4 3 3 3
W
MC1
MC2
44Proof Ideas
- Lemma No pair of supercharacters are
incompatible. - So by the NASC for a Perfect Phylogeny, there is
a unique perfect phylogeny T for W.
45Proof Ideas
- For each connected component C of G(M), all
supercharacters that originate from C label edges
in T that are incident with one single node vC
in T. So, if we expand each node vC to be a
network that generates the supercharacters from C
(the sequences in MC), and connect each network
correctly to the edges in T, the resulting
network is a fully-decomposed blobbed-tree that
generates M.
46- Algorithmically, T is easy to find and is the
tree resulting from contracting each blob in the
fully-decomposed blobbed-tree T(M) for M. T can
be constructed from M in O(nm2) time.
47Broader Biological Applications
- Our major interest is in recombination, but
the proof of the decomposition theorem does not
explicitly use recombination. So it holds for
whatever biological phenomena caused the
incompatibility of sites. For example, back or
recurrent mutation, gene-conversion, lateral gene
transfer etc.
48The main open question
- The Decomposition Theorem says there is always
a fully-decomposed blobbed-tree for any M, but - Is there always a fully-decomposed
blobbed-tree that minimizes the number of
recombinations over all possible phylogenetic
networks for M?
49We conjecture the answer is yes.
- If true, then we can decompose the problem of
minimizing the total number of recombinations
into separate problems on each connected
component, and also find lower bounds on the
needed number of recombinations, in each
component separately, adding those bounds to get
a valid overall lower bound for M. This
computation of lower bounds is known to be
correct for certain lower bounds (Bafna, Bansal
2004).
50Progress on Proving the Conjecture
- Definition If N is a phylogenetic network for M,
and a node v in N is labeled with a sequence in
M, then v is said to be visible in N. - Theorem If every node in N is visible, then
there is a fully-decomposed network for M where
the number of recombinations is at most the
number in N. - Corollary The conjecture is true for any M where
the Haplotype or History lower bounds (S. Myers)
on the number of recombinations needed to
generate M, is tight.
51Optimality of Galled-Trees
Theorem (G,H,B,B) The minimum number of
recombination nodes in any phylogenetic network
for M is at least the number of non-trivial
connected components of the incompatibility
graph. Hence, when the sequences on each blob
on T(M) can be generated with a single
recombination node, the blobbed-tree
minimizes the number of recombination nodes over
all phylogenetic networks and all choices of
ancestral sequence. This solves the root-unknown
galled-tree problem in polynomial time. Code is
on the web.
52The number of arrangements on agall (all-0
ancestral sequence)
By analysing the algorithm to layout the sites
on a gall (not discussed here), one can prove
that the number of arrangements of any gall is at
most three, and this happens only if the gall has
two sites. If the gall has more than two sites,
then the number of arrangements is at most
two. If the gall has four or more sites, with at
least two sites on each side of the recombination
point (not the side of the gall) then the
arrangement is forced and unique.
53Computing close bounds on the minimum number of
recombinationsDan Gusfield UCD
- Y. Song, Y. F. Wu, D. Gusfield (ISMB2005)
-
- D. Gusfield, D. Hickerson (Dis. Appl. Math 2005)
- D. Gusfield, V. Bansal (Recomb 2005)
54The grandfather of all lower bounds - HK 1985
- Arrange the nodes of the incompatibility graph on
the line in order that the sites appear in the
sequence. This bound requires a linear order. - The HK bound is the minimum number of vertical
lines needed to cut every edge in the
incompatibility graph. Weak bound, but widely
used - not only to bound the number of
recombinations, but also to suggest their
locations.
55Justification for HK
- If two sites are incompatible, there must have
been some recombination where the crossover point
is between the two sites.
56HK Lower Bound
1 2 3 4 5
57HK Lower Bound 1
1 2 3 4 5
58More general view of HK
Given a set of intervals on the line, and for
each interval I, a number N(I), define the
composite problem Find the minimum number of
vertical lines so that every interval I
intersects at least N(I) of the vertical
lines. In HK, each incompatibility defines an
interval I where N(I) 1. The composite problem
is easy to solve by a left-to-right
myopic placement of vertical lines Sort the
intervals by right end-point Process the
intervals left to right in that order when the
right endpoint of an interval I is reached, place
there (if needed) additional vertical so that
N(I) lines intersect I.
59If each N(I) is a local lower bound on the
number of recombinations needed in interval I,
then the solution to the composite problem is a
valid lower bound for the full sequences. The
resulting bound is called the composite bound
given the local bounds.
This general approach is called the Composite
Method (Simon Myers 2002).
60Haplotype Bound (Simon Myers)
- Rh Number of distinct sequences (rows) - Number
of distinct sites (columns) -1 (folklore) - Before computing Rh, remove any site that is
compatible with all other sites. A valid lower
bound results - generally increases the bound. - Generally really bad bound, often negative, when
used on large intervals, but Very Good when used
as local bounds in the Composite Interval Method,
and other methods. -
61Composite Interval Method using RH bounds
- Compute Rh separately for each of the C(m,2)
intervals of the m sites let N(I) Rh(I) be the
local lower bound for interval I. Then compute
the composite bound using these local bounds. - Polynomial time and gives bounds that often
double HK in our simulations. -
62Composite Subset Method (Myers)
- Let S be subset of sites, and Rh(S) be the
haplotype bound for subset S. If the leftmost
site in S is L and the rightmost site in S is R,
then use Rh(S) as a local bound N(I) for interval
I S,L. - Compute Rh(S) on many subsets, and then solve the
composite problem to find a composite bound.
63RecMin (Myers)
- World Champion Lower Bound Program (until now).
Often RecMin gives a bound three times as large
as HK. - Computes Rh on subsets of sites, but limits the
size and the span of the subsets. Default
parameters are s 6, w 15 (s size, w
span). - Generally, impractical to set s w m, so
generally one doesnt know if increasing the
parameters would increase the bound. (example
Myers bound of 70 on the LPL data).
64Optimal RecMin Bound (ORB)
- The Optimal RecMin Bound is the lower bound that
RecMin would produce if both parameters were set
to their maximum values (s w m). - In general, RecMin cannot compute (in practical
time) the ORB. - Practical computation of the ORB is our first
contribution.
65Computing the ORB
- Gross Idea For each interval I, use ILP to find
a subset of sites S that maximizes Rh(S) over
all subsets in interval I. Call the result
Opt(I). - Set N(I) Opt(I), and compute the composite
bound using those local bounds. - The composite bound is the ORB. -- the result
one would get by using all 2m subsets in RecMin,
with s w m.
66We have moved from doing an exponential number of
simple computations (computing Rh for each
subset), to solving a quadratic number of
(possibly expensive) ILPs. Is this a good
trade-off in practice? Our experience - very
much so!
67How to compute Opt(I) by ILP
Create one variable Xi for each row i one
variable Yj for each column j in interval I. All
variables are 0/1 variables. Define S(i,i) as
the set of columns where rows i and i have
different values. Each column in S(i,i) is a
witness that rows i and i differ. For each pair
of rows (i,i), create the constraint Xi Xi
lt 1 ? Yj j in S(i,i) Objective Function
Maximize ? Xi - ? Yj -1
68Alternate way to compute Opt(I) by ILP
First remove any duplicate rows. Let N be the
number of rows remaining.
Create one variable Yj for each column j in
interval I. All variables are 0/1
variables. S(i,i) as before. For each pair of
rows (i,i) create the constraint 1 lt ? Yj
j in S(i,i) Objective Function Maximize N -
?(Yj) -1 Finds the smallest number of columns to
distinguish the rows.
69Two critical tricks
- Use the second ILP formulation to compute Opt(I).
It solves - much faster than the first (why?)
- 2) Reduce the number of intervals where Opt(I) is
computed
I
If the solution to Opt(I) uses columns that only
span interval L, then there is no need to find
Opt(I) in any interval I containing L. Each
ILP computation directly spawns at most 4 other
ILPs. Apply this idea recursively.
L
70- With the second trick we need to find Opt(I)
for only 0.5 - 35 of all the C(m,2) intervals
(empirical result). Surprisingly fast in
practice (with either the GNU solver or CPLEX).
71Bounds Higher Than the Optimal RecMin Bound
- Often the ORB underestimates the true minimum,
e.g. Kreitmans ADH data 6 v. 7 - How to derive sharper bound? Idea In the
composite method, check if each local bound N(I)
Opt(I) is tight, and if not, increase N(I) by
one. - Small increases in local bounds can increase the
composite bound.
72Bounds Sharper Than Optimal RecMin Bound
- A set of sequences M is called self-derivable if
there is a network generating M where the
sequence at every node (leaf and intermediate)
is in M. - Observation The haplotype bound for a set of
sequences is tight if and only if the sequences
are self-derivable. - So for each interval I where Opt(I) is computed,
we check self-derivability of the sequences
induced by columns S, where S are the columns
specified by Opt(I). Increase N(I) by 1 if the
sequences are not self-derivable.
73Algorithm for Self-Derivability
- Solution is easy when there are no mutations
--only recombinations are allowed and one initial
pair of sequences is chosen as reached. - Two reached sequences S1 and S2 can reach a third
sequence S3, if S3 can be created by recombining
S1 and S2. - Do BFS to see if all sequences can be reached by
successive application of the reach operation. - Clearly polynomial time and can be optimized with
suffix-trees etc. (Kececiouglu, Gusfield)
74 Self-derivability Test with mutations allowed
- For each site i, construct set MUT(i) of sequence
pairs (S1, S2) in M where S1 and S2 differ at
only site. - Try each sequence in M as root (which is the only
reached sequence initially). - For each root, enumerate all ways of choosing
exactly one ordered pair of sequences (Sp, Sq)
from each MUT(i). Sp is allowed to reach Sq. - Run the prior self-derivability algorithm with
these new permitted reach relations, to test if
all sequences in M can be reached. If so, then M
is self-derivable, otherwise it is not.
75Checking if N(I) should be increased by two
- If the set of sequences is not self-derivable,
- we can test if adding one new sequence makes it
self-derivable. - the number of candidate sequences is polynomial
and for each one added to M we check
self-derivability. N(I) should be increased by
two if none of these sets of sequences is
self-derivable.
76Self-derivability Test Algorithm
- Number of choices n x ?site i MUT(i)
O(nm1), but usually much less - To get even better bounds, allow, say 1, new
sequence to be added into M, and test
self-derivability. Can increase bound by 2. - This new sequence must be either recombinant of
two existing sequences, or a mutant of one
existing sequence. Thus their number is
polynomial size.
77Program HapBound
- HapBound S. Checks each Opt(I) subset for
self-derivability. Increase N(I) by 1 or 2 if
possible. This often beats ORB and is still
practical for most datasets. - HapBound M. Explicitly examine each interval
directly for self-derivability. Increase local
bound if possible. Derives lower bound of 7 for
Kreitmans ADH data in this mode.
78HapBound vs. RecMin on LPL from Clark et al.
Program Lower Bound Time
RecMin (default) 59 3s
RecMin s 25 w 25 75 7944s
RecMin s 48 w 48 No result 5 days
HapBound 75 31s
HapBound -S 78 1643s
2 Ghz PC
79Example where RecMin has difficulity in Finding
Optimal Bound on a 25 by 376 Data Matrix
Program Bound Time
RecMin default 36 1s
RecMin s 30 w 30 42 3m 25s
RecMin s 35 w 35 43 24m 2s
RecMin s 40 w 40 43 2h 9m 4s
RecMin s 45 w 45 43 10h 20m 59s
HapBound 44 2m 59s
HapBound -S 48 39m 30s
80Frequency of HapBound S Bound Sharper Than
Optimal RecMin Bound
ms param. Rho1 Rho5 Rho10 Rho20
Theta1 0.0 0.4 0.5 1.5
Theta5 0.7 4.0 10.4 27.0
Theta10 1.4 9.2 17.8 40.4
Theta20 1.4 10.5 27.8 45.4
For every ms parameters, 1000 data sets are used.
81Computing Upper Bounds
- The method is an adaptation of the history
lower bound (Myers). - A non-informative column is one with fewer than
two 0s or fewer than two 1s.
82Single History Computation
- Set W 0
- Collapse identical rows together, and remove
non-informative columns. Repeat until neither is
possible. - Let A be the data at this point. If A is empty,
stop, else remove some row r from A, and set W
W 1. Go to step 2). - Note that the choice of r is arbitrary in Step
3), so the resulting W can vary.
83History Lower Bound
- Theorem (Myers) Let W be the minimum W obtained
from all possible single history computations. - Then W is a valid lower bound on the number of
recombinations needed. - Naïve time theta(n!) (RecMin), but can be
- reduced to theta(2n) (Bafna, Bansal).
84Converting the History Lower Bound to an Upper
Bound
- Given a set of rows A and a single row r, define
w(r A - r) as the minimum number of
recombinations needed to create r from A-r (well
defined in our application). - w(r A-r) can be computed in linear time by a
greedy-type algorithm.
85Upper Bound Computation
- Set W 0
- Collapse identical rows together, and remove
non-informative columns. Repeat until neither is
possible. - Let A be the data at this point. If A is empty,
stop, else remove some row r from A, and set W
W W(r A-r). Go to step 2). - Note that the choice of r is arbitrary in Step
3), so the resulting W can vary.
This is the Single History Computation, with a
change in step 3).
86- Note, even a single execution of the upper bound
computation gives a valid upper bound, and a way
to construct a network. This is in contrast to
the History Bound which requires finding the
minimum W over all histories. - However, we also would like to find the lowest
upper bound possible with this approach. This can
be done in O(2n) time by DP. In practice, we can
use branch and bound to find the lowest upper
bound, but we have also found that branching on
the best local choice, or randomizing gives good
results.
87Branch and Bound
- (Branching) In Step 3) choose r to minimize
- w(r A-r) L(A-r), where L(A-r) is some
fast lower bound on the number of recombinations
needed for the set A-r. Even HK is good for this
purpose. - (Bounding) Let C be the min for an full solution
found so far If W L(A) gt C, then backtrack.
88Kreitmans 1983 ADH Data
- 11 sequences, 43 segregating sites
- Both HapBound and SHRUB took only a fraction of a
second to analyze this data. - Both produced 7 for the number of detected
recombination events - Therefore, independently of all other
methods, our lower and upper bound methods
together imply that 7 is the minimum number of
recombination events.
89A minimal ARG for Kreitmans data
SHRUB produces code that can be input to an open
source program to display the constructed ARG
90The Human LPL Data (Nickerson et al. 1998)
(88 Sequences, 88 sites)
Our new lower and upper bounds
Optimal RecMin Bounds
(We ignored insertion/deletion, unphased sites,
and sites with missing data.)
91Study on simulated data Exact-Match frequency
for varying parameters
- ? Scaled mutation rate
- ?? Scaled recombination rate
- n Number of sequences
Used Hudsons MS to generate1000 simulated
datasets for each pair of ??and ??
n 25
n 15
For ?????lt 5, our lower and upper bounds match
more than 95 of the time.
92Exact-Match frequency for varying number of
sequences
- Match frequency does not depend on n as much as
it does on ? or ?
93A closer look at the deviation
- ??? Average ratio of lower bound to upper bound
when they do not match
For n 25
The numerical difference between lower and upper
bounds grows as ??or ? increases, but their ratio
is more stable.
94Papers and Software
- wwwcsif.cs.ucdavis.edu/gusfield/