Phylogenetic Networks with Constrained Recombination - PowerPoint PPT Presentation

1 / 93
About This Presentation
Title:

Phylogenetic Networks with Constrained Recombination

Description:

Title: Phylogenetic Networks with Constrained Recombination Author: Dan Gusfield Last modified by: Dan Gusfield Created Date: 3/4/2003 6:04:43 AM Document ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 94
Provided by: DanG194
Category:

less

Transcript and Presenter's Notes

Title: Phylogenetic Networks with Constrained Recombination


1
(No Transcript)
2
Reconstructing Ancestral Recombination Graphs -
or Phylogenetic Networks with Recombination
  • Dan Gusfield
  • UC Davis

Different parts of this work are joint with
Satish Eddhu, Charles Langley, Dean Hickerson,
Yun Song, Yufeng Wu.
3
Reconstructing the Evolution of Binary
Bio-Sequences
  • Perfect Phylogeny (tree) model
  • Phylogenetic Networks (DAG) with recombination
  • Phylogenetic Networks with disjoint cycles
    Galled-Trees
  • Phylogenetic Networks with unconstrained cycles
    Blobbed-Trees
  • Combinatorial Structure and Efficient Algorithms
  • Efficiently Computed Lower and Upper bounds on
    the number of recombinations needed

4
Geneological or Phylogenetic Networks
  • The major biological motivation comes from
    genetics and attempts to reconstruct the history
    of recombination in populations.
  • Also relates to phylogenetic-based haplotyping.
  • Some of the algorithmic and mathematical results
    have phylogenetic applications, for example in
    hybrid speciation, lateral gene transfer.

5
Why binary?
Single nucleotide polymorphisms are the key data
that we use. SNPs imply that the sequences are
binary, and also that the order of the sites
is fixed (on a chromosome). This is in contrast
to a set of taxonomic characters, which may be
binary, but where the given order is arbitrary.
6
The Perfect Phylogeny Model for binary sequences

sites
12345
00000
Ancestral sequence
1
4
Site mutations on edges
3
00010
The tree derives the set M 10100 10000 01011 0101
0 00010
2
10100
5
10000
01010
01011
Extant sequences at the leaves
7
The converse problem
Given a set of sequences M we want to find, if
possible, a perfect phylogeny that derives M.
Remember that each site can change state from 0
to 1 only once.
n will denote the number of sequences in M, and m
will denote the length of each sequence in M.
m
01101001 11100101 10101011
M
n
8
When can a set of sequences be derived on a
perfect phylogeny?
  • Classic NASC Arrange the sequences in a matrix.
    Then (with no duplicate columns), the sequences
    can be generated on a unique perfect phylogeny if
    and only if no two columns (sites) contain all
    four pairs
  • 0,0 and 0,1 and 1,0 and 1,1

This is the 4-Gamete Test
9
A richer model

10100 10000 01011 01010 00010 10101 added
12345
00000
1
4
3
00010
2
10100
5
pair 4, 5 fails the three gamete-test. The sites
4, 5 conflict.
10000
01010
01011
Real sequence histories often involve
recombination.
10
Sequence Recombination
01011
10100
S
P
5
Single crossover recombination
10101
A recombination of P and S at recombination point
5.
The first 4 sites come from P (Prefix) and the
sites from 5 onward come from S (Suffix).
11
Network with Recombination

10100 10000 01011 01010 00010 10101 new
12345
00000
1
4
3
00010
2
10100
5
10000
P
01010
The previous tree with one recombination event
now derives all the sequences.
01011
5
S
10101
12
Multiple Crossover Recombination
4-crossovers
2-crossovers gene conversion
13
Elements of a Phylogenetic Network (single
crossover recombination)
  • Directed acyclic graph.
  • Integers from 1 to m written on the edges. Each
    integer written only once. These represent
    mutations.
  • A choice of ancestral sequence at the root.
  • Every non-root node is labeled by a sequence
    obtained from its parent(s) and any edge label on
    the edge into it.
  • A node with two edges into it is a
    recombination node, with a recombination point
    r. One parent is P and one is S.
  • The network derives the sequences that label the
    leaves.

14
A Phylogenetic Network
00000
4
00010
a00010
3
1
10010
00100
5
00101
2
01100
S
b10010
4
S
P
01101
p
c00100
g00101
3
d10100
f01101
e01100
15
Which Phylogenetic Networks are meaningful?
  • Given M we want a phylogenetic network that
    derives M, but which one?

A A perfect phylogeny (tree) if possible. As
little deviation from a tree, if a tree is not
possible. Use as little recombination or
gene-conversion as possible.
16
Minimizing recombinations
  • Any set M of sequences can be generated by a
    phylogenetic network with enough recombinations,
    and one mutation per site. This is not
    interesting or useful.
  • However, the number of (observable)
    recombinations is small in realistic sets of
    sequences. Observable depends on n and m
    relative to the number of recombinations.
  • Two algorithmic problems given a set of
    sequences M, find a phylogenetic network
    generating M, minimizing the number of
    recombinations (Heins problem). Find a network
    generating M that has some biologically-motivated
    structural properties.

17
Minimization is NP-hard
  • The problem of finding a phylogenetic network
    that creates a given set of sequences M, and
    minimizes the number of recombinations, is
    NP-hard. (Wang et al 2000) (Semple 2004)
  • Wang et al. explored the problem of finding a
    phylogenetic network where the recombination
    cycles are required to be node disjoint, if
    possible.
  • They gave a sufficient but not a necessary
    condition to recognize cases when this is
    possible. O(nm n4) time.

18
Recombination Cycles
  • In a Phylogenetic Network, with a recombination
    node x, if we trace two paths backwards from x,
    then the paths will eventually meet.
  • The cycle specified by those two paths is called
    a recombination cycle.

19
Galled-Trees
  • A recombination cycle in a phylogenetic network
    is called a gall if it shares no node with any
    other recombination cycle.
  • A phylogenetic network is called a galled-tree
    if every recombination cycle is a gall.

20
A galled-tree generating the sequences
generated by the prior network.
4
3
1
s
p
a 00010
3
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
21
(No Transcript)
22
Sales pitch for Galled-Trees
  • Galled-trees represent a small deviation from
    true trees.
  • There are sufficient applications where it is
    plausible that a galled tree
  • exists that generates the sequences.
  • Observable recombinations tend to be recent
    block structure of human DNA recombination is
    sparse, so the true history of observable
  • recombinations may be a galled-tree.

The number of recombinations is never more than
m/2. Moreover, when M can be derived on a
galled-tree, the number of recombinations used
is the minimum number over any phylogenetic
network, even if multiple cross-overs at a
recombination event are counted as a single
recombination. A galled-tree for M is almost
unique - implications for reconstructing
the correct history.
23
Old (Aug. 2003) Results
  • O(nm n3)-time algorithm to determine whether
    or not M can be derived on a galled-tree with
    all-0 ancestral sequence.
  • Proof that the galled-tree produced by the
    algorithm is a nearly-unique solution.
  • Proof that the galled-tree (if one exists)
    produced by the algorithm minimizes the number of
    recombinations used, over all phylogenetic-network
    s with all-0 ancestral sequence.

24
New work
  • We derive the galled-tree results in a more
  • general setting that addresses unconstrained
    recombination cycles and multiple
  • crossover recombination. This also solves the
  • problem of finding the most tree-like
  • network when a perfect phylogeny is not
  • possible. In this algorithm, no ancestral
    sequence is known in advance.

25
Blobbed-trees generalizing galled-trees
  • In a phylogenetic network a maximal set of
    intersecting cycles is called a blob.
  • Contracting each blob results in a directed,
    rooted tree, otherwise one of the blobs was not
    maximal.
  • So every phylogenetic network can be viewed as a
    directed tree of blobs - a blobbed-tree.
  • The blobs are the non-tree-like parts of the
    network.

26
Every network is a tree of blobs. How do the
tree parts and the blobs relate?
How can we exploit this relationship?
Ugly tangled network inside the blob.
27
Incompatible Sites
  • A pair of sites (columns) of M that fail the
  • 4-gametes test are said to be incompatible.
  • A site that is not in such a pair is compatible.

28
1 2 3 4 5
Incompatibility Graph
a b c d e f g
0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 1 0 1
4
M
1
3
2
5
Two nodes are connected iff the pair of sites are
incompatible, i.e, fail the 4-gamete test.
THE MAIN TOOL We represent the pairwise
incompatibilities in a incompatibility graph.
29
The connected components of G(M) are very
informative
  • The number of non-trivial connected components is
    a lower-bound on the number of recombinations
    needed in any network (Bafna, Bansal Gusfield,
    Hickerson).
  • When each blob is a single-cycle (galled-tree
    case) all the incompatible sites in a blob must
    come from a single connected component C, and
    that blob must contain all the sites from C.
    Compatible sites need not be inside any blob.
    (Gusfield et al 2003-5)

30
Simple Fact
  • If sites two sites i and j are incompatible,
    then the sites must be together on some
    recombination cycle whose recombination point is
    between the two sites i and j.
  • (This is a general fact for all phylogenetic
    networks.)

Ex In the prior example, sites 1, 3 are
incompatible, as are 1, 4 as are 2, 5.
31
A Phylogenetic Network
00000
4
00010
a00010
3
1
10010
00100
5
00101
2
01100
S
b10010
4
S
P
01101
p
c00100
g00101
3
d10100
f01101
e01100
32
Simple Consequence of the simple fact
  • All sites on the same (non-trivial) connected
    component of the incompatibility graph
  • must be on the same blob in any blobbed-tree.
  • Follows by transitivity.
  • So we cant subdivide a blob into a tree-like
    structure if it only contains sites from a single
    connected component of the incompatibility graph.

33
Key Result about Galls For galls, the converse
of the simple consequence is also true.
  • Two sites that are in different (non-trivial)
    connected
  • components cannot be placed on the same gall in
  • any phylogenetic network for M.
  • Hence, in a galled-tree T for M each gall
    contains all and only the sites of one
    (non-trivial) connected component of the
    incompatibility graph. All compatible sites can
    be put on edges outside of the galls.

This is the key to the galled-tree solution.
34
Incompatibility Graph
A galled-tree generating the sequences
generated by the prior network.
4
4
3
1
3
2
5
1
s
p
a 00010
2
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
35
Motivated by the one-one correspondence between
galls and non-trivial connected components, we
ask To what extent does this one-one
correspondence hold in general blobbed-trees,
i.e. with no constraints on how recombination
cycles interweave?
36
The Decomposition Theorem (Recomb 2005)
  • For any set of sequences M, there is a
    blobbed-tree T(M) that derives M, where each
    blob contains all and only the sites in one
    non-trivial connected component of G(M). The
    compatible sites can always be put on edges
    outside of any blob.

A blobbed-tree with this structure is called
fully-decomposed.
37
General Structure
  • So, for any set of sequences M, there is a
    phylogenetic network T(M) that is fully
    decomposed.
  • Moreover, the tree part of T(M) is unique. And it
    is easy to find the tree part.

38
A fully-decomposed network for the sequences
generated by the prior network.
Incompatibility Graph
4
4
3
1
3
2
5
1
s
p
a 00010
2
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
39
Moreover
  • Since all sites from a single connected
    component must be together on some blob in any
    phylogenetic network, no network is more
    decomposed than the fully decomposed network.

40
Proof Ideas
  • Let C be a connected component of G(M). Define
    MC as the sequences in M restricted to the
    sites in C.

41
1 2 3 4 5
a b c d e f g
0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 1 0 1
4
C2
C1
M
1
3
2
5
B1
B2
1 3 4
2 5
a b c d e f g
0 0 0 0 0 0 0 0 1 0 1 1 0 1
a 0 0 1 b 1 0 1 c 0 1 0 d 1 1 0 e 0 1 0 f 0 1
0 g 0 1 0
MC1
MC2
42
Faux Proof
  • Pick one site from each connected component C in
    G(M) to represent C. No pair of those sites
    are incompatible, so by the NASC for a perfect
    phylogeny, there will be a perfect phylogeny T
    for the sites. Expand each node to a network
    generating the sequences in MC.
  • Incorrect, because the structure of T can be
    wrong. We need to use information about all the
    sites in each C.

43
Now for each connected component C in G(M), call
each distinct sequence in MC a supercharacter,
and let W be the indicator matrix for the
supercharacters. So W indicates which rows of
M contain which particular supercharacters.
1 2 3 4 5 6 7 8
1 3 4
2 5
a b c d e f g
1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0
1 0 0 0 0 1
a b c d e f g
0 0 0 0 0 0 0 0 1 0 1 1 0 1
5 5 5 5 6 7 8
a 0 0 1 b 1 0 1 c 0 1 0 d 1 1 0 e 0 1 0 f 0 1 0 g
01 0
1 2 3 4 3 3 3
W
MC1
MC2
44
Proof Ideas
  • Lemma No pair of supercharacters are
    incompatible.
  • So by the NASC for a Perfect Phylogeny, there is
    a unique perfect phylogeny T for W.

45
Proof Ideas
  • For each connected component C of G(M), all
    supercharacters that originate from C label edges
    in T that are incident with one single node vC
    in T. So, if we expand each node vC to be a
    network that generates the supercharacters from C
    (the sequences in MC), and connect each network
    correctly to the edges in T, the resulting
    network is a fully-decomposed blobbed-tree that
    generates M.

46
  • Algorithmically, T is easy to find and is the
    tree resulting from contracting each blob in the
    fully-decomposed blobbed-tree T(M) for M. T can
    be constructed from M in O(nm2) time.

47
Broader Biological Applications
  • Our major interest is in recombination, but
    the proof of the decomposition theorem does not
    explicitly use recombination. So it holds for
    whatever biological phenomena caused the
    incompatibility of sites. For example, back or
    recurrent mutation, gene-conversion, lateral gene
    transfer etc.

48
The main open question
  • The Decomposition Theorem says there is always
    a fully-decomposed blobbed-tree for any M, but
  • Is there always a fully-decomposed
    blobbed-tree that minimizes the number of
    recombinations over all possible phylogenetic
    networks for M?

49
We conjecture the answer is yes.
  • If true, then we can decompose the problem of
    minimizing the total number of recombinations
    into separate problems on each connected
    component, and also find lower bounds on the
    needed number of recombinations, in each
    component separately, adding those bounds to get
    a valid overall lower bound for M. This
    computation of lower bounds is known to be
    correct for certain lower bounds (Bafna, Bansal
    2004).

50
Progress on Proving the Conjecture
  • Definition If N is a phylogenetic network for M,
    and a node v in N is labeled with a sequence in
    M, then v is said to be visible in N.
  • Theorem If every node in N is visible, then
    there is a fully-decomposed network for M where
    the number of recombinations is at most the
    number in N.
  • Corollary The conjecture is true for any M where
    the Haplotype or History lower bounds (S. Myers)
    on the number of recombinations needed to
    generate M, is tight.

51
Optimality of Galled-Trees
Theorem (G,H,B,B) The minimum number of
recombination nodes in any phylogenetic network
for M is at least the number of non-trivial
connected components of the incompatibility
graph. Hence, when the sequences on each blob
on T(M) can be generated with a single
recombination node, the blobbed-tree
minimizes the number of recombination nodes over
all phylogenetic networks and all choices of
ancestral sequence. This solves the root-unknown
galled-tree problem in polynomial time. Code is
on the web.
52
The number of arrangements on agall (all-0
ancestral sequence)
By analysing the algorithm to layout the sites
on a gall (not discussed here), one can prove
that the number of arrangements of any gall is at
most three, and this happens only if the gall has
two sites. If the gall has more than two sites,
then the number of arrangements is at most
two. If the gall has four or more sites, with at
least two sites on each side of the recombination
point (not the side of the gall) then the
arrangement is forced and unique.
53
Computing close bounds on the minimum number of
recombinationsDan Gusfield UCD
  • Y. Song, Y. F. Wu, D. Gusfield (ISMB2005)
  • D. Gusfield, D. Hickerson (Dis. Appl. Math 2005)
  • D. Gusfield, V. Bansal (Recomb 2005)

54
The grandfather of all lower bounds - HK 1985
  • Arrange the nodes of the incompatibility graph on
    the line in order that the sites appear in the
    sequence. This bound requires a linear order.
  • The HK bound is the minimum number of vertical
    lines needed to cut every edge in the
    incompatibility graph. Weak bound, but widely
    used - not only to bound the number of
    recombinations, but also to suggest their
    locations.

55
Justification for HK
  • If two sites are incompatible, there must have
    been some recombination where the crossover point
    is between the two sites.

56
HK Lower Bound
1 2 3 4 5
57
HK Lower Bound 1
1 2 3 4 5
58
More general view of HK
Given a set of intervals on the line, and for
each interval I, a number N(I), define the
composite problem Find the minimum number of
vertical lines so that every interval I
intersects at least N(I) of the vertical
lines. In HK, each incompatibility defines an
interval I where N(I) 1. The composite problem
is easy to solve by a left-to-right
myopic placement of vertical lines Sort the
intervals by right end-point Process the
intervals left to right in that order when the
right endpoint of an interval I is reached, place
there (if needed) additional vertical so that
N(I) lines intersect I.
59
If each N(I) is a local lower bound on the
number of recombinations needed in interval I,
then the solution to the composite problem is a
valid lower bound for the full sequences. The
resulting bound is called the composite bound
given the local bounds.
This general approach is called the Composite
Method (Simon Myers 2002).
60
Haplotype Bound (Simon Myers)
  • Rh Number of distinct sequences (rows) - Number
    of distinct sites (columns) -1 (folklore)
  • Before computing Rh, remove any site that is
    compatible with all other sites. A valid lower
    bound results - generally increases the bound.
  • Generally really bad bound, often negative, when
    used on large intervals, but Very Good when used
    as local bounds in the Composite Interval Method,
    and other methods.

61
Composite Interval Method using RH bounds
  • Compute Rh separately for each of the C(m,2)
    intervals of the m sites let N(I) Rh(I) be the
    local lower bound for interval I. Then compute
    the composite bound using these local bounds.
  • Polynomial time and gives bounds that often
    double HK in our simulations.

62
Composite Subset Method (Myers)
  • Let S be subset of sites, and Rh(S) be the
    haplotype bound for subset S. If the leftmost
    site in S is L and the rightmost site in S is R,
    then use Rh(S) as a local bound N(I) for interval
    I S,L.
  • Compute Rh(S) on many subsets, and then solve the
    composite problem to find a composite bound.

63
RecMin (Myers)
  • World Champion Lower Bound Program (until now).
    Often RecMin gives a bound three times as large
    as HK.
  • Computes Rh on subsets of sites, but limits the
    size and the span of the subsets. Default
    parameters are s 6, w 15 (s size, w
    span).
  • Generally, impractical to set s w m, so
    generally one doesnt know if increasing the
    parameters would increase the bound. (example
    Myers bound of 70 on the LPL data).

64
Optimal RecMin Bound (ORB)
  • The Optimal RecMin Bound is the lower bound that
    RecMin would produce if both parameters were set
    to their maximum values (s w m).
  • In general, RecMin cannot compute (in practical
    time) the ORB.
  • Practical computation of the ORB is our first
    contribution.

65
Computing the ORB
  • Gross Idea For each interval I, use ILP to find
    a subset of sites S that maximizes Rh(S) over
    all subsets in interval I. Call the result
    Opt(I).
  • Set N(I) Opt(I), and compute the composite
    bound using those local bounds.
  • The composite bound is the ORB. -- the result
    one would get by using all 2m subsets in RecMin,
    with s w m.

66
We have moved from doing an exponential number of
simple computations (computing Rh for each
subset), to solving a quadratic number of
(possibly expensive) ILPs. Is this a good
trade-off in practice? Our experience - very
much so!
67
How to compute Opt(I) by ILP
Create one variable Xi for each row i one
variable Yj for each column j in interval I. All
variables are 0/1 variables. Define S(i,i) as
the set of columns where rows i and i have
different values. Each column in S(i,i) is a
witness that rows i and i differ. For each pair
of rows (i,i), create the constraint Xi Xi
lt 1 ? Yj j in S(i,i) Objective Function
Maximize ? Xi - ? Yj -1
68
Alternate way to compute Opt(I) by ILP
First remove any duplicate rows. Let N be the
number of rows remaining.
Create one variable Yj for each column j in
interval I. All variables are 0/1
variables. S(i,i) as before. For each pair of
rows (i,i) create the constraint 1 lt ? Yj
j in S(i,i) Objective Function Maximize N -
?(Yj) -1 Finds the smallest number of columns to
distinguish the rows.
69
Two critical tricks
  • Use the second ILP formulation to compute Opt(I).
    It solves
  • much faster than the first (why?)
  • 2) Reduce the number of intervals where Opt(I) is
    computed

I
If the solution to Opt(I) uses columns that only
span interval L, then there is no need to find
Opt(I) in any interval I containing L. Each
ILP computation directly spawns at most 4 other
ILPs. Apply this idea recursively.
L
70
  • With the second trick we need to find Opt(I)
    for only 0.5 - 35 of all the C(m,2) intervals
    (empirical result). Surprisingly fast in
    practice (with either the GNU solver or CPLEX).

71
Bounds Higher Than the Optimal RecMin Bound
  • Often the ORB underestimates the true minimum,
    e.g. Kreitmans ADH data 6 v. 7
  • How to derive sharper bound? Idea In the
    composite method, check if each local bound N(I)
    Opt(I) is tight, and if not, increase N(I) by
    one.
  • Small increases in local bounds can increase the
    composite bound.

72
Bounds Sharper Than Optimal RecMin Bound
  • A set of sequences M is called self-derivable if
    there is a network generating M where the
    sequence at every node (leaf and intermediate)
    is in M.
  • Observation The haplotype bound for a set of
    sequences is tight if and only if the sequences
    are self-derivable.
  • So for each interval I where Opt(I) is computed,
    we check self-derivability of the sequences
    induced by columns S, where S are the columns
    specified by Opt(I). Increase N(I) by 1 if the
    sequences are not self-derivable.

73
Algorithm for Self-Derivability
  • Solution is easy when there are no mutations
    --only recombinations are allowed and one initial
    pair of sequences is chosen as reached.
  • Two reached sequences S1 and S2 can reach a third
    sequence S3, if S3 can be created by recombining
    S1 and S2.
  • Do BFS to see if all sequences can be reached by
    successive application of the reach operation.
  • Clearly polynomial time and can be optimized with
    suffix-trees etc. (Kececiouglu, Gusfield)

74
Self-derivability Test with mutations allowed
  • For each site i, construct set MUT(i) of sequence
    pairs (S1, S2) in M where S1 and S2 differ at
    only site.
  • Try each sequence in M as root (which is the only
    reached sequence initially).
  • For each root, enumerate all ways of choosing
    exactly one ordered pair of sequences (Sp, Sq)
    from each MUT(i). Sp is allowed to reach Sq.
  • Run the prior self-derivability algorithm with
    these new permitted reach relations, to test if
    all sequences in M can be reached. If so, then M
    is self-derivable, otherwise it is not.

75
Checking if N(I) should be increased by two
  • If the set of sequences is not self-derivable,
  • we can test if adding one new sequence makes it
    self-derivable.
  • the number of candidate sequences is polynomial
    and for each one added to M we check
    self-derivability. N(I) should be increased by
    two if none of these sets of sequences is
    self-derivable.

76
Self-derivability Test Algorithm
  • Number of choices n x ?site i MUT(i)
    O(nm1), but usually much less
  • To get even better bounds, allow, say 1, new
    sequence to be added into M, and test
    self-derivability. Can increase bound by 2.
  • This new sequence must be either recombinant of
    two existing sequences, or a mutant of one
    existing sequence. Thus their number is
    polynomial size.

77
Program HapBound
  • HapBound S. Checks each Opt(I) subset for
    self-derivability. Increase N(I) by 1 or 2 if
    possible. This often beats ORB and is still
    practical for most datasets.
  • HapBound M. Explicitly examine each interval
    directly for self-derivability. Increase local
    bound if possible. Derives lower bound of 7 for
    Kreitmans ADH data in this mode.

78
HapBound vs. RecMin on LPL from Clark et al.
Program Lower Bound Time
RecMin (default) 59 3s
RecMin s 25 w 25 75 7944s
RecMin s 48 w 48 No result 5 days
HapBound 75 31s
HapBound -S 78 1643s
2 Ghz PC
79
Example where RecMin has difficulity in Finding
Optimal Bound on a 25 by 376 Data Matrix
Program Bound Time
RecMin default 36 1s
RecMin s 30 w 30 42 3m 25s
RecMin s 35 w 35 43 24m 2s
RecMin s 40 w 40 43 2h 9m 4s
RecMin s 45 w 45 43 10h 20m 59s
HapBound 44 2m 59s
HapBound -S 48 39m 30s
80
Frequency of HapBound S Bound Sharper Than
Optimal RecMin Bound
ms param. Rho1 Rho5 Rho10 Rho20
Theta1 0.0 0.4 0.5 1.5
Theta5 0.7 4.0 10.4 27.0
Theta10 1.4 9.2 17.8 40.4
Theta20 1.4 10.5 27.8 45.4
For every ms parameters, 1000 data sets are used.
81
Computing Upper Bounds
  • The method is an adaptation of the history
    lower bound (Myers).
  • A non-informative column is one with fewer than
    two 0s or fewer than two 1s.

82
Single History Computation
  • Set W 0
  • Collapse identical rows together, and remove
    non-informative columns. Repeat until neither is
    possible.
  • Let A be the data at this point. If A is empty,
    stop, else remove some row r from A, and set W
    W 1. Go to step 2).
  • Note that the choice of r is arbitrary in Step
    3), so the resulting W can vary.

83
History Lower Bound
  • Theorem (Myers) Let W be the minimum W obtained
    from all possible single history computations.
  • Then W is a valid lower bound on the number of
    recombinations needed.
  • Naïve time theta(n!) (RecMin), but can be
  • reduced to theta(2n) (Bafna, Bansal).

84
Converting the History Lower Bound to an Upper
Bound
  • Given a set of rows A and a single row r, define
    w(r A - r) as the minimum number of
    recombinations needed to create r from A-r (well
    defined in our application).
  • w(r A-r) can be computed in linear time by a
    greedy-type algorithm.

85
Upper Bound Computation
  • Set W 0
  • Collapse identical rows together, and remove
    non-informative columns. Repeat until neither is
    possible.
  • Let A be the data at this point. If A is empty,
    stop, else remove some row r from A, and set W
    W W(r A-r). Go to step 2).
  • Note that the choice of r is arbitrary in Step
    3), so the resulting W can vary.

This is the Single History Computation, with a
change in step 3).
86
  • Note, even a single execution of the upper bound
    computation gives a valid upper bound, and a way
    to construct a network. This is in contrast to
    the History Bound which requires finding the
    minimum W over all histories.
  • However, we also would like to find the lowest
    upper bound possible with this approach. This can
    be done in O(2n) time by DP. In practice, we can
    use branch and bound to find the lowest upper
    bound, but we have also found that branching on
    the best local choice, or randomizing gives good
    results.

87
Branch and Bound
  • (Branching) In Step 3) choose r to minimize
  • w(r A-r) L(A-r), where L(A-r) is some
    fast lower bound on the number of recombinations
    needed for the set A-r. Even HK is good for this
    purpose.
  • (Bounding) Let C be the min for an full solution
    found so far If W L(A) gt C, then backtrack.

88
Kreitmans 1983 ADH Data
  • 11 sequences, 43 segregating sites
  • Both HapBound and SHRUB took only a fraction of a
    second to analyze this data.
  • Both produced 7 for the number of detected
    recombination events
  • Therefore, independently of all other
    methods, our lower and upper bound methods
    together imply that 7 is the minimum number of
    recombination events.

89
A minimal ARG for Kreitmans data
SHRUB produces code that can be input to an open
source program to display the constructed ARG
90
The Human LPL Data (Nickerson et al. 1998)
(88 Sequences, 88 sites)
Our new lower and upper bounds
Optimal RecMin Bounds
(We ignored insertion/deletion, unphased sites,
and sites with missing data.)
91
Study on simulated data Exact-Match frequency
for varying parameters
  • ? Scaled mutation rate
  • ?? Scaled recombination rate
  • n Number of sequences

Used Hudsons MS to generate1000 simulated
datasets for each pair of ??and ??
n 25
n 15
For ?????lt 5, our lower and upper bounds match
more than 95 of the time.
92
Exact-Match frequency for varying number of
sequences
  • Match frequency does not depend on n as much as
    it does on ? or ?

93
A closer look at the deviation
  • ??? Average ratio of lower bound to upper bound
    when they do not match

For n 25
The numerical difference between lower and upper
bounds grows as ??or ? increases, but their ratio
is more stable.
94
Papers and Software
  • wwwcsif.cs.ucdavis.edu/gusfield/
Write a Comment
User Comments (0)
About PowerShow.com