Title: Combinatorial Optimization Problems in Computational Biology
1Combinatorial Optimization Problems in
Computational Biology
- Ion Mandoiu
- CSE Department
2What Is Computational Biology?
- G. Lancia Study of mathematical and
computational problems of modeling biological
processes in the cell, removing experimental
errors from genomic data, interpreting the data
and providing theories about their biological
relations - Multidisciplinary field at the intersection of
computer science, biology, discrete mathematics,
statistics, optimization, chemistry, physics,
35 Steps to Solving CB Problems
- Understand biological problem
- Represent biological data as mathematical objects
(strings, sets, graphs, permutations,), map
biological relations into mathematical relations,
and formulate the biological question as
optimization or feasibility problem - Study computational complexity Polynomial?
NP-hard? - Develop efficient algorithms
- If in P, find fast and memory efficient exact
algorithms - If NP-hard, find practical exact algorithms
and/or algorithms with provable approximation
guarantees - Validate algorithms on biological data
4Outline
- Shortest Superstring
- Sequencing by Hybridization
- PCR Primer Selection
5Shotgun Sequencing
6Shortest Superstring
- Given set of strings s1, s2, , sn
- Find shortest string s containing each si as a
substring - Example
- Set of strings 000, 001, 010, 011, 100, 101,
110, 111 - Superstring 0001110100
- NP-Hard MaierStorer77
7Greedy Merging Algorithm
- S s1,s2,,sn
- While S gt 1 do
- Find s,t in S with longest overlap
- S ( S \ s,t ) U s overlapped with t to
maximum extent - Output final string
- Approximation factor no better than 2
- s1 abk, s2 bkc, s3 bk1
- Greedy output abkcbk1 length 2k3
- Optimum abk1c length k3
- Open problem prove that greedy superstring is
always at most twice longer than optimum
8Overlap Prefix of 2 strings
- Overlap of s and t longest suffix of s that is a
prefix of t - Prefix of s and t s after removing overlap(s,t)
- s a1 a2 a3 as-k1as
-
- t b1 bk bt
- prefix(s,t)
- overlap(s,t)
9Lower Bound on OPT
OPT prefix(s1,s2) prefix(sn-1,sn)
prefix(sn,s1) overlap(sn,s1) cost
of tour 1?2??n in the prefix graph
10The Cycle Cover Algorithm
- Computing TSP in prefix graph is NP-hard
- Key idea lowerbound OPT using min-weight cycle
cover - For every cycle c (i1?i2??il?i1), ?(c)
prefix(si1,si2) prefix(sil,si1) si1 is a
superstring of si1, , sil - Cycle cover algorithm
11The Cycle Cover Algorithm
- Theorem Blum,Jiang,Li,Tromp,Yannakakis94 Cycle
cover algorithm gives factor 4 approximation. - Length of output is
- where ri is a representative string from
cycle ci - wt(C) ? OPT
- - If ri no longer than wt(ci) ? output within
factor 2 of optimum! - ri can be much longer than wt(ci) (periodic
strings!) - it can be shown that ? ri ? OPT 2 wt(C) ?
factor 4
12Improved Algorithm
Theorem Blum,Jiang,Li,Tromp,Yannakakis 94 The
improved algorithm gives factor 3
approximation. Proof using that the greedy
algorithm gives at least ½ of the optimum
compression. Current best approximation factor is
2.596 Breslauer,Jiang,Jiang97
13Sequencing by Hybridization
- Exploits parallel hybridization in DNA arrays
- All 4k probes of a certain length k (k8 to 10)
are synthesized on the array - Target DNA hybridizes at locations containing
probes complementary to its k-substrings - Sequencing by Hybridization (SBH) Problem
Reconstruct target DNA given its k-length
substrings (spectrum)
14Mathematical Formulation of SBH
- SBH is a special case of the shortest
superstring solution corresponds to a
Hamiltonian path (NP-hard to find) in the prefix
length 1 graph - Pevzner 89 SBH is equivalent to finding an
Eulerian path (easy to find in linear time) in
the following graph - Vertices are all (k-1)-tuples
- Directed edge between two (k-1)-tuples u and v
iff there is a k-length string in the spectrum
whose first k symbols match u and last k symbols
match v - Choose the right mathematical abstraction!
15 Polymerase Chain Reaction
16 Primer Selection Problem
r i
3'
5'
Reverse primer
? Lx
? Lx
Forward primer
3'
5'
f i
i-th amplification locus
- Given
- Pairs of forward/reverse sequences for the n
amplification loci - Primer length k and amplification upperbound L
- Find
- Minimum set of primers S of length k such that,
for each amplification locus, there are two
primers in S hybridizing to the forward and
reverse sequences within a distance of L of each
other
17 Previous Work
- Pearson et al. 96 Logarithmic approximation
factor using greedy set cover algorithm for a
formulation that does not distinguish between
forward and reverse primers - Similar formulations used by LinhartShamir02,
Souvenir et al.03 - To enforce bound of L on amplification length
must truncate forward and reverse sequences to
length L/2 - FernandesSkiena02 model primer selection as
a minimum multicolored subgraph problem - Vertices are candidate primers
- Add edge colored by color i between primers u
and v if they hybridize to i-th forward and
reverse sequences within a distance of L - Find minimum size set of vertices inducing edges
of all colors - No non-trivial approximation factor proposed
18 Improved Approximations
- Konwar,M,Russell,Shvartsman 04
- Logarithmic approximation factor using
potential function greedy for the bounded
amplification length primer selection problem - O(Lln n) approximation factor based on
randomized rounding for the minimum multicolored
subgraph problem of FernandesSkiena02
19 Improved Approximations
- Konwar,M,Russell,Shvartsman 04
- Logarithmic approximation factor using
potential function greedy for the bounded
amplification length primer selection problem - O(Lln n) approximation factor based on
randomized rounding for the minimum multicolored
subgraph problem of FernandesSkiena02
20Key Lemma
If r and r are representative strings from
cycles c and c, then
If overlap(r,r) ? wt(c) wt(c), then ??
(?)? ? ?? covers strings in both c and c ?
cycle cover is not minimal
21Proof of Factor 4
- Length of output
-
- Numbering ris in order of lefmost occurrence in
OPT and using Lemma - ? ? ri ? OPT ? overlap(ri,ri1) ? OPT
2 wt(C) - wt(C) ? OPT
- ? Length of output ? 4 x OPT
22Improved Algorithm Analysis
- Observation 1 The greedy algorithm is known to
achieve at least ½ of the optimum compression,
i.e.,
? ?(ci) - ? ? ½ (
? ?(ci) - OPT?)
where OPT? is the shortest superstring of
?(ci), i1,,k - ? ? - OPT? ? ½ (? ?(ci) - OPT?)
- Observation 2 By numbering ?(ci)s in order of
lefmost occurrence in OPT? and using again the
key Lemma - ?(ci) - OPT? ? overlap(?(ci), ?(ci1)) ?
2 wt(C) - ? ? - OPT? ? wt(C)
- Observation 3 OPT? ? OPT wt(C)
- ? ? ? 3 OPT