Combinatorial Optimization Problems in Computational Biology - PowerPoint PPT Presentation

About This Presentation
Title:

Combinatorial Optimization Problems in Computational Biology

Description:

Multidisciplinary field at the intersection of computer ... s1 = abk, s2 =bkc, s3 = bk 1. Greedy output: abkcbk 1 length = 2k 3. Optimum: abk 1c length = k 3 ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 20
Provided by: IonMa8
Category:

less

Transcript and Presenter's Notes

Title: Combinatorial Optimization Problems in Computational Biology


1
Combinatorial Optimization Problems in
Computational Biology
  • Ion Mandoiu
  • CSE Department

2
What Is Computational Biology?
  • G. Lancia Study of mathematical and
    computational problems of modeling biological
    processes in the cell, removing experimental
    errors from genomic data, interpreting the data
    and providing theories about their biological
    relations
  • Multidisciplinary field at the intersection of
    computer science, biology, discrete mathematics,
    statistics, optimization, chemistry, physics,

3
5 Steps to Solving CB Problems
  • Understand biological problem
  • Represent biological data as mathematical objects
    (strings, sets, graphs, permutations,), map
    biological relations into mathematical relations,
    and formulate the biological question as
    optimization or feasibility problem
  • Study computational complexity Polynomial?
    NP-hard?
  • Develop efficient algorithms
  • If in P, find fast and memory efficient exact
    algorithms
  • If NP-hard, find practical exact algorithms
    and/or algorithms with provable approximation
    guarantees
  • Validate algorithms on biological data

4
Outline
  • Shortest Superstring
  • Sequencing by Hybridization
  • PCR Primer Selection

5
Shotgun Sequencing

6
Shortest Superstring
  • Given set of strings s1, s2, , sn
  • Find shortest string s containing each si as a
    substring
  • Example
  • Set of strings 000, 001, 010, 011, 100, 101,
    110, 111
  • Superstring 0001110100
  • NP-Hard MaierStorer77

7
Greedy Merging Algorithm
  • S s1,s2,,sn
  • While S gt 1 do
  • Find s,t in S with longest overlap
  • S ( S \ s,t ) U s overlapped with t to
    maximum extent
  • Output final string
  • Approximation factor no better than 2
  • s1 abk, s2 bkc, s3 bk1
  • Greedy output abkcbk1 length 2k3
  • Optimum abk1c length k3
  • Open problem prove that greedy superstring is
    always at most twice longer than optimum

8
Overlap Prefix of 2 strings
  • Overlap of s and t longest suffix of s that is a
    prefix of t
  • Prefix of s and t s after removing overlap(s,t)
  • s a1 a2 a3 as-k1as
  • t b1 bk bt
  • prefix(s,t)
  • overlap(s,t)

9
Lower Bound on OPT
OPT prefix(s1,s2) prefix(sn-1,sn)
prefix(sn,s1) overlap(sn,s1) cost
of tour 1?2??n in the prefix graph
10
The Cycle Cover Algorithm
  • Computing TSP in prefix graph is NP-hard
  • Key idea lowerbound OPT using min-weight cycle
    cover
  • For every cycle c (i1?i2??il?i1), ?(c)
    prefix(si1,si2) prefix(sil,si1) si1 is a
    superstring of si1, , sil
  • Cycle cover algorithm

11
The Cycle Cover Algorithm
  • Theorem Blum,Jiang,Li,Tromp,Yannakakis94 Cycle
    cover algorithm gives factor 4 approximation.
  • Length of output is
  • where ri is a representative string from
    cycle ci
  • wt(C) ? OPT
  • - If ri no longer than wt(ci) ? output within
    factor 2 of optimum!
  • ri can be much longer than wt(ci) (periodic
    strings!)
  • it can be shown that ? ri ? OPT 2 wt(C) ?
    factor 4

12
Improved Algorithm
Theorem Blum,Jiang,Li,Tromp,Yannakakis 94 The
improved algorithm gives factor 3
approximation. Proof using that the greedy
algorithm gives at least ½ of the optimum
compression. Current best approximation factor is
2.596 Breslauer,Jiang,Jiang97
13
Sequencing by Hybridization
  • Exploits parallel hybridization in DNA arrays
  • All 4k probes of a certain length k (k8 to 10)
    are synthesized on the array
  • Target DNA hybridizes at locations containing
    probes complementary to its k-substrings
  • Sequencing by Hybridization (SBH) Problem
    Reconstruct target DNA given its k-length
    substrings (spectrum)

14
Mathematical Formulation of SBH
  • SBH is a special case of the shortest
    superstring solution corresponds to a
    Hamiltonian path (NP-hard to find) in the prefix
    length 1 graph
  • Pevzner 89 SBH is equivalent to finding an
    Eulerian path (easy to find in linear time) in
    the following graph
  • Vertices are all (k-1)-tuples
  • Directed edge between two (k-1)-tuples u and v
    iff there is a k-length string in the spectrum
    whose first k symbols match u and last k symbols
    match v
  • Choose the right mathematical abstraction!

15
Polymerase Chain Reaction

16
Primer Selection Problem
r i
3'
5'
Reverse primer
? Lx
? Lx
Forward primer
3'
5'
f i
i-th amplification locus
  • Given
  • Pairs of forward/reverse sequences for the n
    amplification loci
  • Primer length k and amplification upperbound L
  • Find
  • Minimum set of primers S of length k such that,
    for each amplification locus, there are two
    primers in S hybridizing to the forward and
    reverse sequences within a distance of L of each
    other

17
Previous Work
  • Pearson et al. 96 Logarithmic approximation
    factor using greedy set cover algorithm for a
    formulation that does not distinguish between
    forward and reverse primers
  • Similar formulations used by LinhartShamir02,
    Souvenir et al.03
  • To enforce bound of L on amplification length
    must truncate forward and reverse sequences to
    length L/2
  • FernandesSkiena02 model primer selection as
    a minimum multicolored subgraph problem
  • Vertices are candidate primers
  • Add edge colored by color i between primers u
    and v if they hybridize to i-th forward and
    reverse sequences within a distance of L
  • Find minimum size set of vertices inducing edges
    of all colors
  • No non-trivial approximation factor proposed

18
Improved Approximations
  • Konwar,M,Russell,Shvartsman 04
  • Logarithmic approximation factor using
    potential function greedy for the bounded
    amplification length primer selection problem
  • O(Lln n) approximation factor based on
    randomized rounding for the minimum multicolored
    subgraph problem of FernandesSkiena02

19
Improved Approximations
  • Konwar,M,Russell,Shvartsman 04
  • Logarithmic approximation factor using
    potential function greedy for the bounded
    amplification length primer selection problem
  • O(Lln n) approximation factor based on
    randomized rounding for the minimum multicolored
    subgraph problem of FernandesSkiena02

20
Key Lemma
If r and r are representative strings from
cycles c and c, then
If overlap(r,r) ? wt(c) wt(c), then ??
(?)? ? ?? covers strings in both c and c ?
cycle cover is not minimal
21
Proof of Factor 4
  • Length of output
  • Numbering ris in order of lefmost occurrence in
    OPT and using Lemma
  • ? ? ri ? OPT ? overlap(ri,ri1) ? OPT
    2 wt(C)
  • wt(C) ? OPT
  • ? Length of output ? 4 x OPT

22
Improved Algorithm Analysis
  • Observation 1 The greedy algorithm is known to
    achieve at least ½ of the optimum compression,
    i.e.,
    ? ?(ci) - ? ? ½ (
    ? ?(ci) - OPT?)
    where OPT? is the shortest superstring of
    ?(ci), i1,,k
  • ? ? - OPT? ? ½ (? ?(ci) - OPT?)
  • Observation 2 By numbering ?(ci)s in order of
    lefmost occurrence in OPT? and using again the
    key Lemma
  • ?(ci) - OPT? ? overlap(?(ci), ?(ci1)) ?
    2 wt(C)
  • ? ? - OPT? ? wt(C)
  • Observation 3 OPT? ? OPT wt(C)
  • ? ? ? 3 OPT
Write a Comment
User Comments (0)
About PowerShow.com