Title: Nonbreaking Similarity of Genomes with Gene Repetitions
1Non-breaking Similarity of Genomes with Gene
Repetitions
- Binhai Zhu
- Computer Science Department, Montana State
University - Joint work with Zhixiang Chen, Bin Fu, Jinhui
Xu, Boting Yang and Zhiyu Zhao
2Background
- Computing genomic distance between genomes is
important in evolutionary molecular biology, the
problem was first studied by Sturtevant and
Dobzhansky in 1936. - A lot of research has been done on computing
genomic distances since 1990, assuming that each
gene appears in a genome once, e.g., the famous
result by Hannenhalli and Pevzner on sorting
signed permutations by reversals.
3Background (cond.)
- On the other hand, gene repetition is very common
in genomes. So computing genomic distances with
gene repetition is a more realistic problem. - This is a typical optimization problem, it makes
sense to study the approximability of the
problem.
4Definitions
- Given n gene families (alphabets) F, a genome G
is a sequence of elements of F such that each
element has a (/-) sign. - Example. Fa,b,c,d,
- G-bd-cab-d-c
- We will focus on unsigned sequences in this work.
- A genome G is said to be exemplar if every gene
appears exactly once in G.
5Definitions (cond.)
- Given exemplar genomes G and H, over the same
set of gene families, if gene ab is a substring
in G but not in H, then ab constitutes a
breakpoint in G. - Example, Gabcdefg
- Hefgdcab
- there are 3 breakpoints in G (and
symmetrically in H). - The number of breakpoints between G and H is
called the breakpoint distance between G and H.
6Exemplar Breakpoint Distance Problem
- Given two genomes G and H over n gene families,
compute two exemplar genomes G and H such that
the breakpoint distance between G and H is
minimized. - We call this the exemplar breakpoint distance
problem (between G and H). Denote this distance
by eb(G,H)b(G,H).
7Approximation Algorithms
- Given a minimization (maximization) problem ?,
let the optimal solution of ? be OPT, an
approximation algorithm A provides a performance
guarantee of a for ? if for every instance of ?
the solution value returned by A is at most ? x
OPT (at least OPT/?). - Usually we say that A is a factor-? approximation
for ?.
8Prior Results (1)
- We showed that the exemplar breakpoint distance
problem does not admit any approximation, unless
PNP (or, deciding whether eb(GH)0 is
NP-complete) Chen, Fu and Zhu2006. - This result holds for any genomic distance d( )
satisfying GH implies d(G,H)0. - Based on the above result, even under a weaker
model of approximation, we showed that the
exemplar conserved interval distance problem does
not admit any WEAK approximation of a superlinear
factor Chen, Fowler, Fu and Zhu, 2007.
9Prior Results (2)
- On the other hand, for the exemplar breakpoint
distance problem, Sankoff has used
branch-and-bound Sankoff, 1999 and Nguyen, Tay
and Zhang 2005 have used divide-and-conquer on
practical datasets to obtain good empirical
results. - As a related, but slightly different effort,
Chauve, et al. 2006 studied the exemplar
genomic similarity problems which does not
satisfy GH implies d(G,H)0, e.g., the exemplar
common interval measure problem.
10Background for this work
- We try to look at the complement of the
breakpoint distance under the gene duplication
model. - As the problem is still hard to approximate, we
follow Nguyen, et al. by considering genomes
satisfying some practical conditions.
11Definitions
- Given exemplar genomes G and H drawn from the
same alphabet, ab is a non-breaking point, if ab
appears in both G and H. - Example. G abcdefg
- H fegcdab
- We have two non-breaking points in G and H,
which is called the non-breaking similarity of G
and H, denoted as nbs(G,H). - Note that when GHn, if GH,
nbs(G,H)n-1. - Given genomes G and H drawn from the same
alphabet, possibly with gene repetitions, the
exemplar non-breaking similarity problem is to
delete redundant genes to obtain exemplar genomes
G and H such that nbs(G,H) is maximized. The
corresponding measure is also denoted as
enbs(G,H).
12Example
- G abcadcefg
- H cfegcdabf
- We have 4 possible exemplar genomes for G
abcdefg, abdcefg, bcadefg, badcefg. - We have 4 possible exemplar genomes for H
cfegdab, cegdabf, fegcdab, egcdabf. - enbs(G,H)nbs(abcdefg,fegcdab)2.
13Inapproximability Result
- Theorem 1. Given an exemplar genome G and
another genome H such that the genes are all
from the same alphabet with size n and each gene
appears in H at most two times, the Exemplar
Non-breaking Similarity Problem over G and H
does not admit any approximation of factor n1-e,
unless PNP. - Proof Idea A linear reduction from Independent
Set (IS).
14e2
v2
v1
N5 vertices, M5 edges NM is even
e4
e3
e1
e5
v4
v5
v3
Gv1v1v2v2v3v3v4v4v5v5x1e1x1x2e2x2x3e3x3x4
e4x4x5e5x5
HYNM-1YNM-3Y1YNMYNM-2Y2
x4x4x2x2v5e4e5v5v3e1v3v1e1e2v1x5x5x3x3x1x1
v4e3e5v4v2e2e3e4v2 YiviAivi, if i N
YNixixi, if iM
Hx4x4x2x2v5e5v5v3v3v1e1e2v1x5x5x3x3x1x1v4
v4v2e3e4v2 correspond to the optimal
independent set v3,v4
Input graph has an IS of size K iff enbs(G,H)K.
15e2
v2
v1
N5 vertices, M5 edges NM is even
e4
e3
e1
e5
v4
v5
v3
Gv1v1v2v2v3v3v4v4v5v5x1e1x1x2e2x2x3e3x3x4
e4x4x5e5x5
HYNM-1YNM-3Y1YNMYNM-2Y2
x4x4x2x2v5e4e5v5v3e1v3v1e1e2v1x5x5x3x3x1x1
v4e3e5v4v2e2e3e4v2 YiviAivi, if i N
YNixixi, if iM
Hx4x4x2x2v5e5v5v3v3v1e1e2v1x5x5x3x3x1x1v4
v4v2e3e4v2 correspond to the optimal
independent set v3,v4
Input graph has an IS of size K iff enbs(G,H)K.
16Positive Results
- Our motivation was from Nguyen, Tay and Zhang
2005, who observed that for certain bacteria
genome pairs (Baphi-Wigg, Pmult-Hinft,
Ecoli-Styphi, Xaxo-Xcamp and Ypes), repeated
genes are usually pegged, e.g., - xyxaba
-
17Positive Results
- Definition
- occ(g,G) is the number of occurrence of g in
G. - span(g,G) is the maximum distance between two
copies of g in G. - totalocc(c,G)?gene g in G with span(g,G)c
occ(g,G) -
18Positive Results
- Definition
- occ(g,G) is the number of occurrence of g in
G. - span(g,G) is the maximum distance between two
copies of g in G. - totalocc(c,G)?gene g in G with span(g,G)c
occ(g,G) - Example. Gabcdaebd
- span(a,G)4, span(b,G)5, span(d,G)4,
- totalocc(4,G)6
19Positive Results
- Theorem 2. Let G and H be two genomes with
ttotalocc(1,G) totalocc(c,H), for a constant
c. Then enbs(G,H) can be computed in
O(3t/3nc2e) time. -
20Positive Results
- Theorem 2. Let G and H be two genomes with
ttotalocc(1,G) totalocc(c,H), for a constant
c. Then enbs(G,H) can be computed in
O(3t/3nc2e) time. - Idea 1 Given an exemplar genome G and another
genome H satisfying span(g,H)c, for every g in
H, we can use divide and conquer to compute
enbs(G,H) in O(nc2e) time. - Roughly speaking, HH1H2H3, H2c, then
enumerate all solutions on H2 and recurse. - T(n) 2c12T(n/2c) O(n) O(nc2e)
21Positive Results
- Theorem 2. Let G and H be two genomes with
ttotalocc(1,G) totalocc(c,H), for a constant
c. Then enbs(G,H) can be computed in
O(3t/3nc2e) time. - Idea 2 As t is considered as a constant, we
enumerate all possibilities for deleting
duplicated genes in G (to obtain G) and for
deleting genes with span greater than c in H (to
obtain H). By Lemma 6, there are at most
4?3t/3 such combinations. Therefore, the total
running time is 4?3t/3?O(nc2e)
O(3t/3nc2e) time.
22Positive Results
- Theorem 3. Let G and H be two genomes with a
total of t genes g satisfying shift(g,G,H) gtc,
for some constant c. Then enbs(G,H) can be
computed in O(3t/3n2c1e) time. -
- Example. Gabcadef
- Hbcedefad
- shift(a,G,H) 6
23Conclusion
- We introduce non-breaking similarity, which is
the complement of the famous breakpoint distance,
for genome comparison. - The general exemplar non-breaking similarity
problem is hard to approximate. - 3. For some special cases, we can obtain
polynomial solutions.