Nonbreaking Similarity of Genomes with Gene Repetitions - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Nonbreaking Similarity of Genomes with Gene Repetitions

Description:

Computing genomic distance between genomes is important in evolutionary ... G'=-bd-cab-d-c. We will focus on unsigned sequences in this work. ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 24

Provided by: csd50

Category:

more less

Transcript and Presenter's Notes

Title: Nonbreaking Similarity of Genomes with Gene Repetitions

1
Non-breaking Similarity of Genomes with Gene
Repetitions

Binhai Zhu
Computer Science Department, Montana State
University
Joint work with Zhixiang Chen, Bin Fu, Jinhui
Xu, Boting Yang and Zhiyu Zhao

2
Background

Computing genomic distance between genomes is
important in evolutionary molecular biology, the
problem was first studied by Sturtevant and
Dobzhansky in 1936.
A lot of research has been done on computing
genomic distances since 1990, assuming that each
gene appears in a genome once, e.g., the famous
result by Hannenhalli and Pevzner on sorting
signed permutations by reversals.

3
Background (cond.)

On the other hand, gene repetition is very common
in genomes. So computing genomic distances with
gene repetition is a more realistic problem.
This is a typical optimization problem, it makes
sense to study the approximability of the
problem.

4
Definitions

Given n gene families (alphabets) F, a genome G
is a sequence of elements of F such that each
element has a (/-) sign.
Example. Fa,b,c,d,
G-bd-cab-d-c
We will focus on unsigned sequences in this work.
A genome G is said to be exemplar if every gene
appears exactly once in G.

5
Definitions (cond.)

Given exemplar genomes G and H, over the same
set of gene families, if gene ab is a substring
in G but not in H, then ab constitutes a
breakpoint in G.
Example, Gabcdefg
Hefgdcab
there are 3 breakpoints in G (and
symmetrically in H).
The number of breakpoints between G and H is
called the breakpoint distance between G and H.

6
Exemplar Breakpoint Distance Problem

Given two genomes G and H over n gene families,
compute two exemplar genomes G and H such that
the breakpoint distance between G and H is
minimized.
We call this the exemplar breakpoint distance
problem (between G and H). Denote this distance
by eb(G,H)b(G,H).

7
Approximation Algorithms

Given a minimization (maximization) problem ?,
let the optimal solution of ? be OPT, an
approximation algorithm A provides a performance
guarantee of a for ? if for every instance of ?
the solution value returned by A is at most ? x
OPT (at least OPT/?).
Usually we say that A is a factor-? approximation
for ?.

8
Prior Results (1)

We showed that the exemplar breakpoint distance
problem does not admit any approximation, unless
PNP (or, deciding whether eb(GH)0 is
NP-complete) Chen, Fu and Zhu2006.
This result holds for any genomic distance d( )
satisfying GH implies d(G,H)0.
Based on the above result, even under a weaker
model of approximation, we showed that the
exemplar conserved interval distance problem does
not admit any WEAK approximation of a superlinear
factor Chen, Fowler, Fu and Zhu, 2007.

9
Prior Results (2)

On the other hand, for the exemplar breakpoint
distance problem, Sankoff has used
branch-and-bound Sankoff, 1999 and Nguyen, Tay
and Zhang 2005 have used divide-and-conquer on
practical datasets to obtain good empirical
results.
As a related, but slightly different effort,
Chauve, et al. 2006 studied the exemplar
genomic similarity problems which does not
satisfy GH implies d(G,H)0, e.g., the exemplar
common interval measure problem.

10
Background for this work

We try to look at the complement of the
breakpoint distance under the gene duplication
model.
As the problem is still hard to approximate, we
follow Nguyen, et al. by considering genomes
satisfying some practical conditions.

11
Definitions

Given exemplar genomes G and H drawn from the
same alphabet, ab is a non-breaking point, if ab
appears in both G and H.
Example. G abcdefg
H fegcdab
We have two non-breaking points in G and H,
which is called the non-breaking similarity of G
and H, denoted as nbs(G,H).
Note that when GHn, if GH,
nbs(G,H)n-1.
Given genomes G and H drawn from the same
alphabet, possibly with gene repetitions, the
exemplar non-breaking similarity problem is to
delete redundant genes to obtain exemplar genomes
G and H such that nbs(G,H) is maximized. The
corresponding measure is also denoted as
enbs(G,H).

12
Example

G abcadcefg
H cfegcdabf
We have 4 possible exemplar genomes for G
abcdefg, abdcefg, bcadefg, badcefg.
We have 4 possible exemplar genomes for H
cfegdab, cegdabf, fegcdab, egcdabf.
enbs(G,H)nbs(abcdefg,fegcdab)2.

13
Inapproximability Result

Theorem 1. Given an exemplar genome G and
another genome H such that the genes are all
from the same alphabet with size n and each gene
appears in H at most two times, the Exemplar
Non-breaking Similarity Problem over G and H
does not admit any approximation of factor n1-e,
unless PNP.
Proof Idea A linear reduction from Independent
Set (IS).

14
e2
v2
v1
N5 vertices, M5 edges NM is even
e4
e3
e1
e5
v4
v5
v3
Gv1v1v2v2v3v3v4v4v5v5x1e1x1x2e2x2x3e3x3x4
e4x4x5e5x5
HYNM-1YNM-3Y1YNMYNM-2Y2
x4x4x2x2v5e4e5v5v3e1v3v1e1e2v1x5x5x3x3x1x1
v4e3e5v4v2e2e3e4v2 YiviAivi, if i N
YNixixi, if iM
Hx4x4x2x2v5e5v5v3v3v1e1e2v1x5x5x3x3x1x1v4
v4v2e3e4v2 correspond to the optimal
independent set v3,v4
Input graph has an IS of size K iff enbs(G,H)K.
15
e2
v2
v1
N5 vertices, M5 edges NM is even
e4
e3
e1
e5
v4
v5
v3
Gv1v1v2v2v3v3v4v4v5v5x1e1x1x2e2x2x3e3x3x4
e4x4x5e5x5
HYNM-1YNM-3Y1YNMYNM-2Y2
x4x4x2x2v5e4e5v5v3e1v3v1e1e2v1x5x5x3x3x1x1
v4e3e5v4v2e2e3e4v2 YiviAivi, if i N
YNixixi, if iM
Hx4x4x2x2v5e5v5v3v3v1e1e2v1x5x5x3x3x1x1v4
v4v2e3e4v2 correspond to the optimal
independent set v3,v4
Input graph has an IS of size K iff enbs(G,H)K.
16
Positive Results

Our motivation was from Nguyen, Tay and Zhang
2005, who observed that for certain bacteria
genome pairs (Baphi-Wigg, Pmult-Hinft,
Ecoli-Styphi, Xaxo-Xcamp and Ypes), repeated
genes are usually pegged, e.g.,
xyxaba

17
Positive Results

Definition
occ(g,G) is the number of occurrence of g in
G.
span(g,G) is the maximum distance between two
copies of g in G.
totalocc(c,G)?gene g in G with span(g,G)c
occ(g,G)

18
Positive Results

Definition
occ(g,G) is the number of occurrence of g in
G.
span(g,G) is the maximum distance between two
copies of g in G.
totalocc(c,G)?gene g in G with span(g,G)c
occ(g,G)
Example. Gabcdaebd
span(a,G)4, span(b,G)5, span(d,G)4,
totalocc(4,G)6

19
Positive Results

Theorem 2. Let G and H be two genomes with
ttotalocc(1,G) totalocc(c,H), for a constant
c. Then enbs(G,H) can be computed in
O(3t/3nc2e) time.

20
Positive Results

Theorem 2. Let G and H be two genomes with
ttotalocc(1,G) totalocc(c,H), for a constant
c. Then enbs(G,H) can be computed in
O(3t/3nc2e) time.
Idea 1 Given an exemplar genome G and another
genome H satisfying span(g,H)c, for every g in
H, we can use divide and conquer to compute
enbs(G,H) in O(nc2e) time.
Roughly speaking, HH1H2H3, H2c, then
enumerate all solutions on H2 and recurse.
T(n) 2c12T(n/2c) O(n) O(nc2e)

21
Positive Results

Theorem 2. Let G and H be two genomes with
ttotalocc(1,G) totalocc(c,H), for a constant
c. Then enbs(G,H) can be computed in
O(3t/3nc2e) time.
Idea 2 As t is considered as a constant, we
enumerate all possibilities for deleting
duplicated genes in G (to obtain G) and for
deleting genes with span greater than c in H (to
obtain H). By Lemma 6, there are at most
4?3t/3 such combinations. Therefore, the total
running time is 4?3t/3?O(nc2e)
O(3t/3nc2e) time.

22
Positive Results

Theorem 3. Let G and H be two genomes with a
total of t genes g satisfying shift(g,G,H) gtc,
for some constant c. Then enbs(G,H) can be
computed in O(3t/3n2c1e) time.
Example. Gabcadef
Hbcedefad
shift(a,G,H) 6

23
Conclusion

We introduce non-breaking similarity, which is
the complement of the famous breakpoint distance,
for genome comparison.
The general exemplar non-breaking similarity
problem is hard to approximate.
3. For some special cases, we can obtain
polynomial solutions.

Write a Comment

User Comments (0)