Nonbreaking Similarity of Genomes with Gene Repetitions - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Nonbreaking Similarity of Genomes with Gene Repetitions

Description:

Computing genomic distance between genomes is important in evolutionary ... G'=-bd-cab-d-c. We will focus on unsigned sequences in this work. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 24
Provided by: csd50
Category:

less

Transcript and Presenter's Notes

Title: Nonbreaking Similarity of Genomes with Gene Repetitions


1
Non-breaking Similarity of Genomes with Gene
Repetitions
  • Binhai Zhu
  • Computer Science Department, Montana State
    University
  • Joint work with Zhixiang Chen, Bin Fu, Jinhui
    Xu, Boting Yang and Zhiyu Zhao

2
Background
  • Computing genomic distance between genomes is
    important in evolutionary molecular biology, the
    problem was first studied by Sturtevant and
    Dobzhansky in 1936.
  • A lot of research has been done on computing
    genomic distances since 1990, assuming that each
    gene appears in a genome once, e.g., the famous
    result by Hannenhalli and Pevzner on sorting
    signed permutations by reversals.

3
Background (cond.)
  • On the other hand, gene repetition is very common
    in genomes. So computing genomic distances with
    gene repetition is a more realistic problem.
  • This is a typical optimization problem, it makes
    sense to study the approximability of the
    problem.

4
Definitions
  • Given n gene families (alphabets) F, a genome G
    is a sequence of elements of F such that each
    element has a (/-) sign.
  • Example. Fa,b,c,d,
  • G-bd-cab-d-c
  • We will focus on unsigned sequences in this work.
  • A genome G is said to be exemplar if every gene
    appears exactly once in G.

5
Definitions (cond.)
  • Given exemplar genomes G and H, over the same
    set of gene families, if gene ab is a substring
    in G but not in H, then ab constitutes a
    breakpoint in G.
  • Example, Gabcdefg
  • Hefgdcab
  • there are 3 breakpoints in G (and
    symmetrically in H).
  • The number of breakpoints between G and H is
    called the breakpoint distance between G and H.

6
Exemplar Breakpoint Distance Problem
  • Given two genomes G and H over n gene families,
    compute two exemplar genomes G and H such that
    the breakpoint distance between G and H is
    minimized.
  • We call this the exemplar breakpoint distance
    problem (between G and H). Denote this distance
    by eb(G,H)b(G,H).

7
Approximation Algorithms
  • Given a minimization (maximization) problem ?,
    let the optimal solution of ? be OPT, an
    approximation algorithm A provides a performance
    guarantee of a for ? if for every instance of ?
    the solution value returned by A is at most ? x
    OPT (at least OPT/?).
  • Usually we say that A is a factor-? approximation
    for ?.

8
Prior Results (1)
  • We showed that the exemplar breakpoint distance
    problem does not admit any approximation, unless
    PNP (or, deciding whether eb(GH)0 is
    NP-complete) Chen, Fu and Zhu2006.
  • This result holds for any genomic distance d( )
    satisfying GH implies d(G,H)0.
  • Based on the above result, even under a weaker
    model of approximation, we showed that the
    exemplar conserved interval distance problem does
    not admit any WEAK approximation of a superlinear
    factor Chen, Fowler, Fu and Zhu, 2007.

9
Prior Results (2)
  • On the other hand, for the exemplar breakpoint
    distance problem, Sankoff has used
    branch-and-bound Sankoff, 1999 and Nguyen, Tay
    and Zhang 2005 have used divide-and-conquer on
    practical datasets to obtain good empirical
    results.
  • As a related, but slightly different effort,
    Chauve, et al. 2006 studied the exemplar
    genomic similarity problems which does not
    satisfy GH implies d(G,H)0, e.g., the exemplar
    common interval measure problem.

10
Background for this work
  • We try to look at the complement of the
    breakpoint distance under the gene duplication
    model.
  • As the problem is still hard to approximate, we
    follow Nguyen, et al. by considering genomes
    satisfying some practical conditions.

11
Definitions
  • Given exemplar genomes G and H drawn from the
    same alphabet, ab is a non-breaking point, if ab
    appears in both G and H.
  • Example. G abcdefg
  • H fegcdab
  • We have two non-breaking points in G and H,
    which is called the non-breaking similarity of G
    and H, denoted as nbs(G,H).
  • Note that when GHn, if GH,
    nbs(G,H)n-1.
  • Given genomes G and H drawn from the same
    alphabet, possibly with gene repetitions, the
    exemplar non-breaking similarity problem is to
    delete redundant genes to obtain exemplar genomes
    G and H such that nbs(G,H) is maximized. The
    corresponding measure is also denoted as
    enbs(G,H).

12
Example
  • G abcadcefg
  • H cfegcdabf
  • We have 4 possible exemplar genomes for G
    abcdefg, abdcefg, bcadefg, badcefg.
  • We have 4 possible exemplar genomes for H
    cfegdab, cegdabf, fegcdab, egcdabf.
  • enbs(G,H)nbs(abcdefg,fegcdab)2.

13
Inapproximability Result
  • Theorem 1. Given an exemplar genome G and
    another genome H such that the genes are all
    from the same alphabet with size n and each gene
    appears in H at most two times, the Exemplar
    Non-breaking Similarity Problem over G and H
    does not admit any approximation of factor n1-e,
    unless PNP.
  • Proof Idea A linear reduction from Independent
    Set (IS).

14
e2
v2
v1
N5 vertices, M5 edges NM is even
e4
e3
e1
e5
v4
v5
v3
Gv1v1v2v2v3v3v4v4v5v5x1e1x1x2e2x2x3e3x3x4
e4x4x5e5x5
HYNM-1YNM-3Y1YNMYNM-2Y2
x4x4x2x2v5e4e5v5v3e1v3v1e1e2v1x5x5x3x3x1x1
v4e3e5v4v2e2e3e4v2 YiviAivi, if i N
YNixixi, if iM
Hx4x4x2x2v5e5v5v3v3v1e1e2v1x5x5x3x3x1x1v4
v4v2e3e4v2 correspond to the optimal
independent set v3,v4
Input graph has an IS of size K iff enbs(G,H)K.
15
e2
v2
v1
N5 vertices, M5 edges NM is even
e4
e3
e1
e5
v4
v5
v3
Gv1v1v2v2v3v3v4v4v5v5x1e1x1x2e2x2x3e3x3x4
e4x4x5e5x5
HYNM-1YNM-3Y1YNMYNM-2Y2
x4x4x2x2v5e4e5v5v3e1v3v1e1e2v1x5x5x3x3x1x1
v4e3e5v4v2e2e3e4v2 YiviAivi, if i N
YNixixi, if iM
Hx4x4x2x2v5e5v5v3v3v1e1e2v1x5x5x3x3x1x1v4
v4v2e3e4v2 correspond to the optimal
independent set v3,v4
Input graph has an IS of size K iff enbs(G,H)K.
16
Positive Results
  • Our motivation was from Nguyen, Tay and Zhang
    2005, who observed that for certain bacteria
    genome pairs (Baphi-Wigg, Pmult-Hinft,
    Ecoli-Styphi, Xaxo-Xcamp and Ypes), repeated
    genes are usually pegged, e.g.,
  • xyxaba

17
Positive Results
  • Definition
  • occ(g,G) is the number of occurrence of g in
    G.
  • span(g,G) is the maximum distance between two
    copies of g in G.
  • totalocc(c,G)?gene g in G with span(g,G)c
    occ(g,G)

18
Positive Results
  • Definition
  • occ(g,G) is the number of occurrence of g in
    G.
  • span(g,G) is the maximum distance between two
    copies of g in G.
  • totalocc(c,G)?gene g in G with span(g,G)c
    occ(g,G)
  • Example. Gabcdaebd
  • span(a,G)4, span(b,G)5, span(d,G)4,
  • totalocc(4,G)6

19
Positive Results
  • Theorem 2. Let G and H be two genomes with
    ttotalocc(1,G) totalocc(c,H), for a constant
    c. Then enbs(G,H) can be computed in
    O(3t/3nc2e) time.

20
Positive Results
  • Theorem 2. Let G and H be two genomes with
    ttotalocc(1,G) totalocc(c,H), for a constant
    c. Then enbs(G,H) can be computed in
    O(3t/3nc2e) time.
  • Idea 1 Given an exemplar genome G and another
    genome H satisfying span(g,H)c, for every g in
    H, we can use divide and conquer to compute
    enbs(G,H) in O(nc2e) time.
  • Roughly speaking, HH1H2H3, H2c, then
    enumerate all solutions on H2 and recurse.
  • T(n) 2c12T(n/2c) O(n) O(nc2e)

21
Positive Results
  • Theorem 2. Let G and H be two genomes with
    ttotalocc(1,G) totalocc(c,H), for a constant
    c. Then enbs(G,H) can be computed in
    O(3t/3nc2e) time.
  • Idea 2 As t is considered as a constant, we
    enumerate all possibilities for deleting
    duplicated genes in G (to obtain G) and for
    deleting genes with span greater than c in H (to
    obtain H). By Lemma 6, there are at most
    4?3t/3 such combinations. Therefore, the total
    running time is 4?3t/3?O(nc2e)
    O(3t/3nc2e) time.

22
Positive Results
  • Theorem 3. Let G and H be two genomes with a
    total of t genes g satisfying shift(g,G,H) gtc,
    for some constant c. Then enbs(G,H) can be
    computed in O(3t/3n2c1e) time.
  • Example. Gabcadef
  • Hbcedefad
  • shift(a,G,H) 6

23
Conclusion
  • We introduce non-breaking similarity, which is
    the complement of the famous breakpoint distance,
    for genome comparison.
  • The general exemplar non-breaking similarity
    problem is hard to approximate.
  • 3. For some special cases, we can obtain
    polynomial solutions.
Write a Comment
User Comments (0)
About PowerShow.com