Combinatorial and Statistical Approaches in Gene Rearrangement Analysis - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Combinatorial and Statistical Approaches in Gene Rearrangement Analysis

Description:

Title: Department of Computer Science and Engineering and the South Carolina Information Technology Institute Author: buell Last modified by: jtang – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Combinatorial and Statistical Approaches in Gene Rearrangement Analysis


1
Combinatorial and Statistical Approaches in Gene
Rearrangement Analysis
  • Jijun Tang
  • Computer Science and Engineering
  • University of South Carolina
  • jtang_at_cse.sc.edu
  • (803) 777-8923

2
Outline
  • Backgrounds
  • Branch-and-Bound Algorithms for the Median
    Problem
  • Maximum Likelihood Methods for Phylogenetic
    Reconstruction
  • Post-Analysis
  • Conclusions

3
(No Transcript)
4
Simple Rearrangements
5
Phylogenetic Reconstruction
6
Rearrangement Phylogeny
7
(No Transcript)
8
(No Transcript)
9
Median Problem
Goal find M so that DAMDBMDCM is minimized NP
hard for most metric distances
10
Multichromosomal Reversal Median problem
  • To find a median genome that minimizes the
    summation of the multichromosomal HP distances on
    the three edges
  • Events considered reversal, translocation,
    fusion, fission
  • Exact and heuristic solvers exist for the
    Unichromosomal Reversal Median Problem (reversals
    are the only events)

11
Capless Breakpoint Graph
  • Genome A ? Non-perfect Matching M(A)
  • Let a,b be adjacency genes in A. Then (at,bh) is
    an edge in M(A)
  • A genome is composed of a set of edges and ends.
  • Matchings naturally correspond to Undirected
    Genomes (Flipping of chromosomes does not alter
    matchings)

12
Example
  • Example Genomes
  • A -5, 1, 6, 3 , 2, 4
  • B 1, 6 , -5, -4, -3, -2

Adjacency Graph
13
Capless Breakpoint Graph
B-end
A-end
  • Denote C(A,B) Cycles, AB AB-Paths, AA
    AA-paths, BB BB-paths in G(A,B), n
    genes
  • n 6,C(A,B) 1,AB 4,
  • dHP 6-1-4/2 3

14
A Lower Bound of the HP Distance
  • A simpler lower bound only contains genes,
    cycles, paths.
  • Derived from Hannenhalli, Pevzner 1995
  • dHP (A,B)n C(A,B) - AB/2 AA - BB
  • Pseudo-cycle of A and B

15
Pseudo-cycle distance Median Problem
  • Pseudo-cycle distance
  • Pseudo-cycle distance Median Problem (PMP) to
    find a median genome that minimizes the summation
    of the Pseudo-cycle distance on the three edges
  • We use the Pseudo-cycle distance as a lower bound
    for the HP distance to derive a RMP solver

16
Branch-and-Bound Algorithm
  • Enumerate the solution genomes gene by gene
    (Genome Enumeration)
  • After enumerated a gene, compute an upper bound
    based on the partial solution genome
  • Bound check whether the upper bound of the
    partial solution is less than a criteria
  • Branch
  • If it is true, the partial genome is discarded,
    enumerate another gene
  • Otherwise update the criteria and continue
    enumeration

17
Genome Enumeration for Multichromosome Genomes
Genome Enumeration For genomes on gene 1,2,3
2
2
2
-2
-2
-2
18
Features
  • Main Components
  • Contraction Operation
  • Upper Bound on the number of pseudo-cycles
  • Genome enumeration
  • Extension of Capraras method for unichromosomal
    genomes (1999)

19
Contraction Operation
  • Contraction eat,bh on M(A) M(A)/e

20
Upper Bound on the Number of Pseudo-cycles
  • Let S be a genome and ZG1, G2, G3 a set of
    three input genomes
  • The maximal ?(S,Z) is denoted by ?
  • Based on triangle inequality, an upper bound on
    the number of pseudo-cycles can be derived

21
Notes
  • qn- ? is the lower bound of the sum of
    pseudo-cycle distances between any S and each
    genome in Z G1, G2, G3
  • Given an edge e, assume genome S contains e and
    maximizes ?(S,Z) let ZG1/e, G2/e, G3/e, and
    assume S maximizes Z?(S,Z), then S S?e

22
Upper Bound Test
  • In a step of the algorithm, the current partial
    solution is Sie1,e2,,ei
  • The upper bound of ?(S,Z) of genoms containing Si
    is the following
  • Let UB be the current upper bound
  • If UBSiltUB, then the best upper bound of the
    genomes containing Si is worse than UB

23
Branch-and-Bound Algorithm for Multichromosomal
Genomes
  • Compute an initial Upper Bound (UB) from the
    input genomes.
  • In each step, either an end or an edge is fixed
    in the solution.
  • End Fixing Mark a node as an end of a
    chromosome.
  • Edge Fixing Fix an edge e to the current partial
    solution genome Si.

24
Genome Enumeration for Multichromosome Genomes
Genome Enumeration For genomes on gene 1,2,3
2
2
2
-2
-2
-2
  • Red line end fixing
  • Black line edge fixing

25
Properties
  • Can be extended to compute a given tree using
    iterative or progressive approaches
  • However, median computation is still difficult
  • Large nuclear genomes
  • Complex events
  • We also need to search the best tree from the
    large tree space
  • N species
  • 20 species

26
Statistical Approaches
  • Combinatorial approaches are the focus of genome
    rearrangement research
  • Only one MCMC method exists
  • Maximum Likelihood methods have been very popular
    in sequence phylogenetic analysis
  • Bootstrapping (data resampling) is a popular
    method to assess quality of obtained trees
  • Hard to directly apply ML and bootstrapping to
    gene order

27
Sequence ML Phylogeny
  • For each position, generate all possible tree
    structures
  • Based on the evolutionary model, calculate
    likelihood of these trees and sum them to get the
    column likelihood
  • Calculate tree likelihood by multiplying the
    likelihood for each position
  • Choose tree with the greatest likelihood

28
Example
A acgcaa
B acataa
C atgtca
D gcgtta
29
All Possible Evolutionary Paths (Column 1)
a c g t
a c g t
a c g t
30
Likelihood for One Path
a
a
a
g
31
Sum of All Paths (Column 1)
a c g t
a c g t
a c g t
32
Whole Sequence
33
MLBE
  • Convert the gene-orders into binary sequences
    based on adjacencies
  • Convert the binary sequences into protein or DNA
    sequence
  • Use RAxML to compute a ML tree on the sequences
  • Binary encoding was used before for parsimony
    analysis, with reasonable results

34
Binary Encoding
35
MLBE Sequences
36
Experimental Setup
  • Generate random trees of N taxa
  • Each tree is equally likely
  • Birth-death model is preferred
  • Starting from the root, apply r events along each
    edge
  • r is the expected number of events
  • Actual number is a sample between 12r
  • Comparing the inferred tree with the true tree
    using RF rate

37
(No Transcript)
38
Experimental Results (Equal Content 1)
80 inversion, 20 transposition
39
Experimental Results (Equal Content 2)
80 inversion, 20 transposition
40
Experimental Results (Unequal 1)
90 inversion, 10 of del/ins/dup, 5-30 genes per
segment
41
Experimental Results (Unequal 2)
90 inversion, 10 of del/ins/dup, 5-30 genes per
segment
42
Multistate Endocing
43
MLME Results (200 genes 20 genomes)
44
MLME Results (1000 genes 20 genomes)
45
Post Analysis
  • Bootstrapping has been widely used to assess the
    quality of sequence phylogeny
  • The same procedure is impossible for gene order
    data since there is only one character
  • We tested the procedure of jackknifing through
    simulated data to obtain
  • Is jackknifing useful
  • The best jackknifing rate
  • What is the threshold of the support values

46
DNA bootstrapping
47
Bootstrapping Results
48
Jackknifing Procedure
  • Generate a new dataset by removing half of the
    genes from the original genomes (orders are
    preserved)
  • Compute a tree on the new dataset
  • Repeat K times and obtain K replicates
  • Obtain a consensus tree with support values

49
An ExampleNew Genomes
  • 1 2 3 4 5 6 7 8 9 10
  • 1 -4 5 2 8 10 9 -7 -6 3

1 3 5 7 9 1 5 9 -7 3
50
Jackknifing Rate
51
Support Value Threshold - FP
Up to 90 FP can be identified with 85 as the
threshold
52
Trees with FP
53
Support Value Threshold - FN
54
Low Support Branches
55
Jackknife Properties
  • Jackknifing is necessary and useful for gene
    order phylogeny, and a large number of errors can
    be identified
  • 40 jackknifing rate is reasonable
  • 85 is a conservative threshold, 75 can also be
    used
  • Low support branches should be examined in detail

56
Conclusions
  • Great progress has been made in genome
    rearrangement research
  • We are able to handle real size data
  • Now the question is what data
  • Data quality and biological modeling
  • Ancestral genome reconstruction is still
    difficult
  • Putting everything together has just started

57
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com