Paul Medvedev - PowerPoint PPT Presentation

About This Presentation
Title:

Paul Medvedev

Description:

Maximum Likelihood Genome Assembly Paul Medvedev Michael Brudno Bioinformatics Algorithms Presented by Md. Tanvir Al Amin, Md. Shaifur Rahman Khalid Mahmood – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 94
Provided by: MdTanvi
Category:

less

Transcript and Presenter's Notes

Title: Paul Medvedev


1
Maximum Likelihood Genome Assembly
  • Paul Medvedev
  • Michael Brudno

Bioinformatics Algorithms
Presented by Md. Tanvir Al Amin, Md. Shaifur
Rahman Khalid Mahmood
Department of Computer Science and Engineering
BUET
Some of the slides are taken from other sources
2
Computational Genomics
  • Our genome encodes an enormous amount of
    information about our beings
  • our looks
  • our size
  • how our bodies work
  • .
  • our health
  • our behaviors
  • who we are!

gcgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtg
tgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatg
agcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtag
acttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagct
gatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagc
tgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgat
agctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgc
tagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtg
atcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcagg
atgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgc
atggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggct
gctgagagcgtaggcccg.
3
Contributions of the paper
  • Two-fold, first one being
  • First exact polynomial time algorithm for the
    shortest double-stranded genome, given its
    k-molecule spectrum
  • A problem that was solved for strings, but
    remained open for molecules

4
Contributions of the paper
  • Second one
  • Oppose the idea of shortest genome
  • Because It overcollapses
  • Instead propose a new objective
  • A maximum likelihood framework for assembling the
    genome that is most likely the source of the
    reads.

5
Contributions of the paper
  • Maximum likelihood framework
  • Assumes perfect reads
  • Uniform distribution
  • Advantage of high coverage (NGS)
  • Estimate copy counts of repeats
  • Combine with matepair data
  • Read gt Contigs

6
Outline
  • Whole Genome Shotgun Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Bidirected Overlap Graph
  • Adjustments to the Standard Min-cost Biflow
    Problem
  • Maximizing the Global Read-Count Likelihood
  • Efficiently Solving a Min-cost Biflow
  • Flow to Contigs
  • Conflict node resolution
  • Results
  • Discussion

7
Outline
  • Whole Genome Shotgun Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Bidirected Overlap Graph
  • Adjustments to the Standard Min-cost Biflow
    Problem
  • Maximizing the Global Read-Count Likelihood
  • Efficiently Solving a Min-cost Biflow
  • Flow to Contigs
  • Conflict node resolution
  • Results
  • Discussion

8
Whole Genome Shotgun Sequencing
DNA
SEQUENCER
Sanger vs. NGS
reads
ASSEMBLER
C
  • Problems in Assembly
  • Sequencing Errors
  • Unknown Orientation
  • Incomplete Coverage
  • Repeats

contigs
FINISHING
sequence
9
Whole Genome Shotgun Sequencing
  • Break genome into shotgun-sized fragments and
    sequence
  • Match the overlapping regions of contiguous
    sequences
  • Demonstrated by Celera Genomics to be feasible
    for whole genome assembly
  • Sequenced human genome at 1/10th the cost of the
    public Human Genome Project

10
Whole Genome Assembly
  • Next Generation Sequencing (NGS) ??
  • Improved speed and cost-effectiveness relative to
    the other methods
  • but much shorter read length (25-200 bp)
  • Only proven on re sequencing projects, i.e. a
    reference genome is already available
  • Posses significant challenges to the problem of
    de novo genome assembly determination of a
    completely unknown genome.

11
Assemblers
  • Previous (Sanger) Assemblers
  • NGS Assemblers
  • SSAKE (Jeck et al., 2007)
  • VCAKE (Warren et al. 2007)
  • SHARCGS (Dohm et al. 2007)
  • Shorty (Chen and Skiena 2007)
  • ALLPATHS (Butler et al. 2008)
  • Edena (Hernandez et al. 2008)
  • Euler-(U)SR (Chaisson and Pevzner 2008, 2009)
  • Velvet (Zerbino and Birney, 2008)

12
Outline
  • Whole Genome Shotgun Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Bidirected Overlap Graph
  • Adjustments to the Standard Min-cost Biflow
    Problem
  • Maximizing the Global Read-Count Likelihood
  • Efficiently Solving a Min-cost Biflow
  • Flow to Contigs
  • Conflict node resolution
  • Results
  • Discussion

13
Theoretical view
  • Input set of strings over A,C,G,T called reads
  • Output A common superstring of the reads.
  • TACAT, CATAC, ACGTAC ? TACATACGTAC
  • Initially Shortest Common Superstring (SCS)
  • NP-hard Gallant et al 1980
  • Over-collapsing of repeats
  • Can be found using a TSP solver
  • de Bruijn graphs Pevzner, Tang, Waterman 01
  • string graphs Myers 05
  • Both formulations are NP-hard.

14
String graph (Myers)
  • Represent reads as vertices, and read overlaps as
    edges
  • Remove redundant edges
  • Establish edge constraints
  • Unique? (flow is exactly one)
  • Required? (min. flow is 1)
  • Optional? (min. flow is 0)
  • Find shortest walk

15
EULER assembler (Pevzner, Tang and Waterman)
  • Represent reads as edges and overlaps as vertices
    in a de Bruijn graph
  • Assembly can be efficiently solved as an Eulerian
    Path Problem each edge must be visited exactly
    once
  • Repeats dealt with by using multiple edges for a
    single repeat read

16
Overlap Graph
  • Nodes are reads
  • Edges are overlaps
  • Weights are lengths of prefix
  • TSP Tour is SCS
  • Example
  • TACAT, CATAC, ACGTAC ? TACATACGTAC

17
Why Shortest CS?
  • DNA is full of repeats identical and nearly
    identical copies that appear multiple times
  • Alu repeat is 300beses long, present 1,000,000
    times in the human genome
  • SCS approach over-collapses the repeats they
    are only present once in the answer
  • Solution Model repeats explicitly through either
    de Brujin graph or String graps
  • Maybe this will also become tractable?

18
De Bruijn Graphs
AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC
  • Nodes are (k-1)-mers
  • Edges are k-mers
  • The set of k-mers is called a k-spectrum
  • Finding shortest string with given k-spectrum
    equivalent to Chinese Postman

Pevzner 1989
19
De Bruijn Graphs with Walks
AGC, ATTCA, CATT, GCAG, ATG
  • Nodes are (k-1)-mers
  • Edges are k-mers
  • Reads are walks
  • Finding superwalk (one that
  • includes all walks)
  • Not a polynomial time problem
  • De Bruijn Superwalk is NP-hard

Pevzner et al 2001
20
Chinese Postman Tours
  • Solving Chinese Postman An Eulerian tour is a
    solution
  • Euleriazation make a graph Eulerian
  • Can be done with min cost flow
  • Unbalanced nodes are sources/sinks
  • Duplicate all edges used in flow

AGC, ATTCA, CATT, GCAG, ATG
TT
TC
AT
CA
GC
AG
Pevzner 1989
21
DNA is not a String
AAC, ATT, CAA, CCA, GCC, TGC, TTG
ATTGCCAAC
5
3
AA
  • The shortest walk that visits every edge at least
    once (a Chinese postman tour) is the shortest
    string with the given k-spectrum Pevzner 1989

22
Complexity of CPT
Equivalent to
Undirected Polynomial Time Matching
Directed Polynomial Time Matching
Mixed NP-hard Network Flow
Bidirected Polynomial Time Bidirected Flow
23
Modeling Double Strandedness
Kececioglu 91, Kececioglu-Meyers 95
24
Modeling Double-Strandedness
  • How can two DNA molecules overlap?

A A C
CTT
AAC
-AAG
-GTT
C T T
ATTGCCAAC
5
3
A A C
TCG
AAC
-CGA
-GTT
T C G
T G G
TGG
AAC
-CCA
-GTT
A A C
Kececioglu 1992
25
Walks in bidirected graphs
  • A walk has to match directions at each node.
  • Suppose the node AA/TT-.
  • Edge orientations correspond to strands
  • A path can use a node in both orientations

26
Rules for Matching Directions
  • When we walk through it, we can
  • Come in using in arrow, then leave using out
    arrow
  • This is forward, so read the strand. i.e. AA
    here
  • Come in using out arrow, then leave using in
    arrow
  • This is backward, so Read the - strand, i.e TT
    here.

27
Bidirected Graphs
  • So what this walk corresponds to ?
  • GGCAAT
  • ATTGCC

28
Bidirected de Bruijn Graphs
  • The shortest walk that visits every edge at least
    once (a Chinese postman tour) is the shortest DNA
    molecule with the given k-spectrum

AAC, ATT, CAA, CCA, GCC, TGC, TTG
ATTGCCAAC
5
3
29
Representing Bidirected graphs
30
Motivation Overlap Graphs
  • Several downsides of the de Bruijn approach
  • Division into k-mers arbitrary
  • Very sensitive to sequencing errors
  • Not memory efficient (one node per k-mer)
  • Goal
  • One node per read (or better)
  • No division into k-mers
  • Flexibility in the presence of sequencing errors

Myers 2005
31
How To Build A Overlap Graph (1)
TACATACGTAC
  • ACGTAC, CATAC, TACAT
  • Nodes are reads
  • Edges are overlaps
  • Weights are lengths of non-overlapping prefix
  • Transitively inferable overlaps

3
5
3
2
2
32
Bidirected Overlap Graph
  • In this work, authors have used Bidirected
    overlap graph.
  • In a bidirected overlap graph, each vertex is a
    double-stranded read
  • Edges represent read overlaps

33
Bidirected Overlap Graph
  • Three possible ways that two double-stranded
    reads can overlap (corresponds to the three types
    of edges)
  • Suppose we have two reads r1 and r2
  • Each read can be oriented to the left or to the
    right
  • The three possible overlaps are
  • i) Both strands point in the same direction (both
  • reads can point left, or both can point
    right, its
  • the same overlap either way)
  • ii) r1 points left and r2 points right
  • iii) r1 points right and r2 points left

34
Bidirected Overlap Graph
  • The overlap graph is constructed by placing an
    edge between two reads if they overlap by a
    minimum number of characters omin
  • Question How is omin determined?
  • Then perform transitive edge reduction remove
    overlaps covered by two shorter overlaps

35
Observation
  • A bidirected graph contains an Eulerian circuit
    if and only if it is connected and balanced.

36
Chinese postman Problem on Bidirected Graphs
37
Chinese postman Problem on Bidirected Graphs
  • Let G be a weighted bidirected graph. There
    exists a circuit of weight i if and only if there
    exists an Eulerian extension of weight i.
  • G has a circuit if and only if it is strongly
    connected.
  • The minimum weight Eulerian extension of G has at
    most 2EV edges.

38
Chinese postman Problem on Bidirected Graphs
  • The running time of Algorithm 1 is
    O(E2log(V)log(E)).
  • Gabows algorithm runs in O(E2log(V)log(max(u(
    e)))
  • u is the flow upper bound function
  • f(e) lt 2 E V for every edge e,
  • So, we can safely let u(e) 2 E V

39
Chinese postman Problem on Bidirected Graphs
  • Hence the theorem is proved
  • Given a set of k-molecules S, we can find the
    shortest (k-1)-circular DNA molecule whose
    k-molecule spectrum is S in time
    O(S2log2(S)).
  • This is a polynomial time algorithm, explicitly
    handling the double strandedness
  • The first main result of this paper.

40
Outline
  • Whole Genome Shotgun Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Bidirected Overlap Graph
  • Adjustments to the Standard Min-cost Biflow
    Problem
  • Maximizing the Global Read-Count Likelihood
  • Efficiently Solving a Min-cost Biflow
  • Flow to Contigs
  • Conflict node resolution
  • Results
  • Discussion

41
Sequence assembly using NGS
  • Sequence assembly using NGS
  • Several methods available now (e.g. SSAKE, VCAKE,
    SHARCGS, etc.)
  • All of these assume that the length of the
    assembled genome must be minimized
  • Results in over-collapsing of repeats
  • Given ubiquity of repeats in eukaryotic genomes,
    authors considered this a poor assumption

42
Goal of an Assembler
  • What should the goal of an assembler be ??
  • Shortest string ??
  • Problem of over-collapse

43
Maximum Likelihood Genome Assembly
  • Change goal of sequence assembly
  • Maximize the likelihood that the resultant genome
    was the source of the given reads
  • Take advantage of the high coverage of NGS to
    statistically estimate the copy-count of each
    read identify and quantify repeats
  • Maximizing the likelihood of observed read
    frequencies can be cast as mininum cost
    bidirected flow (biflow) problem
  • Allows solution to be obtained with an
    off-the-shelf network flow solver
  • Authors claim 99.99 accuracy

44
Maximum Likelihood Genome Assembly
  • Second important aspect is the use of matepair
    information for joining contigs
  • Other systems look for all paths between mated
    reads
  • The proposed Method looks only for short paths
    between some pairs of reads
  • Question How to decide the upper bound for these
    short paths? And how to decide which pairs of
    reads to examine?

45
Outline
  • Whole Genome Shotgun Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Bidirected Overlap Graph
  • Adjustments to the Standard Min-cost Biflow
    Problem
  • Maximizing the Global Read-Count Likelihood
  • Efficiently Solving a Min-cost Biflow
  • Flow to Contigs
  • Conflict node resolution
  • Results
  • Discussion

46
Adjustments to the Standard Min-cost Biflow
Problem
  • Standard Min-cost Biflow Problem
  • Set upper and lower flow bounds on each edge
  • Flow function f E ? N must obey the constraint
    for each edge e
  • For each vertex, the incoming flow is balanced
    with the outgoing flow
  • Objective Find the flow that minimizes

47
Adjustments to the Standard Min-cost Biflow
Problem
  • Medvedev-Brudno Min-cost Biflow Problem
  • Upper and lower flow bounds on vertices as well
  • Accomplished by splitting every vertex v into
    two
  • v and v-

48
Adjustments to the Standard Min-cost Biflow
Problem
  • v- serves as the incoming vertex, and inherits
    v incoming edges
  • v serves as the outgoing vertex, and inherits
    vs outgoing edges
  • Finally add one edge between v- and v and assign
    it the upper and lower flow bounds for v

49
Adjustments to the Standard Min-cost Biflow
Problem
  • Second variation represent the cost ce as a
    convex function
  • A function is convex if every point on or above
    it forms a convex set
  • A convex set refers to an area where, for every
    pair of points within that area, every point on
    the straight line segment connecting those points
    also lies within that area

50
Convex Function
51
Adjustments to the Standard Min-cost Biflow
Problem
  • An area that is not convex would have some sort
    of concave portion that would contradict the
    above property of convex sets
  • In the overlap graph, convex functions are
    modelled with piecewise-linear approximations,
    allowing the flow to be solved as a linear
    min-cost flow problem

52
Adjustments to the Standard Min-cost Biflow
Problem
  • Supersource and supersink added to convert flow
    problem into circulation problem
  • Each vertex has a lower bound of 1, since each
    read must appear in the finished genome at least
    once
  • Edge bounds are set to 0 (lower bound) and
    infinity (upper bound)

53
Adjustments to the Standard Min-cost Biflow
Problem
  • Prohibitively large cost on the edge leading from
    the supersource and the edge leading to the
    supersink to ensure that the assembly uses the
    smallest number of contigs possible
  • Flow through each vertex represents number of
    times it appears in the assembled genome

54
Supersource and Supersink
55
Maximum Likelihood Framework
  • Let D be a circular genome of length N(D)
  • di number of times the k-molecule i appears
    in D
  • Suppose
  • i ACGT

A C G T
For, simplicity they are drawn as strings instead
of molecules
56
Maximum Likelihood Framework
  • Random trial
  • Sample a position and take a k-molecule
  • What is the probability that the k-molecule is i

For, simplicity they are drawn as strings instead
of molecules
57
Maximum Likelihood Framework
  • Sample Uniformly
  • We call it success, if we get i
  • So, p success probability
  • We do the experiment n times
  • Xi be the random variable indicating number of
    times we get i
  • What is the distribution of Xi ??

Binomial Distribution
58
Maximum Likelihood Framework
  • How many options for i ?
  • There of 4k possibilities .
  • Hence 4k random variables .
  • Suppose k 3
  • X1 X2 X3 X4 X5 X6
    X64
  • They are,
  • XAAA XAAC XAAG XAAT
    XTTT

59
Maximizing the Global Read-Count Likelihood
  • Taking all random variables over n experiments.
  • What is the probability that AAA comes xAAA
    times, and AAC comes xAAC times, .. and CGT
    comes xCGT times and TTT comes xTTT times ??
  • Each random variable for every possible k-mer has
    a binomial distribution. Their joint distribution
    is the following multinomial distribution

60
Maximum Likelihood Framework
  • But D is not known, but the results of the n
    trials are known !!
  • The probability can be considered as the
    likelihood of the parameters of the distribution
    di, given the outcome of the trials xi which is
    called
  • Global Read-count Likelihood

61
Maximizing the Global Read-Count Likelihood
  • Goal is to maximize L, or, equivalently, minimize
    the negative log of L

62
Maximizing the Global Read-Count Likelihood
  • To translate this problem into a convex min-cost
    biflow problem, we need convex functions ci for
    each k-mer
  • Problem the Xi random variables are not
    independent, because we have constraint
  • We need something like

63
Maximizing the Global Read-Count Likelihood
  • But, as the number of trials goes to infinity,
    the Xi random variables become independent.
  • In NGS techniques, the number of trials is
    usually large enough to warrant the approximation
    of the multinomial distribution as the product of
    the binomial distributions for each Xi

64
Maximizing the Global Read-Count Likelihood
  • In this binomial approximation, genome length
    N(G) is constant, and independent of the sampling
    frequencies
  • Therefore, use N instead, which is the actual
    length of the genome G

65
Maximizing the Global Read-Count Likelihood
  • New approximation of L
  • Now
  • And
  • ci is used as the convex functions for the
    vertices of the min-cost biflow

66
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
  • Methods Show Me the Contigs
  • Results
  • Discussion

67
Efficiently Solving a Min-Cost Biflow
  • Problem No existing efficient implementation of
    a min-cost biflow algorithm
  • Though, Gabow (1983) presented polynomial time
    algorithm for min cost biflow
  • It is difficult to implement.
  • Authors didnt find any existing implementation
    either
  • Authors solve by converting a bidirected flow
    into a directed flow problem.

68
Efficiently Solving a Min-Cost Biflow
  • Directed network flow is solved by reducing the
    problem to a linear program (LP)
  • Use an edge incidence matrix derived from
    the overlap graph
  • If cell has a value of 1, then edge n is an
    in-edge for vertex m
  • If the value is -1, n is an out-edge
  • 0 means n and m are not on speaking terms
  • Use incidence matrix as constraint matrix for LP
    optimal LP solution corresponds to a minimum flow

69
Efficiently Solving a Min-Cost Biflow
  • The incidence matrix is Totally Unimodular (TU)
  • Leads to Linear programs that always have integer
    solutions.
  • Makes it possible to produce an integral solution
    with LP, rather than resort to Integer
    Programming -gt NP-hard

70
Efficiently Solving a Min-Cost Biflow
  • Possible for 2 or -2 to appear in the incidence
    matrix, since two in-edges/out-edges can enter a
    single vertex
  • Incidence matrix is actually a
  • binet matrix
  • Optimal LP solution for binet matrices
    is guaranteed to be half-integral (i.e. the
    coefficients are multiples of 0.5)

Hochbaum 2004
71
Efficiently Solving a Min-Cost Biflow
72
Efficiently Solving a Min-Cost Biflow
  • Monotonization Procedure
  • For every vertex v in the bidirected graph,
    replace with two vertices v1 and v2 in the new
    graph
  • Each of vs in-edges are replaced with two edges,
    one of which points into v1, while the other
    points out of v2
  • Likewise, each of vs out-edges are replaced with
    two edges, one of which points out of v1, while
    the other points into v2
  • Bounds and costs from original graph are
    transferred to the new graph, and the solution of
    the new graph will be transferred to the original
    graph

Hochbaum 2004
73
Efficiently Solving a Min Cost Flow
  • Problem can now be solved with off-the-shelf
    software
  • After finding the min cost flow in the directed
    graph, transfer the results to the original
    bidirected graph by adding the flows through the
    pairs of twin edges and dividing by two.
  • Hence, the optimal result is half integral and
    the monotonized flow is at worst a
    2-approximation to the optimal integral flow.

Hochbaum 2004
74
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

75
Flow to Contigs
  • Flows have been solved,
  • Now, decompose it into a collection of walks,
    which translates into assembled contigs
  • Graph is first simplified by removing all edges
    with a flow of zero
  • Additional simplifications possible .

76
Flow to Contigs
  • by removing vertices v where
  • There is exactly one edge going into v and one
    edge leading out of v, and the flow on both edges
    is the same
  • Vertices where there is also a loop with the same
    flow as the other two edges, and
  • Split and join vertices, where the flow on the
    in-
  • edges is the same as those of the out-edges

77
Flow to Contigs
  • After at most 2V of these simplifications, the
    remaining vertices are conflict vertices
  • those that didnt match the previous criteria

78
Conflict Node Resolution
  • Using matepair information
  • Look for edges at these vertices with opposite
    orientations supported by matepairs
  • Use BFS to find all reads within a certain
    distance from the vertex (in both direction)
  • We have two sets of vertices L and R,
    corresponding to reads that were observed on the
    inside of a vertex and the outside.
  • Match those reads that are matepairs.
  • For those matepairs where one read is on the
    incoming side and the other is on the outgoing
    side, find the shortest path between them using
    Dijkstras algorithm

79
Resolving Conflict Nodes with Mate Pairs
A
B
A
B
A
B
A
B
?
Does there exist a short path between A and B?
  • Dijkstras shortest path algorithm -- bounded
  • Greedily join edges if they have enough
    supporting reads.

80
Greedy Matching
  • Make note of the number of mates that fall within
    the expected insert distance
  • Pairs of in/out edges that have a significant
    number of matepairs that fall within the insert
    distance are joined into a common edge
  • The previous step is repeated until no more edges
    can be joined in this manner
  • Graph simplification continues in iterative
    phases until convergence

81
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

82
Results
  • Generated synthetic reads from E. coli genome,
    which has a total length of 4.6 Mega basepairs.
  • Simulated matepairs distances were uniformly
    distributed within 10 of the expected insert
    size
  • Reads were 25 bp long, and error-free

83
Results
  • Coverage rates involved 50x, 75x, 100x, and 200x
  • Minimum overlap length varied between 17 and 21
  • Authors claim that, overall running time of the
    algorithm is approx 1 hour on one machine
  • Question What kind of machine??

84
Copy Count Results
  • Authors compared the flow going through every
    vertex in the overlap graph to the number of
    times that the corresponding read appears.

85
Read Count Results
  • Compared vertex flow with read frequency in the
    original genome
  • High degree of accuracy
  • Error rate between 10-4 and 10-6
  • Generally more tendency to overestimate read
    frequency
  • Authors claim only slight improvements beyond 75x
    coverage
  • but 200x coverage is fantastically good

86
Assembly Results
  • Take the edges of the graph produced after the
    conflict node resolution and generate the
    sequence it spells out
  • Compute N50 The length of the shortest contig
    s.t. 50 of the genome lies in longer contigs
  • Also compute N90 Similar to N50, but the cutoff
    is 90
  • Finally, compute errors by aligning each contig
    to the reference genome and seeing how many local
    alignments it takes to completely tile the contig
    (minus one because it always takes at least one
    alignment to do it)

87
Assembly Results
N50 Results
N90 Results
88
Assembly Results (contd)
  • Length of contigs that contain 50 of the genome
    varied between 23-28 kb
  • Length of contigs that contain 90 of the genome
    varied between 7-8 kb
  • N50 error rate 1/100-180 kb
  • N90 error rate 1/100-160 kb
  • Greedy algorithm can be fooled by several strong
    edge matches
  • Contig size is good relative to other whole
    genome assemblies involving small read sizes

89
Outline
  • Whole Genome Assembly
  • Review of Related Work
  • The Medvedev-Brudno Method
  • Methods Bidirected Overlap Graph
  • Methods Adjustments to the Standard Min-cost
    Biflow Problem
  • Methods Maximizing the Global Read-Count
    Likelihood
  • Methods Efficiently Solving a Min-cost Biflow
    (Linear)
  • Methods Show Me the Contigs
  • Results
  • Discussion

90
Discussion
  • Demonstrated that bidirected flow is a powerful
    method for gnome-assembly.
  • Introduced a maximum likelihood framework for
    sequence assembly
  • By unifying Pevzners work on de Bruijn graphs,
    Kececioglu and Myers work on bidirected graphs
    in assembly, and Edmond and Gabows work on
    bidirected flow.
  • The paper gives an exact polynomial time assembly
    algorithm in the parsimony setting explicitly
    dealing with double-strandedness.

91
Discussion
  • First major assumption Reads are error-free
  • Can be overcome with higher coverage
  • Second major assumption Uniform sampling of all
    genomic regions
  • Reality certain portions of the genome are
    easier to sample than others
  • More difficult to overcome
  • Could be overcome by establishing the biases of
    the sequencing apparatus used

92
Future Research
  • Exploration of the exact biases of the NGS
    platforms
  • Correction for these
  • Is there any better heuristic for the greedy
    resolution ??

93
Questions ??
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com