Title: Paul Medvedev
1Maximum Likelihood Genome Assembly
- Paul Medvedev
- Michael Brudno
Bioinformatics Algorithms
Presented by Md. Tanvir Al Amin, Md. Shaifur
Rahman Khalid Mahmood
Department of Computer Science and Engineering
BUET
Some of the slides are taken from other sources
2Computational Genomics
- Our genome encodes an enormous amount of
information about our beings - our looks
- our size
- how our bodies work
- .
- our health
- our behaviors
- who we are!
gcgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtg
tgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatg
agcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtag
acttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagct
gatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagc
tgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgat
agctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgc
tagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtg
atcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcagg
atgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgc
atggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggct
gctgagagcgtaggcccg.
3Contributions of the paper
- Two-fold, first one being
- First exact polynomial time algorithm for the
shortest double-stranded genome, given its
k-molecule spectrum - A problem that was solved for strings, but
remained open for molecules
4Contributions of the paper
- Second one
- Oppose the idea of shortest genome
- Because It overcollapses
- Instead propose a new objective
- A maximum likelihood framework for assembling the
genome that is most likely the source of the
reads.
5Contributions of the paper
- Maximum likelihood framework
- Assumes perfect reads
- Uniform distribution
- Advantage of high coverage (NGS)
- Estimate copy counts of repeats
- Combine with matepair data
- Read gt Contigs
6Outline
- Whole Genome Shotgun Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Bidirected Overlap Graph
- Adjustments to the Standard Min-cost Biflow
Problem - Maximizing the Global Read-Count Likelihood
- Efficiently Solving a Min-cost Biflow
- Flow to Contigs
- Conflict node resolution
- Results
- Discussion
7Outline
- Whole Genome Shotgun Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Bidirected Overlap Graph
- Adjustments to the Standard Min-cost Biflow
Problem - Maximizing the Global Read-Count Likelihood
- Efficiently Solving a Min-cost Biflow
- Flow to Contigs
- Conflict node resolution
- Results
- Discussion
8Whole Genome Shotgun Sequencing
DNA
SEQUENCER
Sanger vs. NGS
reads
ASSEMBLER
C
- Problems in Assembly
- Sequencing Errors
- Unknown Orientation
- Incomplete Coverage
- Repeats
contigs
FINISHING
sequence
9Whole Genome Shotgun Sequencing
- Break genome into shotgun-sized fragments and
sequence - Match the overlapping regions of contiguous
sequences - Demonstrated by Celera Genomics to be feasible
for whole genome assembly - Sequenced human genome at 1/10th the cost of the
public Human Genome Project
10Whole Genome Assembly
- Next Generation Sequencing (NGS) ??
- Improved speed and cost-effectiveness relative to
the other methods - but much shorter read length (25-200 bp)
- Only proven on re sequencing projects, i.e. a
reference genome is already available - Posses significant challenges to the problem of
de novo genome assembly determination of a
completely unknown genome.
11Assemblers
- Previous (Sanger) Assemblers
- NGS Assemblers
- SSAKE (Jeck et al., 2007)
- VCAKE (Warren et al. 2007)
- SHARCGS (Dohm et al. 2007)
- Shorty (Chen and Skiena 2007)
- ALLPATHS (Butler et al. 2008)
- Edena (Hernandez et al. 2008)
- Euler-(U)SR (Chaisson and Pevzner 2008, 2009)
- Velvet (Zerbino and Birney, 2008)
12Outline
- Whole Genome Shotgun Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Bidirected Overlap Graph
- Adjustments to the Standard Min-cost Biflow
Problem - Maximizing the Global Read-Count Likelihood
- Efficiently Solving a Min-cost Biflow
- Flow to Contigs
- Conflict node resolution
- Results
- Discussion
13Theoretical view
- Input set of strings over A,C,G,T called reads
- Output A common superstring of the reads.
- TACAT, CATAC, ACGTAC ? TACATACGTAC
- Initially Shortest Common Superstring (SCS)
- NP-hard Gallant et al 1980
- Over-collapsing of repeats
- Can be found using a TSP solver
- de Bruijn graphs Pevzner, Tang, Waterman 01
- string graphs Myers 05
- Both formulations are NP-hard.
14String graph (Myers)
- Represent reads as vertices, and read overlaps as
edges - Remove redundant edges
- Establish edge constraints
- Unique? (flow is exactly one)
- Required? (min. flow is 1)
- Optional? (min. flow is 0)
- Find shortest walk
15EULER assembler (Pevzner, Tang and Waterman)
- Represent reads as edges and overlaps as vertices
in a de Bruijn graph - Assembly can be efficiently solved as an Eulerian
Path Problem each edge must be visited exactly
once - Repeats dealt with by using multiple edges for a
single repeat read
16Overlap Graph
- Nodes are reads
- Edges are overlaps
- Weights are lengths of prefix
- TSP Tour is SCS
- Example
- TACAT, CATAC, ACGTAC ? TACATACGTAC
17Why Shortest CS?
- DNA is full of repeats identical and nearly
identical copies that appear multiple times - Alu repeat is 300beses long, present 1,000,000
times in the human genome - SCS approach over-collapses the repeats they
are only present once in the answer - Solution Model repeats explicitly through either
de Brujin graph or String graps - Maybe this will also become tractable?
18De Bruijn Graphs
AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC
- Nodes are (k-1)-mers
- Edges are k-mers
- The set of k-mers is called a k-spectrum
- Finding shortest string with given k-spectrum
equivalent to Chinese Postman
Pevzner 1989
19De Bruijn Graphs with Walks
AGC, ATTCA, CATT, GCAG, ATG
- Nodes are (k-1)-mers
- Edges are k-mers
- Reads are walks
- Finding superwalk (one that
- includes all walks)
- Not a polynomial time problem
- De Bruijn Superwalk is NP-hard
Pevzner et al 2001
20Chinese Postman Tours
- Solving Chinese Postman An Eulerian tour is a
solution - Euleriazation make a graph Eulerian
- Can be done with min cost flow
- Unbalanced nodes are sources/sinks
- Duplicate all edges used in flow
AGC, ATTCA, CATT, GCAG, ATG
TT
TC
AT
CA
GC
AG
Pevzner 1989
21DNA is not a String
AAC, ATT, CAA, CCA, GCC, TGC, TTG
ATTGCCAAC
5
3
AA
- The shortest walk that visits every edge at least
once (a Chinese postman tour) is the shortest
string with the given k-spectrum Pevzner 1989
22Complexity of CPT
Equivalent to
Undirected Polynomial Time Matching
Directed Polynomial Time Matching
Mixed NP-hard Network Flow
Bidirected Polynomial Time Bidirected Flow
23Modeling Double Strandedness
Kececioglu 91, Kececioglu-Meyers 95
24Modeling Double-Strandedness
- How can two DNA molecules overlap?
A A C
CTT
AAC
-AAG
-GTT
C T T
ATTGCCAAC
5
3
A A C
TCG
AAC
-CGA
-GTT
T C G
T G G
TGG
AAC
-CCA
-GTT
A A C
Kececioglu 1992
25Walks in bidirected graphs
- A walk has to match directions at each node.
- Suppose the node AA/TT-.
- Edge orientations correspond to strands
- A path can use a node in both orientations
26Rules for Matching Directions
- When we walk through it, we can
- Come in using in arrow, then leave using out
arrow - This is forward, so read the strand. i.e. AA
here - Come in using out arrow, then leave using in
arrow - This is backward, so Read the - strand, i.e TT
here.
27Bidirected Graphs
- So what this walk corresponds to ?
28Bidirected de Bruijn Graphs
- The shortest walk that visits every edge at least
once (a Chinese postman tour) is the shortest DNA
molecule with the given k-spectrum
AAC, ATT, CAA, CCA, GCC, TGC, TTG
ATTGCCAAC
5
3
29Representing Bidirected graphs
30Motivation Overlap Graphs
- Several downsides of the de Bruijn approach
- Division into k-mers arbitrary
- Very sensitive to sequencing errors
- Not memory efficient (one node per k-mer)
- Goal
- One node per read (or better)
- No division into k-mers
- Flexibility in the presence of sequencing errors
Myers 2005
31How To Build A Overlap Graph (1)
TACATACGTAC
- ACGTAC, CATAC, TACAT
- Nodes are reads
- Edges are overlaps
- Weights are lengths of non-overlapping prefix
- Transitively inferable overlaps
3
5
3
2
2
32Bidirected Overlap Graph
- In this work, authors have used Bidirected
overlap graph. - In a bidirected overlap graph, each vertex is a
double-stranded read - Edges represent read overlaps
33Bidirected Overlap Graph
- Three possible ways that two double-stranded
reads can overlap (corresponds to the three types
of edges) - Suppose we have two reads r1 and r2
- Each read can be oriented to the left or to the
right - The three possible overlaps are
- i) Both strands point in the same direction (both
- reads can point left, or both can point
right, its - the same overlap either way)
- ii) r1 points left and r2 points right
- iii) r1 points right and r2 points left
34Bidirected Overlap Graph
- The overlap graph is constructed by placing an
edge between two reads if they overlap by a
minimum number of characters omin - Question How is omin determined?
- Then perform transitive edge reduction remove
overlaps covered by two shorter overlaps
35Observation
- A bidirected graph contains an Eulerian circuit
if and only if it is connected and balanced.
36Chinese postman Problem on Bidirected Graphs
37Chinese postman Problem on Bidirected Graphs
- Let G be a weighted bidirected graph. There
exists a circuit of weight i if and only if there
exists an Eulerian extension of weight i. - G has a circuit if and only if it is strongly
connected. - The minimum weight Eulerian extension of G has at
most 2EV edges.
38Chinese postman Problem on Bidirected Graphs
- The running time of Algorithm 1 is
O(E2log(V)log(E)). - Gabows algorithm runs in O(E2log(V)log(max(u(
e))) - u is the flow upper bound function
- f(e) lt 2 E V for every edge e,
- So, we can safely let u(e) 2 E V
39Chinese postman Problem on Bidirected Graphs
- Hence the theorem is proved
- Given a set of k-molecules S, we can find the
shortest (k-1)-circular DNA molecule whose
k-molecule spectrum is S in time
O(S2log2(S)). - This is a polynomial time algorithm, explicitly
handling the double strandedness - The first main result of this paper.
40Outline
- Whole Genome Shotgun Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Bidirected Overlap Graph
- Adjustments to the Standard Min-cost Biflow
Problem - Maximizing the Global Read-Count Likelihood
- Efficiently Solving a Min-cost Biflow
- Flow to Contigs
- Conflict node resolution
- Results
- Discussion
41Sequence assembly using NGS
- Sequence assembly using NGS
- Several methods available now (e.g. SSAKE, VCAKE,
SHARCGS, etc.) - All of these assume that the length of the
assembled genome must be minimized - Results in over-collapsing of repeats
- Given ubiquity of repeats in eukaryotic genomes,
authors considered this a poor assumption
42Goal of an Assembler
- What should the goal of an assembler be ??
- Shortest string ??
- Problem of over-collapse
43Maximum Likelihood Genome Assembly
- Change goal of sequence assembly
- Maximize the likelihood that the resultant genome
was the source of the given reads - Take advantage of the high coverage of NGS to
statistically estimate the copy-count of each
read identify and quantify repeats - Maximizing the likelihood of observed read
frequencies can be cast as mininum cost
bidirected flow (biflow) problem - Allows solution to be obtained with an
off-the-shelf network flow solver - Authors claim 99.99 accuracy
44Maximum Likelihood Genome Assembly
- Second important aspect is the use of matepair
information for joining contigs - Other systems look for all paths between mated
reads - The proposed Method looks only for short paths
between some pairs of reads - Question How to decide the upper bound for these
short paths? And how to decide which pairs of
reads to examine?
45Outline
- Whole Genome Shotgun Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Bidirected Overlap Graph
- Adjustments to the Standard Min-cost Biflow
Problem - Maximizing the Global Read-Count Likelihood
- Efficiently Solving a Min-cost Biflow
- Flow to Contigs
- Conflict node resolution
- Results
- Discussion
46Adjustments to the Standard Min-cost Biflow
Problem
- Standard Min-cost Biflow Problem
- Set upper and lower flow bounds on each edge
- Flow function f E ? N must obey the constraint
for each edge e - For each vertex, the incoming flow is balanced
with the outgoing flow - Objective Find the flow that minimizes
47Adjustments to the Standard Min-cost Biflow
Problem
- Medvedev-Brudno Min-cost Biflow Problem
- Upper and lower flow bounds on vertices as well
- Accomplished by splitting every vertex v into
two - v and v-
48Adjustments to the Standard Min-cost Biflow
Problem
- v- serves as the incoming vertex, and inherits
v incoming edges - v serves as the outgoing vertex, and inherits
vs outgoing edges - Finally add one edge between v- and v and assign
it the upper and lower flow bounds for v
49Adjustments to the Standard Min-cost Biflow
Problem
- Second variation represent the cost ce as a
convex function - A function is convex if every point on or above
it forms a convex set - A convex set refers to an area where, for every
pair of points within that area, every point on
the straight line segment connecting those points
also lies within that area
50Convex Function
51Adjustments to the Standard Min-cost Biflow
Problem
- An area that is not convex would have some sort
of concave portion that would contradict the
above property of convex sets - In the overlap graph, convex functions are
modelled with piecewise-linear approximations,
allowing the flow to be solved as a linear
min-cost flow problem
52Adjustments to the Standard Min-cost Biflow
Problem
- Supersource and supersink added to convert flow
problem into circulation problem - Each vertex has a lower bound of 1, since each
read must appear in the finished genome at least
once - Edge bounds are set to 0 (lower bound) and
infinity (upper bound)
53Adjustments to the Standard Min-cost Biflow
Problem
- Prohibitively large cost on the edge leading from
the supersource and the edge leading to the
supersink to ensure that the assembly uses the
smallest number of contigs possible - Flow through each vertex represents number of
times it appears in the assembled genome
54Supersource and Supersink
55Maximum Likelihood Framework
- Let D be a circular genome of length N(D)
- di number of times the k-molecule i appears
in D - Suppose
- i ACGT
A C G T
For, simplicity they are drawn as strings instead
of molecules
56Maximum Likelihood Framework
- Random trial
- Sample a position and take a k-molecule
- What is the probability that the k-molecule is i
For, simplicity they are drawn as strings instead
of molecules
57Maximum Likelihood Framework
- Sample Uniformly
- We call it success, if we get i
- So, p success probability
- We do the experiment n times
- Xi be the random variable indicating number of
times we get i - What is the distribution of Xi ??
Binomial Distribution
58Maximum Likelihood Framework
- How many options for i ?
- There of 4k possibilities .
- Hence 4k random variables .
- Suppose k 3
- X1 X2 X3 X4 X5 X6
X64 - They are,
- XAAA XAAC XAAG XAAT
XTTT
59Maximizing the Global Read-Count Likelihood
- Taking all random variables over n experiments.
- What is the probability that AAA comes xAAA
times, and AAC comes xAAC times, .. and CGT
comes xCGT times and TTT comes xTTT times ?? - Each random variable for every possible k-mer has
a binomial distribution. Their joint distribution
is the following multinomial distribution
60Maximum Likelihood Framework
- But D is not known, but the results of the n
trials are known !! - The probability can be considered as the
likelihood of the parameters of the distribution
di, given the outcome of the trials xi which is
called - Global Read-count Likelihood
61Maximizing the Global Read-Count Likelihood
- Goal is to maximize L, or, equivalently, minimize
the negative log of L
62Maximizing the Global Read-Count Likelihood
- To translate this problem into a convex min-cost
biflow problem, we need convex functions ci for
each k-mer - Problem the Xi random variables are not
independent, because we have constraint - We need something like
63Maximizing the Global Read-Count Likelihood
- But, as the number of trials goes to infinity,
the Xi random variables become independent. - In NGS techniques, the number of trials is
usually large enough to warrant the approximation
of the multinomial distribution as the product of
the binomial distributions for each Xi
64Maximizing the Global Read-Count Likelihood
- In this binomial approximation, genome length
N(G) is constant, and independent of the sampling
frequencies - Therefore, use N instead, which is the actual
length of the genome G
65Maximizing the Global Read-Count Likelihood
- New approximation of L
- Now
- And
- ci is used as the convex functions for the
vertices of the min-cost biflow
66Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
- Methods Show Me the Contigs
- Results
- Discussion
67Efficiently Solving a Min-Cost Biflow
- Problem No existing efficient implementation of
a min-cost biflow algorithm - Though, Gabow (1983) presented polynomial time
algorithm for min cost biflow - It is difficult to implement.
- Authors didnt find any existing implementation
either - Authors solve by converting a bidirected flow
into a directed flow problem.
68Efficiently Solving a Min-Cost Biflow
- Directed network flow is solved by reducing the
problem to a linear program (LP) - Use an edge incidence matrix derived from
the overlap graph - If cell has a value of 1, then edge n is an
in-edge for vertex m - If the value is -1, n is an out-edge
- 0 means n and m are not on speaking terms
- Use incidence matrix as constraint matrix for LP
optimal LP solution corresponds to a minimum flow
69Efficiently Solving a Min-Cost Biflow
- The incidence matrix is Totally Unimodular (TU)
- Leads to Linear programs that always have integer
solutions. - Makes it possible to produce an integral solution
with LP, rather than resort to Integer
Programming -gt NP-hard
70Efficiently Solving a Min-Cost Biflow
- Possible for 2 or -2 to appear in the incidence
matrix, since two in-edges/out-edges can enter a
single vertex - Incidence matrix is actually a
- binet matrix
- Optimal LP solution for binet matrices
is guaranteed to be half-integral (i.e. the
coefficients are multiples of 0.5)
Hochbaum 2004
71Efficiently Solving a Min-Cost Biflow
72Efficiently Solving a Min-Cost Biflow
- Monotonization Procedure
- For every vertex v in the bidirected graph,
replace with two vertices v1 and v2 in the new
graph - Each of vs in-edges are replaced with two edges,
one of which points into v1, while the other
points out of v2 - Likewise, each of vs out-edges are replaced with
two edges, one of which points out of v1, while
the other points into v2 - Bounds and costs from original graph are
transferred to the new graph, and the solution of
the new graph will be transferred to the original
graph
Hochbaum 2004
73Efficiently Solving a Min Cost Flow
- Problem can now be solved with off-the-shelf
software - After finding the min cost flow in the directed
graph, transfer the results to the original
bidirected graph by adding the flows through the
pairs of twin edges and dividing by two. - Hence, the optimal result is half integral and
the monotonized flow is at worst a
2-approximation to the optimal integral flow.
Hochbaum 2004
74Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
75Flow to Contigs
- Flows have been solved,
- Now, decompose it into a collection of walks,
which translates into assembled contigs - Graph is first simplified by removing all edges
with a flow of zero - Additional simplifications possible .
76Flow to Contigs
- by removing vertices v where
- There is exactly one edge going into v and one
edge leading out of v, and the flow on both edges
is the same - Vertices where there is also a loop with the same
flow as the other two edges, and - Split and join vertices, where the flow on the
in- - edges is the same as those of the out-edges
77Flow to Contigs
- After at most 2V of these simplifications, the
remaining vertices are conflict vertices - those that didnt match the previous criteria
78Conflict Node Resolution
- Using matepair information
- Look for edges at these vertices with opposite
orientations supported by matepairs - Use BFS to find all reads within a certain
distance from the vertex (in both direction) - We have two sets of vertices L and R,
corresponding to reads that were observed on the
inside of a vertex and the outside. - Match those reads that are matepairs.
- For those matepairs where one read is on the
incoming side and the other is on the outgoing
side, find the shortest path between them using
Dijkstras algorithm
79Resolving Conflict Nodes with Mate Pairs
A
B
A
B
A
B
A
B
?
Does there exist a short path between A and B?
- Dijkstras shortest path algorithm -- bounded
- Greedily join edges if they have enough
supporting reads.
80Greedy Matching
- Make note of the number of mates that fall within
the expected insert distance - Pairs of in/out edges that have a significant
number of matepairs that fall within the insert
distance are joined into a common edge - The previous step is repeated until no more edges
can be joined in this manner - Graph simplification continues in iterative
phases until convergence
81Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
82Results
- Generated synthetic reads from E. coli genome,
which has a total length of 4.6 Mega basepairs. - Simulated matepairs distances were uniformly
distributed within 10 of the expected insert
size - Reads were 25 bp long, and error-free
83Results
- Coverage rates involved 50x, 75x, 100x, and 200x
- Minimum overlap length varied between 17 and 21
- Authors claim that, overall running time of the
algorithm is approx 1 hour on one machine - Question What kind of machine??
84Copy Count Results
- Authors compared the flow going through every
vertex in the overlap graph to the number of
times that the corresponding read appears.
85Read Count Results
- Compared vertex flow with read frequency in the
original genome - High degree of accuracy
- Error rate between 10-4 and 10-6
- Generally more tendency to overestimate read
frequency - Authors claim only slight improvements beyond 75x
coverage - but 200x coverage is fantastically good
86Assembly Results
- Take the edges of the graph produced after the
conflict node resolution and generate the
sequence it spells out - Compute N50 The length of the shortest contig
s.t. 50 of the genome lies in longer contigs - Also compute N90 Similar to N50, but the cutoff
is 90 - Finally, compute errors by aligning each contig
to the reference genome and seeing how many local
alignments it takes to completely tile the contig
(minus one because it always takes at least one
alignment to do it)
87Assembly Results
N50 Results
N90 Results
88Assembly Results (contd)
- Length of contigs that contain 50 of the genome
varied between 23-28 kb - Length of contigs that contain 90 of the genome
varied between 7-8 kb - N50 error rate 1/100-180 kb
- N90 error rate 1/100-160 kb
- Greedy algorithm can be fooled by several strong
edge matches - Contig size is good relative to other whole
genome assemblies involving small read sizes
89Outline
- Whole Genome Assembly
- Review of Related Work
- The Medvedev-Brudno Method
- Methods Bidirected Overlap Graph
- Methods Adjustments to the Standard Min-cost
Biflow Problem - Methods Maximizing the Global Read-Count
Likelihood - Methods Efficiently Solving a Min-cost Biflow
(Linear) - Methods Show Me the Contigs
- Results
- Discussion
90Discussion
- Demonstrated that bidirected flow is a powerful
method for gnome-assembly. - Introduced a maximum likelihood framework for
sequence assembly - By unifying Pevzners work on de Bruijn graphs,
Kececioglu and Myers work on bidirected graphs
in assembly, and Edmond and Gabows work on
bidirected flow. - The paper gives an exact polynomial time assembly
algorithm in the parsimony setting explicitly
dealing with double-strandedness.
91Discussion
- First major assumption Reads are error-free
- Can be overcome with higher coverage
- Second major assumption Uniform sampling of all
genomic regions - Reality certain portions of the genome are
easier to sample than others - More difficult to overcome
- Could be overcome by establishing the biases of
the sequencing apparatus used
92Future Research
- Exploration of the exact biases of the NGS
platforms - Correction for these
- Is there any better heuristic for the greedy
resolution ??
93Questions ??