Paul Medvedev

About This Presentation

Title:

Paul Medvedev

Description:

Maximum Likelihood Genome Assembly Paul Medvedev Michael Brudno Bioinformatics Algorithms Presented by Md. Tanvir Al Amin, Md. Shaifur Rahman Khalid Mahmood – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 94

Provided by: MdTanvi

Category:

more less

Transcript and Presenter's Notes

Title: Paul Medvedev

1
Maximum Likelihood Genome Assembly

Paul Medvedev
Michael Brudno

Bioinformatics Algorithms
Presented by Md. Tanvir Al Amin, Md. Shaifur
Rahman Khalid Mahmood
Department of Computer Science and Engineering
BUET
Some of the slides are taken from other sources
2
Computational Genomics

Our genome encodes an enormous amount of
information about our beings
our looks
our size
how our bodies work
.
our health
our behaviors
who we are!

gcgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtg
tgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatg
agcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtag
acttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagct
gatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagc
tgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgat
agctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgc
tagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtg
atcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcagg
atgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgc
atggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggct
gctgagagcgtaggcccg.
3
Contributions of the paper

Two-fold, first one being
First exact polynomial time algorithm for the
shortest double-stranded genome, given its
k-molecule spectrum
A problem that was solved for strings, but
remained open for molecules

4
Contributions of the paper

Second one
Oppose the idea of shortest genome
Because It overcollapses
Instead propose a new objective
A maximum likelihood framework for assembling the
genome that is most likely the source of the
reads.

5
Contributions of the paper

Maximum likelihood framework
Assumes perfect reads
Uniform distribution
Advantage of high coverage (NGS)
Estimate copy counts of repeats
Combine with matepair data
Read gt Contigs

6
Outline

Whole Genome Shotgun Assembly
Review of Related Work
The Medvedev-Brudno Method
Bidirected Overlap Graph
Adjustments to the Standard Min-cost Biflow
Problem
Maximizing the Global Read-Count Likelihood
Efficiently Solving a Min-cost Biflow
Flow to Contigs
Conflict node resolution
Results
Discussion

7
Outline

Whole Genome Shotgun Assembly
Review of Related Work
The Medvedev-Brudno Method
Bidirected Overlap Graph
Adjustments to the Standard Min-cost Biflow
Problem
Maximizing the Global Read-Count Likelihood
Efficiently Solving a Min-cost Biflow
Flow to Contigs
Conflict node resolution
Results
Discussion

8
Whole Genome Shotgun Sequencing
DNA
SEQUENCER
Sanger vs. NGS
reads
ASSEMBLER
C

Problems in Assembly
Sequencing Errors
Unknown Orientation
Incomplete Coverage
Repeats

contigs
FINISHING
sequence
9
Whole Genome Shotgun Sequencing

Break genome into shotgun-sized fragments and
sequence
Match the overlapping regions of contiguous
sequences
Demonstrated by Celera Genomics to be feasible
for whole genome assembly
Sequenced human genome at 1/10th the cost of the
public Human Genome Project

10
Whole Genome Assembly

Next Generation Sequencing (NGS) ??
Improved speed and cost-effectiveness relative to
the other methods
but much shorter read length (25-200 bp)
Only proven on re sequencing projects, i.e. a
reference genome is already available
Posses significant challenges to the problem of
de novo genome assembly determination of a
completely unknown genome.

11
Assemblers

Previous (Sanger) Assemblers
NGS Assemblers
SSAKE (Jeck et al., 2007)
VCAKE (Warren et al. 2007)
SHARCGS (Dohm et al. 2007)
Shorty (Chen and Skiena 2007)
ALLPATHS (Butler et al. 2008)
Edena (Hernandez et al. 2008)
Euler-(U)SR (Chaisson and Pevzner 2008, 2009)
Velvet (Zerbino and Birney, 2008)

12
Outline

Whole Genome Shotgun Assembly
Review of Related Work
The Medvedev-Brudno Method
Bidirected Overlap Graph
Adjustments to the Standard Min-cost Biflow
Problem
Maximizing the Global Read-Count Likelihood
Efficiently Solving a Min-cost Biflow
Flow to Contigs
Conflict node resolution
Results
Discussion

13
Theoretical view

Input set of strings over A,C,G,T called reads
Output A common superstring of the reads.
TACAT, CATAC, ACGTAC ? TACATACGTAC
Initially Shortest Common Superstring (SCS)
NP-hard Gallant et al 1980
Over-collapsing of repeats
Can be found using a TSP solver
de Bruijn graphs Pevzner, Tang, Waterman 01
string graphs Myers 05
Both formulations are NP-hard.

14
String graph (Myers)

Represent reads as vertices, and read overlaps as
edges
Remove redundant edges
Establish edge constraints
Unique? (flow is exactly one)
Required? (min. flow is 1)
Optional? (min. flow is 0)
Find shortest walk

15
EULER assembler (Pevzner, Tang and Waterman)

Represent reads as edges and overlaps as vertices
in a de Bruijn graph
Assembly can be efficiently solved as an Eulerian
Path Problem each edge must be visited exactly
once
Repeats dealt with by using multiple edges for a
single repeat read

16
Overlap Graph

Nodes are reads
Edges are overlaps
Weights are lengths of prefix
TSP Tour is SCS
Example
TACAT, CATAC, ACGTAC ? TACATACGTAC

17
Why Shortest CS?

DNA is full of repeats identical and nearly
identical copies that appear multiple times
Alu repeat is 300beses long, present 1,000,000
times in the human genome
SCS approach over-collapses the repeats they
are only present once in the answer
Solution Model repeats explicitly through either
de Brujin graph or String graps
Maybe this will also become tractable?

18
De Bruijn Graphs
AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC

Nodes are (k-1)-mers
Edges are k-mers
The set of k-mers is called a k-spectrum
Finding shortest string with given k-spectrum
equivalent to Chinese Postman

Pevzner 1989
19
De Bruijn Graphs with Walks
AGC, ATTCA, CATT, GCAG, ATG

Nodes are (k-1)-mers
Edges are k-mers
Reads are walks
Finding superwalk (one that
includes all walks)
Not a polynomial time problem
De Bruijn Superwalk is NP-hard

Pevzner et al 2001
20
Chinese Postman Tours

Solving Chinese Postman An Eulerian tour is a
solution
Euleriazation make a graph Eulerian
Can be done with min cost flow
Unbalanced nodes are sources/sinks
Duplicate all edges used in flow

AGC, ATTCA, CATT, GCAG, ATG
TT
TC
AT
CA
GC
AG
Pevzner 1989
21
DNA is not a String
AAC, ATT, CAA, CCA, GCC, TGC, TTG
ATTGCCAAC
5
3
AA

The shortest walk that visits every edge at least
once (a Chinese postman tour) is the shortest
string with the given k-spectrum Pevzner 1989

22
Complexity of CPT
Equivalent to
Undirected Polynomial Time Matching
Directed Polynomial Time Matching
Mixed NP-hard Network Flow
Bidirected Polynomial Time Bidirected Flow
23
Modeling Double Strandedness
Kececioglu 91, Kececioglu-Meyers 95
24
Modeling Double-Strandedness

How can two DNA molecules overlap?

A A C
CTT
AAC
-AAG
-GTT
C T T
ATTGCCAAC
5
3
A A C
TCG
AAC
-CGA
-GTT
T C G
T G G
TGG
AAC
-CCA
-GTT
A A C
Kececioglu 1992
25
Walks in bidirected graphs

A walk has to match directions at each node.
Suppose the node AA/TT-.
Edge orientations correspond to strands
A path can use a node in both orientations

26
Rules for Matching Directions

When we walk through it, we can
Come in using in arrow, then leave using out
arrow
This is forward, so read the strand. i.e. AA
here
Come in using out arrow, then leave using in
arrow
This is backward, so Read the - strand, i.e TT
here.

27
Bidirected Graphs

So what this walk corresponds to ?

GGCAAT
ATTGCC

28
Bidirected de Bruijn Graphs

The shortest walk that visits every edge at least
once (a Chinese postman tour) is the shortest DNA
molecule with the given k-spectrum

AAC, ATT, CAA, CCA, GCC, TGC, TTG
ATTGCCAAC
5
3
29
Representing Bidirected graphs
30
Motivation Overlap Graphs

Several downsides of the de Bruijn approach
Division into k-mers arbitrary
Very sensitive to sequencing errors
Not memory efficient (one node per k-mer)
Goal
One node per read (or better)
No division into k-mers
Flexibility in the presence of sequencing errors

Myers 2005
31
How To Build A Overlap Graph (1)
TACATACGTAC

ACGTAC, CATAC, TACAT
Nodes are reads
Edges are overlaps
Weights are lengths of non-overlapping prefix
Transitively inferable overlaps

3
5
3
2
2
32
Bidirected Overlap Graph

In this work, authors have used Bidirected
overlap graph.
In a bidirected overlap graph, each vertex is a
double-stranded read
Edges represent read overlaps

33
Bidirected Overlap Graph

Three possible ways that two double-stranded
reads can overlap (corresponds to the three types
of edges)
Suppose we have two reads r1 and r2
Each read can be oriented to the left or to the
right
The three possible overlaps are
i) Both strands point in the same direction (both
reads can point left, or both can point
right, its
the same overlap either way)
ii) r1 points left and r2 points right
iii) r1 points right and r2 points left

34
Bidirected Overlap Graph

The overlap graph is constructed by placing an
edge between two reads if they overlap by a
minimum number of characters omin
Question How is omin determined?
Then perform transitive edge reduction remove
overlaps covered by two shorter overlaps

35
Observation

A bidirected graph contains an Eulerian circuit
if and only if it is connected and balanced.

36
Chinese postman Problem on Bidirected Graphs
37
Chinese postman Problem on Bidirected Graphs

Let G be a weighted bidirected graph. There
exists a circuit of weight i if and only if there
exists an Eulerian extension of weight i.
G has a circuit if and only if it is strongly
connected.
The minimum weight Eulerian extension of G has at
most 2EV edges.

38
Chinese postman Problem on Bidirected Graphs

The running time of Algorithm 1 is
O(E2log(V)log(E)).
Gabows algorithm runs in O(E2log(V)log(max(u(
e)))
u is the flow upper bound function
f(e) lt 2 E V for every edge e,
So, we can safely let u(e) 2 E V

39
Chinese postman Problem on Bidirected Graphs

Hence the theorem is proved
Given a set of k-molecules S, we can find the
shortest (k-1)-circular DNA molecule whose
k-molecule spectrum is S in time
O(S2log2(S)).
This is a polynomial time algorithm, explicitly
handling the double strandedness
The first main result of this paper.

40
Outline

Whole Genome Shotgun Assembly
Review of Related Work
The Medvedev-Brudno Method
Bidirected Overlap Graph
Adjustments to the Standard Min-cost Biflow
Problem
Maximizing the Global Read-Count Likelihood
Efficiently Solving a Min-cost Biflow
Flow to Contigs
Conflict node resolution
Results
Discussion

41
Sequence assembly using NGS

Sequence assembly using NGS
Several methods available now (e.g. SSAKE, VCAKE,
SHARCGS, etc.)
All of these assume that the length of the
assembled genome must be minimized
Results in over-collapsing of repeats
Given ubiquity of repeats in eukaryotic genomes,
authors considered this a poor assumption

42
Goal of an Assembler

What should the goal of an assembler be ??
Shortest string ??
Problem of over-collapse

43
Maximum Likelihood Genome Assembly

Change goal of sequence assembly
Maximize the likelihood that the resultant genome
was the source of the given reads
Take advantage of the high coverage of NGS to
statistically estimate the copy-count of each
read identify and quantify repeats
Maximizing the likelihood of observed read
frequencies can be cast as mininum cost
bidirected flow (biflow) problem
Allows solution to be obtained with an
off-the-shelf network flow solver
Authors claim 99.99 accuracy

44
Maximum Likelihood Genome Assembly

Second important aspect is the use of matepair
information for joining contigs
Other systems look for all paths between mated
reads
The proposed Method looks only for short paths
between some pairs of reads
Question How to decide the upper bound for these
short paths? And how to decide which pairs of
reads to examine?

45
Outline

Whole Genome Shotgun Assembly
Review of Related Work
The Medvedev-Brudno Method
Bidirected Overlap Graph
Adjustments to the Standard Min-cost Biflow
Problem
Maximizing the Global Read-Count Likelihood
Efficiently Solving a Min-cost Biflow
Flow to Contigs
Conflict node resolution
Results
Discussion

46
Adjustments to the Standard Min-cost Biflow
Problem

Standard Min-cost Biflow Problem
Set upper and lower flow bounds on each edge
Flow function f E ? N must obey the constraint
for each edge e
For each vertex, the incoming flow is balanced
with the outgoing flow
Objective Find the flow that minimizes

47
Adjustments to the Standard Min-cost Biflow
Problem

Medvedev-Brudno Min-cost Biflow Problem
Upper and lower flow bounds on vertices as well
Accomplished by splitting every vertex v into
two
v and v-

48
Adjustments to the Standard Min-cost Biflow
Problem

v- serves as the incoming vertex, and inherits
v incoming edges
v serves as the outgoing vertex, and inherits
vs outgoing edges
Finally add one edge between v- and v and assign
it the upper and lower flow bounds for v

49
Adjustments to the Standard Min-cost Biflow
Problem

Second variation represent the cost ce as a
convex function
A function is convex if every point on or above
it forms a convex set
A convex set refers to an area where, for every
pair of points within that area, every point on
the straight line segment connecting those points
also lies within that area

50
Convex Function
51
Adjustments to the Standard Min-cost Biflow
Problem

An area that is not convex would have some sort
of concave portion that would contradict the
above property of convex sets
In the overlap graph, convex functions are
modelled with piecewise-linear approximations,
allowing the flow to be solved as a linear
min-cost flow problem

52
Adjustments to the Standard Min-cost Biflow
Problem

Supersource and supersink added to convert flow
problem into circulation problem
Each vertex has a lower bound of 1, since each
read must appear in the finished genome at least
once
Edge bounds are set to 0 (lower bound) and
infinity (upper bound)

53
Adjustments to the Standard Min-cost Biflow
Problem

Prohibitively large cost on the edge leading from
the supersource and the edge leading to the
supersink to ensure that the assembly uses the
smallest number of contigs possible
Flow through each vertex represents number of
times it appears in the assembled genome

54
Supersource and Supersink
55
Maximum Likelihood Framework

Let D be a circular genome of length N(D)
di number of times the k-molecule i appears
in D
Suppose
i ACGT

A C G T
For, simplicity they are drawn as strings instead
of molecules
56
Maximum Likelihood Framework

Random trial
Sample a position and take a k-molecule
What is the probability that the k-molecule is i

For, simplicity they are drawn as strings instead
of molecules
57
Maximum Likelihood Framework

Sample Uniformly
We call it success, if we get i
So, p success probability
We do the experiment n times
Xi be the random variable indicating number of
times we get i
What is the distribution of Xi ??

Binomial Distribution
58
Maximum Likelihood Framework

How many options for i ?
There of 4k possibilities .
Hence 4k random variables .
Suppose k 3
X1 X2 X3 X4 X5 X6
X64
They are,
XAAA XAAC XAAG XAAT
XTTT

59
Maximizing the Global Read-Count Likelihood

Taking all random variables over n experiments.
What is the probability that AAA comes xAAA
times, and AAC comes xAAC times, .. and CGT
comes xCGT times and TTT comes xTTT times ??
Each random variable for every possible k-mer has
a binomial distribution. Their joint distribution
is the following multinomial distribution

60
Maximum Likelihood Framework

But D is not known, but the results of the n
trials are known !!
The probability can be considered as the
likelihood of the parameters of the distribution
di, given the outcome of the trials xi which is
called
Global Read-count Likelihood

61
Maximizing the Global Read-Count Likelihood

Goal is to maximize L, or, equivalently, minimize
the negative log of L

62
Maximizing the Global Read-Count Likelihood

To translate this problem into a convex min-cost
biflow problem, we need convex functions ci for
each k-mer
Problem the Xi random variables are not
independent, because we have constraint
We need something like

63
Maximizing the Global Read-Count Likelihood

But, as the number of trials goes to infinity,
the Xi random variables become independent.
In NGS techniques, the number of trials is
usually large enough to warrant the approximation
of the multinomial distribution as the product of
the binomial distributions for each Xi

64
Maximizing the Global Read-Count Likelihood

In this binomial approximation, genome length
N(G) is constant, and independent of the sampling
frequencies
Therefore, use N instead, which is the actual
length of the genome G

65
Maximizing the Global Read-Count Likelihood

New approximation of L
Now
And
ci is used as the convex functions for the
vertices of the min-cost biflow

66
Outline

Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods Bidirected Overlap Graph
Methods Adjustments to the Standard Min-cost
Biflow Problem
Methods Maximizing the Global Read-Count
Likelihood
Methods Efficiently Solving a Min-cost Biflow
Methods Show Me the Contigs
Results
Discussion

67
Efficiently Solving a Min-Cost Biflow

Problem No existing efficient implementation of
a min-cost biflow algorithm
Though, Gabow (1983) presented polynomial time
algorithm for min cost biflow
It is difficult to implement.
Authors didnt find any existing implementation
either
Authors solve by converting a bidirected flow
into a directed flow problem.

68
Efficiently Solving a Min-Cost Biflow

Directed network flow is solved by reducing the
problem to a linear program (LP)
Use an edge incidence matrix derived from
the overlap graph
If cell has a value of 1, then edge n is an
in-edge for vertex m
If the value is -1, n is an out-edge
0 means n and m are not on speaking terms
Use incidence matrix as constraint matrix for LP
optimal LP solution corresponds to a minimum flow

69
Efficiently Solving a Min-Cost Biflow

The incidence matrix is Totally Unimodular (TU)
Leads to Linear programs that always have integer
solutions.
Makes it possible to produce an integral solution
with LP, rather than resort to Integer
Programming -gt NP-hard

70
Efficiently Solving a Min-Cost Biflow

Possible for 2 or -2 to appear in the incidence
matrix, since two in-edges/out-edges can enter a
single vertex
Incidence matrix is actually a
binet matrix
Optimal LP solution for binet matrices
is guaranteed to be half-integral (i.e. the
coefficients are multiples of 0.5)

Hochbaum 2004
71
Efficiently Solving a Min-Cost Biflow
72
Efficiently Solving a Min-Cost Biflow

Monotonization Procedure
For every vertex v in the bidirected graph,
replace with two vertices v1 and v2 in the new
graph
Each of vs in-edges are replaced with two edges,
one of which points into v1, while the other
points out of v2
Likewise, each of vs out-edges are replaced with
two edges, one of which points out of v1, while
the other points into v2
Bounds and costs from original graph are
transferred to the new graph, and the solution of
the new graph will be transferred to the original
graph

Hochbaum 2004
73
Efficiently Solving a Min Cost Flow

Problem can now be solved with off-the-shelf
software
After finding the min cost flow in the directed
graph, transfer the results to the original
bidirected graph by adding the flows through the
pairs of twin edges and dividing by two.
Hence, the optimal result is half integral and
the monotonized flow is at worst a
2-approximation to the optimal integral flow.

Hochbaum 2004
74
Outline

Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods Bidirected Overlap Graph
Methods Adjustments to the Standard Min-cost
Biflow Problem
Methods Maximizing the Global Read-Count
Likelihood
Methods Efficiently Solving a Min-cost Biflow
(Linear)
Methods Show Me the Contigs
Results
Discussion

75
Flow to Contigs

Flows have been solved,
Now, decompose it into a collection of walks,
which translates into assembled contigs
Graph is first simplified by removing all edges
with a flow of zero
Additional simplifications possible .

76
Flow to Contigs

by removing vertices v where
There is exactly one edge going into v and one
edge leading out of v, and the flow on both edges
is the same
Vertices where there is also a loop with the same
flow as the other two edges, and
Split and join vertices, where the flow on the
in-
edges is the same as those of the out-edges

77
Flow to Contigs

After at most 2V of these simplifications, the
remaining vertices are conflict vertices
those that didnt match the previous criteria

78
Conflict Node Resolution

Using matepair information
Look for edges at these vertices with opposite
orientations supported by matepairs
Use BFS to find all reads within a certain
distance from the vertex (in both direction)
We have two sets of vertices L and R,
corresponding to reads that were observed on the
inside of a vertex and the outside.
Match those reads that are matepairs.
For those matepairs where one read is on the
incoming side and the other is on the outgoing
side, find the shortest path between them using
Dijkstras algorithm

79
Resolving Conflict Nodes with Mate Pairs
A
B
A
B
A
B
A
B
?
Does there exist a short path between A and B?

Dijkstras shortest path algorithm -- bounded
Greedily join edges if they have enough
supporting reads.

80
Greedy Matching

Make note of the number of mates that fall within
the expected insert distance
Pairs of in/out edges that have a significant
number of matepairs that fall within the insert
distance are joined into a common edge
The previous step is repeated until no more edges
can be joined in this manner
Graph simplification continues in iterative
phases until convergence

81
Outline

Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods Bidirected Overlap Graph
Methods Adjustments to the Standard Min-cost
Biflow Problem
Methods Maximizing the Global Read-Count
Likelihood
Methods Efficiently Solving a Min-cost Biflow
(Linear)
Methods Show Me the Contigs
Results
Discussion

82
Results

Generated synthetic reads from E. coli genome,
which has a total length of 4.6 Mega basepairs.
Simulated matepairs distances were uniformly
distributed within 10 of the expected insert
size
Reads were 25 bp long, and error-free

83
Results

Coverage rates involved 50x, 75x, 100x, and 200x
Minimum overlap length varied between 17 and 21
Authors claim that, overall running time of the
algorithm is approx 1 hour on one machine
Question What kind of machine??

84
Copy Count Results

Authors compared the flow going through every
vertex in the overlap graph to the number of
times that the corresponding read appears.

85
Read Count Results

Compared vertex flow with read frequency in the
original genome
High degree of accuracy
Error rate between 10-4 and 10-6
Generally more tendency to overestimate read
frequency
Authors claim only slight improvements beyond 75x
coverage
but 200x coverage is fantastically good

86
Assembly Results

Take the edges of the graph produced after the
conflict node resolution and generate the
sequence it spells out
Compute N50 The length of the shortest contig
s.t. 50 of the genome lies in longer contigs
Also compute N90 Similar to N50, but the cutoff
is 90
Finally, compute errors by aligning each contig
to the reference genome and seeing how many local
alignments it takes to completely tile the contig
(minus one because it always takes at least one
alignment to do it)

87
Assembly Results
N50 Results
N90 Results
88
Assembly Results (contd)

Length of contigs that contain 50 of the genome
varied between 23-28 kb
Length of contigs that contain 90 of the genome
varied between 7-8 kb
N50 error rate 1/100-180 kb
N90 error rate 1/100-160 kb
Greedy algorithm can be fooled by several strong
edge matches
Contig size is good relative to other whole
genome assemblies involving small read sizes

89
Outline

Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods Bidirected Overlap Graph
Methods Adjustments to the Standard Min-cost
Biflow Problem
Methods Maximizing the Global Read-Count
Likelihood
Methods Efficiently Solving a Min-cost Biflow
(Linear)
Methods Show Me the Contigs
Results
Discussion

90
Discussion

Demonstrated that bidirected flow is a powerful
method for gnome-assembly.
Introduced a maximum likelihood framework for
sequence assembly
By unifying Pevzners work on de Bruijn graphs,
Kececioglu and Myers work on bidirected graphs
in assembly, and Edmond and Gabows work on
bidirected flow.
The paper gives an exact polynomial time assembly
algorithm in the parsimony setting explicitly
dealing with double-strandedness.

91
Discussion