Quasispecies Assembly Using Network Flows - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Quasispecies Assembly Using Network Flows

Description:

Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya Georgia ... – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 31
Provided by: Kelly349
Category:

less

Transcript and Presenter's Notes

Title: Quasispecies Assembly Using Network Flows


1
Quasispecies Assembly Using Network Flows
  • Alex Zelikovsky
  • Georgia State University
  • Joint work with
  • Kelly Westbrooks Georgia State University
  • Irina Astrovskaya Georgia State University
  • David Campo Centers for Disease Control
  • Yury Khudyakov Centers for Disease Control
  • Piotr Berman Pennsylvania State University
  • Ion Mandoiu University of Connecticut

2
Outline
  • 454 Sequencing of Virus Genome
  • Quasispecies Assembly
  • The Read Graph
  • Network Flow Formulations
  • Phasing Flow Problem
  • Maximum Likelihood
  • Simulation Results

3
HCV Quasispecies
  • HCV is a small, enveloped, positive sense single
    strand RNA virus that is responsible for
    Hepatitis C infection.
  • Over the course of infection, the mutations made
    in replication are passed down to descendants,
    eventually producing a family of related variants
    of the ancestral genome referred as quasispecies.
  • Due to HCV's high mutation rate, in time the
    quasispecies in an infected person can become
    very diverse.
  • A better understanding of HCV quasispecies
    diversity could potentially lead to new
    treatments.
  • The ultimate objective of our work is to develop
    a method of computationally inferring the
    quasispecies sequences in a HCV-infected
    individual.

4
454 Sequencing
  • 454 Sequencing is one promising technology that
    may prove useful for quasispecies sequencing. It
    is a massively-parallel pyrosequencing system
    developed the by biotechnology firm 454 Life
    Sciences for genome sequencing.
  • The system fragments the source genetic material
    to be sequenced into pieces called reads. Then,
    each read is sequenced and the original genome is
    reconstructed via software.
  • Since this system was originally designed to
    sequence genetic material from a single organism,
    the software assembles all of the reads to a
    single genome.
  • In order to use 454 for quasispecies sequencing,
    new methods and software are needed to correctly
    infer the sequences of the quasispecies present
    in an infected person and their population
    frequencies directly from 454 read data.

5
Problem Formulations
  • Given a collection of 454 reads taken from a
    quasispecies population of unknown size and
    distribution, reconstruct the quasispecies
    population, i.e. the sequences and tehir
    frequencies.

Original Quasispecies
454 Reads
Reconstructed Quasispecies
6
Parsimonious/Min Cost Quasispecies Assembly
  • Given a set of reads,
  • Find the minimum number set of quasispecies
    covering all reads
  • Given a set of reads with costs on read overlaps,
  • Find the minimum number set of quasispecies
    covering all reads
  • The cost of the assembly should be inversely
    connected to the likelihood that the assembly is
    the correct one.

Reconstruction 1 Cost C1
Original Quasispecies
454 Reads
Reconstruction 2 Cost C2
If C1 lt C2, then favor Reconstruction 1 over
Reconstruction 2
7
Read Alignment
  • Before beginning assembly, first find the genome
    offset of the read. We assume that the consensus
    sequence for the particular strain of HCV that
    the quasispecies came from is available to us.
  • We simply align each read to the consensus
    sequence, choosing the offset that yields the
    best alignment (i.e. lowest Hamming distance).
  • Because HCV quasispecies don't contain repeats as
    long as a 454 read, the alignment is both fast
    and extremely accurate.

GUCUCAUCGGAACAGCAAAACACUUGCCCCGAACGCUAGCGGUUGGGGUA
CUAUUCAAUGGCUGUAG AACAGCAAAACACUUGCUCCGA
ACGCUAGCGGUUGGGGAACUAU
8
The Read Graph
  • The data structure a directed acyclic graph that
    contains every possible quasispecies
    reconstruction.
  • An aligned read can be contained within another
    aligned read.
  • Find the subset of reads that are not contained
    within any other read
  • We call these reads superreads
  • subreads everything else
  • The superreads are vertices in the read graph.

9
Edges of the Read Graph
  • Put an edge between read X and read Y if
  • X overlaps Y in the alignment
  • some suffix of X some prefix of Y

UGGACUAGAUGUGGUGGGUGCUCUCCGGAAUACCUUGGUGGCGGGU
GAUGUGGUGGGUGCUCUCCGGAAUACCUUGGUGGCGGGUUAGAGA
GGGUGCUCUCCGGAAUACCUUGGUGGCGGGUUAGA
GAGAAGAGAGCA
CUCCGGAAUACCUUGGUGGCGGGUUAGAGAGAAGAGAGCAAGUGUCA
AUACCUUGGUGGCGGGUUAGAGA
GAAGAGAGCAAGUGUCAACGCCUA
10
Quasispecies in the Read Graph
  • Then, we add two extra vertices a new vertex
    with outgoing edges to all vertices with indegree
    0 (the source) and a new vertex with incoming
    edges from all vertices with outdegree 0 (the
    sink)

The Read Graph
Source
Sink
Any path from the source to the sink represents a
potential sequence in the quasispecies population!
11
Transitive Reduction
  • Edge u ? w logically follows from edges u? v and
    v? w
  • Drop edge u? w from consideration no
    information, any quasispecies sequence containing
    u and w will also have v
  • The transitive reduction of a graph
  • smallest subgraph that maintains all
    reachability relationships
  • The graph is partially closed the transitive
    reduction found in O(dE), where d is the read
    degree

12
Estimating Read Frequencies
  • In general, superreads may be contained in
    several quasispecies sequences.
  • Thus, each superread has associated with its
    frequency the sum over the quasispecies of the
    population frequencies of quasispecies that
    contains the superread.
  • Although the true read frequencies are unknown to
    us, we may estimate them by counting the number
    of subreads contained within each superread.
  • By definition, the read frequency of the source
    and sink vertices are 0.

13
Probability of a True Overlap
  • Given N reads over Q sequences, each read with L
    possible starting positions, the probability that
    a position is bu for some read u is N/(LQ).
  • Let (u, v) be an edge in transitive reduction.
  • The probability of bv-bu gt ? is proportional to
    exp(? N/(LQ)).
  • Probability of an edge from the source or to sink
    is 0.

bu
GUGGGGGCAGCGGACGUAUGC
GACGUAUGCAGAACUCUAGGCA
?
bv
14
Network Flow Through Vertices
  • Replacing the vertex for read r with
  • two vertices r_b and r_e
  • and the edge (r_b, r_e)

15
Networks Flows
  • Observe that the true quasispecies sequences in
    the read graph can be represented as a flow

1
1
1
1
2
1
3
1
2
2
2
2
1
1
2
1
1
1
Each vertex has a frequency proportional to the
number and frequencies of the quasispecies that
contains it's associated superread. When we solve
the flow, we demand that each vertex has a inflow
passing into it gt its frequency.
16
Min Cost Flow
  • We define the cost of a flow in the following
    manner
  • The flow cost of an edge is that edge's cost
    multiplied by the amount of flow that traverses
    the edge
  • The cost of a flow through the graph is the sum
    of all of the edge flow costs.
  • Out of all possible flows that go through all of
    the vertices in the graph, we seek to find the
    flow with the minimum cost.
  • After solving min cost flow for the graph, all of
    the edges that have flow gt 0 are assumed to
    participate in true quasispecies. The remaining
    edges can be dropped from the graph.

17
LP for Min Cost Flow
  • Although there are fast combinatorial algorithms
    for solving min cost flow, we opted to solve the
    flow using a linear program.
  • For each edge e, create a real-valued,
    nonnegative variable fe to represent the flow
    across that edge.

The Read Graph
Source
Sink
18
Linear Program for Min Cost Quasispecies Assembly
  • ObjectiveMinimize the sum of cost(e) fe over
    all edges e in the read graph.Subject to
  • For all vertices v
  • The sum of fe over incoming edges to v equals the
    sum of fe over outgoing edges from v.
  • The sum of fe over incoming edges to v is greater
    or equal to the frequency of v.

19
Splitting Flow in Quasispecies
  • Five quasispecies share a common long segment
    a,b and differ on the left and the right in
    value of a SNP. The resulted graph with network
    flow have multiple feasible solutions.

l a
b r
b-a gt the read length
A A C C T
T T A CT
Multi set L
Multi set R
A
T
f2
f3
f2
f1
C
A
f1
T
C
f1
20
Quasispecies Matching Problem
  • Given
  • two multisets of haplotypes
  • L on the segment l, b and
  • R on the segment a, r
  • such that all haplotypes are indistinguishable
    on a common
  • segment a, b (l lt a lt b lt r), L R,
  • Find
  • the matching between multisets L and R such
    that
  • concatenation of the matched haplotypes
  • correspond to the original quasispecies.

21
Decomposing the flow into paths
  • General strategy
  • Find a source-to-sink path with positive flow f
  • Subtract f from all of the edge flows in the path
  • How to find paths?
  • Pick the shortest path ? most likely
    quasispecies
  • Pick the maximum bandwidth path ? most frequent
    quasispecies

1?3?5?6 is the shortest path 1?2?4?6 is the
maximum bandwidth path
22
Finding Max Bandwidth Paths
  • A single iteration of the Bellman-Ford algorithm
    gives an efficient method for finding the maximum
    bandwidth path from the source to the sink
  • Initialize
  • For each vertex I
  • W(i) ? 0
  • p(i) ? 0
  • For the source s
  • W(s) ? infinity
  • Relax
  • For each edge (i,j) in order      (i,j)lt
    (k,l) if iltk or ik jltl
  • if W(j) lt min W(i), cap(i,j)       
    cap(i,j) capacity of (i,j)
  • W(j) ? min W(i), cap(i,j)
  • p(j) ? I
  • Return path p(sink), p(p(sink)),..., source0
  • Finding minimum cost paths is simple just grow a
    shortest-path tree from the source using costs
    for weights.

23
Maximum Likelihood Choice
  • After path decomposition, we have a set of
    candidate quasispecies sequences, but we dont
    know what their frequencies are.
  • Given a set of candidate quasispecies and
    observed reads
  • Expectation Maximization alternates between 2
    steps until convergence Expectation (E) and
    Maximization (M)
  • E Step Calculates the expected likelihood by
    including the current estimate of the latent
    variables
  • M step Computes maximum likelihood estimates of
    parameters by maximizing the expected likelihood
    found in the previous E step

24
EM Implementation
  • Create a bipartite graph
  • Left side quasispecies
  • Right side superreads
  • Put an edge if quasispecies contains read
  • Keep 3 sets of numbers
  • For each qsp q, keep its estimated frequency fq
  • For each superread r, keep its frequency nr
  • For each (q, r) edge, keep pqr
  • E step Compute pqr nr fq / S fq for every
    edge
  • M step Compute fq S pqr / S nr for every qsp

25
Validation
  • Real quasispecies population, simulated reads
  • Real data 44 sequences from E1E2 region of HCV
  • 3 populations consisting of 10 sequences each
  • Uniformly distributed frequencies (the uniform
    population)
  • Geometrically distributed frequencies (the
    geometric population)
  • Highly skewed distribution (the skewed
    population)

26
Instance Generation
  • Inputs A quasispecies population Q, n number
    of reads desired n, read length mean µ and
    variance s2
  • S ?Ø
  • While S lt number of reads desired
  • Randomly select a quasispecies q of length lq
    according to the population frequency
    distribution
  • Generate a read length lr using normal
    distribution (µ, s2)
  • Generate an offset o using uniform distribution
    on 0, lq - lr
  • Extract a substring of length lr starting at
    position o from quasispecies q and add it to S
  • Return S

27
Quality Measures
  • Percentage of correctly predicted sequences
  • Takes into account frequencies
  • Symmetric difference between multisets
  • Switching error
  • Generalization of switching error from the
    haplotype phasing community
  • Average number of times each path corresponding
    to a predicted quasispecies switches between
    paths in the read graph corresponding to real
    quasispecies.

28
Initial Results
Read Length Number of Reads Correctly Predicted
Uniform Uniform Uniform
400 40000 8
400 50000 4
500 40000 29
500 50000 38
Geometric Geometric Geometric
400 40000 21
400 50000 11
500 40000 50
500 50000 66
Skewed Skewed Skewed
400 40000 46
400 50000 10
500 40000 82
500 50000 84
29
Results Geometric Instances
30
Results Uniform Instances
Write a Comment
User Comments (0)
About PowerShow.com