Whole Genome Assembly - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Whole Genome Assembly

Description:

shotgun sequencing the one discussed here. The results were almost identical ... Bacteriophage lambda (virus), 50,000. Escherichia Coli (bacterium), 5,000,000 ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 26
Provided by: Owne834
Category:

less

Transcript and Presenter's Notes

Title: Whole Genome Assembly


1
Whole Genome Assembly
  • The problem biological sequencing is gives
    relatively short fragments (clones) of DNA.
  • How the whole genome can be reconstructed?
  • Two general approaches were implemented
  • clone-by-clone
  • shotgun sequencing the one discussed here
  • The results were almost identical

2
Mapping using Clones
  • Clone
  • A large fragment of genomic DNA obtained using
    restriction enzymes
  • One can make faithful copies of a clone large
    number of times from a small number of initial
    clones.
  • All location information for a clone is assumed
    to be lost. For instance it is not known
  • Which chromosome a clone belongs to
  • Whether two clones overlap
  • What base-pair sequence the clone has etc.

3
Clone Library
  • A large set of clones Clone Library is created
  • Locations of the clones are assumed to be
    uniformly random distributed
  • The sizes of all clone are roughly same.
  • G Genome length, L Clone Length,
  • N Clones in a library
  • Coverage NL/G c
  • (c is the expected number of clones covering any
    location of the genome.)
  • If the coverage is at least 3 for a base, it is
    assumed to be sure.

4
Clone Library
Genome
Clone Library
Minimal Tiling Path
5
Clone Libraries Commonly Used
6
Example Genome Sizes
  • Bacteriophage lambda (virus),
    50,000
  • Escherichia Coli (bacterium),
    5,000,000
  • Saccharomyces cerevisiae (yeast), 10,000,000
  • Caenorhabditis elegans (worm), 100,000,000
  • Drosophila melanogaster (fruitfly),
    200,000,000
  • Homo sapiens (human), 3,000,000,000

7
Example
  • A BAC library for human
  • G 3,300 Mb, L 180 Kb, N 96,000
  • c NL/G 96 103 180 103/ (3.3 109) 6¼
  • 96,000 randomly chosen BACs from the human genome
    provide a 6 library.
  • Certain regions of the genome may be difficult to
    clone and hence may not be represented in the
    library.
  • A Tiling Path is a subset of clones that
    minimally cover the genome.
  • Removal of any clone from the tiling path
    will leave some location of the genome uncovered.

8
Mapping A Single Clone
  • Provide a clone with additional information a
    finger print
  • Restriction Pattern
  • End Sequencing (500 base pairs on each end)
  • Probes (Hybridization probes, etc.)
  • Restriction Pattern
  • Take a clone and completely digest it into small
    pieces (restriction fragments) by a restriction
    enzyme.
  • The restriction fragments and their order are
    always the same for that clone.

9
  • Restriction Patterns are to expansive for large
    genome projects
  • Mapping with probes is much cheaper. Probes may
    be short random sequences or extracted from DNA.
  • STS (sequence tag sites) are long enough DNA
    pieces so that are unique with very probability.
    Probing is usually done at the ends of clones.

10
Sequence Assembly
  • Idealized Assembly with probes can be formalized
    as following problem (assuming no error in the
    read sequence)
  • Shortest Common Superstring Problem
  • Given a collection of n strings F f1, f2, ,
    fn, find the shortest string that is a
    superstring of every string in F.

11
Example
  • F actcc, gagca, ccctac, agg
  • Each of these is a superstring of everything in F
    (because each contains everything in F as a
    substring)
  • actccgagcaccctacagg
  • actcccctacaggagca
  • aggagcactccctac
  • overlaps in bold shortest one is best.

12
About the SCS Problem
  • No algorithm is known which is guaranteed to
    find the best solution (i.e. the shortest
    supersrting) in reasonable time for large cases.
  • We can never guarantee finding the correct/best
    superstring (all genome construction problems
    are large).
  • Thats not all even if we would have a solution
    there are still hard problems

13
  • Problems of SCS solution
  • Does not allow experimental errors need to have
    perfect superstring.
  • The orientation of each fragment must be known
    unrealistic.
  • May not be the biological solutions if there are
    many long repeats, it is almost impossible to
    reconstruct exact original sequence.

14
Example of the repeats problem
  • Real sequence may include
  • agactactactactga, and the covering fragments
    may be
  • agactactact and actactactga
  • which will be collapsed by solving the SCS to
    agactactactga,
  • overlapping too many of the act repeats.

15
Fragment Assembly
  • Despite the issues, we need to put the fragments
    together somehow so
  • Approximate methods are used, which are
    algorithms that try to find good (may be not
    best) solutions quickly.
  • Repeats are dealt with by focussing extra
    biotechnology effort to get large sequenced
    fragments (large reads) around certain areas
    which might contain multiple repeats.

16
Overlap multigraphs
A useful data structure for addressing the SCS is
the overlap multigraph of the set of strings.
This is denoted OM(F) for a set of strings
(fragments) F. First, an illustration for the
set of fragments
4
2
maryha
teass
ryhad
hite
2
1
3
1
2
aswhit
lambit
ssnow
tlela
1
3
4
3
2
2
little
yhadali
lelam
1
2
2
tsfle
2
1
4
leecew
ecewas
Edge between two frags is labelled by length of
overlap
17
More formally
OM(F) is a directed graph in which we have an
edge from fa to fb whenever there is a nonzero
overlap between the suffix of fa and the prefix
of fb. The edge is labelled with the largest
overlap between them. A Hamiltonian path in a
directed graph is a path which visits every node
in the graph precisely once. Any Hamiltonian
path in an overlap multigraph corresponds to a
superstring. Finally, finding the SCS
corresponds to finding the Hamiltionian path with
maximal weight (adding up the edges on the path)

18
Example
4
2
maryha
teass
ryhad
hite
2
1
3
1
2
aswhit
lambit
tlela
1
3
ssnow
4
3
2
2
little
yhadali
lelam
1
2
2
tsfle
2
1
4
leecew
ecewas
The red path is the longest H path in this case,
and leads to the correct sequence. But in genome
sequencing, it is rarely so easy the graph is
immensely larger, and there are very many H
paths.
19
How its done
  • Basic approach is this (where an edge is start,
    finish)
  • 1. build the OM(F) graph
  • 2. Sort the edges in ascending order of
    weight, breaking ties randomly
  • 3. Iteratively choose the highest weighted
    edge (first in the list) with the following
    properties
  • -- does not have finish same as start or
    finish of a previously chosen edge (i.e. does not
    form a cycle)
  • -- does not have start same as start of a
    previously chosen edge (i.e. were building a
    path, not a tree) .

20
Reconstruction
  • To deal with experimental errors, we need to
    introduce the concept of a distance between
    sequences.
  • Substring edit distance the cost of every
    insertion, deletion, or substitution is one unit
    distance. Exception deletions in the
    extremities of the second sequence has no cost.
  • If we measure substring edit distances between
    strings, then we can mandate a string S such that
    either f or its reverse complement f must be an
    approximate substring of S at error level e.

21
Given a collection F of f strings, and an error
tolerance e between 0 and 1, we need to find the
shortest possible string S such that for every f,
we have dist (f, S) lt ef. Find a string
S as short as possible such that either f (or
its reverse complement) must be an approximate
substring of S at error level e. This means that
we are allowed an average e errors for each base
in f. If e 0.05 we are allowed 5 errors per
hundred bases. Example If a GCGATAG and b
CAGTCGCTGATCGTACG, then the best alignment
is -----GC-GATAG---- CAGTCGCTGATCGTACG
22
The next step
  • When short reads (few hundreds of bp) are
    assembled into contigs - long sequences by some
    of the techniques they are to be assembled to
    supercontigs (or scaffolds) and places at right
    locations on the genome.

23
The order of the whole genome assembly
24
Placing the assembly on the genome
  • A sequence tag is a short sequence that is unique
    among the whole genome.
  • Genetic map contains many sequence tags and their
    locations.
  • Align the super contigs to the genome according
    to the tags.

25
Review of WGS
  • WGS Whole Genome Shotgun sequencing
  • 1. Create clone library
  • 2. Assembly the overlapped reads into contigs
  • 3. Assembly the contigs into super contigs.
  • 4. Align the super contigs to the genome
  • 5. Genome Finishing filling the possible gaps
    between supercontigs by additional sequencing
Write a Comment
User Comments (0)
About PowerShow.com