Title: Whole Genome Assembly
1Whole Genome Assembly
- The problem biological sequencing is gives
relatively short fragments (clones) of DNA. - How the whole genome can be reconstructed?
- Two general approaches were implemented
- clone-by-clone
- shotgun sequencing the one discussed here
- The results were almost identical
2Mapping using Clones
- Clone
- A large fragment of genomic DNA obtained using
restriction enzymes - One can make faithful copies of a clone large
number of times from a small number of initial
clones. - All location information for a clone is assumed
to be lost. For instance it is not known - Which chromosome a clone belongs to
- Whether two clones overlap
- What base-pair sequence the clone has etc.
3Clone Library
- A large set of clones Clone Library is created
- Locations of the clones are assumed to be
uniformly random distributed - The sizes of all clone are roughly same.
- G Genome length, L Clone Length,
- N Clones in a library
- Coverage NL/G c
- (c is the expected number of clones covering any
location of the genome.) - If the coverage is at least 3 for a base, it is
assumed to be sure.
4Clone Library
Genome
Clone Library
Minimal Tiling Path
5Clone Libraries Commonly Used
6Example Genome Sizes
- Bacteriophage lambda (virus),
50,000 - Escherichia Coli (bacterium),
5,000,000 - Saccharomyces cerevisiae (yeast), 10,000,000
- Caenorhabditis elegans (worm), 100,000,000
- Drosophila melanogaster (fruitfly),
200,000,000 - Homo sapiens (human), 3,000,000,000
7Example
- A BAC library for human
- G 3,300 Mb, L 180 Kb, N 96,000
- c NL/G 96 103 180 103/ (3.3 109) 6¼
- 96,000 randomly chosen BACs from the human genome
provide a 6 library. - Certain regions of the genome may be difficult to
clone and hence may not be represented in the
library. - A Tiling Path is a subset of clones that
minimally cover the genome. - Removal of any clone from the tiling path
will leave some location of the genome uncovered.
8Mapping A Single Clone
- Provide a clone with additional information a
finger print - Restriction Pattern
- End Sequencing (500 base pairs on each end)
- Probes (Hybridization probes, etc.)
- Restriction Pattern
- Take a clone and completely digest it into small
pieces (restriction fragments) by a restriction
enzyme. - The restriction fragments and their order are
always the same for that clone.
9- Restriction Patterns are to expansive for large
genome projects - Mapping with probes is much cheaper. Probes may
be short random sequences or extracted from DNA. - STS (sequence tag sites) are long enough DNA
pieces so that are unique with very probability.
Probing is usually done at the ends of clones.
10Sequence Assembly
- Idealized Assembly with probes can be formalized
as following problem (assuming no error in the
read sequence) - Shortest Common Superstring Problem
- Given a collection of n strings F f1, f2, ,
fn, find the shortest string that is a
superstring of every string in F.
11Example
- F actcc, gagca, ccctac, agg
- Each of these is a superstring of everything in F
(because each contains everything in F as a
substring) - actccgagcaccctacagg
- actcccctacaggagca
- aggagcactccctac
- overlaps in bold shortest one is best.
12About the SCS Problem
- No algorithm is known which is guaranteed to
find the best solution (i.e. the shortest
supersrting) in reasonable time for large cases. - We can never guarantee finding the correct/best
superstring (all genome construction problems
are large). - Thats not all even if we would have a solution
there are still hard problems
13 - Problems of SCS solution
- Does not allow experimental errors need to have
perfect superstring. - The orientation of each fragment must be known
unrealistic. - May not be the biological solutions if there are
many long repeats, it is almost impossible to
reconstruct exact original sequence. -
-
14Example of the repeats problem
- Real sequence may include
- agactactactactga, and the covering fragments
may be - agactactact and actactactga
- which will be collapsed by solving the SCS to
agactactactga, - overlapping too many of the act repeats.
15Fragment Assembly
- Despite the issues, we need to put the fragments
together somehow so - Approximate methods are used, which are
algorithms that try to find good (may be not
best) solutions quickly. - Repeats are dealt with by focussing extra
biotechnology effort to get large sequenced
fragments (large reads) around certain areas
which might contain multiple repeats.
16Overlap multigraphs
A useful data structure for addressing the SCS is
the overlap multigraph of the set of strings.
This is denoted OM(F) for a set of strings
(fragments) F. First, an illustration for the
set of fragments
4
2
maryha
teass
ryhad
hite
2
1
3
1
2
aswhit
lambit
ssnow
tlela
1
3
4
3
2
2
little
yhadali
lelam
1
2
2
tsfle
2
1
4
leecew
ecewas
Edge between two frags is labelled by length of
overlap
17More formally
OM(F) is a directed graph in which we have an
edge from fa to fb whenever there is a nonzero
overlap between the suffix of fa and the prefix
of fb. The edge is labelled with the largest
overlap between them. A Hamiltonian path in a
directed graph is a path which visits every node
in the graph precisely once. Any Hamiltonian
path in an overlap multigraph corresponds to a
superstring. Finally, finding the SCS
corresponds to finding the Hamiltionian path with
maximal weight (adding up the edges on the path)
18Example
4
2
maryha
teass
ryhad
hite
2
1
3
1
2
aswhit
lambit
tlela
1
3
ssnow
4
3
2
2
little
yhadali
lelam
1
2
2
tsfle
2
1
4
leecew
ecewas
The red path is the longest H path in this case,
and leads to the correct sequence. But in genome
sequencing, it is rarely so easy the graph is
immensely larger, and there are very many H
paths.
19How its done
- Basic approach is this (where an edge is start,
finish) - 1. build the OM(F) graph
- 2. Sort the edges in ascending order of
weight, breaking ties randomly - 3. Iteratively choose the highest weighted
edge (first in the list) with the following
properties - -- does not have finish same as start or
finish of a previously chosen edge (i.e. does not
form a cycle) - -- does not have start same as start of a
previously chosen edge (i.e. were building a
path, not a tree) . -
20Reconstruction
- To deal with experimental errors, we need to
introduce the concept of a distance between
sequences. - Substring edit distance the cost of every
insertion, deletion, or substitution is one unit
distance. Exception deletions in the
extremities of the second sequence has no cost. - If we measure substring edit distances between
strings, then we can mandate a string S such that
either f or its reverse complement f must be an
approximate substring of S at error level e.
21Given a collection F of f strings, and an error
tolerance e between 0 and 1, we need to find the
shortest possible string S such that for every f,
we have dist (f, S) lt ef. Find a string
S as short as possible such that either f (or
its reverse complement) must be an approximate
substring of S at error level e. This means that
we are allowed an average e errors for each base
in f. If e 0.05 we are allowed 5 errors per
hundred bases. Example If a GCGATAG and b
CAGTCGCTGATCGTACG, then the best alignment
is -----GC-GATAG---- CAGTCGCTGATCGTACG
22The next step
- When short reads (few hundreds of bp) are
assembled into contigs - long sequences by some
of the techniques they are to be assembled to
supercontigs (or scaffolds) and places at right
locations on the genome.
23The order of the whole genome assembly
24Placing the assembly on the genome
- A sequence tag is a short sequence that is unique
among the whole genome. - Genetic map contains many sequence tags and their
locations. - Align the super contigs to the genome according
to the tags.
25Review of WGS
- WGS Whole Genome Shotgun sequencing
- 1. Create clone library
- 2. Assembly the overlapped reads into contigs
- 3. Assembly the contigs into super contigs.
- 4. Align the super contigs to the genome
- 5. Genome Finishing filling the possible gaps
between supercontigs by additional sequencing