Title: JAZZ: A Whole Genome Shotgun Assembler
1JAZZ A Whole Genome Shotgun Assembler
- Jarrod Chapman
- Nik Putnam
- Isaac Ho
- Dan Rokhsar
2A Solving a one-dimensional jigsaw
puzzle with millions of pieces (without
the box)
Q What is Whole Genome Shotgun Assembly?
3WGS The Statistical Ensemble
Poisson statistics d Nr Lr / G
ltreads/contiggt ed ltunsequenced basesgt
G e-d Lander and Waterman 1988 Genomics
2(3)231-9
Idealizations Random sampling Random
sequence (without repeats)
4Complications of Real Data Sets
- Inherent
- Repeats
- Paired-end reads
- Cloning bias
- Polymorphism
- Experimental
- Sequencing errors
- Contamination
- Tracking errors
- ???
5Assembly Goals and Motivations
- Treat sequence overlap and paired end information
on an equal footing. Build in flexibility to
include BAC localization and other mapping data
for mixed projects - Allow for polymorphisms. Unlike microbes and
flies, fugu, ciona, and other sequencing targets
are neither haploid nor inbred individuals from
the wild - Efficient assemblies to provide good substrates
for annotation. 6X depth should statistically
give good coverage (99.7), contiguity (30 kb),
and contig linking - Develop parallel implementation from the start.
Scale to large projects (Fugu rubripes _at_ 400 MB,
Poplar _at_ 600 MB, Xenopus tropicalis _at_ 1.7 GB) - Integrate assembly visualization and analysis
tools for q.c. and validation visualize
multiple scales
paired ends
polymorphism
efficiency
scalability
visualization
6JAZZ assembly pipeline
- Use
- Overlap
- Layout
- Consensus
- paradigm
parallelized and distributed over project
lifetime
1. MALIGN all-vs-all comparison
data in
reads
overlaps
DB
0. TRIM for vector, quality
2. GRAPHY Build layout
overlaps
reads
layout
genome out
reads,
Embarrassingly parallel
layout
multithreaded
3. THREE Find consensus
4. GAPCLOSER Close captured gaps
consensus
7MALIGN Rapid screening for overlaps
- Identify all pairs of reads that share ten or
more informative words (reverse complement, too) - Use a parallel hash scheme for speed/memory.
- Designate over-represented words in
quality-trimmed data set as unhashable.
(AAAAAAA) - Their shared occurrence in two reads is not a
reliable indicator of a true overlap. - Align candidate overlapping reads using banded
Smith-Waterman algorithm. Reject low id.
Screen read pairs for potential overlaps using a
hashing scheme
Minimal detectable overlap NMM-1
8GRAPHY graphical layout algorithm
- Given a set of all-vs-all alignments (from
MALIGN) - Estimate the likelihood that each edge is true.
- Construct an initial solution from the highest
confidence, unique sequence. - Improve the solution iteratively with
self-consistency requirement.
Layout Problem ? Finding the sub-graph of true
edges
True edges join reads actually derived from
overlapping portions of the genome. False edges
arise from repeats.
9GRAPHY Graphical layout algorithm
sister edge
- Key ingredients
- Use of rectangles and other local structures in
read graph to corroborate overlaps and reads - Iterative, self-consistent formation of contigs
and scaffolds
neighborhood of R
R
Long mismatch
contigs
10THREE Reaching consensus
- Identify backbone of forward-moving, minimally
overlapping reads spanning each contig
Use central high quality segments of reads as a
proto-consensus
- Form reference made from concatenating segments
closest to center of each backbone read - Make master-slave alignment of reference segment
to its overlapping reads (with quality-weighted
voting) - Mark potential polymorphisms/misassemblies in
alignment
11Accuracy (Prochlorococcus _at_8X, 3kb only)
chromosome is circular
err vs finished lt 10-5 lt 10-4
- BLAST finished sequence vs JAZZ assembly
- White lines connect hits to same scaffold
12JAZZ scaffolds are a good substrate for annotation
- GeneWise models introduce few or no indels
- Approximate error rate lt 1 indels/10 kb as
expected at 6X depth
13JAZZ view of assembly
local depth
of internal pairs mean insert size
local edge
clones spanning gap estimated gap size
selected read
contig name, reads, size, depth, scaffold
clone coverage
14Fugu rubripes
Whole-Genome Shotgun Assembly and Analysis of
the Genome of Fugu Rubripes Aparicio, Chapman, ,
Putnam, , Rokhsar, Brenner Science 23 August
2002 Vol.297 No. 5585 1301-1310
http//genome.jgi-psf.org/fugu6/fugu6.home.html
15Comparison of assembled and measured BAC sizes in
Fugu
BAC sizes assembly vs fingerprint
- Several thousand fingerprinted BACs have both
ends placed in same (cosmid-OOd) scaffold - With small calibration correction, distance
between ends on (small-insert-only) assembly
equals sum of restriction fragment lengths
200
150
assembly insert size (kb)
100
50
0 50 100 150 200 250
fingerprint insert size (kb)
16Ciona intestinalis 7X assembly summary
Assembly summary
120
1,172 scaff gt 13 kB (109 MB90)
N90
100
contigs
80
scaffolds
cumulative sequence length (MB)
178 scaff gt 191 kB 1,002 contigs gt 33.9 kb
60
N50
The Draft Genome of Ciona intestinalis Insights
into Chordate And Vertebrate Origins Science 13
December 2002 Vol. 298 No. 5601 2157-2166
40
http//genome.jgi-psf.org/ciona4/ciona4.home.html
20
1000 2000 3000
- Length of assembled genome vs number of
contigs/scaffolds
17Ciona genome is polymorphic
Polymorphism rate vs. sequence position (100 kb)
Half Moon Bay Ciona intestinalis 1.5 average
allelic polymorphism rate Polymorphism detection
- SNP and VNTR variants c consensus A/B
reads from two different haplotypes .
high-qual match o low-qual match
polymorphism
c TTGCTAAGCTTTTCGCTTTTTTGATAAAAAAAAAC
GTTTTATGTGTTACTGTGTGGCAGT A ....................
.....-.......C...............----------. A
.........................-.......C...............-
---------. A .........................-.......C.
..............----------. A oooooo............oo
ooooo-oo.....C........ooooooo----------o B
......C...........................................
.......... B ......C............................
.........................
c GTGGGTCTGTAGAAGCGAAGTTAAAACCTATTTGAGTGTGATTT
TAAGAAAGCTATTTGG A .............................
............................... A
..................................................
.......... A .. A .....ooooo..................
..ooo........o.o................ B
..............................................oooo
o...G..... B ...................................
...................G.....
18Assembly Goals Revisited
- Plasmid-, cosmid-, BAC-end data used to achieve
good contiguity - 1.5 Ciona intestinalis, 0.4 Fugu
- Successful annotation of fugu, ciona assembly
provides good annotation substrate - 400MB fugu assembled gearing up for poplar (600
MB) Xenopus (1.7GB) - JAZZ view allows read, contig, scaffold level
visualization
paired ends
polymorphism
efficiency
scalability
visualization
microbial consortia larger polymorphic genomes
general availability
To come
19Acknowledgements
- DOE Joint Genome Institute (JGI)
- JGI Assembly Team
- Nik Putnam, Isaac Ho, Dan Rokhsar
- Susan Lucas, Paul Richardson, Chris Detter
- Fugu and Ciona Genome Consortia
- DOE CSGF / Krell Institute
JGI http//www.jgi.doe.gov