JAZZ: A Whole Genome Shotgun Assembler - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

JAZZ: A Whole Genome Shotgun Assembler

Description:

Unlike microbes and flies, fugu, ciona, and other sequencing targets are neither ... Comparison of assembled and measured BAC sizes in Fugu ... – PowerPoint PPT presentation

Number of Views:342
Avg rating:3.0/5.0
Slides: 20
Provided by: paulfp
Category:

less

Transcript and Presenter's Notes

Title: JAZZ: A Whole Genome Shotgun Assembler


1
JAZZ A Whole Genome Shotgun Assembler
  • Jarrod Chapman
  • Nik Putnam
  • Isaac Ho
  • Dan Rokhsar

2
A Solving a one-dimensional jigsaw
puzzle with millions of pieces (without
the box)
Q What is Whole Genome Shotgun Assembly?
3
WGS The Statistical Ensemble
Poisson statistics d Nr Lr / G
ltreads/contiggt ed ltunsequenced basesgt
G e-d Lander and Waterman 1988 Genomics
2(3)231-9
Idealizations Random sampling Random
sequence (without repeats)
4
Complications of Real Data Sets
  • Inherent
  • Repeats
  • Paired-end reads
  • Cloning bias
  • Polymorphism
  • Experimental
  • Sequencing errors
  • Contamination
  • Tracking errors
  • ???

5
Assembly Goals and Motivations
  • Treat sequence overlap and paired end information
    on an equal footing. Build in flexibility to
    include BAC localization and other mapping data
    for mixed projects
  • Allow for polymorphisms. Unlike microbes and
    flies, fugu, ciona, and other sequencing targets
    are neither haploid nor inbred individuals from
    the wild
  • Efficient assemblies to provide good substrates
    for annotation. 6X depth should statistically
    give good coverage (99.7), contiguity (30 kb),
    and contig linking
  • Develop parallel implementation from the start.
    Scale to large projects (Fugu rubripes _at_ 400 MB,
    Poplar _at_ 600 MB, Xenopus tropicalis _at_ 1.7 GB)
  • Integrate assembly visualization and analysis
    tools for q.c. and validation visualize
    multiple scales

paired ends
polymorphism
efficiency
scalability
visualization
6
JAZZ assembly pipeline
  • Use
  • Overlap
  • Layout
  • Consensus
  • paradigm

parallelized and distributed over project
lifetime
1. MALIGN all-vs-all comparison
data in
reads
overlaps
DB
0. TRIM for vector, quality
2. GRAPHY Build layout
overlaps
reads
layout
genome out
reads,
Embarrassingly parallel
layout
multithreaded
3. THREE Find consensus
4. GAPCLOSER Close captured gaps
consensus
7
MALIGN Rapid screening for overlaps
  • Identify all pairs of reads that share ten or
    more informative words (reverse complement, too)
  • Use a parallel hash scheme for speed/memory.
  • Designate over-represented words in
    quality-trimmed data set as unhashable.
    (AAAAAAA)
  • Their shared occurrence in two reads is not a
    reliable indicator of a true overlap.
  • Align candidate overlapping reads using banded
    Smith-Waterman algorithm. Reject low id.

Screen read pairs for potential overlaps using a
hashing scheme
Minimal detectable overlap NMM-1
8
GRAPHY graphical layout algorithm
  • Given a set of all-vs-all alignments (from
    MALIGN)
  • Estimate the likelihood that each edge is true.
  • Construct an initial solution from the highest
    confidence, unique sequence.
  • Improve the solution iteratively with
    self-consistency requirement.

Layout Problem ? Finding the sub-graph of true
edges
True edges join reads actually derived from
overlapping portions of the genome. False edges
arise from repeats.
9
GRAPHY Graphical layout algorithm
sister edge
  • Key ingredients
  • Use of rectangles and other local structures in
    read graph to corroborate overlaps and reads
  • Iterative, self-consistent formation of contigs
    and scaffolds

neighborhood of R
R
Long mismatch
contigs
10
THREE Reaching consensus
  • Identify backbone of forward-moving, minimally
    overlapping reads spanning each contig

Use central high quality segments of reads as a
proto-consensus
  • Form reference made from concatenating segments
    closest to center of each backbone read
  • Make master-slave alignment of reference segment
    to its overlapping reads (with quality-weighted
    voting)
  • Mark potential polymorphisms/misassemblies in
    alignment

11
Accuracy (Prochlorococcus _at_8X, 3kb only)
chromosome is circular
err vs finished lt 10-5 lt 10-4
  • BLAST finished sequence vs JAZZ assembly
  • White lines connect hits to same scaffold

12
JAZZ scaffolds are a good substrate for annotation
  • GeneWise models introduce few or no indels
  • Approximate error rate lt 1 indels/10 kb as
    expected at 6X depth

13
JAZZ view of assembly
local depth
of internal pairs mean insert size
local edge
clones spanning gap estimated gap size
selected read
contig name, reads, size, depth, scaffold
clone coverage
14
Fugu rubripes
Whole-Genome Shotgun Assembly and Analysis of
the Genome of Fugu Rubripes Aparicio, Chapman, ,
Putnam, , Rokhsar, Brenner Science 23 August
2002 Vol.297 No. 5585 1301-1310
http//genome.jgi-psf.org/fugu6/fugu6.home.html
15
Comparison of assembled and measured BAC sizes in
Fugu
BAC sizes assembly vs fingerprint
  • Several thousand fingerprinted BACs have both
    ends placed in same (cosmid-OOd) scaffold
  • With small calibration correction, distance
    between ends on (small-insert-only) assembly
    equals sum of restriction fragment lengths

200
150
assembly insert size (kb)
100
50
0 50 100 150 200 250
fingerprint insert size (kb)
16
Ciona intestinalis 7X assembly summary
Assembly summary
120
1,172 scaff gt 13 kB (109 MB90)
N90
100
contigs
80
scaffolds
cumulative sequence length (MB)
178 scaff gt 191 kB 1,002 contigs gt 33.9 kb
60
N50
The Draft Genome of Ciona intestinalis Insights
into Chordate And Vertebrate Origins Science 13
December 2002 Vol. 298 No. 5601 2157-2166
40
http//genome.jgi-psf.org/ciona4/ciona4.home.html
20
1000 2000 3000
  • Length of assembled genome vs number of
    contigs/scaffolds

17
Ciona genome is polymorphic
Polymorphism rate vs. sequence position (100 kb)
Half Moon Bay Ciona intestinalis 1.5 average
allelic polymorphism rate Polymorphism detection
- SNP and VNTR variants c consensus A/B
reads from two different haplotypes .
high-qual match o low-qual match
polymorphism

c TTGCTAAGCTTTTCGCTTTTTTGATAAAAAAAAAC
GTTTTATGTGTTACTGTGTGGCAGT A ....................
.....-.......C...............----------. A
.........................-.......C...............-
---------. A .........................-.......C.
..............----------. A oooooo............oo
ooooo-oo.....C........ooooooo----------o B
......C...........................................
.......... B ......C............................
.........................
c GTGGGTCTGTAGAAGCGAAGTTAAAACCTATTTGAGTGTGATTT
TAAGAAAGCTATTTGG A .............................
............................... A
..................................................
.......... A .. A .....ooooo..................
..ooo........o.o................ B
..............................................oooo
o...G..... B ...................................
...................G.....
18
Assembly Goals Revisited
  • Plasmid-, cosmid-, BAC-end data used to achieve
    good contiguity
  • 1.5 Ciona intestinalis, 0.4 Fugu
  • Successful annotation of fugu, ciona assembly
    provides good annotation substrate
  • 400MB fugu assembled gearing up for poplar (600
    MB) Xenopus (1.7GB)
  • JAZZ view allows read, contig, scaffold level
    visualization

paired ends
polymorphism
efficiency
scalability
visualization
microbial consortia larger polymorphic genomes
general availability
To come
19
Acknowledgements
  • DOE Joint Genome Institute (JGI)
  • JGI Assembly Team
  • Nik Putnam, Isaac Ho, Dan Rokhsar
  • Susan Lucas, Paul Richardson, Chris Detter
  • Fugu and Ciona Genome Consortia
  • DOE CSGF / Krell Institute

JGI http//www.jgi.doe.gov
Write a Comment
User Comments (0)
About PowerShow.com