Title: Special%20Topics%20in%20Genomics
1Special Topics in Genomics
- Next-generation Sequencing
2Work flow of conventional versus
second-generation sequencing
(a) With high-throughput shotgun Sanger
sequencing, genomic DNA is fragmented, then
cloned to a plasmid vector and used to transform
E. coli. For each sequencing reaction, a single
bacterial colony is picked and plasmid DNA
isolated. Each cycle sequencing reaction takes
place within a microliter-scale volume,
generating a ladder of ddNTP-terminated,
dye-labeled products, which are subjected to
high-resolution electrophoretic separation within
one of 96 or 384 capillaries in one run of a
sequencing instrument. As fluorescently labeled
fragments of discrete sizes pass a detector, the
four-channel emission spectrum is used to
generate a sequencing trace. (b) In shotgun
sequencing with cyclic-array methods, common
adaptors are ligated to fragmented genomic DNA,
which is then subjected to one of several
protocols that results in an array of millions of
spatially immobilized PCR colonies or
'polonies'15. Each polony consists of many copies
of a single shotgun library fragment. As all
polonies are tethered to a planar array, a single
microliter-scale reagent volume (e.g., for primer
hybridization and then for enzymatic extension
reactions) can be applied to manipulate all array
features in parallel. Similarly, imaging-based
detection of fluorescent labels incorporated with
each extension can be used to acquire sequencing
data on all features in parallel. Successive
iterations of enzymatic interrogation and imaging
are used to build up a contiguous sequencing read
for each array feature. Jay Shendure Hanlee
Ji, Nature Biotechnology 26, 1135 - 1145 (2008)
3Available next-generation sequencing platforms
- Illumina/Solexa
- ABI SOLiD
- Roche 454
- Polonator
- HeliScope
4Example Illumina/Solexa
1. Prepare genomic DNA 2. Attach DNA to surface
3. Bridge amplification 4. Fragement become
double stranded 5. Denature the double stranded
molecules 6. Complete amplification
5Illumina/Solexa
7. Determine first base 8. Image first base 9.
Determine second base 10. Image second base 11.
Sequence reads over multiple cycles 12. Align
data. gt50 milliion clusters/flow cell, each
1000 copies of the same template, 1 billion bases
per run, 1 of the cost of capillary-based
method. (From http//www.illumina.com/downloads
/SS_DNAsequencing.pdf)
6Clonal amplification of sequencing features in
the second-generation sequencing
(a) The 454, the Polonator and SOLiD platforms
rely on emulsion PCR20 to amplify clonal
sequencing features. In brief, an in
vitroconstructed adaptor-flanked shotgun library
(shown as gold and turquoise adaptors flanking
unique inserts) is PCR amplified (that is,
multi-template PCR, not multiplex PCR, as only a
single primer pair is used, corresponding to the
gold and turquoise adaptors) in the context of a
water-in-oil emulsion. One of the PCR primers is
tethered to the surface (5'-attached) of
micron-scale beads that are also included in the
reaction. A low template concentration results in
most bead-containing compartments having either
zero or one template molecule present. In
productive emulsion compartments (where both a
bead and template molecule is present), PCR
amplicons are captured to the surface of the
bead. After breaking the emulsion, beads bearing
amplification products can be selectively
enriched. Each clonally amplified bead will bear
on its surface PCR products corresponding to
amplification of a single molecule from the
template library. (b) The Solexa technology
relies on bridge PCR21, 22 (aka 'cluster PCR') to
amplify clonal sequencing features. In brief, an
in vitroconstructed adaptor-flanked shotgun
library is PCR amplified, but both primers
densely coat the surface of a solid substrate,
attached at their 5' ends by a flexible linker.
As a consequence, amplification products
originating from any given member of the template
library remain locally tethered near the point of
origin. At the conclusion of the PCR, each clonal
cluster contains 1,000 copies of a single member
of the template library. Accurate measurement of
the concentration of the template library is
critical to maximize the cluster density while
simultaneously avoiding overcrowding.
Jay Shendure Hanlee Ji, Nature Biotechnology
26, 1135 - 1145 (2008)
7Strategies for cyclic array sequencing
- With the 454 platform, clonally amplified 28-m
beads generated by emulsion PCR serve as
sequencing features and are randomly deposited to
a microfabricated array of picoliter-scale wells.
With pyrosequencing, each cycle consists of the
introduction of a single nucleotide species,
followed by addition of substrate (luciferin,
adenosine 5'-phosphosulphate) to drive light
production at wells where polymerase-driven
incorporation of that nucleotide took place. This
is followed by an apyrase wash to remove
unincorporated nucleotide. - (b) With the Solexa technology, a dense array of
clonally amplified sequencing features is
generated directly on a surface by bridge PCR.
Each sequencing cycle includes the simultaneous
addition of a mixture of four modified
deoxynucleotide species, each bearing one of four
fluorescent labels and a reversibly terminating
moiety at the 3' hydroxyl position. A modified
DNA polymerase drives synchronous extension of
primed sequencing features. This is followed by
imaging in four channels and then cleavage of
both the fluorescent labels and the terminating
moiety. - (c) With the SOLiD and the Polonator platforms,
clonally amplified 1-m beads are used to generate
a disordered, dense array of sequencing features.
Sequencing is performed with a ligase, rather
than a polymerase. With SOLiD, each sequencing
cycle introduces a partially degenerate
population of fluorescently labeled octamers. The
population is structured such that the label
correlates with the identity of the central 2 bp
in the octamer (the correlation with 2 bp, rather
than 1 bp, is the basis of two-base encoding).
After ligation and imaging in four channels, the
labeled portion of the octamer (that is, 'zzz')
is cleaved via a modified linkage between bases 5
and 6, leaving a free end for another cycle of
ligation. Several such cycles will iteratively
interrogate an evenly spaced, discontiguous set
of bases. The system is then reset (by
denaturation of the extended primer), and the
process is repeated with a different offset
(e.g., a primer set back from the original
position by one or several bases) such that a
different set of discontiguous bases is
interrogated on the next round of serial
ligations. - (d) With the HeliScope platform, single nucleic
acid molecules are sequenced directly, that is,
there is no clonal amplification step required.
Poly-Atailed template molecules are captured by
hybridization to surface-tethered poly-T
oligomers to yield a disordered array of primed
single-molecule sequencing templates. Templates
are labeled with Cy3, such that imaging can
identify the subset of array coordinates where a
sequencing read is expected. Each cycle consists
of the polymerase-driven incorporation of a
single species of fluorescently labeled
nucleotide at a subset of templates, followed by
fluorescence imaging of the full array and
chemical cleavage of the label.
Jay Shendure Hanlee Ji, Nature Biotechnology
26, 1135 - 1145 (2008)
8Conventional sequencing
- Can sequence up to 1,000 bp, and per-base 'raw'
accuracies as high as 99.999. In the context of
high-throughput shotgun genomic sequencing,
Sanger sequencing costs on the order of 0.50 per
kilobase.
Jay Shendure Hanlee Ji, Nature Biotechnology
26, 1135 - 1145 (2008)
9Second-generation DNA sequencing technologies
Jay Shendure Hanlee Ji, Nature Biotechnology
26, 1135 - 1145 (2008)
10Applications of next-generation sequencing
Jay Shendure Hanlee Ji, Nature Biotechnology
26, 1135 - 1145 (2008)
11Base calling
Schematic representation of main Illumina noise
factors. (ad) A DNA cluster comprises identical
DNA templates (colored boxes) that are attached
to the flow cell. Nascent strands (black boxes)
and DNA polymerase (black ovals) are depicted.
(a) In the ideal situation, after several cycles
the signal (green arrows) is strong, coherent and
corresponds to the interrogated position. (b)
Phasing noise introduces lagging (blue arrows)
and leading (red arrow) nascent strands, which
transmit a mixture of signals. (c) Fading is
attributed to loss of material that reduces the
signal intensity (c). (d) Changes in the
fluorophore cross-talk cause misinterpretation of
the received signal (teal arrows d). For
simplicity, the noise factors are presented
separately from each other.
Erlich et al. Nature Methods 5 679-682 (2008)
12Base calling Alta-Cyclic
The training process (green arrows) starts with
creation of the training set, beginning with
sequences generated by the standard Illumina
pipeline, by linking intensity reads and a
corresponding genome sequence (the 'correct'
sequence). Then, two grid searches are used to
optimize the parameters to call the bases. After
optimization, a final SVM array is created, each
of which corresponds to a cycle. In the
base-calling stage (blue arrows), the intensity
files of the desired library undergo
deconvolution to correct for phasing noise using
the optimized values and are sent for
classification with the SVM array. The output is
processed, and sequences and quality scores are
reported.
Erlich et al. Nature Methods 5 679-682 (2008)
13Alta-Cyclic performance
(a) Analysis of the HepG2 RNA library using
Alta-Cyclic. The absolute number of additional
fully correct reads (in addition to those
generated by the Illumina base caller) is
indicated by the red line the fold change of the
improvement is indicated by the blue bars. (b) A
comparison of fully correct reads for the
Tetrahymena micronuclear library by the Illumina
base caller and Alta-Cyclic. (c) The average
error rate in calls of the artificial SNP
locations in the phi X library as a function of
the cycle in which they were called. The dashed
line represents 1 error rate (Q20). The plot on
the right shows the last 18 cycles in a different
scale. (d) A comparison of fully correct reads
for the phi X library with 1 artificial SNPs.
(e) Phi X sequences generated by Alta-Cyclic or
Illumina were exhaustively aligned to the
reference genome (allowing up to 53 mismatches
out of 78). The distribution of alignment scores
is shown beginning with an identical number of
raw reads for input into each base caller.
Erlich et al. Nature Methods 5 679-682 (2008)
14ChIP-Seq
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
15ChIP-Seq Analysis
Alignment
Peak Detection
Annotation
Visualization
Sequence Analysis
Motif Analysis
16Alignment
17Peak detection
- FindPeaks
- CHiPSeq
- BS-Seq
- SISSRs
- QuEST
- MACS
- CisGenome
18Two common designs
- One sample experiment
- contains only a ChIPd sample
- Two sample experiment
- contains a ChIPd sample and a negative control
sample
19One sample analysis
A simple way is the sliding window method
Poisson background model is commonly used to
estimate error rate
ki Poisson(?0)
Or people use Monte Carlo simulations Both are
based on the assumption that read sampling rate
is a constant across the genome.
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
20The constant rate assumption does not hold!
Negative binomial model fits the data better!
ki ?i Poisson(?i) ?i Gamma(a,
ß) Marginally, ki NegBinom(a, ß)
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
21FDR estimation based on Poisson and negative
binomial model
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
22Read direction provides extra information
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
23CisGenome procedure
Alignment
Exploration
FDR computation
Negative binomial model
Peak Detection
Use read direction to refine peak boundary and
filter low quality peaks
Post Processing
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
24Two sample analysis
Reason read sample rates at the same genomic
locus are correlated across different samples.
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
25CisGenome two sample analysis
ni k1i k2i k1i ni Binom(ni , p0)
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
26A comparative study of ChIP-chip and ChIP-seq
- NRSF ChIP-chip
- 2 ChIP 2 Mock IP in Jurkat cells, profiled
using Affymetrix Human Tiling 2.0R arrays. - NRSF ChIP-seq
- ChIP Negative Control in Jurkat cells,
sequenced with the next generation sequencer made
by Illumina/Solexa.
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
27Intersection
Before post-processing
After post-processing
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
28Signal correlation
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
29Visual comparison
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
30Comparison of peak detection results
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
31Are array specific peaks noise or signal?
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
32Effects of read number in ChIP-seq
Hongkai Ji et al. Nature Biotechnology 26
1293-1300. 2008
33RNA-Seq
- After two rounds of poly(A) selection, RNA is
fragmented to an average length of 200 nt by
magnesium-catalyzed hydrolysis and then converted
into cDNA by random priming. The cDNA is then
converted into a molecular library for
Illumina/Solexa 1G sequencing, and the resulting
25-bp reads are mapped onto the genome.
Normalized transcript prevalence is calculated
with an algorithm from the ERANGE package. - (b) Primary data from mouse muscle RNAs that map
uniquely in the genome to a 1-kb region of the
Myf6 locus, including reads that span introns.
The RNA-Seq graph above the gene model summarizes
the quantity of reads, so that each point
represents the number of reads covering each
nucleotide, per million mapped reads (normalized
scale of 05.5 reads). - (c) Detection and quantification of differential
expression. Mouse poly(A)-selected RNAs from
brain, liver and skeletal muscle for a 20-kb
region of chromosome 10 containing Myf6 and its
paralog Myf5, which are muscle specific. In
muscle, Myf6 is highly expressed in mature
muscle, whereas Myf5 is expressed at very low
levels from a small number of cells. The
specificity of RNA-Seq is high Myf6 expression
is known to be highly muscle specific, and only 4
reads out of 71 million total liver and brain
mapped reads were assigned to the Myf6 gene
model.
Mortazavi et al. Nature Methods,5621-628, 2008
34Reproducibility, linearity and sensitivity
(a) Comparison of two brain technical replicate
RNA-Seq determinations for all mouse gene models
(from the UCSC genome database), measured in
reads per kilobase of exon per million mapped
sequence reads (RPKM), which is a normalized
measure of exonic read density R 2 0.96. (b)
Distribution of uniquely mappable reads onto gene
parts in the liver sample. Although 93 of the
reads fall onto exons or the RNAFAR-enriched
regions (see Fig. 3 and text), another 4 of the
reads falls onto introns and 3 in intergenic
regions. (c) Six in vitrosynthesized reference
transcripts of lengths 0.310 kb were added to
the liver RNA sample (1.2 104 to 1.2 109
transcripts per sample R 2 gt 0.99). (d)
Robustness of RPKM measurement as a function of
RPKM expression level and depth of sequencing.
Subsets of the entire liver dataset (with 41
million mapped unique splice multireads) were
used to calculate the expression level of genes
in four different expression classes to their
final expression level. Although the measured
expression level of the 211 most highly expressed
genes (black and cyan) was effectively unchanged
after 8 million mappable reads, the measured
expression levels of the other two classes
(purple and red) converged more slowly. The
fraction of genes for which the measured
expression level was within 5 of the final value
is reported. 3 RPKM corresponds to approximately
one transcript per cell in liver. The
corresponding number of spliced reads in each
subset is shown on the top x axis.
Mortazavi et al. Nature Methods,5621-628, 2008
35ERANGE
(a) The main steps in the computational pipeline
are outlined at left, with different aspects of
read assignment and weighting diagrammed at right
and the corresponding number of gene model reads
treated in muscle shown in parentheses. In each
step, the sequence read or reads being assigned
by the algorithm are shown as a black rectangle,
and their assignment to one or more gene models
is indicated in color. Sequence reads falling
outside known or predicted regions are shown in
gray. RNAFAR regions (clusters of reads that do
not belong to any gene model in our reference
set) are shown as dotted lines. They can either
be assigned to neighboring gene models, if they
are within a specified threshold radius (purple),
or assigned their own predicted transcript model
(green). Multireads (shown as parallelograms) are
assigned fractionally to their different possible
locations based on the expression levels of their
respective gene models as described in the text.
(b) Comparison of mouse liver expanded RPKM
values to publicly available Affymetrix
microarray intensities from GEO (GSE6850) for
genes called as present by Rosetta Resolver.
Expanded RPKMs include unique reads, spliced
reads and RNAFAR candidate exon aggregation, but
not multireads. Genes with gt30 contribution of
multireads to their final RPKM (Supplementary
Fig. 4) are marked in red. (c) Comparison of
Affymetrix intensity values with final RPKMs,
which includes multireads. Note that the
multiread-affected genes that are below the
regression line in b straddle the regression line
in c.
Mortazavi et al. Nature Methods,5621-628, 2008
36Summary
- The next-generation sequencing is
- Rapidly evolving
- Democratizing the extent to which individual
investigators can pursue projects at a scale
previously accessible only to major genome
centers - New statistical tools need to be developed