Title: 454 Nextgeneration sequencing Quality Control and Benchmarks
1454 Next-generation sequencingQuality Control
and Benchmarks
- Jennifer M Taylor
- Bioinformatics Leader, Plant Industry
- June 2009
2454 Quality Control
- Technical Details
- Roche 454 Workflow
- Applications
- Data
- General features
- Quality Metrics
- Genome Sequencing
- Variant Detection
3Roche 454 Pyrosequencing
- High-throughput sequencing-by-synthesis
technology - GS FLX with Titanium series reagents
- Current 200-400 bp read lengths, 400 Mbp 10
hour run - Projected 500-700 bp read lengths (20kb, 8kb,
3kb paired end reads) - Applications
- Genome resequencing
- Variant Detection
- Denovo genome sequencing
- Metagenomics
- ChIPSeq protein-DNA interactions
- RNASeq transcript profiling small RNA
profiling
4Roche 454 Pyrosequencing
- Workflow
- Library preparation
- Fragmentation
- Adaptors
- Emulsion PCR
- Immobilisation
- Dilution to single bead / well
- Emulsion-based amplification
5Roche 454 Pyrosequencing
- Workflow (contd)
- Sequencing by Synthesis
- Enrichment of bead with amplified fragments
- Arrayed to 1 bead / well / fibre-optic slide
- Nucleotides flows (T, A, C, G)
- Luciferase reaction measured for 1 or more
nucleotide hybridisations - Data Analysis
- Integration of position and signal
- Application of base-calling and quality filtering
6Roche 454 General Data Features
- Flowgram / Pyrogram
- Does not read sequence base directly
Flow Cycles 3 Nucleotide Flows 12 Seq length
13
TCAGGTTTTTAAG
7Roche 454 Pyrosequencing
- Sources of Error
- Undercall / Overcall
- T(0.49 , 0), T(1.6 , 2)
- Typically result in insertions / deletions
- Miscalls
- TCTTG True TCTCG (overcalled T AND
undercalled C) - Variation in sequence coverage
- PCR amplification bias
- PCR error (1)
- Sanger sequencing can average out PCR error
- NGS reads derived from a single molecule
leading to transfer of error rate
8Roche 454 Pyrosequencing
TCAGGTTTTTAAG TAACGGTTTACGG
- For sequence length (n)
- Min(f) n Max(f) 3n1
- n fixed f N(µ,s²)
- µ and s² increase linearly with sequence length
- Nucleotide frequencies ? equal µ ? max s² ? min
- f fixed n N(µ,s²)
- µ and s² increase linearly with flow cycle number
- Nucleotide frequencies ? equal µ ? min s² ? min
Kong, 2009
9Roche 454 Quality Filters/Scores
0.5 0.55 0.6 0.65
0.7 0.75
- 0.5 lt signal lt 0.7 overlap region
- Allow only those reads lt5 of flows in overlap
region - Excluded reads trimmed from end until
- lt5 of flows in overlap region
- lt 82 flows (21 flow cycles)
- Exclude reads gt 5 ambiguous calls (N)
Margulies et al., 2005
10Roche 454 Quality Filters/Scores
- n length of homopolymer
- s signal
- j is position in read
- P(sn) empirically determined to follow a
Gaussian distribution - P(n) for random nucleotide sequence is (¼)n
Margulies et al., 2005
11454 Quality Control
- Technical Details
- 454 Workflow
- Applications
- Data
- General features
- Quality Metrics
- Genome Sequencing
- Variant Detection
12Genome sequencing Accuracy and Coverage
- Alignment of multiple reads
- Integration with Sanger sequencing
- Wicker et al., 2006 Barley Genome
- Compared 454-derived and ABI-Sanger derived
consensus sequences - Error rates of 0.07 / position
- Moore et al., 2006 Plastid Genomes
- Comparison across genomes w.r.t consensus
- 0.031 0.043
13Roche 454 Quality Filters/Scores
- Modifications to Quality Filters / Scores Huse
et al., 2007 - 43 reference templates of known sequence
- Divergent bacteria
- gt 340,000 reads
- P(n) for random nucleotide sequence is (¼)n
- Penalises long homopolymer indiscriminately
14Accuracy and Coverage
- Huse et al., 2007
- Error rate of 0.49
- 39 were homopolymer effects (insertions 36,
deletions 27) - lt 2 of reads accounted for nearly 50 of errors.
- Ambiguous bases
- Strong correlation between Ns and other types of
errors - 454 quality control allows up to 5 Ns per read
- Removal of all sequences containing Ns
- significant error rate improvement 0.24
15Accuracy and Coverage
- Harismenday et al., 2009
- 260 kB across 4 individuals 454, Illumina, ABI
Solid, ABI Sanger - Saturating coverage 43 x (Roche), 188 x
(Illumina), 841 x (SOLiD)
16Accuracy and Coverage Variant Detection
- Harismenday et al., 2009
- Variant detection accuracy
ABI Sanger FP 0.9 FN 0.31
17Accuracy and Coverage
- Amplicon End Bias
- 2.3 of total reference sequence
- 56 of Illumina sequence reads
- 11 of SOLiD
- 5 of 454
- Bias after fragmentation
- SOLiD and 454 library preparation adaptations
- Repeats
- SOLiD (1/2 fold coverage)
- 454 ( equal coverage)
- Illumina (2 fold coverage)
- Sequence composition
- Low coverage regions for SR tend to be AT rich.
18Conclusions
- 454 Quality control scores need optimisation
- Naïve penalties for homopolymer length
- Inadequate control of ambiguities
- Lack of control of undercalling
- Brockman et al., 2008 Genome Research
- Uniformity of per-base sequence coverage needs to
be improved - Pyrosequencing shows high specificity in variant
detection and low error rates in the construction
of consensus sequencing IN THE PRESENCE OF
saturating coverage.
19Acknowledgements References
- Andrew Spriggs
- Karl Gordon
- David Townley
- David Lovell
- Brockman et al., Genome Research 2008, 18763-770
- Marguiles et al., Nature 2005, 437(15)376-380
- Kong, J. Comp. Biology 2009, 16(1)1-12
- Huse et al., Genome Biology 2007, 8R143
- Harismendy et al., Genome Biology 2009, 10R32
20Thank you
Plant Industry Jennifer M Taylor Bionformatics
Leader Phone 61 2 62464929 Email
Jen.Taylor_at_csiro.au Web