Title: Shotgun Sequencing
1Shotgun Sequencing Assembly
- Chuong Huynh
- NIH/NLM/NCBI
- huynh_at_ncbi.nlm.nih.gov
- Bangkok, Thailand
- July 15, 2002
- Acknowledgement Daniel Lawson, Artur Gruber
2Sequencing
Genomic DNA
Shearing/Sonication
Small DNA fragments 1.0-2.0kb
Clone LibrarypUC18
Subclone and Sequence
DNA sequencing Random clones
Shotgun reads
Assembly
Contigs
Finishing read
Both strands coverage Gap filled
Finishing
Complete sequence
3Genome Sequencing - Review
Strategy
Strategy
Libraries
Libraries
Sequencing
Sequencing
Assembly
Assembly
Closure
Closure
Annotation
Annotation
Release
Release
4Sequencing Software
- Staden Package
- Pregap4 (prepare sequence trace data for
analysis) - Gap4 (assembly program)
- http//www.mrc-lmb.cam.ac.uk/pubseq/
- CAP3 (Sequence Assembly Program)
- Email huang_at_mtu.edu
- Phred/Phrap/Consed distribution package
http//www.phrap.org/ - Sequencing software utilities available at Sanger
Centre http//www.sanger.ac.uk/software/
5Large-scale genome projects
- Sequencing DNA molecules in the Mb size range
- All strategies employ the same underlying
principles - Random Shotgun sequencing
6Growth of Nucleotide DatabaseEMBL/GenBank/DDBJ
7Progress on Large Sequencing Projects
8Strategies for sequencing
- How big can you go??
- Large-insert clones
- cosmids 30-40 kb
- BACs/PACs 50 - 100 kb
- Whole chromosomes
- Whole genomes
9Genome size and sequencing strategies
Genome size (log Mb)
4
0
1
2
3
H.sapiens (3000 Mb)
D.melanogaster (170 Mb)
C.elegans (100Mb)
P.falciparum (30 Mb)
S.cerevisiae (14 Mb)
E.coli (4 Mb)
Whole genome shotgun (WGS)
Clone-by-clone
Whole Chromosome Shotgun (WCS)
Whole Genome Shotgun (WGS) with Clone skims
10Strategies for sequencing
- Size and GC composition of genome
- Volume of data
- Ease of cloning
- Ease of sequencing
- Genome complexity
- dispersed repetitive sequence
- telomeres centromeres
- Politics/Funding
11Strategies Whole Genome shotgun (WGS)
- Moderate to High complexity (10-100s K reads)
- Problems with repeats
- Complex informatics
- Quality of physical map
- Fingerprint map
- STS markers
- End-sequences
- Skims of mapped clones
12Sequencing
Genomic DNA
Shearing/Sonication
Small DNA fragments 1.0-2.0kb
Clone LibrarypUC18
Subclone and Sequence
DNA sequencing Random clones
Shotgun reads
Assembly
Contigs
Finishing read
Both strands coverage Gap filled
Finishing
Complete sequence
13Sequencing
- Library construction
- Colony picking
- DNA preparation
- Sequencing reactions
- Electrophoresis
- Tracking/Base calling
14Libraries
- Essentially Sub-cloning
- Generation of small insert libraries in a well
characterised vector. - Ease of propagation
- Ease of DNA purification
- e.g. puc18, M13
- Test the libraries
15Sequence generation
- Pick colonies
- Template preparation
- Sequence reactions
- Standard terminator chemistry
- pUC libraries sequenced with forward and reverse
primers
16Sequence generation
- Electrophoresis of products
- Old style - slab gels, 32 gt 64 gt 96 lanes
- New style - capillary gels, 96 lanes
- Transfer of gel image to UNIX
- Sequencing machines use a slave Mac/PC
- Move data to centralised storage area for
processing
17Gel image processing
- Light-to-Dye estimation
- Lane tracking
- Lane editing
- Trace extraction
- Trace standardisation
- Mobility correction
- Background substitution
18Pre-processing
- Base calling using Phred
- modifies SCF file
- Quality clipping
- Vector clipping
- Sequencing vector
- Cloning vector
- Screen for contaminants
- Feature mark up (repeats/transposons)
19Phred
- Phred is a program that performs several tasks
- a. Reads trace files compatible with most file
formats SCF (standard chromatogram format), ABI
(373/377/3700), ESD (MegaBACE) and LI-COR. - b. Calls bases attributes a base for each
identified peak with a lower error rate than the
standard base calling programs. - c. Assigns quality values to the bases a Phred
value based on an error rate estimation
calculated for each individual base. - d. Creates output files base calls and quality
values are written to output files.
20Trace File High quality region no ambiguities
(Ns)
- No ambiguities no noise peaks very well spaced
21Trace File Medium quality region some
ambiguities (Ns)
- Some ambiguities (Ns)
- -No noise
- - Peaks very well spaced
- - some homopolymeric stretches are not well
resolved
22Trace File Poor quality region low confidence
- Overlapping peaks, peaks not evenly spaced
- Low resolution low confidence to base assignment
23Finishing
- Assembly Process of taking raw single-pass
reads into contiguous consensus sequence - Closure Process of ordering and merging
consensus sequences into a single contiguous
sequence - Finished is defined as sequenced on both strands
using multiple clones. In the absence of multiple
clones the clone must be sequenced with multiple
chemistries. The overall error rate is estimated
at less than 1 error per 10 kb
24Genome Assembly
- Pre-assembly
- Assembly
- Automated appraisal
- Manual review
25Assembly
- Run Cross_match to remove or screen out vector
sequences before running phrap - Assemble using Phrap
- Read fasta quality scores from CAF file
- Merge existing Phrap .ace file as necessary
- Adjust clipping
26Phrap Phragment Assembly Program
- Phrap is a program for assembling shotgun DNA
sequence data - Key Features
- a. Uses the entire read content no need for
trimming. - b. User supplied (i.e. Repbase) internally
computed data better accuracy of assembly in
the presence of repeats. - c. Contig sequence is constituted by a mosaic of
the highest quality parts of the reads its not
a consensus! - d. Provides extensive information about assembly
contained in phrap.out, .ace and
.screen.contigs.qual files - e. Handles very large datasets hundreds of
thousands of reads are easily manipulated. - f. Generate output files contain some important
data and enable visualization by other programs
27Assembly appraisal
- auto-edit
- removes 70 of read discrepancies
- Remove cloning vector (VecScreen, CrossMatch)
- Mark up sequence features
- finish
- Identify low-quality regions
- Cover using re-runs and long-runs
- Compare with current databases
- plate contamination
28Gap closure
- Read pairs
- PCR reactions (long-range / combinatorial)
- Small-insert libraries
- Transposon-insertion libraries
29Gap closure - contig ordering
- Read pair consistency
- STS mapping
- Physical mapping
- Genetic mapping
- Optical mapping
- Large-insert clone
- skims
- end-sequencing
30Consed
- Consed is a program for viewing and editing
assemblies produced by Phrap - a. Assembly viewer - allows for visualization of
contigs, assembly (aligned reads), quality values
of reads and final sequence. - b. Trace file viewer single and multiple trace
files can be visualized allowing for comparison
of a given sequence in several reads. - c. Navigation identify and list regions which
are below a given quality threshold, contain high
quality discrepancies, single-strand coverage,
etc. - d. Autofinish automatic set of functions for
gap closure, improvement of sequence quality,
determination of relative orientation of contigs,
identification of regions covered by a single
read or by reads of a single strand. The program
automatically performs primer picking and chooses
the templates.
31Consed in Action
32AutoFinish
- Autofinish is part of the Consed software package
- Automatically chooses finishing reads in order to
finish a project as predefined by the user
inputted parameters - Software that help determine how contigs are
ordered and oriented - Close gaps
- Improve error rate
- Cover every base by reads from at least 2
different subclones!!!
33(No Transcript)
34Annotation
- DNA features (repeats/similarities)
- Gene finding
- Peptide features
- Initial role assignment
- Others- regulatory regions
35Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction
transcription
Unprocessed RNA
RNA processing
Mature mRNA
AAAAAAA
Gm3
Comparative gene prediction
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
36Genome analysis overview C.elegans
37DNA features
- Similarity features
- mapping repeats
- simple tandem and inverted
- repeat families
- mapping DNA similarities
- EST/mRNAs in eukaryotes
- Duplications,
- RNAs
- mapping peptide similarities
- protein similarities
38Gene finding
- ORF finding (simple but messy)
- ab initio prediction
- Measures of codon bias
- Simple statistical frequencies
- Comparative prediction
- Using similarity data
- Using cross-species similarities
39Peptide features
- Peptide features
- low-complexity regions
- trans-membrane regions
- structural information (coiled-coil)
- Similarities and alignments
- Protein families (InterPro/COGS)
40Initial role assignment
- Simple attempt to describe the functional
identity of a peptide - Uses data from
- peptide similarities
- protein families
- Vital for data mining
- Large number of predicted genes remain
hypothetical or unknown
41Other regulatory features
- Ribosomal binding sites
- Promoter regions
42Data Release
- DNA release
- Unfinished
- Finished
- Nucleotide databases
- GENBANK/EMBL/DDBJ
- Peptide databases
- SWISSPROT/TREMBL/GENPEPT
- Others
43What do you get?
DATA!!, DATA !!, and more DATA!!
- Sequence
- incomplete v complete
- First-pass annotation
- Gene discovery
- Full annotation
- A starting point for research
44Genome annotation is central to functional
genomics
45Sequencing my genome
Politics
Production
Finishing
Annotation
TIME
MONEY
46Extra Slides
47What is Phred/Phrap/Consed ?
- Phred/Phrap/Consed is a worldwide distributed
package for - a. Trace file (chromatograms) reading
- b. Quality (confidence) assignment to each
individual base - c. Vector and repeat sequences identification
and masking - d. Sequence assembly
- e. Assembly visualization and editing
- f. Automatic finishing.
-