Shotgun Sequencing - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Shotgun Sequencing

Description:

Most genome will be sequenced and can be sequenced; few problem are unsolvable. ... Assembly: Process of taking raw single-pass reads into ... – PowerPoint PPT presentation

Number of Views:2274

Avg rating:3.0/5.0

Slides: 48

Provided by: chu69

Category:

more less

Transcript and Presenter's Notes

Title: Shotgun Sequencing

1
Shotgun Sequencing Assembly

Chuong Huynh
NIH/NLM/NCBI
huynh_at_ncbi.nlm.nih.gov
Bangkok, Thailand
July 15, 2002
Acknowledgement Daniel Lawson, Artur Gruber

2
Sequencing
Genomic DNA
Shearing/Sonication
Small DNA fragments 1.0-2.0kb
Clone LibrarypUC18
Subclone and Sequence
DNA sequencing Random clones
Shotgun reads
Assembly
Contigs
Finishing read
Both strands coverage Gap filled
Finishing
Complete sequence
3
Genome Sequencing - Review
Strategy
Strategy
Libraries
Libraries
Sequencing
Sequencing
Assembly
Assembly
Closure
Closure
Annotation
Annotation
Release
Release
4
Sequencing Software

Staden Package
Pregap4 (prepare sequence trace data for
analysis)
Gap4 (assembly program)
http//www.mrc-lmb.cam.ac.uk/pubseq/
CAP3 (Sequence Assembly Program)
Email huang_at_mtu.edu
Phred/Phrap/Consed distribution package
http//www.phrap.org/
Sequencing software utilities available at Sanger
Centre http//www.sanger.ac.uk/software/

5
Large-scale genome projects

Sequencing DNA molecules in the Mb size range
All strategies employ the same underlying
principles
Random Shotgun sequencing

6
Growth of Nucleotide DatabaseEMBL/GenBank/DDBJ
7
Progress on Large Sequencing Projects
8
Strategies for sequencing

How big can you go??
Large-insert clones
cosmids 30-40 kb
BACs/PACs 50 - 100 kb
Whole chromosomes
Whole genomes

9
Genome size and sequencing strategies
Genome size (log Mb)
4
0
1
2
3
H.sapiens (3000 Mb)
D.melanogaster (170 Mb)
C.elegans (100Mb)
P.falciparum (30 Mb)
S.cerevisiae (14 Mb)
E.coli (4 Mb)
Whole genome shotgun (WGS)
Clone-by-clone
Whole Chromosome Shotgun (WCS)
Whole Genome Shotgun (WGS) with Clone skims
10
Strategies for sequencing

Size and GC composition of genome
Volume of data
Ease of cloning
Ease of sequencing
Genome complexity
dispersed repetitive sequence
telomeres centromeres
Politics/Funding

11
Strategies Whole Genome shotgun (WGS)

Moderate to High complexity (10-100s K reads)
Problems with repeats
Complex informatics
Quality of physical map
Fingerprint map
STS markers
End-sequences
Skims of mapped clones

12
Sequencing
Genomic DNA
Shearing/Sonication
Small DNA fragments 1.0-2.0kb
Clone LibrarypUC18
Subclone and Sequence
DNA sequencing Random clones
Shotgun reads
Assembly
Contigs
Finishing read
Both strands coverage Gap filled
Finishing
Complete sequence
13
Sequencing

Library construction
Colony picking
DNA preparation
Sequencing reactions
Electrophoresis
Tracking/Base calling

14
Libraries

Essentially Sub-cloning
Generation of small insert libraries in a well
characterised vector.
Ease of propagation
Ease of DNA purification
e.g. puc18, M13
Test the libraries

15
Sequence generation

Pick colonies
Template preparation
Sequence reactions
Standard terminator chemistry
pUC libraries sequenced with forward and reverse
primers

16
Sequence generation

Electrophoresis of products
Old style - slab gels, 32 gt 64 gt 96 lanes
New style - capillary gels, 96 lanes
Transfer of gel image to UNIX
Sequencing machines use a slave Mac/PC
Move data to centralised storage area for
processing

17
Gel image processing

Light-to-Dye estimation
Lane tracking
Lane editing
Trace extraction
Trace standardisation
Mobility correction
Background substitution

18
Pre-processing

Base calling using Phred
modifies SCF file
Quality clipping
Vector clipping
Sequencing vector
Cloning vector
Screen for contaminants
Feature mark up (repeats/transposons)

19
Phred

Phred is a program that performs several tasks
a. Reads trace files compatible with most file
formats SCF (standard chromatogram format), ABI
(373/377/3700), ESD (MegaBACE) and LI-COR.
b. Calls bases attributes a base for each
identified peak with a lower error rate than the
standard base calling programs.
c. Assigns quality values to the bases a Phred
value based on an error rate estimation
calculated for each individual base.
d. Creates output files base calls and quality
values are written to output files.

20
Trace File High quality region no ambiguities
(Ns)
- No ambiguities no noise peaks very well spaced
21
Trace File Medium quality region some
ambiguities (Ns)

Some ambiguities (Ns)
-No noise
- Peaks very well spaced
- some homopolymeric stretches are not well
resolved

22
Trace File Poor quality region low confidence

Overlapping peaks, peaks not evenly spaced
Low resolution low confidence to base assignment

23
Finishing

Assembly Process of taking raw single-pass
reads into contiguous consensus sequence
Closure Process of ordering and merging
consensus sequences into a single contiguous
sequence
Finished is defined as sequenced on both strands
using multiple clones. In the absence of multiple
clones the clone must be sequenced with multiple
chemistries. The overall error rate is estimated
at less than 1 error per 10 kb

24
Genome Assembly

Pre-assembly
Assembly
Automated appraisal
Manual review

25
Assembly

Run Cross_match to remove or screen out vector
sequences before running phrap
Assemble using Phrap
Read fasta quality scores from CAF file
Merge existing Phrap .ace file as necessary
Adjust clipping

26
Phrap Phragment Assembly Program

Phrap is a program for assembling shotgun DNA
sequence data
Key Features
a. Uses the entire read content no need for
trimming.
b. User supplied (i.e. Repbase) internally
computed data better accuracy of assembly in
the presence of repeats.
c. Contig sequence is constituted by a mosaic of
the highest quality parts of the reads its not
a consensus!
d. Provides extensive information about assembly
contained in phrap.out, .ace and
.screen.contigs.qual files
e. Handles very large datasets hundreds of
thousands of reads are easily manipulated.
f. Generate output files contain some important
data and enable visualization by other programs

27
Assembly appraisal

auto-edit
removes 70 of read discrepancies
Remove cloning vector (VecScreen, CrossMatch)
Mark up sequence features
finish
Identify low-quality regions
Cover using re-runs and long-runs
Compare with current databases
plate contamination

28
Gap closure

Read pairs
PCR reactions (long-range / combinatorial)
Small-insert libraries
Transposon-insertion libraries

29
Gap closure - contig ordering

Read pair consistency
STS mapping
Physical mapping
Genetic mapping
Optical mapping
Large-insert clone
skims
end-sequencing

30
Consed

Consed is a program for viewing and editing
assemblies produced by Phrap
a. Assembly viewer - allows for visualization of
contigs, assembly (aligned reads), quality values
of reads and final sequence.
b. Trace file viewer single and multiple trace
files can be visualized allowing for comparison
of a given sequence in several reads.
c. Navigation identify and list regions which
are below a given quality threshold, contain high
quality discrepancies, single-strand coverage,
etc.
d. Autofinish automatic set of functions for
gap closure, improvement of sequence quality,
determination of relative orientation of contigs,
identification of regions covered by a single
read or by reads of a single strand. The program
automatically performs primer picking and chooses
the templates.

31
Consed in Action
32
AutoFinish

Autofinish is part of the Consed software package
Automatically chooses finishing reads in order to
finish a project as predefined by the user
inputted parameters
Software that help determine how contigs are
ordered and oriented
Close gaps
Improve error rate
Cover every base by reads from at least 2
different subclones!!!

33
(No Transcript)
34
Annotation

DNA features (repeats/similarities)
Gene finding
Peptide features
Initial role assignment
Others- regulatory regions

35
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction
transcription
Unprocessed RNA
RNA processing
Mature mRNA
AAAAAAA
Gm3
Comparative gene prediction
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
36
Genome analysis overview C.elegans
37
DNA features

Similarity features
mapping repeats
simple tandem and inverted
repeat families
mapping DNA similarities
EST/mRNAs in eukaryotes
Duplications,
RNAs
mapping peptide similarities
protein similarities

38
Gene finding

ORF finding (simple but messy)
ab initio prediction
Measures of codon bias
Simple statistical frequencies
Comparative prediction
Using similarity data
Using cross-species similarities

39
Peptide features

Peptide features
low-complexity regions
trans-membrane regions
structural information (coiled-coil)
Similarities and alignments
Protein families (InterPro/COGS)

40
Initial role assignment

Simple attempt to describe the functional
identity of a peptide
Uses data from
peptide similarities
protein families
Vital for data mining
Large number of predicted genes remain
hypothetical or unknown

41
Other regulatory features

Ribosomal binding sites
Promoter regions

42
Data Release

DNA release
Unfinished
Finished
Nucleotide databases
GENBANK/EMBL/DDBJ
Peptide databases
SWISSPROT/TREMBL/GENPEPT
Others

43
What do you get?
DATA!!, DATA !!, and more DATA!!

Sequence
incomplete v complete
First-pass annotation
Gene discovery
Full annotation
A starting point for research

44
Genome annotation is central to functional
genomics
45
Sequencing my genome
Politics
Production
Finishing
Annotation
TIME
MONEY
46
Extra Slides
47
What is Phred/Phrap/Consed ?