Shotgun Sequencing - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Shotgun Sequencing

Description:

Most genome will be sequenced and can be sequenced; few problem are unsolvable. ... Assembly: Process of taking raw single-pass reads into ... – PowerPoint PPT presentation

Number of Views:2265
Avg rating:3.0/5.0
Slides: 48
Provided by: chu69
Category:

less

Transcript and Presenter's Notes

Title: Shotgun Sequencing


1
Shotgun Sequencing Assembly
  • Chuong Huynh
  • NIH/NLM/NCBI
  • huynh_at_ncbi.nlm.nih.gov
  • Bangkok, Thailand
  • July 15, 2002
  • Acknowledgement Daniel Lawson, Artur Gruber

2
Sequencing
Genomic DNA
Shearing/Sonication
Small DNA fragments 1.0-2.0kb
Clone LibrarypUC18
Subclone and Sequence
DNA sequencing Random clones
Shotgun reads
Assembly
Contigs
Finishing read
Both strands coverage Gap filled
Finishing
Complete sequence
3
Genome Sequencing - Review
Strategy
Strategy
Libraries
Libraries
Sequencing
Sequencing
Assembly
Assembly
Closure
Closure
Annotation
Annotation
Release
Release
4
Sequencing Software
  • Staden Package
  • Pregap4 (prepare sequence trace data for
    analysis)
  • Gap4 (assembly program)
  • http//www.mrc-lmb.cam.ac.uk/pubseq/
  • CAP3 (Sequence Assembly Program)
  • Email huang_at_mtu.edu
  • Phred/Phrap/Consed distribution package
    http//www.phrap.org/
  • Sequencing software utilities available at Sanger
    Centre http//www.sanger.ac.uk/software/

5
Large-scale genome projects
  • Sequencing DNA molecules in the Mb size range
  • All strategies employ the same underlying
    principles
  • Random Shotgun sequencing

6
Growth of Nucleotide DatabaseEMBL/GenBank/DDBJ
7
Progress on Large Sequencing Projects
8
Strategies for sequencing
  • How big can you go??
  • Large-insert clones
  • cosmids 30-40 kb
  • BACs/PACs 50 - 100 kb
  • Whole chromosomes
  • Whole genomes

9
Genome size and sequencing strategies
Genome size (log Mb)
4
0
1
2
3
H.sapiens (3000 Mb)
D.melanogaster (170 Mb)
C.elegans (100Mb)
P.falciparum (30 Mb)
S.cerevisiae (14 Mb)
E.coli (4 Mb)
Whole genome shotgun (WGS)
Clone-by-clone
Whole Chromosome Shotgun (WCS)
Whole Genome Shotgun (WGS) with Clone skims
10
Strategies for sequencing
  • Size and GC composition of genome
  • Volume of data
  • Ease of cloning
  • Ease of sequencing
  • Genome complexity
  • dispersed repetitive sequence
  • telomeres centromeres
  • Politics/Funding

11
Strategies Whole Genome shotgun (WGS)
  • Moderate to High complexity (10-100s K reads)
  • Problems with repeats
  • Complex informatics
  • Quality of physical map
  • Fingerprint map
  • STS markers
  • End-sequences
  • Skims of mapped clones

12
Sequencing
Genomic DNA
Shearing/Sonication
Small DNA fragments 1.0-2.0kb
Clone LibrarypUC18
Subclone and Sequence
DNA sequencing Random clones
Shotgun reads
Assembly
Contigs
Finishing read
Both strands coverage Gap filled
Finishing
Complete sequence
13
Sequencing
  • Library construction
  • Colony picking
  • DNA preparation
  • Sequencing reactions
  • Electrophoresis
  • Tracking/Base calling

14
Libraries
  • Essentially Sub-cloning
  • Generation of small insert libraries in a well
    characterised vector.
  • Ease of propagation
  • Ease of DNA purification
  • e.g. puc18, M13
  • Test the libraries

15
Sequence generation
  • Pick colonies
  • Template preparation
  • Sequence reactions
  • Standard terminator chemistry
  • pUC libraries sequenced with forward and reverse
    primers

16
Sequence generation
  • Electrophoresis of products
  • Old style - slab gels, 32 gt 64 gt 96 lanes
  • New style - capillary gels, 96 lanes
  • Transfer of gel image to UNIX
  • Sequencing machines use a slave Mac/PC
  • Move data to centralised storage area for
    processing

17
Gel image processing
  • Light-to-Dye estimation
  • Lane tracking
  • Lane editing
  • Trace extraction
  • Trace standardisation
  • Mobility correction
  • Background substitution

18
Pre-processing
  • Base calling using Phred
  • modifies SCF file
  • Quality clipping
  • Vector clipping
  • Sequencing vector
  • Cloning vector
  • Screen for contaminants
  • Feature mark up (repeats/transposons)

19
Phred
  • Phred is a program that performs several tasks
  • a. Reads trace files compatible with most file
    formats SCF (standard chromatogram format), ABI
    (373/377/3700), ESD (MegaBACE) and LI-COR.
  • b. Calls bases attributes a base for each
    identified peak with a lower error rate than the
    standard base calling programs.
  • c. Assigns quality values to the bases a Phred
    value based on an error rate estimation
    calculated for each individual base.
  • d. Creates output files base calls and quality
    values are written to output files.

20
Trace File High quality region no ambiguities
(Ns)
- No ambiguities no noise peaks very well spaced
21
Trace File Medium quality region some
ambiguities (Ns)
  • Some ambiguities (Ns)
  • -No noise
  • - Peaks very well spaced
  • - some homopolymeric stretches are not well
    resolved

22
Trace File Poor quality region low confidence
  • Overlapping peaks, peaks not evenly spaced
  • Low resolution low confidence to base assignment

23
Finishing
  • Assembly Process of taking raw single-pass
    reads into contiguous consensus sequence
  • Closure Process of ordering and merging
    consensus sequences into a single contiguous
    sequence
  • Finished is defined as sequenced on both strands
    using multiple clones. In the absence of multiple
    clones the clone must be sequenced with multiple
    chemistries. The overall error rate is estimated
    at less than 1 error per 10 kb

24
Genome Assembly
  • Pre-assembly
  • Assembly
  • Automated appraisal
  • Manual review

25
Assembly
  • Run Cross_match to remove or screen out vector
    sequences before running phrap
  • Assemble using Phrap
  • Read fasta quality scores from CAF file
  • Merge existing Phrap .ace file as necessary
  • Adjust clipping

26
Phrap Phragment Assembly Program
  • Phrap is a program for assembling shotgun DNA
    sequence data
  • Key Features
  • a. Uses the entire read content no need for
    trimming.
  • b. User supplied (i.e. Repbase) internally
    computed data better accuracy of assembly in
    the presence of repeats.
  • c. Contig sequence is constituted by a mosaic of
    the highest quality parts of the reads its not
    a consensus!
  • d. Provides extensive information about assembly
    contained in phrap.out, .ace and
    .screen.contigs.qual files
  • e. Handles very large datasets hundreds of
    thousands of reads are easily manipulated.
  • f. Generate output files contain some important
    data and enable visualization by other programs

27
Assembly appraisal
  • auto-edit
  • removes 70 of read discrepancies
  • Remove cloning vector (VecScreen, CrossMatch)
  • Mark up sequence features
  • finish
  • Identify low-quality regions
  • Cover using re-runs and long-runs
  • Compare with current databases
  • plate contamination

28
Gap closure
  • Read pairs
  • PCR reactions (long-range / combinatorial)
  • Small-insert libraries
  • Transposon-insertion libraries

29
Gap closure - contig ordering
  • Read pair consistency
  • STS mapping
  • Physical mapping
  • Genetic mapping
  • Optical mapping
  • Large-insert clone
  • skims
  • end-sequencing

30
Consed
  • Consed is a program for viewing and editing
    assemblies produced by Phrap
  • a. Assembly viewer - allows for visualization of
    contigs, assembly (aligned reads), quality values
    of reads and final sequence.
  • b. Trace file viewer single and multiple trace
    files can be visualized allowing for comparison
    of a given sequence in several reads.
  • c. Navigation identify and list regions which
    are below a given quality threshold, contain high
    quality discrepancies, single-strand coverage,
    etc.
  • d. Autofinish automatic set of functions for
    gap closure, improvement of sequence quality,
    determination of relative orientation of contigs,
    identification of regions covered by a single
    read or by reads of a single strand. The program
    automatically performs primer picking and chooses
    the templates.

31
Consed in Action
32
AutoFinish
  • Autofinish is part of the Consed software package
  • Automatically chooses finishing reads in order to
    finish a project as predefined by the user
    inputted parameters
  • Software that help determine how contigs are
    ordered and oriented
  • Close gaps
  • Improve error rate
  • Cover every base by reads from at least 2
    different subclones!!!

33
(No Transcript)
34
Annotation
  • DNA features (repeats/similarities)
  • Gene finding
  • Peptide features
  • Initial role assignment
  • Others- regulatory regions

35
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction
transcription
Unprocessed RNA
RNA processing
Mature mRNA
AAAAAAA
Gm3
Comparative gene prediction
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
36
Genome analysis overview C.elegans
37
DNA features
  • Similarity features
  • mapping repeats
  • simple tandem and inverted
  • repeat families
  • mapping DNA similarities
  • EST/mRNAs in eukaryotes
  • Duplications,
  • RNAs
  • mapping peptide similarities
  • protein similarities

38
Gene finding
  • ORF finding (simple but messy)
  • ab initio prediction
  • Measures of codon bias
  • Simple statistical frequencies
  • Comparative prediction
  • Using similarity data
  • Using cross-species similarities

39
Peptide features
  • Peptide features
  • low-complexity regions
  • trans-membrane regions
  • structural information (coiled-coil)
  • Similarities and alignments
  • Protein families (InterPro/COGS)

40
Initial role assignment
  • Simple attempt to describe the functional
    identity of a peptide
  • Uses data from
  • peptide similarities
  • protein families
  • Vital for data mining
  • Large number of predicted genes remain
    hypothetical or unknown

41
Other regulatory features
  • Ribosomal binding sites
  • Promoter regions

42
Data Release
  • DNA release
  • Unfinished
  • Finished
  • Nucleotide databases
  • GENBANK/EMBL/DDBJ
  • Peptide databases
  • SWISSPROT/TREMBL/GENPEPT
  • Others

43
What do you get?
DATA!!, DATA !!, and more DATA!!
  • Sequence
  • incomplete v complete
  • First-pass annotation
  • Gene discovery
  • Full annotation
  • A starting point for research

44
Genome annotation is central to functional
genomics
45
Sequencing my genome
Politics
Production
Finishing
Annotation
TIME
MONEY
46
Extra Slides
47
What is Phred/Phrap/Consed ?
  • Phred/Phrap/Consed is a worldwide distributed
    package for
  • a. Trace file (chromatograms) reading
  • b. Quality (confidence) assignment to each
    individual base
  • c. Vector and repeat sequences identification
    and masking
  • d. Sequence assembly
  • e. Assembly visualization and editing
  • f. Automatic finishing.
Write a Comment
User Comments (0)
About PowerShow.com