Title: DNA Sequencing and the Human Genome Project
1(No Transcript)
2DNA Sequencing and the Human Genome Project
- History
- Technology
- Analysis
3Technology
4Aspects of Sequencing Genomes
- Sequencing method
- Cloned DNA
- Clone/Sequence Assembly
5Sequencing Methods
- Sanger chain termination method gt90 of all
sequencing - Relies on ability of DNA polymerase to
incorporate nucleotide analogs while synthesizing
template driven DNA
6Dideoxynucleotide-based Sequencing
7Automating Sanger Sequencing
8Other Automated Methods
- Hybridization method
- Hybridize to oligos on a chip
- Affymetrix can do 30K resequence
- Limited by number of features and hybridization
specificity - Single molecule methods
- Pore-base - threads DNA through molecular pore in
membrane - bases determined by changes in
conductance - Mass spec - best for small molecules now like SNPs
9Most-used hardware
- ABI 377 - gel based - 96 lanes a pop - read
length 500bp - run time 4-16h gt 40,000
bases/run X 3 runs/day 120,000 - ABI 3700 - Capillary based - 48 capillaries -
read length 500bp - run time 40 minutes gt
950,000
10Whaddaya determine the sequence of?
- Major problem is all methods only get 500bp/read
- Shotgun method can help
11Basic Shotgun Strategy
12Power of Shotgun
- Needs no prior knowledge of target
- Requires no maps or landmarks
- Got genome? Can start!
13Limits of Shotgun
- Requires redundant sequence
- Not only bad - gives higher accuracy
- Requires representative library
- No missing clones
- Difficulty of locating overlaps
- Harder with larger genomes
- Vertebrate genomes have repetitive DNA
- Especially human 50 repeat
14Computers Help
- When first begun
- PDP-11 (DEC) was state of art
- No PCs existed
- Data input punch cards
- Data output unformatted text!
- Conclude You guys are fortunate!
15Locating Overlap - considerations
- Need rules
- How much overlap to call it real?
- How much mismatach in overlap?
- How to determine error?
- How to automate?
- Do you let the CPU do it?
- Do you Edit? If so how?
16The CONTIG
- CONTIG CONTIGuous sequence from overlapping
consensus - Phrase coined by Roger Staden MRC
- First level of assembly in shotgun
- Definition of consensus carefully controlled
17Sample Contig
GGCTCTTAGGAGATT
GATTTAGTTATGTTATTGTGCAACTATC
Overlap?
ATGTTATTCTGCAACCATCGCTGCGGACGAATAGCTGT
TTGTGCAACAATCGCTGCGGACGA
11111111111345224455662223333311111112233332333111
3333333
GGCTCTTAGGAGATTTAGTTATGTTATTGTGCAACNATCGCTGCGGACGA
ATAGCTGT
What constitutes consensus?
18PHRED, PHRAP, CONSED, FINISHER
- Phils Rapid Editor (PHRED)
- Reads tracings from ABI
- Calls bases using best available graphical
analyzer - Makes quality assessment based on signal
strength, background, overlapping bands etc. - This gives a quantitative basis for
establishing a contig
19Phrap
- Phils Rapid Assembly Program
- Takes PHRED output as input
- Compares all PHRED reads to all other reads and
contigs and - - Makes tentative contigs with biases
- End overlaps better than middle
- Overlap must reach threshold score
- Score is identity plus PHRED quality factor
20Consed
- CONtig Sequence Editor
- Permits finisher to edit overlaps
- Permits/confirms contig joins
- Permits (but discourages) sequence editing
- Allows identification of repeats
- Uses RepeatMasker output
21Finisher
- Part of Consed
- Makes suggestions for closure
- Tells which clones to extend or reverse sequence
- Derives PCR primers for gap filling
- Estimated that finishing takes over twice as long
as sequencing
22Workflow
BAC or Small Genomic DNA
Fragment - sonication preferred
Clone library - 5-10X representative
Sequence - Enough runs
Data to PHRED -gt PHRAP -gt Consed -gt Finisher
Decide to fill gaps or done
Post to NCBI
23Strategies
- Divide and conquer
- Create physical map
- Create smaller and smaller subclones of mapped
pieces - Carry out shotgun sequencing smallest pieces
- Whole genome
- Generate sequence-able clones
- Determine sequences at random using shotgun
- Use sequence overlap to reassemble into consensus
24Philosophical issues regarding which to use
Wet-bench intensive
Fully map then sequence
Partial sequence ends, construct map, then
sequence
Full genome shotgun
Computationally intensive
25Limitations on shotgun sequencing genomes
- Obtaining enough clones to cover all spots
- Finding credible sequence overlaps
- Repetitive DNA in Humans
- Computational power
26More on repeats
Reads are only 500bp
GCTAGGCTAGTGGCATG
Genome is 3,000,000,000bp
Identical repeat sequences are interspersed throug
hout the genome so impossible to
place repeat-containing reads.
Reads are only 500bp
GCTAGGCTAGTGGCATG
GCTAGGCTAGTGGCATG
GCTAGGCTAGTGGCATG
GCTAGGCTAGTGGCATG
GCTAGGCTAGTGGCATG
Clone 1 sequence
CGAGCGTGTTGTACGTGTGA
GCTAGGCTAGTGGCATG
Clone 2 sequence
GGAGTGCTGAGTGGTGCAGCTAGGCTAGTGGCATGGGAGTGCTGAGTGGT
GCA
27Mapping first - Shotgun sequence later
M13 or plasmid - 1 BAC (150kb) needs 6000
sequence reads or 2-3000 clones
28Clones
- Large insert clones
- YACs (Yeast Artificial Chromosomes
- Useful for mapping 1mb inserts
- Unstable during construction and propagation
- Not useful for sequencing
- BACs (Bacterial Artificial Chromosomes)
- 150kb insert
- Extremely stable and easy to propagate
- Gold standard for sequencing targets and
chromosome-scale maps - Cosmids
- 50kb insert
- Extremely stable and easy to propagate
- Useful for sequencing but too small for
chromosome maps
29Sequence-ready clones
- Plasmids
- 1-10kb insert capacity
- High copy number
- Easy to sequence bi-directionally
- Automated clone picking/DNA isolation possible
- Examples pUC18, pBR322
- Single-stranded Bacteriophage
- 1-5kb insert capacity
- Grows at high copy as plasmid and is shed into
medium as single stranded DNA phage - Easy to isolate, pick, sequence
- Easy to automate
- M13 is used almost exclusively
30Mapping
- Human Genome Maps
- BAC Fingerprint map
- Genetic Map
- Cytogenetic Map
- STS-based physical map (YACs)
- Radiation Hybrid Map
31Clone map from USSC
32Genetic Map
- Genethon and Marshfield
- Used CEPH families to map
- Used microsatellite markers (highly polymorphic)
- Mapping on only 100 families attained 0.7cM map
- Gave 5000 well ordered PHYSICAL markers
- Can be used to order clones and contigs
33Cytogentics
- FISH (Fluorescence in situ hybridization) -
useful to locate clones
34STS Content YAC Map
35RH Map
36RH principle
37General RH mapping on Panels
38RH Mapping Panels
- Genebridge
- Number 93
- Retention 32
- Avg. Size 25mb
- Stanford G3
- Number 83
- Retention 16
- Avg. Size 2.4mb
39(No Transcript)
40Output from Stanford
- From rhserver_at_paxil.stanford.edu
- Date Tue Sep 9, 2003 92726 AM America/Denver
- To krauter_at_colorado.edu
- Subject SHGC RHSERVER
- This email message has been sent automatically by
the StanfordHuman Genome Center RHserver in
response to your - submission.If you have questions or comments
please submit them towebmaster_at_shgc.stanford.edu
and include the - message ID krauter_at_colorado.edu1063121245.93
- Duplicate markers are indicated with a (D) after
the marker name,a LOD score is now given for
duplicates.Reference - Number Stanford RH Panel G3 Lowest LOD
Reported 4 Chromosome Value 0 - Results for HUM_GEN
- ----------------------------------------
- SubmittedVector1100000010000000101000100000100000
0110011001000001100100010000011101000 - 110000000110
- SHGCNAME CHROM
LOD_SCOREDIST. (cRs) - 1 SHGC-57080 22
19.18 4 - Vector1100000000000000101000100000100000011001100
10000011001000100000111010001100000001102 - 2 SHGC-7822 22
18.63 4 - Vector1100000000000000R01000100000100000011001100
10000011001000100000111010001100000001103 - 3 SHGC-58507 22
17.15 7
41END of LINE
- At the end, one has
- Detailed marker and clone maps
- Collection of BACs covering most of genome
- Sequence minimal tiling path of BACs
42Assembly-line process at MIT Genome Center
Grow in 2ml cultures
Pick from plate into dishes
Bar code 384-well dishes
ABI 3700 Sequencer
Multiposition robot preps DNA
Sanger rxns done in thermal cyclers
43Technical sidebar
- Literally hundreds of millions of clones must be
sequenced - Must automate (i.e. use robots)
- Methods to pick clones with inserts, prepare DNA,
carry out sequencing reactions and load automated
sequencers must be fully automatic (no human
steps)
44Both Celera and HGC compromised
- Human Genome Consortium
- Derived STS maps
- Sequenced BAC ends and fingerprinted to make maps
- Then sequenced minimum tiling path of BACs
- Celera
- Did full random shotgun of 1-3kb and 10-20kb
clones - Used STS, EST, and BAC maps to order small
contigs into larger contigs
45Whats the diff?
- Did the methods produce different outcomes?
- No
- Both produced gapped sequences
- Both lacked highly repeated segments
- Both produced sequence of sufficient quality to
begin detailed analyses - Yes
- Several regions had significantly different
sequence orders - Not all genes in one were present in the other
- HGC had, on average smaller but better contigs
- Celera had higher redundancy (i.e. accuracy)
sequence
46Products
- HGC
- Reads 181 X 106
- Bases 23 X 109
- Av. Contig 3 X 105
- No. Gaps 1.5 X 105
- No. Genes 24,500
- Celera
- Reads 27 X 106
- Bases 14 X 109
- Av. Contig 3 X 106
- No. Gaps 1 x 105
- No. Genes 26,383