Title: Lecture 1 Introduction to high throughput sequencing
1(No Transcript)
2Lecture 1Introduction to high throughput
sequencing
- Michael Brudno
- CSC 2431
- January 13, 2010
Adapted from presentations by Francis Ouelette,
OICR, Michael Stromberg, BC and Asim Siddiqui, ABI
3DNA sequencing
- How we obtain the sequence of nucleotides of a
species
ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGAC
TACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT
4DNA Sequencing
- Goal
- Find the complete sequence of A, C, G, Ts in
DNA - Challenge
- There is no machine that takes long DNA as an
input, and gives the complete sequence as output - Can only sequence 500 letters at a time
5Generations of Sequences
- Sanger-style Classic
- 454 First Next-gen
- Illumina ABI SOLiD Next-gen
- Helicos 2.5 Gen
- PacBio Next-next-gen, 3rd gen
6Why are we sequencing?
- Before Next-generation
- DNA, RNA, (proteins), (populations), sampling,
averages, consensus - Problems sampling, averages, consensus.
- After Next-generation
- Genome sequence and structure
- Less cloning/PCR
- Single molecules (for some)
7Sanger (old-gen) Sequencing Now-Gen Sequencing
Whole Genome Human (early drafts), model organisms, bacteria, viruses and mitochondria (chloroplast), low coverage New human (!), individual genome, 1,000 normal, 25,000 cancer matched control pairs, rare-samples
RNA cDNA clones, ESTs, Full Length Insert cDNAs, other RNAs RNA-Seq Digitization of transcriptome, alternative splicing events, miRNA
Communities Environmental sampling, 16S RNA populations, ocean sampling, Human microbiome, deep environmental sequencing, Bar-Seq
Other Epigenome, rearrangements, ChIP-Seq
8Differences between the various platforms
- Nanotechnology used.
- Resolution of the image analysis.
- Chemistry and enzymology.
- Signal to noise detection in the software
- Software/images/file size/pipeline
- Cost
9Adapted from Richard Wilson, School of Medicine,
Washington University, Sequencing the Cancer
Genome http//tinyurl.com/5f3alk
Next Generation DNA Sequencing Technologies
Human Genome 6GB 6000 MB 6GB 6000 MB 6GB 6000 MB
Reqd Coverage 6 12 30
3730 454 Illumina
bp/read 600 400 2X75
reads/run 96 500,000 100,000.000
bp/run 57,600 0.5 GB 15 GB
runs reqd 625,000 144 12
runs/day 2 1 0.1
Machine days/human genome 312,500 (856 years) 144 120
Cost/run 48 6,800 9,300
Total cost 15,000,000 979,200 111,600
10Solexa-based Whole Genome Sequencing
Adapted from Richard Wilson, School of Medicine,
Washington University, Sequencing the Cancer
Genome http//tinyurl.com/5f3alk
11Illumina (Solexa)
12Illumina (Solexa)
13Illumina (Solexa)
14From Debbie Nickerson, Department of Genome
Sciences, University of Washington,
http//tinyurl.com/6zbzh4
15What is a base quality?
Base Quality Perror(obs. base)
3 50.12
5 31.62
10 10.00
15 3.16
20 1.00
25 0.32
30 0.10
35 0.03
40 0.01
16Next-gen sequencers
From John McPherson, OICR
100 Gb
AB/SOLiDv3, Illumina/GAII short-read sequencers
(10Gb in 50-100 bp reads, gt100M
reads, 4-8 days)
10 Gb
454 GS FLX pyrosequencer
1 Gb
(100-500 Mb in 100-400 bp reads, 0.5-1M reads,
5-10 hours)
bases per machine run
100 Mb
ABI capillary sequencer
10 Mb
(0.04-0.08 Mb in 450-800 bp reads, 96 reads,
1-3 hours)
1 Mb
10 bp
1,000 bp
100 bp
read length
17DNA sequencing vectors
DNA
Shake
DNA fragments
Known location (restriction site)
Vector Circular genome (bacterium, plasmid)
18Method to sequence longer regions
genomic segment
cut many times at random (Shotgun)
Get two reads from each segment
500 bp
500 bp
19Reconstructing the Sequence (Fragment Assembly)
reads
Cover region with 7-fold redundancy (7X)
Overlap reads and extend to reconstruct the
original genomic region
20Definition of Coverage
C
- Length of genomic segment L
- Number of reads n
- Length of each read l
- Definition Coverage C n l / L
- How much coverage is enough?
- Lander-Waterman model
- Assuming uniform distribution of reads, C10
results in 1 gapped region /1,000,000 nucleotides
21Challenges with Fragment Assembly
- Sequencing errors
- 1-2 of bases are wrong
- Repeats
- Computation O( N2 ) where N reads
false overlap due to repeat
22 History of DNA Sequencing
Adapted from Eric Green, NIH Adapted from
Messing Llaca, PNAS (1998)
1870
Miescher Discovers DNA
Avery Proposes DNA as Genetic Material
1940
Efficiency (bp/person/year)
Watson Crick Double Helix Structure of DNA
1953
Holley Sequences Yeast tRNAAla
1
15
1965
Wu Sequences ? Cohesive End DNA
150
1970
Sanger Dideoxy Chain Termination Gilbert
Chemical Degradation
1,500
1977
Messing M13 Cloning
15,000
1980
25,000
Hood et al. Partial Automation
50,000
1986
- Cycle Sequencing
- Improved Sequencing Enzymes
- Improved Fluorescent Detection Schemes
200,000
1990
50,000,000
2002
- Next Generation Sequencing
- Improved enzymes and chemistry
- New image processing
100,000,000,000
2009
23Which representative of the species?
- Which human?
- Answer one
- Answer two it doesnt matter
-
- Polymorphism rate number of letter changes
between two different members of a species -
- Humans 1/1,000 1/10,000
- Other organisms have much higher polymorphism
rates
24Why humans are so similar
- A small population that interbred reduced the
genetic variation - Out of Africa 40,000 years ago
Out of Africa
25Migration of human variation
- http//info.med.yale.edu/genetics/kkidd/point.html
26Migration of human variation
- http//info.med.yale.edu/genetics/kkidd/point.html
27Migration of human variation
- http//info.med.yale.edu/genetics/kkidd/point.html
28Genetic Variations Why?
Phenotypic differences
Inherited diseases
Ancestral history
29Genetic Variations SNPs INDELs
30Structural Variations
Paul Medvedev review in prep July 2009
31SNP Discovery Goal
SNP
sequencing errors
32SNP Discovery Base Qualities
High quality
Low quality
33SNPs Bayesian Statistics
base quality
of individuals
allele call in read
34SNP Discovery
haploid
diploid
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCAT
A
strain 1
individual 1
AACGTTCGCATA AACGTTCGCATA
strain 2
AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCAT
A
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
individual 2
strain 3
AACGTTAGCATA AACGTTAGCATA
individual 3
35Genotyping Consensus Generation
haploid
diploid
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCAT
A
strain 1 A
individual 1 A/C
strain 2 C
AACGTTCGCATA AACGTTCGCATA
AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCAT
A
individual 2 C/C
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
strain 3 A
individual 3 A/A
AACGTTAGCATA AACGTTAGCATA
36Visualization Consed
371000 Genomes Project
381000G Goals
- Discover genetic variations
- 1 minor allele frequencies across genome
- 0.1 0.5 MAF across gene regions
- Variant alleles
- Estimate frequencies
- Identify haplotype background
- Characterize linkage disequilibrium
391000G Pilot Projects
- Pilot 1
- Low coverage
- 180 samples
- 70 samples _at_ 4X
- 110 samples _at_ 2X
- 2.7 Tbp total
- 202 Gbp 454
- 1.8 Tbp Illumina
- 640 Gbp AB SOLiD
Pilot 2 Deep trios (CEU YRI) 6 samples 1.1 Tbp
total 87 Gbp 454 773 Gbp Illumina 270 Gbp AB SOLiD
Pilot 3 Exon capture 607 samples 2.2 Mbp of
targets 8800 targets 10 20x coverage
40Questions about the genome
- Obtaining a genome sequence is a one step towards
understanding biological processes - Questions that follow from the genome are
- What is transcribed?
- Where do proteins bind?
- What is methylated?
- In other words, how does it work?
41Central dogma
ZOOM IN
tRNA
transcription
DNA
rRNA
snRNA
translation
POLYPEPTIDE
mRNA
42Transcription
- The DNA is contained in the nucleus of the cell.
- A stretch of it unwinds there, and its message
(or sequence) is copied onto a molecule of mRNA. - The mRNA then exits from the cell nucleus.
43DNA
RNA
A T G C
T ? U
44More complexity
- The RNA message is sometimes edited.
- Exons are nucleotide segments whose codons will
be expressed. - Introns are intervening segments (genetic
gibberish) that are snipped out. - Exons are spliced together to form mRNA.
45Splicing
- frgjjthissentencehjfmkcontainsjunkelm
- thissentencecontainsjunk
46Key player RNA polymerase
- It is the enzyme that brings about transcription
by going down the line, pairing mRNA nucleotides
with their DNA counterparts.
47Promoters
- Promoters are sequences in the DNA just upstream
of transcripts that define the sites of
initiation. - The role of the promoter is to attract RNA
polymerase to the correct start site so
transcription can be initiated.
5
3
Promoter
48Promoters
- Promoters are sequences in the DNA just upstream
of transcripts that define the sites of
initiation. - The role of the promoter is to attract RNA
polymerase to the correct start site so
transcription can be initiated.
5
3
Promoter
49Transcription key steps
DNA
- Initiation
- Elongation
- Termination
DNA
RNA
50Genes can be switched on/off
- In an adult multicellular organism, there is a
wide variety of cell types seen in the adult. eg,
muscle, nerve and blood cells. - The different cell types contain the same DNA
though. - This differentiation arises because different
cell types express different genes. - Promoters are one type of gene regulators
51Transcription (recap)
- The DNA is contained in the nucleus of the cell.
- A stretch of it unwinds there, and its message
(or sequence) is copied onto a molecule of mRNA. - The mRNA then exits from the cell nucleus.
- Its destination is a molecular workbench in the
cytoplasm, a structure called a ribosome.
52The Transcriptome
- The transcriptome is the entire set of RNA
transcripts in the cell, tissue or organ. - The transcriptome is cell type specific and time
dependant i.e. It is a function of cell state - The transcriptome can help us understand how
cells differentiate and respond to changes in
their environment.
53Transcriptome complexity
- Transcripts may be
- Modified
- Spliced
- Edited
- Degraded
- Transcriptome is substantially more complex than
the genome and is time variant.
54ESTs
- ESTs were the first genome wide scan for
transcriptional elements - Different library types
- Proportional
- Normalized
- Subtractive
- Can be sequenced from the 5 or 3 end
55Hello Mr Chips
- Microarray chips introduced in 90s
- Parallel way to measure many genes
- Probes placed on slides
- RNA -gt cDNA, labelled with fluorescent dye and
hybridized. - Fluorescence measured
- Chips have been highly successful
- Simplified analysis
- Useful when there is no genome sequence
- Linear signal across 500 fold variation
- Standardization has aided use in medical
diagnostics - E.g. Mammaprint
56Microarray expression profiling by 2-color assay
(cDNA arrays)
Array PCR products 6250 yeast ORFs hybridized
cDNAs green control red experiment
Schena et al., 1995
57Chips pros and cons
- Advantages
- Do not require a genome sequence
- Highly characterised, with many s/w packages
available - One Affymetrix chip FDA approved
- Disadvantages
- Measurements limited to whats on the array
- Hard to distinguish isoforms when used for
expression - Cant detect balanced translocations or
inversions when used for resequencing
58mRNA-seq
- Basic work flow
- Align reads (sometimes to transcriptome first and
then the genome) - Tally transcript counts
- Align tags to spliced transcripts
- Add to transcript counts
59Cloonan et al. 2008
- Used SOLiD to generate 10Gb of data from mouse
embryonic stem cells and embryonic bodies - Used a library of exon junctions to map across
known splice events
60Distribution of tags
61Tag locations
62General issues
- Coverage across the transcript may not be random
- Some reads map to multiple locations
- Some reads dont map at all
- Reads mapping outside of known exons may
represent - New gene models
- New genes
63Size of the transcriptome
- Carter et al (2005)
- Using arrays estimated 520,000 to 850,000
transcripts per cell. - Use upper limit and estimate average transcript
size of 2kb - Transcriptome 2GB
- Transcriptome cost genome cost
64The Boundome
- DNA binding proteins control genome function
- Histones impact chromatin structure
- Activators and repressors impact gene expression
- The location of these proteins helps us
understand how the genome works
65ChIP
66Chip-Seq
- Instead of probing against a chip, measure
directly - Basic work flow
- Align reads to the genome
- Identify clusters and peaks
- Determine bound sites
67Robertson et al. 2007
- Used Illumina technology to find STAT1 binding
sites - Comparisons with two ChIP-PCR data sets suggested
that ChIP-seq sensitivity was between 70 and 92
and specificity was at least 95.
68Tag statistics
69Typical Profile
70Mikkelsen et al., 2007
- Performed a comparison with ChIP-chip methods
98 concordance
71Comparison with ChIP-seq
72The Methylome
- In methylated DNA, cytosines are methylated.
- This leads to silencing of genes in the region
e.g. X inactivation - It is yet another form of transcriptional control
and together with histone modifications a key
component of epigenetics
73Bi-sulphite sequencing
- Converts un-methylated cytosines to uracil (which
becomes thymine when converted to cDNA) - Experimental procedure is difficult
- Sequence alignment is tricky, but the basic
concepts hold
74Taylor et al, 2007
- Targeted sequencing reduced alignment
difficulties - Used dynamic programming to identify alignments
of sequences against an in silico bisulphate
converted sequence of the target amplicon regions
75Metagenomics
- Craig Venters sequencing of the sea one of the
earliest and most well known examples - Used Sanger sequencing
- Many recent studies including
- Angly et al studied ocean virome
- Cox-Foster et al studied colony collapse
disorder - All use 454 for its longer read length and target
amplification of 16S or 18S ribsomal subunits