Lecture 1 Introduction to high throughput sequencing

About This Presentation

Title:

Lecture 1 Introduction to high throughput sequencing

Description:

* * * * * * * * * * * * * Chip-Seq Instead of probing against a chip, measure directly Basic work flow Align reads to the genome Identify clusters and peaks Determine ... – PowerPoint PPT presentation

Number of Views:246

Avg rating:3.0/5.0

Slides: 76

Provided by: MichaelSt151

Learn more at: http://www.cs.toronto.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 1 Introduction to high throughput sequencing

1
(No Transcript)
2
Lecture 1Introduction to high throughput
sequencing

Michael Brudno
CSC 2431
January 13, 2010

Adapted from presentations by Francis Ouelette,
OICR, Michael Stromberg, BC and Asim Siddiqui, ABI
3
DNA sequencing

How we obtain the sequence of nucleotides of a
species

ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGAC
TACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT
4
DNA Sequencing

Goal
Find the complete sequence of A, C, G, Ts in
DNA
Challenge
There is no machine that takes long DNA as an
input, and gives the complete sequence as output
Can only sequence 500 letters at a time

5
Generations of Sequences

Sanger-style Classic
454 First Next-gen
Illumina ABI SOLiD Next-gen
Helicos 2.5 Gen
PacBio Next-next-gen, 3rd gen

6
Why are we sequencing?

Before Next-generation
DNA, RNA, (proteins), (populations), sampling,
averages, consensus
Problems sampling, averages, consensus.
After Next-generation
Genome sequence and structure
Less cloning/PCR
Single molecules (for some)

7
Sanger (old-gen) Sequencing Now-Gen Sequencing
Whole Genome Human (early drafts), model organisms, bacteria, viruses and mitochondria (chloroplast), low coverage New human (!), individual genome, 1,000 normal, 25,000 cancer matched control pairs, rare-samples
RNA cDNA clones, ESTs, Full Length Insert cDNAs, other RNAs RNA-Seq Digitization of transcriptome, alternative splicing events, miRNA
Communities Environmental sampling, 16S RNA populations, ocean sampling, Human microbiome, deep environmental sequencing, Bar-Seq
Other Epigenome, rearrangements, ChIP-Seq
8
Differences between the various platforms

Nanotechnology used.
Resolution of the image analysis.
Chemistry and enzymology.
Signal to noise detection in the software
Software/images/file size/pipeline
Cost

9
Adapted from Richard Wilson, School of Medicine,
Washington University, Sequencing the Cancer
Genome http//tinyurl.com/5f3alk
Next Generation DNA Sequencing Technologies
Human Genome 6GB 6000 MB 6GB 6000 MB 6GB 6000 MB
Reqd Coverage 6 12 30
3730 454 Illumina
bp/read 600 400 2X75
reads/run 96 500,000 100,000.000
bp/run 57,600 0.5 GB 15 GB
runs reqd 625,000 144 12
runs/day 2 1 0.1
Machine days/human genome 312,500 (856 years) 144 120
Cost/run 48 6,800 9,300
Total cost 15,000,000 979,200 111,600
10
Solexa-based Whole Genome Sequencing
Adapted from Richard Wilson, School of Medicine,
Washington University, Sequencing the Cancer
Genome http//tinyurl.com/5f3alk
11
Illumina (Solexa)
12
Illumina (Solexa)
13
Illumina (Solexa)
14
From Debbie Nickerson, Department of Genome
Sciences, University of Washington,
http//tinyurl.com/6zbzh4
15
What is a base quality?
Base Quality Perror(obs. base)
3 50.12
5 31.62
10 10.00
15 3.16
20 1.00
25 0.32
30 0.10
35 0.03
40 0.01
16
Next-gen sequencers
From John McPherson, OICR
100 Gb
AB/SOLiDv3, Illumina/GAII short-read sequencers
(10Gb in 50-100 bp reads, gt100M
reads, 4-8 days)
10 Gb
454 GS FLX pyrosequencer
1 Gb
(100-500 Mb in 100-400 bp reads, 0.5-1M reads,
5-10 hours)
bases per machine run
100 Mb
ABI capillary sequencer
10 Mb
(0.04-0.08 Mb in 450-800 bp reads, 96 reads,
1-3 hours)
1 Mb
10 bp
1,000 bp
100 bp
read length
17
DNA sequencing vectors
DNA
Shake
DNA fragments
Known location (restriction site)
Vector Circular genome (bacterium, plasmid)

18
Method to sequence longer regions
genomic segment
cut many times at random (Shotgun)
Get two reads from each segment
500 bp
500 bp
19
Reconstructing the Sequence (Fragment Assembly)
reads
Cover region with 7-fold redundancy (7X)
Overlap reads and extend to reconstruct the
original genomic region
20
Definition of Coverage
C

Length of genomic segment L
Number of reads n
Length of each read l
Definition Coverage C n l / L
How much coverage is enough?
Lander-Waterman model
Assuming uniform distribution of reads, C10
results in 1 gapped region /1,000,000 nucleotides

21
Challenges with Fragment Assembly

Sequencing errors
1-2 of bases are wrong
Repeats
Computation O( N2 ) where N reads

false overlap due to repeat
22
History of DNA Sequencing
Adapted from Eric Green, NIH Adapted from
Messing Llaca, PNAS (1998)
1870
Miescher Discovers DNA
Avery Proposes DNA as Genetic Material
1940
Efficiency (bp/person/year)
Watson Crick Double Helix Structure of DNA
1953
Holley Sequences Yeast tRNAAla
1
15
1965
Wu Sequences ? Cohesive End DNA
150
1970
Sanger Dideoxy Chain Termination Gilbert
Chemical Degradation
1,500
1977
Messing M13 Cloning
15,000
1980
25,000
Hood et al. Partial Automation
50,000
1986

Cycle Sequencing
Improved Sequencing Enzymes
Improved Fluorescent Detection Schemes

200,000
1990
50,000,000
2002

Next Generation Sequencing
Improved enzymes and chemistry
New image processing

100,000,000,000
2009
23
Which representative of the species?

Which human?
Answer one
Answer two it doesnt matter
Polymorphism rate number of letter changes
between two different members of a species
Humans 1/1,000 1/10,000
Other organisms have much higher polymorphism
rates

24
Why humans are so similar

A small population that interbred reduced the
genetic variation
Out of Africa 40,000 years ago

Out of Africa
25
Migration of human variation

http//info.med.yale.edu/genetics/kkidd/point.html

26
Migration of human variation

http//info.med.yale.edu/genetics/kkidd/point.html

27
Migration of human variation

http//info.med.yale.edu/genetics/kkidd/point.html

28
Genetic Variations Why?
Phenotypic differences
Inherited diseases
Ancestral history
29
Genetic Variations SNPs INDELs
30
Structural Variations
Paul Medvedev review in prep July 2009
31
SNP Discovery Goal
SNP
sequencing errors
32
SNP Discovery Base Qualities
High quality
Low quality
33
SNPs Bayesian Statistics
base quality
of individuals
allele call in read
34
SNP Discovery
haploid
diploid
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCAT
A
strain 1
individual 1
AACGTTCGCATA AACGTTCGCATA
strain 2
AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCAT
A
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
individual 2
strain 3
AACGTTAGCATA AACGTTAGCATA
individual 3
35
Genotyping Consensus Generation
haploid
diploid
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCAT
A
strain 1 A
individual 1 A/C
strain 2 C
AACGTTCGCATA AACGTTCGCATA
AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCAT
A
individual 2 C/C
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
strain 3 A
individual 3 A/A
AACGTTAGCATA AACGTTAGCATA
36
Visualization Consed
37
1000 Genomes Project
38
1000G Goals

Discover genetic variations
1 minor allele frequencies across genome
0.1 0.5 MAF across gene regions
Variant alleles
Estimate frequencies
Identify haplotype background
Characterize linkage disequilibrium

39
1000G Pilot Projects

Pilot 1
Low coverage
180 samples
70 samples _at_ 4X
110 samples _at_ 2X
2.7 Tbp total
202 Gbp 454
1.8 Tbp Illumina
640 Gbp AB SOLiD

Pilot 2 Deep trios (CEU YRI) 6 samples 1.1 Tbp
total 87 Gbp 454 773 Gbp Illumina 270 Gbp AB SOLiD
Pilot 3 Exon capture 607 samples 2.2 Mbp of
targets 8800 targets 10 20x coverage
40
Questions about the genome

Obtaining a genome sequence is a one step towards
understanding biological processes
Questions that follow from the genome are
What is transcribed?
Where do proteins bind?
What is methylated?
In other words, how does it work?

41
Central dogma
ZOOM IN
tRNA
transcription
DNA
rRNA
snRNA
translation
POLYPEPTIDE
mRNA
42
Transcription

The DNA is contained in the nucleus of the cell.
A stretch of it unwinds there, and its message
(or sequence) is copied onto a molecule of mRNA.
The mRNA then exits from the cell nucleus.

43
DNA
RNA
A T G C
T ? U
44
More complexity

The RNA message is sometimes edited.
Exons are nucleotide segments whose codons will
be expressed.
Introns are intervening segments (genetic
gibberish) that are snipped out.
Exons are spliced together to form mRNA.

45
Splicing

frgjjthissentencehjfmkcontainsjunkelm
thissentencecontainsjunk

46
Key player RNA polymerase

It is the enzyme that brings about transcription
by going down the line, pairing mRNA nucleotides
with their DNA counterparts.

47
Promoters

Promoters are sequences in the DNA just upstream
of transcripts that define the sites of
initiation.
The role of the promoter is to attract RNA
polymerase to the correct start site so
transcription can be initiated.

5
3
Promoter
48
Promoters

Promoters are sequences in the DNA just upstream
of transcripts that define the sites of
initiation.
The role of the promoter is to attract RNA
polymerase to the correct start site so
transcription can be initiated.

5
3
Promoter
49
Transcription key steps
DNA

Initiation
Elongation
Termination

DNA

RNA
50
Genes can be switched on/off

In an adult multicellular organism, there is a
wide variety of cell types seen in the adult. eg,
muscle, nerve and blood cells.
The different cell types contain the same DNA
though.
This differentiation arises because different
cell types express different genes.
Promoters are one type of gene regulators

51
Transcription (recap)

The DNA is contained in the nucleus of the cell.
A stretch of it unwinds there, and its message
(or sequence) is copied onto a molecule of mRNA.
The mRNA then exits from the cell nucleus.
Its destination is a molecular workbench in the
cytoplasm, a structure called a ribosome.

52
The Transcriptome

The transcriptome is the entire set of RNA
transcripts in the cell, tissue or organ.
The transcriptome is cell type specific and time
dependant i.e. It is a function of cell state
The transcriptome can help us understand how
cells differentiate and respond to changes in
their environment.

53
Transcriptome complexity

Transcripts may be
Modified
Spliced
Edited
Degraded
Transcriptome is substantially more complex than
the genome and is time variant.

54
ESTs

ESTs were the first genome wide scan for
transcriptional elements
Different library types
Proportional
Normalized
Subtractive
Can be sequenced from the 5 or 3 end

55
Hello Mr Chips

Microarray chips introduced in 90s
Parallel way to measure many genes
Probes placed on slides
RNA -gt cDNA, labelled with fluorescent dye and
hybridized.
Fluorescence measured
Chips have been highly successful
Simplified analysis
Useful when there is no genome sequence
Linear signal across 500 fold variation
Standardization has aided use in medical
diagnostics
E.g. Mammaprint

56
Microarray expression profiling by 2-color assay
(cDNA arrays)
Array PCR products 6250 yeast ORFs hybridized
cDNAs green control red experiment
Schena et al., 1995
57
Chips pros and cons

Advantages
Do not require a genome sequence
Highly characterised, with many s/w packages
available
One Affymetrix chip FDA approved
Disadvantages
Measurements limited to whats on the array
Hard to distinguish isoforms when used for
expression
Cant detect balanced translocations or
inversions when used for resequencing

58
mRNA-seq

Basic work flow
Align reads (sometimes to transcriptome first and
then the genome)
Tally transcript counts
Align tags to spliced transcripts
Add to transcript counts

59
Cloonan et al. 2008

Used SOLiD to generate 10Gb of data from mouse
embryonic stem cells and embryonic bodies
Used a library of exon junctions to map across
known splice events

60
Distribution of tags
61
Tag locations
62
General issues

Coverage across the transcript may not be random
Some reads map to multiple locations
Some reads dont map at all
Reads mapping outside of known exons may
represent
New gene models
New genes

63
Size of the transcriptome

Carter et al (2005)
Using arrays estimated 520,000 to 850,000
transcripts per cell.
Use upper limit and estimate average transcript
size of 2kb
Transcriptome 2GB
Transcriptome cost genome cost

64
The Boundome

DNA binding proteins control genome function
Histones impact chromatin structure
Activators and repressors impact gene expression
The location of these proteins helps us
understand how the genome works

65
ChIP
66
Chip-Seq

Instead of probing against a chip, measure
directly
Basic work flow
Align reads to the genome
Identify clusters and peaks
Determine bound sites

67
Robertson et al. 2007

Used Illumina technology to find STAT1 binding
sites
Comparisons with two ChIP-PCR data sets suggested
that ChIP-seq sensitivity was between 70 and 92
and specificity was at least 95.

68
Tag statistics
69
Typical Profile
70
Mikkelsen et al., 2007

Performed a comparison with ChIP-chip methods
98 concordance

71
Comparison with ChIP-seq
72
The Methylome

In methylated DNA, cytosines are methylated.
This leads to silencing of genes in the region
e.g. X inactivation
It is yet another form of transcriptional control
and together with histone modifications a key
component of epigenetics

73
Bi-sulphite sequencing

Converts un-methylated cytosines to uracil (which
becomes thymine when converted to cDNA)
Experimental procedure is difficult
Sequence alignment is tricky, but the basic
concepts hold

74
Taylor et al, 2007

Targeted sequencing reduced alignment
difficulties
Used dynamic programming to identify alignments
of sequences against an in silico bisulphate
converted sequence of the target amplicon regions

75
Metagenomics

Craig Venters sequencing of the sea one of the
earliest and most well known examples
Used Sanger sequencing
Many recent studies including
Angly et al studied ocean virome
Cox-Foster et al studied colony collapse
disorder
All use 454 for its longer read length and target
amplification of 16S or 18S ribsomal subunits

Write a Comment

User Comments (0)