Lecture 1 Introduction to high throughput sequencing - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 1 Introduction to high throughput sequencing

Description:

* * * * * * * * * * * * * Chip-Seq Instead of probing against a chip, measure directly Basic work flow Align reads to the genome Identify clusters and peaks Determine ... – PowerPoint PPT presentation

Number of Views:239
Avg rating:3.0/5.0
Slides: 76
Provided by: MichaelSt151
Category:

less

Transcript and Presenter's Notes

Title: Lecture 1 Introduction to high throughput sequencing


1
(No Transcript)
2
Lecture 1Introduction to high throughput
sequencing
  • Michael Brudno
  • CSC 2431
  • January 13, 2010

Adapted from presentations by Francis Ouelette,
OICR, Michael Stromberg, BC and Asim Siddiqui, ABI
3
DNA sequencing
  • How we obtain the sequence of nucleotides of a
    species

ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGAC
TACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT
4
DNA Sequencing
  • Goal
  • Find the complete sequence of A, C, G, Ts in
    DNA
  • Challenge
  • There is no machine that takes long DNA as an
    input, and gives the complete sequence as output
  • Can only sequence 500 letters at a time

5
Generations of Sequences
  • Sanger-style Classic
  • 454 First Next-gen
  • Illumina ABI SOLiD Next-gen
  • Helicos 2.5 Gen
  • PacBio Next-next-gen, 3rd gen

6
Why are we sequencing?
  • Before Next-generation
  • DNA, RNA, (proteins), (populations), sampling,
    averages, consensus
  • Problems sampling, averages, consensus.
  • After Next-generation
  • Genome sequence and structure
  • Less cloning/PCR
  • Single molecules (for some)

7
Sanger (old-gen) Sequencing Now-Gen Sequencing
Whole Genome Human (early drafts), model organisms, bacteria, viruses and mitochondria (chloroplast), low coverage New human (!), individual genome, 1,000 normal, 25,000 cancer matched control pairs, rare-samples
RNA cDNA clones, ESTs, Full Length Insert cDNAs, other RNAs RNA-Seq Digitization of transcriptome, alternative splicing events, miRNA
Communities Environmental sampling, 16S RNA populations, ocean sampling, Human microbiome, deep environmental sequencing, Bar-Seq
Other Epigenome, rearrangements, ChIP-Seq
8
Differences between the various platforms
  • Nanotechnology used.
  • Resolution of the image analysis.
  • Chemistry and enzymology.
  • Signal to noise detection in the software
  • Software/images/file size/pipeline
  • Cost

9
Adapted from Richard Wilson, School of Medicine,
Washington University, Sequencing the Cancer
Genome http//tinyurl.com/5f3alk
Next Generation DNA Sequencing Technologies
Human Genome 6GB 6000 MB 6GB 6000 MB 6GB 6000 MB
Reqd Coverage 6 12 30
3730 454 Illumina
bp/read 600 400 2X75
reads/run 96 500,000 100,000.000
bp/run 57,600 0.5 GB 15 GB
runs reqd 625,000 144 12
runs/day 2 1 0.1
Machine days/human genome 312,500 (856 years) 144 120
Cost/run 48 6,800 9,300
Total cost 15,000,000 979,200 111,600
10
Solexa-based Whole Genome Sequencing
Adapted from Richard Wilson, School of Medicine,
Washington University, Sequencing the Cancer
Genome http//tinyurl.com/5f3alk
11
Illumina (Solexa)
12
Illumina (Solexa)
13
Illumina (Solexa)
14
From Debbie Nickerson, Department of Genome
Sciences, University of Washington,
http//tinyurl.com/6zbzh4
15
What is a base quality?
Base Quality Perror(obs. base)
3 50.12
5 31.62
10 10.00
15 3.16
20 1.00
25 0.32
30 0.10
35 0.03
40 0.01
16
Next-gen sequencers
From John McPherson, OICR
100 Gb
AB/SOLiDv3, Illumina/GAII short-read sequencers
(10Gb in 50-100 bp reads, gt100M
reads, 4-8 days)
10 Gb
454 GS FLX pyrosequencer
1 Gb
(100-500 Mb in 100-400 bp reads, 0.5-1M reads,
5-10 hours)
bases per machine run
100 Mb
ABI capillary sequencer
10 Mb
(0.04-0.08 Mb in 450-800 bp reads, 96 reads,
1-3 hours)
1 Mb
10 bp
1,000 bp
100 bp
read length
17
DNA sequencing vectors
DNA
Shake
DNA fragments
Known location (restriction site)
Vector Circular genome (bacterium, plasmid)


18
Method to sequence longer regions
genomic segment
cut many times at random (Shotgun)
Get two reads from each segment
500 bp
500 bp
19
Reconstructing the Sequence (Fragment Assembly)
reads
Cover region with 7-fold redundancy (7X)
Overlap reads and extend to reconstruct the
original genomic region
20
Definition of Coverage
C
  • Length of genomic segment L
  • Number of reads n
  • Length of each read l
  • Definition Coverage C n l / L
  • How much coverage is enough?
  • Lander-Waterman model
  • Assuming uniform distribution of reads, C10
    results in 1 gapped region /1,000,000 nucleotides

21
Challenges with Fragment Assembly
  • Sequencing errors
  • 1-2 of bases are wrong
  • Repeats
  • Computation O( N2 ) where N reads

false overlap due to repeat
22
History of DNA Sequencing
Adapted from Eric Green, NIH Adapted from
Messing Llaca, PNAS (1998)
1870
Miescher Discovers DNA
Avery Proposes DNA as Genetic Material
1940
Efficiency (bp/person/year)
Watson Crick Double Helix Structure of DNA
1953
Holley Sequences Yeast tRNAAla
1
15
1965
Wu Sequences ? Cohesive End DNA
150
1970
Sanger Dideoxy Chain Termination Gilbert
Chemical Degradation
1,500
1977
Messing M13 Cloning
15,000
1980
25,000
Hood et al. Partial Automation
50,000
1986
  • Cycle Sequencing
  • Improved Sequencing Enzymes
  • Improved Fluorescent Detection Schemes

200,000
1990
50,000,000
2002
  • Next Generation Sequencing
  • Improved enzymes and chemistry
  • New image processing

100,000,000,000
2009
23
Which representative of the species?
  • Which human?
  • Answer one
  • Answer two it doesnt matter
  • Polymorphism rate number of letter changes
    between two different members of a species
  • Humans 1/1,000 1/10,000
  • Other organisms have much higher polymorphism
    rates

24
Why humans are so similar
  • A small population that interbred reduced the
    genetic variation
  • Out of Africa 40,000 years ago

Out of Africa
25
Migration of human variation
  • http//info.med.yale.edu/genetics/kkidd/point.html

26
Migration of human variation
  • http//info.med.yale.edu/genetics/kkidd/point.html

27
Migration of human variation
  • http//info.med.yale.edu/genetics/kkidd/point.html

28
Genetic Variations Why?
Phenotypic differences
Inherited diseases
Ancestral history
29
Genetic Variations SNPs INDELs
30
Structural Variations
Paul Medvedev review in prep July 2009
31
SNP Discovery Goal
SNP
sequencing errors
32
SNP Discovery Base Qualities
High quality
Low quality
33
SNPs Bayesian Statistics
base quality
of individuals
allele call in read
34
SNP Discovery
haploid
diploid
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCAT
A
strain 1
individual 1
AACGTTCGCATA AACGTTCGCATA
strain 2
AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCAT
A
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
individual 2
strain 3
AACGTTAGCATA AACGTTAGCATA
individual 3
35
Genotyping Consensus Generation
haploid
diploid
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCAT
A
strain 1 A
individual 1 A/C
strain 2 C
AACGTTCGCATA AACGTTCGCATA
AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCAT
A
individual 2 C/C
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
strain 3 A
individual 3 A/A
AACGTTAGCATA AACGTTAGCATA
36
Visualization Consed
37
1000 Genomes Project
38
1000G Goals
  • Discover genetic variations
  • 1 minor allele frequencies across genome
  • 0.1 0.5 MAF across gene regions
  • Variant alleles
  • Estimate frequencies
  • Identify haplotype background
  • Characterize linkage disequilibrium

39
1000G Pilot Projects
  • Pilot 1
  • Low coverage
  • 180 samples
  • 70 samples _at_ 4X
  • 110 samples _at_ 2X
  • 2.7 Tbp total
  • 202 Gbp 454
  • 1.8 Tbp Illumina
  • 640 Gbp AB SOLiD

Pilot 2 Deep trios (CEU YRI) 6 samples 1.1 Tbp
total 87 Gbp 454 773 Gbp Illumina 270 Gbp AB SOLiD
Pilot 3 Exon capture 607 samples 2.2 Mbp of
targets 8800 targets 10 20x coverage
40
Questions about the genome
  • Obtaining a genome sequence is a one step towards
    understanding biological processes
  • Questions that follow from the genome are
  • What is transcribed?
  • Where do proteins bind?
  • What is methylated?
  • In other words, how does it work?

41
Central dogma
ZOOM IN
tRNA
transcription
DNA
rRNA
snRNA
translation
POLYPEPTIDE
mRNA
42
Transcription
  • The DNA is contained in the nucleus of the cell.
  • A stretch of it unwinds there, and its message
    (or sequence) is copied onto a molecule of mRNA.
  • The mRNA then exits from the cell nucleus.

43
DNA
RNA
A T G C
T ? U
44
More complexity
  • The RNA message is sometimes edited.
  • Exons are nucleotide segments whose codons will
    be expressed.
  • Introns are intervening segments (genetic
    gibberish) that are snipped out.
  • Exons are spliced together to form mRNA.

45
Splicing
  • frgjjthissentencehjfmkcontainsjunkelm
  • thissentencecontainsjunk

46
Key player RNA polymerase
  • It is the enzyme that brings about transcription
    by going down the line, pairing mRNA nucleotides
    with their DNA counterparts.

47
Promoters
  • Promoters are sequences in the DNA just upstream
    of transcripts that define the sites of
    initiation.
  • The role of the promoter is to attract RNA
    polymerase to the correct start site so
    transcription can be initiated.

5
3
Promoter
48
Promoters
  • Promoters are sequences in the DNA just upstream
    of transcripts that define the sites of
    initiation.
  • The role of the promoter is to attract RNA
    polymerase to the correct start site so
    transcription can be initiated.

5
3
Promoter
49
Transcription key steps
DNA
  • Initiation
  • Elongation
  • Termination

DNA

RNA
50
Genes can be switched on/off
  • In an adult multicellular organism, there is a
    wide variety of cell types seen in the adult. eg,
    muscle, nerve and blood cells.
  • The different cell types contain the same DNA
    though.
  • This differentiation arises because different
    cell types express different genes.
  • Promoters are one type of gene regulators

51
Transcription (recap)
  • The DNA is contained in the nucleus of the cell.
  • A stretch of it unwinds there, and its message
    (or sequence) is copied onto a molecule of mRNA.
  • The mRNA then exits from the cell nucleus.
  • Its destination is a molecular workbench in the
    cytoplasm, a structure called a ribosome.

52
The Transcriptome
  • The transcriptome is the entire set of RNA
    transcripts in the cell, tissue or organ.
  • The transcriptome is cell type specific and time
    dependant i.e. It is a function of cell state
  • The transcriptome can help us understand how
    cells differentiate and respond to changes in
    their environment.

53
Transcriptome complexity
  • Transcripts may be
  • Modified
  • Spliced
  • Edited
  • Degraded
  • Transcriptome is substantially more complex than
    the genome and is time variant.

54
ESTs
  • ESTs were the first genome wide scan for
    transcriptional elements
  • Different library types
  • Proportional
  • Normalized
  • Subtractive
  • Can be sequenced from the 5 or 3 end

55
Hello Mr Chips
  • Microarray chips introduced in 90s
  • Parallel way to measure many genes
  • Probes placed on slides
  • RNA -gt cDNA, labelled with fluorescent dye and
    hybridized.
  • Fluorescence measured
  • Chips have been highly successful
  • Simplified analysis
  • Useful when there is no genome sequence
  • Linear signal across 500 fold variation
  • Standardization has aided use in medical
    diagnostics
  • E.g. Mammaprint

56
Microarray expression profiling by 2-color assay
(cDNA arrays)
Array PCR products 6250 yeast ORFs hybridized
cDNAs green control red experiment
Schena et al., 1995
57
Chips pros and cons
  • Advantages
  • Do not require a genome sequence
  • Highly characterised, with many s/w packages
    available
  • One Affymetrix chip FDA approved
  • Disadvantages
  • Measurements limited to whats on the array
  • Hard to distinguish isoforms when used for
    expression
  • Cant detect balanced translocations or
    inversions when used for resequencing

58
mRNA-seq
  • Basic work flow
  • Align reads (sometimes to transcriptome first and
    then the genome)
  • Tally transcript counts
  • Align tags to spliced transcripts
  • Add to transcript counts

59
Cloonan et al. 2008
  • Used SOLiD to generate 10Gb of data from mouse
    embryonic stem cells and embryonic bodies
  • Used a library of exon junctions to map across
    known splice events

60
Distribution of tags
61
Tag locations
62
General issues
  • Coverage across the transcript may not be random
  • Some reads map to multiple locations
  • Some reads dont map at all
  • Reads mapping outside of known exons may
    represent
  • New gene models
  • New genes

63
Size of the transcriptome
  • Carter et al (2005)
  • Using arrays estimated 520,000 to 850,000
    transcripts per cell.
  • Use upper limit and estimate average transcript
    size of 2kb
  • Transcriptome 2GB
  • Transcriptome cost genome cost

64
The Boundome
  • DNA binding proteins control genome function
  • Histones impact chromatin structure
  • Activators and repressors impact gene expression
  • The location of these proteins helps us
    understand how the genome works

65
ChIP
66
Chip-Seq
  • Instead of probing against a chip, measure
    directly
  • Basic work flow
  • Align reads to the genome
  • Identify clusters and peaks
  • Determine bound sites

67
Robertson et al. 2007
  • Used Illumina technology to find STAT1 binding
    sites
  • Comparisons with two ChIP-PCR data sets suggested
    that ChIP-seq sensitivity was between 70 and 92
    and specificity was at least 95.

68
Tag statistics
69
Typical Profile
70
Mikkelsen et al., 2007
  • Performed a comparison with ChIP-chip methods
    98 concordance

71
Comparison with ChIP-seq
72
The Methylome
  • In methylated DNA, cytosines are methylated.
  • This leads to silencing of genes in the region
    e.g. X inactivation
  • It is yet another form of transcriptional control
    and together with histone modifications a key
    component of epigenetics

73
Bi-sulphite sequencing
  • Converts un-methylated cytosines to uracil (which
    becomes thymine when converted to cDNA)
  • Experimental procedure is difficult
  • Sequence alignment is tricky, but the basic
    concepts hold

74
Taylor et al, 2007
  • Targeted sequencing reduced alignment
    difficulties
  • Used dynamic programming to identify alignments
    of sequences against an in silico bisulphate
    converted sequence of the target amplicon regions

75
Metagenomics
  • Craig Venters sequencing of the sea one of the
    earliest and most well known examples
  • Used Sanger sequencing
  • Many recent studies including
  • Angly et al studied ocean virome
  • Cox-Foster et al studied colony collapse
    disorder
  • All use 454 for its longer read length and target
    amplification of 16S or 18S ribsomal subunits
Write a Comment
User Comments (0)
About PowerShow.com