Title: Next Generation Sequencing
1Next Generation Sequencing
2Sequencing techniques
- ChIP-seq
- MBD-seq (MIRA-seq)
- BS-seq
- RNA-seq
- miRNA-seq
3ChIP-seq
- ChIP-Seq is a new frontier technology to analyze
in vivo protein-DNA interactions. - ChIP-Seq
- Combination of chromatin immunoprecipitation
(ChIP) with ultra high-throughput massively
parallel sequencing - Allow mapping of proteinDNA interactions in-vivo
on a genome scale
4Workflow of ChIP-Seq
Mardis, E.R. Nat. Methods 4, 613-614 (2007)
5(No Transcript)
6(No Transcript)
7The advantages of ChIP-seq
- Current microarray and ChIP-ChIP designs require
knowing sequence of interest as a promoter,
enhancer, or RNA-coding domain. - Lower cost
- Higher resolution
- Higher accuracy
- Alterations in transcription-factor binding in
response to environmental stimuli can be
evaluated for the entire genome in a single
experiment.
8Sequencers
- Solexa (Illumina)
- 1 GB of sequences in a single run
- 35 bases in length
- 454 Life Sciences (Roche Diagnostics)
- 25-50 MB of sequences in a single run
- Up to 500 bases in length
- SOLiD (Applied Biosystems)
- 6 GB of sequences in a single run
- 35 bases in length
9Illumina Genome Analysis System
10Sequencing
11Sequencer Output
12Sequence Files
- 10-40 million reads per lane
- 500 MB files
13Quality Score Files
- Quality scores describe the confidence of bases
in each read - Solexa pipeline assigns a quality score to the
four possible nucleotides for each sequenced base - 9 million sequences (500MB file) ? 6.5GB quality
score file
14Bioinformatics Challenges
- Rapid mapping of these short sequence reads to
the reference genome - Visualize mapping results
- Thousand of enriched regions
- Peak analysis
- Peak detection
- Finding exact binding sites
- Compare results of different experiments
- Normalization
- Statistical tests
15Mapping of Short Oligonucleotides to the
Reference Genome
- Mapping Methods
- Need to allow mismatches and gaps
- SNP locations
- Sequencing errors
- Reading errors
- Indexing and hashing
- genome
- oligonucleotide reads
- Use of quality scores
- Use of SNP knowledge
- Performance
- Partitioning the genome or sequence reads
16Mapping Methods Indexing the Genome
- Fast sequence similarity search algorithms (like
BLAST) - Not specifically designed for mapping millions of
query sequences - Take very long time
- e.g. 2 days to map half million sequences to 70MB
reference genome (using BLAST) - Indexing the genome is memory expensive
17(No Transcript)
18SOAP (Li et al, 2008)
- Both reads and reference genome are converted to
numeric data type using 2-bits-per-base coding - Load reference genome into memory
- For human genome, 14GB RAM required for storing
reference sequences and index tables - 300(gapped) to 1200(ungapped) times faster than
BLAST - 2 mismatches or 1-3bp continuous gap
- Errors accumulate during the sequencing process
- Much higher number of sequencing errors at the
3-end (sometimes make the reads unalignable to
the reference genome) - Iteratively trim several basepairs at the 3-end
and redo the alignment - Improve sensitivity
19Mapping Methods Indexing the Oligonucleotide
Reads
- ELAND (Cox, unpublished)
- Efficient Large-Scale Alignment of Nucleotide
Databases (Solexa Ltd.) - SeqMap (Jiang, 2008)
- Mapping massive amount of oligonucleotides to
the genome - RMAP (Smith, 2008)
- Using quality scores and longer reads improves
accuracy of Solexa read mapping - MAQ (Li, 2008)
- Mapping short DNA sequencing reads and calling
variants using mapping quality scores
20Mapping Algorithm (2 mismatches)
- Partition reads into 4 seeds A,B,C,D
- At least 2 seed must map with no mismatches
- Scan genome to identify locations where the seeds
match exactly - 6 possible combinations of the seeds to search
- AB, CD, AC, BD, AD, BC
- 6 scans to find all candidates
- Do approximate matching around the
exactly-matching seeds. - Determine all targets for the reads
- Ins/del can be incorporated
- The reads are indexed and hashed before scanning
genome - Bit operations are used to accelerate mapping
- Each nt encoded into 2-bits
21ELAND (Cox, unpublished)
- Commercial sequence mapping program comes with
Solexa machine - Allow at most 2 mismatches
- Map sequences up to 32 nt in length
- All sequences have to be same length
22(No Transcript)
23(No Transcript)
24RMAP (Smith et al, 2008)
- Improve mapping accuracy
- Possible sequencing errors at 3-ends of longer
reads - Base-call quality scores
- Use of base-call quality scores
- Quality cutoff
- High quality positions are checked for mismatces
- Low quality positions always induce a match
- Quality control step eliminates reads with too
many low quality positions - Allow any number of mismatches
25(No Transcript)
26(No Transcript)
27Visualization
- BED files are build to summarize mapping results
- BED files can be easily visualized in Genome
Browser - http//genome.ucsc.edu
28Visualization Genome Browser
Robertson, G. et al. Nat. Methods 4, 651-657
(2007)
29Visualization Custom
300 kb region from mouse ES cells
Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)
30Screen shot for ZNF263 peaks
Frietze et al JBC 2010
31ChIP-seq peak analysis programs
- SISSRs (Site Identification from Short Sequence
Reads) Jothi et al. NAR, 2008. - MACS (Model-based Analysis of ChIP-Seq) Zhang et
al, Genome Biology, 2008. - QuEST (Genome-wide analysis of transcription
factor binding sites based on ChIPseq data)
Valouev, A. et al. Nature Methods, 2008. - PeakSeq (PeakSeq enables systematic scoring of
ChIPseq experiments relative to controls)
Rozowsky, J. et al. Nature Biotech. 2009. - FindPeaks (FindPeaks 3.1 a tool for identifying
areas of enrichment from massively parallel
short-read sequencing technology.) Fejes, A .P.
et al. Bioinformatisc, 2008. - Hpeak (An HMM-based algorithm for defining
read-enriched regions from massive parallel
sequencing data) Xu et al, Bioinformatics, 2008.
32MBD-seq (MIRA-seq)
- The MBD methyl-CpG binding domain-based (MBDCap)
technology to capture the methylation sites.
Double stranded methylated DNA fragments can be
detected. It is sensitive to different
methylation densities - Genome-wide sequencing technology was used to get
the sequence of each short fragment. - The sequenced read was mapped to human genome to
find the locations.
33Application on MBD-seq data (MCF7)
Lan et al Unpublished
34BS-seq
- BS-seq genomic DNA is treated with sodium
bisulphite (BS) to convert cytosine, but not
methylcytosine, to uracil, and subsequent
high-throughput sequencing. - Truly single-base resolution
35RNA-seq
- RNA-Seq is a new approach to transcriptome
profiling that uses deep-sequencing technologies.
- Studies using this method have already altered
our view of the extent and complexity of
eukaryotic transcriptomes. RNA-Seq also provides
a far more precise measurement of levels of
transcripts and their isoforms than other methods.
36RNA-seq protocol
37The advantages of RNA-seq
- Single base resolution
- High throughput
- Low background noise
- Ability to distinguish different isoforms and
alleic expression - Relatively low cost