Title: ChIP Sequencing BMI/IBGP 730
1ChIP Sequencing BMI/IBGP 730
- Victor Jin, Ph.D.
- (Slides from Dr. H. Gulcin Ozer)
- Department of Biomedical Informatics
2What is ChIP-Sequencing?
- ChIP-Sequencing is a new frontier technology to
analyze protein interactions with DNA. - ChIP-Seq
- Combination of chromatin immunoprecipitation
(ChIP) with ultra high-throughput massively
parallel sequencing - Allow mapping of proteinDNA interactions in-vivo
on a genome scale
3Workflow of ChIP-Seq
Mardis, E.R. Nat. Methods 4, 613-614 (2007)
4Workflow of ChIP-Seq
5(No Transcript)
6Johnson et al, 2007
- ChIP-Seq technology is used to understand in vivo
binding of the neuron-restrictive silencer factor
(NRSF) - Results are compared to known binding sites
- ChIP-Seq signals are strongly agree with the
existing knowledge - Sharp resolution of binding position
- New noncanonical NRSF binding motifs are
identified
7(No Transcript)
8Robertson et al, 2007
- ChIP-Seq technology used to study genome-wide
profiles of STAT1 DNA association - STAT1 targets in interferon-?-stimulated and
unstimulated human HeLA S3 cells are compared - The performance of ChIP-Seq is compared to the
alternative protein-DNA interaction methods of
ChIP-PCR and ChIP-chip. - 41,582 and 11,004 putative STAT-1 binding regions
are identified in stimulated and unstimulated
cells respectively.
9Why ChIP-Sequencing?
- Current microarray and ChIP-ChIP designs require
knowing sequence of interest as a promoter,
enhancer, or RNA-coding domain. - Lower cost
- Less work in ChIP-Seq
- Higher accuracy
- Alterations in transcription-factor binding in
response to environmental stimuli can be
evaluated for the entire genome in a single
experiment.
10(No Transcript)
11Sequencers
- Solexa (Illumina)
- 1 GB of sequences in a single run
- 35 bases in length
- 454 Life Sciences (Roche Diagnostics)
- 25-50 MB of sequences in a single run
- Up to 500 bases in length
- SOLiD (Applied Biosystems)
- 6 GB of sequences in a single run
- 35 bases in length
12Illumina Genome Analysis System
13Sequencing
14Sequencer Output
15Sequence Files
- 10 million sequences per lane
- 500 MB files
16Quality Score Files
- Quality scores describe the confidence of bases
in each read - Solexa pipeline assigns a quality score to the
four possible nucleotides for each sequenced base - 9 million sequences (500MB file) ? 6.5GB quality
score file
17Bioinformatics Challenges
- Rapid mapping of these short sequence reads to
the reference genome - Visualize mapping results
- Thousand of enriched regions
- Peak analysis
- Peak detection
- Finding exact binding sites
- Compare results of different experiments
- Normalization
- Statistical tests
18Mapping of Short Oligonucleotides to the
Reference Genome
- Mapping Methods
- Need to allow mismatches and gaps
- SNP locations
- Sequencing errors
- Reading errors
- Indexing and hashing
- genome
- oligonucleotide reads
- Use of quality scores
- Use of SNP knowledge
- Performance
- Partitioning the genome or sequence reads
19Mapping Methods Indexing the Genome
- Fast sequence similarity search algorithms (like
BLAST) - Not specifically designed for mapping millions of
query sequences - Take very long time
- e.g. 2 days to map half million sequences to 70MB
reference genome (using BLAST) - Indexing the genome is memory expensive
20(No Transcript)
21SOAP (Li et al, 2008)
- Both reads and reference genome are converted to
numeric data type using 2-bits-per-base coding - Load reference genome into memory
- For human genome, 14GB RAM required for storing
reference sequences and index tables - 300(gapped) to 1200(ungapped) times faster than
BLAST
22SOAP (Li et al, 2008)
- 2 mismatches or 1-3bp continuous gap
- Errors accumulate during the sequencing process
- Much higher number of sequencing errors at the
3-end (sometimes make the reads unalignable to
the reference genome) - Iteratively trim several basepairs at the 3-end
and redo the alignment - Improve sensitivity
23Mapping Methods Indexing the Oligonucleotide
Reads
- ELAND (Cox, unpublished)
- Efficient Large-Scale Alignment of Nucleotide
Databases (Solexa Ltd.) - SeqMap (Jiang, 2008)
- Mapping massive amount of oligonucleotides to
the genome - RMAP (Smith, 2008)
- Using quality scores and longer reads improves
accuracy of Solexa read mapping - MAQ (Li, 2008)
- Mapping short DNA sequencing reads and calling
variants using mapping quality scores
24Mapping Algorithm (2 mismatches)
GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG
25Mapping Algorithm (2 mismatches)
- Partition reads into 4 seeds A,B,C,D
- At least 2 seed must map with no mismatches
- Scan genome to identify locations where the seeds
match exactly - 6 possible combinations of the seeds to search
- AB, CD, AC, BD, AD, BC
- 6 scans to find all candidates
- Do approximate matching around the
exactly-matching seeds. - Determine all targets for the reads
- Ins/del can be incorporated
- The reads are indexed and hashed before scanning
genome - Bit operations are used to accelerate mapping
- Each nt encoded into 2-bits
26ELAND (Cox, unpublished)
- Commercial sequence mapping program comes with
Solexa machine - Allow at most 2 mismatches
- Map sequences up to 32 nt in length
- All sequences have to be same length
27(No Transcript)
28(No Transcript)
29RMAP (Smith et al, 2008)
- Improve mapping accuracy
- Possible sequencing errors at 3-ends of longer
reads - Base-call quality scores
- Use of base-call quality scores
- Quality cutoff
- High quality positions are checked for mismatces
- Low quality positions always induce a match
- Quality control step eliminates reads with too
many low quality positions - Allow any number of mismatches
30(No Transcript)
31(No Transcript)
32Bioinformatics Challenges
- Rapid mapping of these short sequence reads to
the reference genome - Visualize mapping results
- Thousand of enriched regions
- Peak analysis
- Peak detection
- Finding exact binding sites
- Compare results of different experiments
- Normalization
- Statistical tests
33Visualization
- BED files are build to summarize mapping results
- BED files can be easily visualized in Genome
Browser - http//genome.ucsc.edu
34Visualization Genome Browser
Robertson, G. et al. Nat. Methods 4, 651-657
(2007)
35Visualization Custom
300 kb region from mouse ES cells
Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)
36Visualization
Huang, 2008 (unpublished)
37Huang, 2008 (unpublished)
38Bioinformatics Challenges
- Rapid mapping of these short sequence reads to
the reference genome - Visualize mapping results
- Thousand of enriched regions
- Peak analysis
- Peak detection
- Finding exact binding sites
- Compare results of different experiments
- Normalization
- Statistical tests
39Peak Analysis
- Peak Detection
- ChIP-Peak Analysis Module (Swiss Institute of
Bioinformatics) - ChIPSeq Peak Finder (Wold Lab, Caltech)
40(No Transcript)
41(No Transcript)
42Peak Analysis
- Finding Exact Binding Site
- Determining the exact binding sites from short
reads generated from ChIP-Seq experiments - SISSRs (Site Identification from Short Sequence
Reads) (Jothi 2008) - MACS (Model-based Analysis of ChIP-Seq) (Zhang et
al, 2008)
43Bioinformatics Challenges
- Rapid mapping of these short sequence reads to
the reference genome - Visualize mapping results
- Thousand of enriched regions
- Peak analysis
- Peak detection
- Finding exact binding sites
- Compare results of different experiments
- Normalization
- Statistical tests
44Compare Samples
Huang, 2008 (unpublished)
45Compare Samples
- Fold change
- HPeak An HMM-based algorithm for defining
read-enriched regions from massive parallel
sequencing data - Xu et al, 2008
- Advanced statistics
46QUESTIONS?