ChIP Sequencing BMI/IBGP 730

About This Presentation

Title:

ChIP Sequencing BMI/IBGP 730

Description:

ChIP Sequencing BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. H. Gulcin Ozer) Department of Biomedical Informatics ... – PowerPoint PPT presentation

Number of Views:309

Avg rating:3.0/5.0

Slides: 47

Provided by: Gul102

Category:

more less

Transcript and Presenter's Notes

Title: ChIP Sequencing BMI/IBGP 730

1
ChIP Sequencing BMI/IBGP 730

Victor Jin, Ph.D.
(Slides from Dr. H. Gulcin Ozer)
Department of Biomedical Informatics

2
What is ChIP-Sequencing?

ChIP-Sequencing is a new frontier technology to
analyze protein interactions with DNA.
ChIP-Seq
Combination of chromatin immunoprecipitation
(ChIP) with ultra high-throughput massively
parallel sequencing
Allow mapping of proteinDNA interactions in-vivo
on a genome scale

3
Workflow of ChIP-Seq
Mardis, E.R. Nat. Methods 4, 613-614 (2007)
4
Workflow of ChIP-Seq
5
(No Transcript)
6
Johnson et al, 2007

ChIP-Seq technology is used to understand in vivo
binding of the neuron-restrictive silencer factor
(NRSF)
Results are compared to known binding sites
ChIP-Seq signals are strongly agree with the
existing knowledge
Sharp resolution of binding position
New noncanonical NRSF binding motifs are
identified

7
(No Transcript)
8
Robertson et al, 2007

ChIP-Seq technology used to study genome-wide
profiles of STAT1 DNA association
STAT1 targets in interferon-?-stimulated and
unstimulated human HeLA S3 cells are compared
The performance of ChIP-Seq is compared to the
alternative protein-DNA interaction methods of
ChIP-PCR and ChIP-chip.
41,582 and 11,004 putative STAT-1 binding regions
are identified in stimulated and unstimulated
cells respectively.

9
Why ChIP-Sequencing?

Current microarray and ChIP-ChIP designs require
knowing sequence of interest as a promoter,
enhancer, or RNA-coding domain.
Lower cost
Less work in ChIP-Seq
Higher accuracy
Alterations in transcription-factor binding in
response to environmental stimuli can be
evaluated for the entire genome in a single
experiment.

10
(No Transcript)
11
Sequencers

Solexa (Illumina)
1 GB of sequences in a single run
35 bases in length
454 Life Sciences (Roche Diagnostics)
25-50 MB of sequences in a single run
Up to 500 bases in length
SOLiD (Applied Biosystems)
6 GB of sequences in a single run
35 bases in length

12
Illumina Genome Analysis System
13
Sequencing
14
Sequencer Output
15
Sequence Files

10 million sequences per lane
500 MB files

16
Quality Score Files

Quality scores describe the confidence of bases
in each read
Solexa pipeline assigns a quality score to the
four possible nucleotides for each sequenced base
9 million sequences (500MB file) ? 6.5GB quality
score file

17
Bioinformatics Challenges

Rapid mapping of these short sequence reads to
the reference genome
Visualize mapping results
Thousand of enriched regions
Peak analysis
Peak detection
Finding exact binding sites
Compare results of different experiments
Normalization
Statistical tests

18
Mapping of Short Oligonucleotides to the
Reference Genome

Mapping Methods
Need to allow mismatches and gaps
SNP locations
Sequencing errors
Reading errors
Indexing and hashing
genome
oligonucleotide reads
Use of quality scores
Use of SNP knowledge
Performance
Partitioning the genome or sequence reads

19
Mapping Methods Indexing the Genome

Fast sequence similarity search algorithms (like
BLAST)
Not specifically designed for mapping millions of
query sequences
Take very long time
e.g. 2 days to map half million sequences to 70MB
reference genome (using BLAST)
Indexing the genome is memory expensive

20
(No Transcript)
21
SOAP (Li et al, 2008)

Both reads and reference genome are converted to
numeric data type using 2-bits-per-base coding
Load reference genome into memory
For human genome, 14GB RAM required for storing
reference sequences and index tables
300(gapped) to 1200(ungapped) times faster than
BLAST

22
SOAP (Li et al, 2008)

2 mismatches or 1-3bp continuous gap
Errors accumulate during the sequencing process
Much higher number of sequencing errors at the
3-end (sometimes make the reads unalignable to
the reference genome)
Iteratively trim several basepairs at the 3-end
and redo the alignment
Improve sensitivity

23
Mapping Methods Indexing the Oligonucleotide
Reads

ELAND (Cox, unpublished)
Efficient Large-Scale Alignment of Nucleotide
Databases (Solexa Ltd.)
SeqMap (Jiang, 2008)
Mapping massive amount of oligonucleotides to
the genome
RMAP (Smith, 2008)
Using quality scores and longer reads improves
accuracy of Solexa read mapping
MAQ (Li, 2008)
Mapping short DNA sequencing reads and calling
variants using mapping quality scores

24
Mapping Algorithm (2 mismatches)
GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG
25
Mapping Algorithm (2 mismatches)

Partition reads into 4 seeds A,B,C,D
At least 2 seed must map with no mismatches
Scan genome to identify locations where the seeds
match exactly
6 possible combinations of the seeds to search
AB, CD, AC, BD, AD, BC
6 scans to find all candidates
Do approximate matching around the
exactly-matching seeds.
Determine all targets for the reads
Ins/del can be incorporated
The reads are indexed and hashed before scanning
genome
Bit operations are used to accelerate mapping
Each nt encoded into 2-bits

26
ELAND (Cox, unpublished)

Commercial sequence mapping program comes with
Solexa machine
Allow at most 2 mismatches
Map sequences up to 32 nt in length
All sequences have to be same length

27
(No Transcript)
28
(No Transcript)
29
RMAP (Smith et al, 2008)

Improve mapping accuracy
Possible sequencing errors at 3-ends of longer
reads
Base-call quality scores
Use of base-call quality scores
Quality cutoff
High quality positions are checked for mismatces
Low quality positions always induce a match
Quality control step eliminates reads with too
many low quality positions
Allow any number of mismatches

30
(No Transcript)
31
(No Transcript)
32
Bioinformatics Challenges

Rapid mapping of these short sequence reads to
the reference genome
Visualize mapping results
Thousand of enriched regions
Peak analysis
Peak detection
Finding exact binding sites
Compare results of different experiments
Normalization
Statistical tests

33
Visualization

BED files are build to summarize mapping results
BED files can be easily visualized in Genome
Browser
http//genome.ucsc.edu

34
Visualization Genome Browser
Robertson, G. et al. Nat. Methods 4, 651-657
(2007)
35
Visualization Custom
300 kb region from mouse ES cells
Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)
36
Visualization
Huang, 2008 (unpublished)
37
Huang, 2008 (unpublished)
38
Bioinformatics Challenges

Rapid mapping of these short sequence reads to
the reference genome
Visualize mapping results
Thousand of enriched regions
Peak analysis
Peak detection
Finding exact binding sites
Compare results of different experiments
Normalization
Statistical tests

39
Peak Analysis

Peak Detection
ChIP-Peak Analysis Module (Swiss Institute of
Bioinformatics)
ChIPSeq Peak Finder (Wold Lab, Caltech)

40
(No Transcript)
41
(No Transcript)
42
Peak Analysis

Finding Exact Binding Site
Determining the exact binding sites from short
reads generated from ChIP-Seq experiments
SISSRs (Site Identification from Short Sequence
Reads) (Jothi 2008)
MACS (Model-based Analysis of ChIP-Seq) (Zhang et
al, 2008)

43
Bioinformatics Challenges