Day 5-2

1 / 22

About This Presentation

Title:

Day 5-2

Description:

Agilent. Custom oligonucleotide arrays. Nimblegen Systems, Inc. ... Agilent eArray probe database 21 million tiled CGH and ChIP-on-chip probes. Do it yourself ... –

Number of Views:39

Avg rating:3.0/5.0

Slides: 23

Provided by: dbrm

Category:

more less

Transcript and Presenter's Notes

Title: Day 5-2

1
Day 5-2
What bioinformatics tools can be used for
analysing ChIP data?
What bioinformatics tools can be used for
analysing ChIP data?
2
After this seminar

You should be able to
Understand the differences between CHip-chip and
CHip-Seq and identify key decision making steps
for choosing a platform
Identify bioinformatics steps needed for handling
CHip-chip and Chip-Seq datasets
Understand underlying data from genome tiling
arrays
Understand how to search for binding sites in
genomic data
Understand the need for skills in handling large
datasets

3
General problem

Find accessible regions of DNA that are bound to
your protein.
What method is best?
What sort of bioinformatics skills are required?
What is real signal and what is noise?
What do we do with the regions once you have
identified them?

Zheng, M. et al. (2007) ChIP-chip data, model,
and analysis. Biometrics, Vol 63, 787-796.
4
Experimental methods give different types of data

ChIP-chip
microarray data defining genomic regions
probe (with position usually defined)
expression
ChIP-Seq
high throughput DNA sequence
ACGATGTCA sequence fragments (from
Solexa/SOLID/454)?
sequence position undefined (search required)?
The same issues exist for microarray vs. deep
sequencing in gene expression experiments
coverage
cost
practicality?

5
Raw (sequence) data

Flat files, processed from base-calls to fasta
format
Solexa
25-30 bp reads
Barcode is used to pool samples in one sequence
run
ACGT Expt1
TGAC Expt2
ACGTSequence
TGACSequence

6
Choice of experiment

Choice of experiment depends on the focus you
require
Whole genome broad coverage (of known genome)?
or focused genomic region?
or discovery based (known or unknown genome)?
How much coverage do you need?
Fewer broad experiments vs. many focused
experiments?
Custom chips can be easily designed for focused
regions and custom applications.

7
Chip- Workflow
CHip-chip
CHip-seq

Select antibody
Decide how deep to sequence ( vs. coverage)?
Sequence fragments
Map Sequence to genomic position (BLAST/BLAT)?
Identify peaks from data and minimise false
positives
Analyse peaks to predict binding sites

Select antibody
Select chip or design and select probes
Map Array probes to genomic positon (BLAST/BLAT
or lookup table from chip supplier)?
Identify peaks from data and minimise false
positives
Analyse peaks to predict binding sites

8
Chip- Ringo Workflow example
9
Chip- output

Peaks on the genome
Score for each genomic position

BMC Bioinformatics 2007, 8219
10
Antibody selection

Success depends on your antibody
Select antibodies that are suitable for CHip-chip
experiments
Only a small number so far!
List available from
http//www.chiponchip.org/antibody.html

11
Microarray companies

DNA microarrays suitable for ChIP-chip assays
Affymetrix
Human Chr2122 tiling microarrays
(oligonucleotide arrays)?
Human ENCODE tiling arrays (oligonucleotide
arrays)
Agilent
Custom oligonucleotide arrays
Nimblegen Systems, Inc.
Human promoter microarrays
Human ENCODE microarrays
Custom oligonucleotide arrays
Aviva Systems Biology
Hu5K promoter arrays (PCR product arrays)?
Hu20K promoter arrays (Oligo arrays)

12
Probe Design

Tiling
high-resolution arrays
target genomic regions of interest
whole genome or specific targeted regions?
Agilent eArray probe database
gt21 million tiled CGH and ChIP-on-chip probes
Do it yourself
unassembled genomes, etc...?

13
Mapping to genome

The genome is still not constant, especially for
many organisms
You must map the probe/sequence to genomic
location using
standard alignment software (BLAST/BLAT/vmatch/...
)?
or rely on datafiles from the vendor (reccomended
for most cases)?
R packages exist for annotating probes to genomic
location

14
Mapping to genome

For sequence based methods this step is critical
(and slow)?
need unix server to run (or VMware)
Do I need access to a computing cluster?
choice of parameters for short sequences
Filter raw sequences -gt representative sequence
set
Do I need to pre-filter data (some seqs will
account for most of the compute time)?
must be aware of speed vs. specificity for large
datasets

Genome
15
Normalisation

A normalization procedure
(a) The MA plot before normalization shows a need
for rotation to correct dye-bias.
(b) To determine the correct angle of rotation,
the s(M) vs s(A) plot of the differences between
probes is generated This circumvents the effect
of binding signal in determining the rotating
angle for original MA plot in (a).
(c) The MA plot after rotation by the angle
determined in (b). The green line is the fitting
line after rotation.
(d) The MA plot after normalization..

BMC Bioinformatics. 2007 8 219.

MA plot is a scatterplot with transformed axes.
The X-axis represents the average log intensity
from 2 channels while Y-axis represents the
log-ratios.
16
Peak detection

What regions of DNA contain signal peaks?
How to define a statistically significant peak?

Zheng, M. et al. (2007) ChIP-chip data, model,
and analysis. Biometrics, Vol 63, 787-796.
17
Normalisation

Before normalization
the mock control appears to show the same
differential enrichment between genic and
intergenic regions as the histone occupancy,
suggesting that the differential enrichment may
be an artifact.
After normalization
the mock control no longer shows significant
differential enrichment while H3 and H4 profiles
still do

Peng et al. BMC Bioinformatics 2007 8219
doi10.1186/1471-2105-8-219

18
Noise

Contamination
Do sequences match the expected genome?
Sequencing errors
Can you determine where a sequencing error is?
Multiple-mapping sequences
Many sequences do not unique genome matches
Dye specific bias
ChIP-chip data for chromatin-associated proteins
and histone modifications present additional
challenges
as they often display broad regions of
enrichment. This is in contrast to the isolated
and sharp peaks that are typical for the binding
of transcription factors.

19
Peak detection - replicates

Use replicates to improve detection
Peaks that are consistent between replications
are more likely to be true

Zheng, M. et al. (2007) ChIP-chip data, model,
and analysis. Biometrics, Vol 63, 787-796.
20
What next?

Given that you've identified accessible regions
in the genome
What information can be gathered from this
sequence?
Use discovery methods to look for common patterns
in the regions
MEME, etc
Use TFBS databases to look for known
transcription factor binding sites in the
sequence
Transfac
High coverage
Noisy database
Jaspar
Low coverage
Higher quality?

21
R packages for chip-chip

Ringo
Well documented workflow and good tutorial
BAC
Perfect example of minimal documentation
Bayesian Analysis of ChIP-chip data

22
Summary

You should be able to
Understand the differences between CHip-chip and
CHip-Seq and identify key decision making steps
for choosing a platform
Identify bioinformatics requirements for handling
CHip-chip and Chip-Seq datasets
Find transcription factor binding sites in
genomic data
Understand the need for skills in handling large
datasets