Title: ChIPchip Data, Model and Analysis
1ChIP-chipData, Model and Analysis
- Ying Nian Wu
- Dept. Of Statistics
- UCLA
- Joint with Ming Zheng, Leah Barrera, Bing Ren
2ChIP-chip
- A technology for isolation and identification of
the DNA sequences occupied by specific DNA
binding proteins (regulatory sequences) in living
cells. - Chromatin-immunoprecipitation and microarray
analysis (chip) are combined to study protein-DNA
interaction in vivo. - Also known as genome-wide location analysis.
3ChIP-chip process
- Step 1 Bound transcription factors are cross-
- linked to DNA with formaldehyde
4ChIP-chip process (contd)
- Step 2 sonication is used to break genomic DNA
- to small DNA fragments (various
lengths, - difficult to measure, 1-2kb)
5ChIP-chip process (contd)
- Step 3 Special antibody is added to immuno-
- precipitate DNA segments
crossed-linked - with target protein
6ChIP-chip process (contd)
- Step 4.1 the cross-linking between DNA and
protein is - reversed and DNA is amplified by
LM-PCR and - labeled with a fluorescent dye Cy5.
7ChIP-chip process (contd)
- Step 4.2 As a negative control, a sample of DNA
- which is not enriched by the
immuno- - precipitation process are also
amplified by - LM-PCR and labeled with another dye
Cy3.
8ChIP-chip process (contd)
- Step 5 Both IP-enriched and IP-unenriched
- samples are hybridized to the same
- oligonucleotide array.
9ChIP-chip process (contd)
- Step 6 The microarray is scanned, Cy5 and Cy3
signal - strengths are extracted, and
log(Cy5/Cy3) is - calculated after normalization.
Ren, B. UCSD
10Summary of ChIP-chip
- Protein bound to DNA
- Sonication
- Immunoprecipitation
- Amplify DNA and add control
- Hybridize to probes
- Microarray analysis
Ren, B. UCSD
11ChIP-chip data
SignalMap, NimbleGen Inc.
- One probe is one data point in the dataset.
- The x-axis represents the genomic position of the
probe. - The y-axis (the height) denotes the signal
strength log(Cy5/Cy3) of each probe.
12A closer look
SignalMap, NimbleGen Inc.
13Cy5 signal
- The Cy5 signal strength at a point should be
proportional to the probability that an
IP-enriched segment contains that point.
14Single binding site scenario
- Assume there is only one binding site at the
origin. - To contribute to the signal at
- 1) this binding site is bound by protein
- 2) no cut should occur between 0 and
- Signal at is proportional to (approx)
15Model derivation
- Assume to be constant around the binding
site. Therefore, the Cy5 signal strength should
decrease exponentially from the binding site. - Log(Cy5/Cy3) decreases linearly from the binding
site triangular shape.
16Two binding sites scenario
17General scenario
18General scenario
19Regression to fit triangle
A simple case probes are evenly spaced.
20Best fitted triangle
- Fix left boundary and the right boundary, we can
identify the slopes and intercept. - For different combinations of left and right
boundary, find the best one with the minimum
variance of residual. - This is the best fitted triangle centered at the
probe we are considering.
21Mpeak process
- Arrange local maxima
- by their signal strength.
- For the first local maximum, find the best fitted
triangle in a small neighborhood and identify the
center as peak.
22Mpeak process
- For any local maximum in the range of this
triangle, if the difference between two fitted
values is small, mark it as non-peak. - Continue this process until every local maximum
has been considered or smaller than a threshold.
23P value of peaks
- Null hypothesis background signal in ChIP-chip
data follows normal distribution with mean 0. - is used as the statistic
for testing it is zero-mean and variance
stabilized. - Background signals are not independent probes
close to each other tend to be included in the
same segment simultaneously.
24Variance approximation
25Result
SignalMap, NimbleGen Inc.
26Result
SignalMap, NimbleGen Inc.
27Result
SignalMap, NimbleGen Inc.
28Result
SignalMap, NimbleGen Inc.
29How good the fit is?
30Result
9,328 promoters for known transcripts
1,196 putative promoters for unknown transcripts
Ren, B. UCSD
- Kim, T.H. et al. A high-resolution map of active
promoters in the human genome. Nature, 436,
876-880
31Comparison with kernel smoothing
32Multi-resolution Peak tree
33Why use model?
- A promoter is characterized not only by a large
probe signal, but also a truncated triangle shape - Identify the neighboring probes that are caused
by the same promoter to pool the info for ranking
the potential binding sites
SignalMap, NimbleGen Inc.
34Model justification
- Intuitively, human vision recognizes the local
shape, instead of a single probe, to detect
peaks. - Model fitting improves detection 1) largest
signal may not always be the tip of the best
fitted triangle, 2) we can handle outliers caused
by probe malfunctioning. - For window smoothing, if the window size is not
chosen well, a local maximum of the window
average can well be the bottom of a valley.
35Model justification
- The model gives us a sensible way to choose the
range - this enables us to pool many weak signals
together if they form a good triangle. So that we
can reduce the chance of false negative. - this prevents us from pooling too many weak
signals together if they do not form a good
triangle. So that we can reduce the chance of
false positive.
36Model justification
- Probabilistic approx Poisson process
- Fact two different slopes around the
non-differential tip - Functional approx line segments locally
- Gives reasonable fit to data
- Not enough data for more complex model
- Not enough computational power to fit more
complex model within minutes
37Software
- Fast 10 seconds for 400,000 probes with a
regular PC. - Robust to noise (data shown later).
-
- Software and source code publicly available
www.stat.ucla.edu/zmdl/Mpeak
38Chromosome structure
Lodish, H. et al. Molecular Cell Biology.
39Histone and transcription
LS3 class note, UCLA
- Histone proteins need to be modified and DNA
needs to be released for transcription to take
place.
40Histone and transcription
Ren, B., UCSD
41Twin-peak phenomenon
The promoter region is in between two binding
sites of the modified histone protein, e.g.,
Acetylated histone H3 (AcH3). ChIP-chip data for
AcH3 show a twin-peak phenomenon, with a valley
corresponding to promoter region.
LS3 class note, UCLA
SignalMap, NimbleGen Inc.
42Possible solutions
- Fit twin-peak shape to data based on the
probability model for two binding site scenario. - Use Witkins scale-space filtering to detect
peaks and twin-peaks.