ChIPchip Data, Model and Analysis - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

ChIPchip Data, Model and Analysis

Description:

Model justification ... Model justification. The model gives us a sensible way to choose the range: ... Model justification. Probabilistic approx: Poisson process ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 43
Provided by: ZMDL
Learn more at: http://www.stat.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: ChIPchip Data, Model and Analysis


1
ChIP-chipData, Model and Analysis
  • Ying Nian Wu
  • Dept. Of Statistics
  • UCLA
  • Joint with Ming Zheng, Leah Barrera, Bing Ren

2
ChIP-chip
  • A technology for isolation and identification of
    the DNA sequences occupied by specific DNA
    binding proteins (regulatory sequences) in living
    cells.
  • Chromatin-immunoprecipitation and microarray
    analysis (chip) are combined to study protein-DNA
    interaction in vivo.
  • Also known as genome-wide location analysis.

3
ChIP-chip process
  • Step 1 Bound transcription factors are cross-
  • linked to DNA with formaldehyde

4
ChIP-chip process (contd)
  • Step 2 sonication is used to break genomic DNA
  • to small DNA fragments (various
    lengths,
  • difficult to measure, 1-2kb)

5
ChIP-chip process (contd)
  • Step 3 Special antibody is added to immuno-
  • precipitate DNA segments
    crossed-linked
  • with target protein

6
ChIP-chip process (contd)
  • Step 4.1 the cross-linking between DNA and
    protein is
  • reversed and DNA is amplified by
    LM-PCR and
  • labeled with a fluorescent dye Cy5.

7
ChIP-chip process (contd)
  • Step 4.2 As a negative control, a sample of DNA
  • which is not enriched by the
    immuno-
  • precipitation process are also
    amplified by
  • LM-PCR and labeled with another dye
    Cy3.

8
ChIP-chip process (contd)
  • Step 5 Both IP-enriched and IP-unenriched
  • samples are hybridized to the same
  • oligonucleotide array.

9
ChIP-chip process (contd)
  • Step 6 The microarray is scanned, Cy5 and Cy3
    signal
  • strengths are extracted, and
    log(Cy5/Cy3) is
  • calculated after normalization.

Ren, B. UCSD
10
Summary of ChIP-chip
  • Protein bound to DNA
  • Sonication
  • Immunoprecipitation
  • Amplify DNA and add control
  • Hybridize to probes
  • Microarray analysis

Ren, B. UCSD
11
ChIP-chip data
SignalMap, NimbleGen Inc.
  • One probe is one data point in the dataset.
  • The x-axis represents the genomic position of the
    probe.
  • The y-axis (the height) denotes the signal
    strength log(Cy5/Cy3) of each probe.

12
A closer look
SignalMap, NimbleGen Inc.
13
Cy5 signal
  • The Cy5 signal strength at a point should be
    proportional to the probability that an
    IP-enriched segment contains that point.

14
Single binding site scenario
  • Assume there is only one binding site at the
    origin.
  • To contribute to the signal at
  • 1) this binding site is bound by protein
  • 2) no cut should occur between 0 and
  • Signal at is proportional to (approx)

15
Model derivation
  • Assume to be constant around the binding
    site. Therefore, the Cy5 signal strength should
    decrease exponentially from the binding site.
  • Log(Cy5/Cy3) decreases linearly from the binding
    site triangular shape.

16
Two binding sites scenario
17
General scenario
18
General scenario
19
Regression to fit triangle
A simple case probes are evenly spaced.
20
Best fitted triangle
  • Fix left boundary and the right boundary, we can
    identify the slopes and intercept.
  • For different combinations of left and right
    boundary, find the best one with the minimum
    variance of residual.
  • This is the best fitted triangle centered at the
    probe we are considering.

21
Mpeak process
  • Arrange local maxima
  • by their signal strength.
  • For the first local maximum, find the best fitted
    triangle in a small neighborhood and identify the
    center as peak.

22
Mpeak process
  • For any local maximum in the range of this
    triangle, if the difference between two fitted
    values is small, mark it as non-peak.
  • Continue this process until every local maximum
    has been considered or smaller than a threshold.

23
P value of peaks
  • Null hypothesis background signal in ChIP-chip
    data follows normal distribution with mean 0.
  • is used as the statistic
    for testing it is zero-mean and variance
    stabilized.
  • Background signals are not independent probes
    close to each other tend to be included in the
    same segment simultaneously.

24
Variance approximation
25
Result
SignalMap, NimbleGen Inc.
26
Result
SignalMap, NimbleGen Inc.
27
Result
SignalMap, NimbleGen Inc.
28
Result
SignalMap, NimbleGen Inc.
29
How good the fit is?
30
Result
9,328 promoters for known transcripts
1,196 putative promoters for unknown transcripts
Ren, B. UCSD
  • Kim, T.H. et al. A high-resolution map of active
    promoters in the human genome. Nature, 436,
    876-880

31
Comparison with kernel smoothing
32
Multi-resolution Peak tree
33
Why use model?
  • A promoter is characterized not only by a large
    probe signal, but also a truncated triangle shape
  • Identify the neighboring probes that are caused
    by the same promoter to pool the info for ranking
    the potential binding sites

SignalMap, NimbleGen Inc.
34
Model justification
  • Intuitively, human vision recognizes the local
    shape, instead of a single probe, to detect
    peaks.
  • Model fitting improves detection 1) largest
    signal may not always be the tip of the best
    fitted triangle, 2) we can handle outliers caused
    by probe malfunctioning.
  • For window smoothing, if the window size is not
    chosen well, a local maximum of the window
    average can well be the bottom of a valley.  

35
Model justification
  • The model gives us a sensible way to choose the
    range
  • this enables us to pool many weak signals
    together if they form a good triangle. So that we
    can reduce the chance of false negative.
  • this prevents us from pooling too many weak
    signals together if they do not form a good
    triangle. So that we can reduce the chance of
    false positive.

36
Model justification
  • Probabilistic approx Poisson process
  • Fact two different slopes around the
    non-differential tip
  • Functional approx line segments locally
  • Gives reasonable fit to data
  • Not enough data for more complex model
  • Not enough computational power to fit more
    complex model within minutes

37
Software
  • Fast 10 seconds for 400,000 probes with a
    regular PC.
  • Robust to noise (data shown later).
  • Software and source code publicly available
    www.stat.ucla.edu/zmdl/Mpeak

38
Chromosome structure
Lodish, H. et al. Molecular Cell Biology.
39
Histone and transcription
LS3 class note, UCLA
  • Histone proteins need to be modified and DNA
    needs to be released for transcription to take
    place.

40
Histone and transcription
Ren, B., UCSD
41
Twin-peak phenomenon
The promoter region is in between two binding
sites of the modified histone protein, e.g.,
Acetylated histone H3 (AcH3). ChIP-chip data for
AcH3 show a twin-peak phenomenon, with a valley
corresponding to promoter region.
LS3 class note, UCLA
SignalMap, NimbleGen Inc.
42
Possible solutions
  • Fit twin-peak shape to data based on the
    probability model for two binding site scenario.
  • Use Witkins scale-space filtering to detect
    peaks and twin-peaks.
Write a Comment
User Comments (0)
About PowerShow.com