Model-based analysis of oligonucleotide arrays, dChip software - PowerPoint PPT Presentation

About This Presentation
Title:

Model-based analysis of oligonucleotide arrays, dChip software

Description:

Model-based analysis of oligonucleotide arrays, dChip software Cheng Li (Joint work with Wing Wong) Statistics and Genomics Lecture 4 Department of Biostatistics – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 46
Provided by: Chen120
Category:

less

Transcript and Presenter's Notes

Title: Model-based analysis of oligonucleotide arrays, dChip software


1
Model-based analysis of oligonucleotide arrays,
dChip software
Cheng Li (Joint work with Wing Wong)
  • Statistics and Genomics Lecture 4
  • Department of Biostatistics
  • Harvard School of Public Health
  • January 23-25, 2002

2
Source Affymetrix website
3
Custom software raw image
4
Custom software getting representative value of
a probe cell
5
Normalization is needed to minimize
non-biological variation between arrays
6
Normalization methods
  • Current software uses linear normalization
  • Nonlinear curve fitting based on scatter plot is
    still inadequate because 1) effects of
    differentially expressed genes may be
    normalized 2) regression phenomenon and
    asymmetry

7
Regression phenomenon and asymmetry
8
Invariant set normalization method
  • A set of points (xi, yi) is said to be
    order-preserving if yi lt yj whenever xi lt xj
  • The maximal order-preserving subset can be
    obtained by dynamic programming
  • If a gene is really differentially expressed,
    its cells tend not to be included into an large
    order-preserving subset
  • Our method is based on an approximately order
    preserving subset, called Invariant set

9
Fig. 2.9 Normalization of a pair of replicated
arrays
10
Figure 2.10. Two different samples. The smoothing
spline in (A) is affected by several points at
the lower-right corner, which might belong to
differentially expressed genes. Whereas the
invariant set does not include these points
when determining normalization curve, leading to
a different normalization relationship at the
high end.
11
A pair of split-sample replicate arrays
12
Source Affymetrix website
13
Data for one probe set, one array
PM/MM differences eliminate background and
cross-hybridization signals
14
Validation experiments suggest Average
Differences are linear to mRNA concentrations at
certain dynamic range
Lockhart et al. (1996) Nature Genetics, Vol 14
1675-1680
15
Data for one gene in many arrays
16
Box plot showing array and probe effects
17
Modeling probe effects
1) Probes sequence has different hybridization
efficiency 2) cross hybridization, SNP,
alternative splicing 3) Probe position effect, 3
bias Probe effects can dominate biological
variation of interest Previous method use
multiple probes, average to reduce noise Our
methods statistical models for probe effects,
meta-analysis, learning algorithms, estimation
of expression level conditional on knowledge of
probe effect
18
Principal component analysis (42 points in
20-space) suggests the data matrix has approx.
rank 1
19
Model for one gene in multiple arrays
20
Figure 1.1. Black curves are the PM and MM data
of gene A in the first 6 arrays. Light curves are
the fitted values to model (1). Probe pairs are
labeled 1 to 20 on the horizontal axis.
21
Using PM/MM Differences
  • PM/MM differences eliminate most background and
    cross-hybridization signals
  • Affyemtrixs GeneChip software is using average
    differences as basis for determining fold
    changes, and their validation showed average
    differences are linear to mRNA concentrations at
    certain dynamic range

22
Model for PM/MM differences (1.2)
23
Figure 1.2. Black curves are the PM-MM difference
data of gene A in the first 6 arrays. Light
curves are the fitted values to model (2).
24
(No Transcript)
25
Residuals of the fitting
26
Model fitting amounts to fixing ?s and regress
to estimate ?
27
Fig 1.5 Array outlier large standard errors of ?4
28
Fig. 1.6 Probe outlier large standard errors of
?17
Also see gene 6898
29
Fig. 1.4 Array outlier image shows that the model
automatically handles image contamination
30
Compare Model-based expression with Average
Difference
  • The array set 5 has 29 pair of arrays replicated
    at split-mRNA level
  • The differences between the replicated arrays
    provides a opportunity to assess different
    expression calculation method

31
Figure 2.5. Log (base 10) expression indexes of a
pair of replicate arrays (array 1 and 2 of array
set 5) for MBEI method (A) and AD method (B). The
center line is yx, and the flanking lines
indicate the difference of a factor of two.
32
(A)
(B)
Figure 2.6. Boxplots of average absolute log
(base 10) ratios between replicate arrays
stratified by presence proportion for (A) MBEI
method, (B) AD method.
33
Source Affymetrix website
34
Finding Confidence Interval of Fold Change
35
Table 2.1 Using expression levels and associated
standard errors to determine confidence intervals
of fold changes
36
Resampling hierarchical clustering using standard
errors of model-based expression
37
Incorporate biological knowledge and database
when analyzing microarray data
Right picture Gene Ontology tool for the
unification of biology, Nature Genetics, 25, p25
38
Functional significant clusters
Found 13 structural protein genes out of a
49-cluster (all 198/2622, PValue 1.00e000)
39
Problems with LWR model
  • Statistical analysis of high-density
    oligonucleotide arrays a multiplicative noise
    model
  • R. Sasik and J. Corbeil (UCSF)
  • LWR model
  • The expression index can still be negative.
  • Genes with negative index can still be classified
    as present.

Slides prepared by Xuemin Fang
40
Statistical model
  • Based on the same assumption as the LW model,
    that PM intensity is directly proportional to the
    concentration ci of the transcript,
    . Write the relation in the form
  • Our model is
  • where
  • Least squared estimation of the parameters.
  • Constraint

41
Algorithm -- When analyzing a batch of ns samples
  • Normalize all samples to the first one on the
    list by requiring the sum of all PM intensities
    be the same as that of the first sample.
  • Select the background probes using Naefs method
    (MM is used in this step).
  • Subtract the median of the background probe
    intensity from every PM probe in the array.
  • Probes that become negative are eliminated.
  • Fit the model and probes contributes most to the
    sum of squares are eliminated.
  • Normalize again and repeat 1-5, until the
    distribution of residuals is Gaussian.

42
  • Bias, variance and fit for three measures of
    expression AvDiff, Li Wong's,
  • AvLog (PM -bg)
  • Rafael Irizarry, Terry Speed (Johns Hopkins)

Slides prepared by Xuemin Fang
43
A background plus signal model
  • Here represents
    background signal caused by optical noise and
    non-specific binding.
  • The mean background level is represented with
    and the random component with .
  • The transcript signal
    contains a probe affinity effect
    , the log expression measures , and an
    error term.
  • Both error terms and are
    independent standard normal.

44
Expression index
  • A naïve estimate of is given by
  • with the mode of the log2(MM)
    distribution.
  • An estimate of this distribution is obtained
    using a density kernel estimate.

45
Acknowledgement
Data source Stan Nelson (UCLA)Sven de Vos
(UCLA) Dan Tang (DFCI)Andy Bhattacharjee
(DFCI)Richardson Andresa (DFCI) Allen Fienberg
(Rockefeller)
Write a Comment
User Comments (0)
About PowerShow.com