Title: Probe Level Analysis of AffymetrixTM Data
1Probe Level Analysis of AffymetrixTM Data
2Outline
- Design of Affy probesets
- Background
- Normalization
- Non-specific hybridization
- Estimation
- Comparison of Methods
3Affymetrix GeneChip Probe Arrays
Single stranded, fluorescently labeled DNA target
4Affymetrix Probe Design
PM is exactly complementary to published
sequence MM is changed on 13th base
5Chip Layout
- Typical chips are square 640x640 (U95A), 712x712
(U133) or 1042x1042 (Plus2) - Older chips placed all probes for one gene in a
row - Modern chips distribute probes according to
sequence, not gene
6Chip Nomenclature
- HGU133A - Human Genome Unigene build 133, first
chip - PM - perfect match
- MM - mismatch
- Control sequence
- sequence from unrelated organism
- Signal - intensity
- Doesnt translate directly to abundance
- Cross-hybridization
- Binding of sequences other than target
7Affymetrix Background Adjustment and Normalization
8Whats the Issue?
- Background some Affy chips show consistently
higher values for the lowest signals (presumably
absent) than others - Background may vary over a chip
- Normalization Distribution of probe signals may
differ between chips, independent of background
adjustment - PM and MM may be shifted differently
9Probe Intensities in 23 Replicates
10Approaches to Background
- Subtract common estimate of background
- Fit local background across chip and subtract -
MAS 5.0 - Consider background as random variable
- Use statistical theory to derive background
correction
11RMA Bayesian BG Correction
- Each S BG Intensity e
- BG randomly sampled from Normal distn
- Intensity randomly sampled from exponential
distribution - Estimate mean and SD of BG distn by fitting
values below mode of signal distn - Estimate Intensity, conditional on S, by
integrating over possible values of BG
12Approaches to Normalization
- Simple find average of each chip divide all
values by chip average - MAS5 trimmed mean
- Invariant set find subset of probes in almost
same rank order in each chip - Quantile normalization fit to average quantiles
across experiment
13Probes on Different Chips
Plots of two Affymetrix chips against the
experiment means
14MAS 5.0
- Plot probes from each chip against common
base-line chip - Fit regression line to middle 98 of probes
15Invariant Set (Li-Wong) Method
- Select baseline chip X
- For each other chip Y
- Select probes p1, , pK, (K 10000), such that
p1 lt p2 lt lt pK in both chips - Fit running median through points
- (xp1,yp1), , (xpK, ypK)
- Repeat
16Quantile Method (RMA)
- Distributions of probe intensities vary
substantially among replicate chips - This cannot be even approximately resolved by any
linear transformation - Drastic solution shoehorn all probe
intensities into same distribution - Ideal distribution is taken as average of all
17Quantile Normalization
Distribution of Chip Intensities
Reference Distribution
Formula xnorm F2-1(F1(x))
Density function
Assumes gene distribution changes little
F1(x)
F2(x)
Cumulative Distribution Function
a
x
y
18Ratio-Intensity Before
19Ratio-Intensity After
20Critique of RMA Normalization
- Distribution of signals looks more like
exponential on log scale - No allowance for regional biases in BG
- Quantile normalization is very strong highly
expressed genes wont be equal - Better to let higher end be roughly linear
- Requires much memory - could be implemented
differently
21Model-based Estimates for Affymetrix Raw Data
22Many Probes for One Gene
How to combine signals from multiple probes into
a single gene abundance estimate?
23Probe Variation
- Individual probes dont agree on fold changes
- Probes for one gene may vary by two orders of
magnitude on each chip - CG content is most important factor in signal
strength
Signal from 16 probes along one gene on one chip
24Competing Models 2005
- GCOS (Affymetrix MicroArray Suite 5.0)
- Manufacturers software
- dChip
- Li and Wong, HSPH
- Bioconductor affy package (RMA)
- Bolstad, Irizarry, Speed, et al
- Variants such as gcRMA, vsn
- Probe-level analyses
- affyPLM, logit-t,
25Probe Measure Variation
- Typical probes are two orders of magnitude
different! - CG content is most important factor
- RNA target folding also affects hybridization
3x104
0
26Principles of MAS 5 method
- First estimate background
- bg MM (if physically possible)
- log(bg) log(PM)-log(non-specific proportion)
(if impossible) - Non-specific proportion max(SB, e)
- SB Tukeybiweight(log(PM)-log(MM))
- Signal Tukeybiweight(log(Adjusted PM))
27Critique of MAS 5 principle
- Not clear what an average of different probes
should mean - Tukey bi-weight can be unstable when data cluster
at either end frequently the conditions here - No learning based on cross-chip performance of
individual probes
28Motivation for multi-chip models
- Probe level data from spike-in study ( log scale
) note parallel trend of all probes
Courtesy of Terry Speed
29Linear Models
- Extension of linear regression
- Essential features
- Measurement errors independent of each other
- random noise
- Needs normalization to eliminate systematic
variation - Noise levels comparable at different levels of
signal - Small number of factors give predicted levels
- combine in linear function or simple algebraic
form
30Model for Probe Signal
- Each probe signal is proportional to
- i) the amount of target sample a
- ii) the affinity of the specific probe sequence
to the target f - NB High affinity is not the same as Specificity
- Probe can give high signal to intended target and
also to other transcripts
Probes 1 2 3
chip 1
a1
a2
chip 2
f1 f2 f3
31Multiplicative Model
- For each gene, a set of probes p1,,pk
- Each probe pj binds the gene with efficiency fj
- In each sample there is an amount qi.
- Probe intensity should be proportional to fjxqi
- Always some noise!
32Robust Statistics
- Outlier a measure that is far beyond the typical
random variation - common in biological measures
- 10-15 in Affy probe sets
- Robust methods try to fit the majority of data
points - Issue is to identify which points to down-weight
or ignore - Median is very robust but inefficient
- Trimmed means are almost as robust and much more
efficient
33Robust Linear Models
- Criterion of fit
- Least median squares
- Sum of weighted squares
- Least squares and throw out outliers
- Method for finding fit
- High-dimensional search
- Iteratively re-weighted least squares
- Median Polish
34Why Robust Models for GeneChips?
- 10 - 15 of individual signals in a probe set
deviate greatly from pattern - Often outliers lie close together
- Causes
- Scratches
- Proximity to heating elements
- Uneven fluid flow
35Li Wong (dChip)
- Model PMij qifj eij
- - Original model (dChip 1.0) used PMij - MMij
qifj eij - by analogy with Affy MAS 4
- Outlier removal
- Identify extreme residuals
- Remove
- Re-fit
- Iterate
Fitting probes in one set on one chip
Dark blue PM values Red fitted values Light
blue probe SD
36Critique of Li-Wong model
- Model assumes that noise for all probes has same
magnitude - All biological measurements exhibit
intensity-dependent noise
37Bolstad, Irizarry, Speed (RMA)
- For each probe set, take the log transform of
- PMij qifj
- i.e. fit the model
- Fit this additive model by iteratively
re-weighted least-squares or median polish
Where nlog() stands for logarithm after
normalization
Critique assumes probe noise is constant
(homoschedastic) on log scale
38Comparison of Methods
Green MAS5.0 Black Li-Wong Blue, Red RMA
20 replicate arrays variance should be
small Standard deviations of expression estimates
on arrays arranged in four groups of genes by
increasing mean expression level
Courtesy of Terry Speed
39Steady Improvement
- Affymetrix improves their model
- PLIER is a multi-chip model
- MAS P A calls reasonable
- MAS 5.0 estimation does a reasonable job on probe
sets that are bright - Abundant genes
- dChip and RMA do better on genes that are less
abundant - Signalling proteins, transcription factors, etc
40Expression Comparison 1 MAS 4
Ratio-Intensity Plot comparing two chips from
spike-in experiment
White dots represent unchanged genes Red numbers
flag spike-in genes
Courtesy of Terry Speed
41Expression Comparison 2 MAS 5
t-scores
changed genes
Theoretical t-distribution
Courtesy of Terry Speed
42Expression Comparison 3 Li-Wong
Courtesy of Terry Speed
43Expression Comparison 4 - RMA
Courtesy of Terry Speed
44Comparison on Real Data
- These results are based on samples with 14
spike-ins - not realistic complexity - Choe et al (Genome Biology 2005) produced a spike
in data set with realistic complexity - found
MAS5 PM correction worked well - Comparisons of biological variation vs technical
variation in replicated samples suggest RMA
defaults work best
45Mix and Match Methods in affy
- Background rma, mas
- Normalization quantile, constant,
- PM-correction none,
- Model median polish, mas
- Estimates lt- expresso( cel.data, bgcorrect.method
mas, normalization.method quantiles,
46gcRMA Estimating Non-specific Hybridization
- Each probe has its own characteristic
cross-hybridizations (NSH) - Mismatch is not a good estimate of NSH
- GC content may predict NSH reasonably well