Probe Level Analysis of AffymetrixTM Data

1 / 46

About This Presentation

Title:

Probe Level Analysis of AffymetrixTM Data

Description:

Quantile normalization: fit to average quantiles ... Quantile Method (RMA) ... Quantile normalization is very strong: highly expressed genes won't be equal ... –

Number of Views:90

Avg rating:3.0/5.0

Slides: 47

Provided by: markre8

Category:

more less

Transcript and Presenter's Notes

Title: Probe Level Analysis of AffymetrixTM Data

1
Probe Level Analysis of AffymetrixTM Data

Mark Reimers, NCI

2
Outline

Design of Affy probesets
Background
Normalization
Non-specific hybridization
Estimation
Comparison of Methods

3
Affymetrix GeneChip Probe Arrays
Single stranded, fluorescently labeled DNA target
4
Affymetrix Probe Design
PM is exactly complementary to published
sequence MM is changed on 13th base
5
Chip Layout

Typical chips are square 640x640 (U95A), 712x712
(U133) or 1042x1042 (Plus2)
Older chips placed all probes for one gene in a
row
Modern chips distribute probes according to
sequence, not gene

6
Chip Nomenclature

HGU133A - Human Genome Unigene build 133, first
chip
PM - perfect match
MM - mismatch
Control sequence
sequence from unrelated organism
Signal - intensity
Doesnt translate directly to abundance
Cross-hybridization
Binding of sequences other than target

7
Affymetrix Background Adjustment and Normalization
8
Whats the Issue?

Background some Affy chips show consistently
higher values for the lowest signals (presumably
absent) than others
Background may vary over a chip
Normalization Distribution of probe signals may
differ between chips, independent of background
adjustment
PM and MM may be shifted differently

9
Probe Intensities in 23 Replicates
10
Approaches to Background

Subtract common estimate of background
Fit local background across chip and subtract -
MAS 5.0
Consider background as random variable
Use statistical theory to derive background
correction

11
RMA Bayesian BG Correction

Each S BG Intensity e
BG randomly sampled from Normal distn
Intensity randomly sampled from exponential
distribution
Estimate mean and SD of BG distn by fitting
values below mode of signal distn
Estimate Intensity, conditional on S, by
integrating over possible values of BG

12
Approaches to Normalization

Simple find average of each chip divide all
values by chip average
MAS5 trimmed mean
Invariant set find subset of probes in almost
same rank order in each chip
Quantile normalization fit to average quantiles
across experiment

13
Probes on Different Chips
Plots of two Affymetrix chips against the
experiment means
14
MAS 5.0

Plot probes from each chip against common
base-line chip
Fit regression line to middle 98 of probes

15
Invariant Set (Li-Wong) Method

Select baseline chip X
For each other chip Y
Select probes p1, , pK, (K 10000), such that
p1 lt p2 lt lt pK in both chips
Fit running median through points
(xp1,yp1), , (xpK, ypK)
Repeat

16
Quantile Method (RMA)

Distributions of probe intensities vary
substantially among replicate chips
This cannot be even approximately resolved by any
linear transformation
Drastic solution shoehorn all probe
intensities into same distribution
Ideal distribution is taken as average of all

17
Quantile Normalization
Distribution of Chip Intensities
Reference Distribution
Formula xnorm F2-1(F1(x))
Density function
Assumes gene distribution changes little
F1(x)
F2(x)
Cumulative Distribution Function
a
x
y
18
Ratio-Intensity Before
19
Ratio-Intensity After
20
Critique of RMA Normalization

Distribution of signals looks more like
exponential on log scale
No allowance for regional biases in BG
Quantile normalization is very strong highly
expressed genes wont be equal
Better to let higher end be roughly linear
Requires much memory - could be implemented
differently

21
Model-based Estimates for Affymetrix Raw Data
22
Many Probes for One Gene
How to combine signals from multiple probes into
a single gene abundance estimate?
23
Probe Variation

Individual probes dont agree on fold changes
Probes for one gene may vary by two orders of
magnitude on each chip
CG content is most important factor in signal
strength

Signal from 16 probes along one gene on one chip
24
Competing Models 2005

GCOS (Affymetrix MicroArray Suite 5.0)
Manufacturers software
dChip
Li and Wong, HSPH
Bioconductor affy package (RMA)
Bolstad, Irizarry, Speed, et al
Variants such as gcRMA, vsn
Probe-level analyses
affyPLM, logit-t,

25
Probe Measure Variation

Typical probes are two orders of magnitude
different!
CG content is most important factor
RNA target folding also affects hybridization

3x104
0
26
Principles of MAS 5 method

First estimate background
bg MM (if physically possible)
log(bg) log(PM)-log(non-specific proportion)
(if impossible)
Non-specific proportion max(SB, e)
SB Tukeybiweight(log(PM)-log(MM))
Signal Tukeybiweight(log(Adjusted PM))

27
Critique of MAS 5 principle

Not clear what an average of different probes
should mean
Tukey bi-weight can be unstable when data cluster
at either end frequently the conditions here
No learning based on cross-chip performance of
individual probes

28
Motivation for multi-chip models

Probe level data from spike-in study ( log scale
) note parallel trend of all probes

Courtesy of Terry Speed
29
Linear Models

Extension of linear regression
Essential features
Measurement errors independent of each other
random noise
Needs normalization to eliminate systematic
variation
Noise levels comparable at different levels of
signal
Small number of factors give predicted levels
combine in linear function or simple algebraic
form

30
Model for Probe Signal

Each probe signal is proportional to
i) the amount of target sample a
ii) the affinity of the specific probe sequence
to the target f
NB High affinity is not the same as Specificity
Probe can give high signal to intended target and
also to other transcripts

Probes 1 2 3
chip 1
a1
a2
chip 2
f1 f2 f3
31
Multiplicative Model

For each gene, a set of probes p1,,pk
Each probe pj binds the gene with efficiency fj
In each sample there is an amount qi.
Probe intensity should be proportional to fjxqi
Always some noise!

32
Robust Statistics

Outlier a measure that is far beyond the typical
random variation
common in biological measures
10-15 in Affy probe sets
Robust methods try to fit the majority of data
points
Issue is to identify which points to down-weight
or ignore
Median is very robust but inefficient
Trimmed means are almost as robust and much more
efficient

33
Robust Linear Models

Criterion of fit
Least median squares
Sum of weighted squares
Least squares and throw out outliers
Method for finding fit
High-dimensional search
Iteratively re-weighted least squares
Median Polish

34
Why Robust Models for GeneChips?

10 - 15 of individual signals in a probe set
deviate greatly from pattern
Often outliers lie close together
Causes
Scratches
Proximity to heating elements
Uneven fluid flow

35
Li Wong (dChip)

Model PMij qifj eij
- Original model (dChip 1.0) used PMij - MMij
qifj eij
by analogy with Affy MAS 4
Outlier removal
Identify extreme residuals
Remove
Re-fit
Iterate

Fitting probes in one set on one chip
Dark blue PM values Red fitted values Light
blue probe SD
36
Critique of Li-Wong model

Model assumes that noise for all probes has same
magnitude
All biological measurements exhibit
intensity-dependent noise

37
Bolstad, Irizarry, Speed (RMA)

For each probe set, take the log transform of
PMij qifj
i.e. fit the model
Fit this additive model by iteratively
re-weighted least-squares or median polish

Where nlog() stands for logarithm after
normalization
Critique assumes probe noise is constant
(homoschedastic) on log scale
38
Comparison of Methods
Green MAS5.0 Black Li-Wong Blue, Red RMA
20 replicate arrays variance should be
small Standard deviations of expression estimates
on arrays arranged in four groups of genes by
increasing mean expression level
Courtesy of Terry Speed
39
Steady Improvement

Affymetrix improves their model
PLIER is a multi-chip model
MAS P A calls reasonable
MAS 5.0 estimation does a reasonable job on probe
sets that are bright
Abundant genes
dChip and RMA do better on genes that are less
abundant
Signalling proteins, transcription factors, etc

40
Expression Comparison 1 MAS 4
Ratio-Intensity Plot comparing two chips from
spike-in experiment
White dots represent unchanged genes Red numbers
flag spike-in genes
Courtesy of Terry Speed
41
Expression Comparison 2 MAS 5
t-scores
changed genes
Theoretical t-distribution
Courtesy of Terry Speed
42
Expression Comparison 3 Li-Wong
Courtesy of Terry Speed
43
Expression Comparison 4 - RMA
Courtesy of Terry Speed
44
Comparison on Real Data

These results are based on samples with 14
spike-ins - not realistic complexity
Choe et al (Genome Biology 2005) produced a spike
in data set with realistic complexity - found
MAS5 PM correction worked well
Comparisons of biological variation vs technical
variation in replicated samples suggest RMA
defaults work best

45
Mix and Match Methods in affy

Background rma, mas
Normalization quantile, constant,
PM-correction none,
Model median polish, mas
Estimates lt- expresso( cel.data, bgcorrect.method
mas, normalization.method quantiles,

46
gcRMA Estimating Non-specific Hybridization