Probe Level Analysis of AffymetrixTM Data - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Probe Level Analysis of AffymetrixTM Data

Description:

Quantile normalization: fit to average quantiles ... Quantile Method (RMA) ... Quantile normalization is very strong: highly expressed genes won't be equal ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 47
Provided by: markre8
Category:

less

Transcript and Presenter's Notes

Title: Probe Level Analysis of AffymetrixTM Data


1
Probe Level Analysis of AffymetrixTM Data
  • Mark Reimers, NCI

2
Outline
  • Design of Affy probesets
  • Background
  • Normalization
  • Non-specific hybridization
  • Estimation
  • Comparison of Methods

3
Affymetrix GeneChip Probe Arrays
Single stranded, fluorescently labeled DNA target
4
Affymetrix Probe Design
PM is exactly complementary to published
sequence MM is changed on 13th base
5
Chip Layout
  • Typical chips are square 640x640 (U95A), 712x712
    (U133) or 1042x1042 (Plus2)
  • Older chips placed all probes for one gene in a
    row
  • Modern chips distribute probes according to
    sequence, not gene

6
Chip Nomenclature
  • HGU133A - Human Genome Unigene build 133, first
    chip
  • PM - perfect match
  • MM - mismatch
  • Control sequence
  • sequence from unrelated organism
  • Signal - intensity
  • Doesnt translate directly to abundance
  • Cross-hybridization
  • Binding of sequences other than target

7
Affymetrix Background Adjustment and Normalization
8
Whats the Issue?
  • Background some Affy chips show consistently
    higher values for the lowest signals (presumably
    absent) than others
  • Background may vary over a chip
  • Normalization Distribution of probe signals may
    differ between chips, independent of background
    adjustment
  • PM and MM may be shifted differently

9
Probe Intensities in 23 Replicates
10
Approaches to Background
  • Subtract common estimate of background
  • Fit local background across chip and subtract -
    MAS 5.0
  • Consider background as random variable
  • Use statistical theory to derive background
    correction

11
RMA Bayesian BG Correction
  • Each S BG Intensity e
  • BG randomly sampled from Normal distn
  • Intensity randomly sampled from exponential
    distribution
  • Estimate mean and SD of BG distn by fitting
    values below mode of signal distn
  • Estimate Intensity, conditional on S, by
    integrating over possible values of BG

12
Approaches to Normalization
  • Simple find average of each chip divide all
    values by chip average
  • MAS5 trimmed mean
  • Invariant set find subset of probes in almost
    same rank order in each chip
  • Quantile normalization fit to average quantiles
    across experiment

13
Probes on Different Chips
Plots of two Affymetrix chips against the
experiment means
14
MAS 5.0
  • Plot probes from each chip against common
    base-line chip
  • Fit regression line to middle 98 of probes

15
Invariant Set (Li-Wong) Method
  • Select baseline chip X
  • For each other chip Y
  • Select probes p1, , pK, (K 10000), such that
    p1 lt p2 lt lt pK in both chips
  • Fit running median through points
  • (xp1,yp1), , (xpK, ypK)
  • Repeat

16
Quantile Method (RMA)
  • Distributions of probe intensities vary
    substantially among replicate chips
  • This cannot be even approximately resolved by any
    linear transformation
  • Drastic solution shoehorn all probe
    intensities into same distribution
  • Ideal distribution is taken as average of all

17
Quantile Normalization
Distribution of Chip Intensities
Reference Distribution
Formula xnorm F2-1(F1(x))
Density function
Assumes gene distribution changes little
F1(x)
F2(x)
Cumulative Distribution Function
a
x
y
18
Ratio-Intensity Before
19
Ratio-Intensity After
20
Critique of RMA Normalization
  • Distribution of signals looks more like
    exponential on log scale
  • No allowance for regional biases in BG
  • Quantile normalization is very strong highly
    expressed genes wont be equal
  • Better to let higher end be roughly linear
  • Requires much memory - could be implemented
    differently

21
Model-based Estimates for Affymetrix Raw Data
22
Many Probes for One Gene
How to combine signals from multiple probes into
a single gene abundance estimate?
23
Probe Variation
  • Individual probes dont agree on fold changes
  • Probes for one gene may vary by two orders of
    magnitude on each chip
  • CG content is most important factor in signal
    strength

Signal from 16 probes along one gene on one chip
24
Competing Models 2005
  • GCOS (Affymetrix MicroArray Suite 5.0)
  • Manufacturers software
  • dChip
  • Li and Wong, HSPH
  • Bioconductor affy package (RMA)
  • Bolstad, Irizarry, Speed, et al
  • Variants such as gcRMA, vsn
  • Probe-level analyses
  • affyPLM, logit-t,

25
Probe Measure Variation
  • Typical probes are two orders of magnitude
    different!
  • CG content is most important factor
  • RNA target folding also affects hybridization

3x104
0
26
Principles of MAS 5 method
  • First estimate background
  • bg MM (if physically possible)
  • log(bg) log(PM)-log(non-specific proportion)
    (if impossible)
  • Non-specific proportion max(SB, e)
  • SB Tukeybiweight(log(PM)-log(MM))
  • Signal Tukeybiweight(log(Adjusted PM))

27
Critique of MAS 5 principle
  • Not clear what an average of different probes
    should mean
  • Tukey bi-weight can be unstable when data cluster
    at either end frequently the conditions here
  • No learning based on cross-chip performance of
    individual probes

28
Motivation for multi-chip models
  • Probe level data from spike-in study ( log scale
    ) note parallel trend of all probes

Courtesy of Terry Speed
29
Linear Models
  • Extension of linear regression
  • Essential features
  • Measurement errors independent of each other
  • random noise
  • Needs normalization to eliminate systematic
    variation
  • Noise levels comparable at different levels of
    signal
  • Small number of factors give predicted levels
  • combine in linear function or simple algebraic
    form

30
Model for Probe Signal
  • Each probe signal is proportional to
  • i) the amount of target sample a
  • ii) the affinity of the specific probe sequence
    to the target f
  • NB High affinity is not the same as Specificity
  • Probe can give high signal to intended target and
    also to other transcripts

Probes 1 2 3
chip 1
a1
a2
chip 2
f1 f2 f3
31
Multiplicative Model
  • For each gene, a set of probes p1,,pk
  • Each probe pj binds the gene with efficiency fj
  • In each sample there is an amount qi.
  • Probe intensity should be proportional to fjxqi
  • Always some noise!

32
Robust Statistics
  • Outlier a measure that is far beyond the typical
    random variation
  • common in biological measures
  • 10-15 in Affy probe sets
  • Robust methods try to fit the majority of data
    points
  • Issue is to identify which points to down-weight
    or ignore
  • Median is very robust but inefficient
  • Trimmed means are almost as robust and much more
    efficient

33
Robust Linear Models
  • Criterion of fit
  • Least median squares
  • Sum of weighted squares
  • Least squares and throw out outliers
  • Method for finding fit
  • High-dimensional search
  • Iteratively re-weighted least squares
  • Median Polish

34
Why Robust Models for GeneChips?
  • 10 - 15 of individual signals in a probe set
    deviate greatly from pattern
  • Often outliers lie close together
  • Causes
  • Scratches
  • Proximity to heating elements
  • Uneven fluid flow

35
Li Wong (dChip)
  • Model PMij qifj eij
  • - Original model (dChip 1.0) used PMij - MMij
    qifj eij
  • by analogy with Affy MAS 4
  • Outlier removal
  • Identify extreme residuals
  • Remove
  • Re-fit
  • Iterate

Fitting probes in one set on one chip
Dark blue PM values Red fitted values Light
blue probe SD
36
Critique of Li-Wong model
  • Model assumes that noise for all probes has same
    magnitude
  • All biological measurements exhibit
    intensity-dependent noise

37
Bolstad, Irizarry, Speed (RMA)
  • For each probe set, take the log transform of
  • PMij qifj
  • i.e. fit the model
  • Fit this additive model by iteratively
    re-weighted least-squares or median polish

Where nlog() stands for logarithm after
normalization
Critique assumes probe noise is constant
(homoschedastic) on log scale
38
Comparison of Methods
Green MAS5.0 Black Li-Wong Blue, Red RMA
20 replicate arrays variance should be
small Standard deviations of expression estimates
on arrays arranged in four groups of genes by
increasing mean expression level
Courtesy of Terry Speed
39
Steady Improvement
  • Affymetrix improves their model
  • PLIER is a multi-chip model
  • MAS P A calls reasonable
  • MAS 5.0 estimation does a reasonable job on probe
    sets that are bright
  • Abundant genes
  • dChip and RMA do better on genes that are less
    abundant
  • Signalling proteins, transcription factors, etc

40
Expression Comparison 1 MAS 4
Ratio-Intensity Plot comparing two chips from
spike-in experiment
White dots represent unchanged genes Red numbers
flag spike-in genes
Courtesy of Terry Speed
41
Expression Comparison 2 MAS 5
t-scores
changed genes
Theoretical t-distribution
Courtesy of Terry Speed
42
Expression Comparison 3 Li-Wong
Courtesy of Terry Speed
43
Expression Comparison 4 - RMA
Courtesy of Terry Speed
44
Comparison on Real Data
  • These results are based on samples with 14
    spike-ins - not realistic complexity
  • Choe et al (Genome Biology 2005) produced a spike
    in data set with realistic complexity - found
    MAS5 PM correction worked well
  • Comparisons of biological variation vs technical
    variation in replicated samples suggest RMA
    defaults work best

45
Mix and Match Methods in affy
  • Background rma, mas
  • Normalization quantile, constant,
  • PM-correction none,
  • Model median polish, mas
  • Estimates lt- expresso( cel.data, bgcorrect.method
    mas, normalization.method quantiles,

46
gcRMA Estimating Non-specific Hybridization
  • Each probe has its own characteristic
    cross-hybridizations (NSH)
  • Mismatch is not a good estimate of NSH
  • GC content may predict NSH reasonably well
Write a Comment
User Comments (0)
About PowerShow.com