Title: Microarray Data Processing for Affymetrix arrays
1Microarray Data Processingfor Affymetrix arrays
- Ben Bolstad
- Biostatistics
- University of California, Berkeley
- www.stat.berkeley.edu/bolstad
2Goals of this session
- To understand and use some of the tools for
exploring and pre-processing Affymetrix data. - This session has two parts
- Theory Discussion of methodology
- Hands on experimentation with BioC tools
3Affymetrix GeneChip arrays
- High density oligonucleotide array technology as
developed by Affymetrix - www.affymetrix.com
Overview images courtesy of Affymetrix unless
otherwise specified
4Probes and Probesets
5Two Probe Types
Reference Sequence
- TAGGTCTGTATGACAGACACAAAGAAGATG
- CAGACATAGTGTCTGTGTTTCTTCT
- CAGACATAGTGTGTGTGTTTCTTCT
-
PM the Perfect Match
MM the Mismatch
6Constructing the Chip
Source Lipshutz et al (1999) Nature Genetics
Supplement The Chipping Forecast
7Focusing on a Single GeneChip Cell Location
8Sample Preparation
9Hybridization to the Chip
10The Chip is Scanned
11Chip dat file checkered board close up pixel
selection
12Chip cel file checkered board
Courtesy F. Colin
13Pre-processing Affymetrix Microarrays
- Take the 500K probe intensities and turn them
into 15K gene expression measures - Computing expression measures
- Background adjustment
- Normalization
- Summarization
- I will discuss in more detail the steps in the
RMA algorithm
14Background/Signal Adjustment
- A method which does some or all of the following
- Corrects for background noise, processing effects
- Adjusts for cross hybridization
- Adjust estimated expression values to fall on
proper scale - Probe intensities are used in background
adjustment to compute correction (unlike cDNA
arrays where area surrounding spot might be used)
15RMA Background Approach
Observed O
Signal S
Noise N
16Correction is given by
17Other background correction methods
- MAS 5.0
- Location Specific gridding
- Subtraction of Mismatch
- GCRMA
- uses sequence information to derive a background
adjustment
18Normalization
- Non-biological factors can contribute to the
variability of data ... In order to reliably
compare data from multiple probe arrays,
differences of non-biological origin must be
minimized.1 - Normalization is a process of reducing unwanted
variation across chips. It may use information
from multiple chips - 1 GeneChip 3.1 Expression Analysis Algorithm
Tutorial, Affymetrix technical support
19Non-Biological Variability
5 scanners for 6 dilution groups
20Non-linear normalization needed
A Non-linear Normalization
Unnormalized
Scaled
21Quantile Normalization
- Normalize so that the quantiles of each chip are
equal. Simple and fast algorithm. Goal is to
give same distribution to each chip.
Target Distribution
Original Distribution
22Sort columns of original matrix
Take averages across rows
Set average as value for All elements in the row
Unsort columns of matrix to original order
23It Reduces Variability
Fold change
Expression Values
Also no serious bias effects. For more see
Bolstad et al (2003)
24Other normalization methods
- Scaling
- Non-linear with baseline
- Cyclic Loess
- Contrast
- VSN
25Summarization
- Problem Calculating gene expression values.
- How do we reduce the 11-20 probe intensities for
each probeset on to a gene expression value? - Our Approach
- RMA a robust multi-chip linear model fit on the
log scale
26The RMA Model
- where
- is a probe-effect i 1,,I
- is chip-effect ( is
log2 gene expression on array j) j1,,J -
27Median Polish Algorithm
Imposes Constraints
Sweep Rows
Sweep Columns
Iterate
28Other summarization approaches
- Single chip
- AvDiff (Affymetrix) no longer recommended for
use due to many flaws - Mas 5.0 (Affymetrix) use a 1 step Tukey
Biweight to combine the probe intensities in log
scale - Multiple Chip
- MBEI (Li-Wong dChip) a multiplicative model on
natural scale
29(No Transcript)
30RMA mostly does well in practice
Detecting Differential Expression
Not noisy in low intensities
RMA
MAS 5.0
31One Drawback
RMA
MAS 5.0
Some fixes for this are being developed see GCRMA
(Irizarry and Wu, JHU)
32For more comparisons see affycomp
33Probe Level Modelling
- Robust regression using M-estimation
- In this talk, we will use Hubers influence
function . The software handles many more. - Fitting algorithm is IRLS with weights dependent
on current residuals - Software for fitting such models is part of
affyPLM package of Bioconductor
34We Will Focus on the Summarization PLM
Array Effect
- Array effect model
- With constraint
Pre-processed Log PM intensity
Probe Effect
35Quality Assessment using PLM
- PLM quantities useful for assessing chip quality
- Weights
- Residuals
- Standard Errors
- Expression values relative to median chip
36Pseudo-chip images
Residuals
Weights
Positive Residuals
Negative Residuals
37An Image Gallery
Crop Circles
Tricolor
Ring of Fire
http//www.stat.berkeley.edu/bolstad/PLMImageGall
ery/
38NUSE Plots
- Normalized
- Unscaled
- Standard
- Errors
39RLE Plots
Relative Log Expression
40A word of acknowledgement
Some Slides Terry Speed Francois Colin Rafael
Irizarry