Title: Analysis of Affymetrix Microarray Data
1Analysis of Affymetrix Microarray Data John
Okyere (NASC)
2Overview of Presentation
- Experiment Design
- Data Normalization and Expression Value
Calculation - Statistical Analysis
- Data Interpretation
3Experiment Design
4(No Transcript)
5Affymetrix Terminology
Probe A 25mer oligo complemetary to a
sequence of interest, attached to a glace
surface on the probe array
Perfect Match (PM) Probes that are
complementary to the sequence of interest.
Mismatch (MM) Probes that are complementary to
the sequence of interest except for
homomeric base
change (A-T or G-C) at the 13th position
Probe Pair (PP) A combination of a PM and MM
11-16 probe pairs/ probe set
Probe Cell A single feature size can be 18X18
or 20X20u
6(No Transcript)
7(No Transcript)
8(No Transcript)
9Experimental Design Flow
Simplified Data Analysis
Pilot Study
Full Scale Experiment
Publication
Bioinformatics
Data Validation
Complete Analysis
10Advantages of a Pilot Study
- Estimate experimental variability
- Refine laboratory methods/techniques
- Refine experimental design
- Allows for rapid screening
- Provides preliminary data for project funding
11Three Sources of Variability
- Biological Differences between samples
- - The ultimate goal of the research
- Technical Sample preparation
- - Protocols and operator
- System Probe Array analysis
- - Arrays, instruments, reagents
12Controlling Biological Variability
- Biological variability contributes more to
experimental variability - than technical variability.
- To mitigate biological variability-
-
- - Consider all potential variables as part
of the experiment design -
- - Increase the number of biological
replicates until Coefficient of - Variation (CV) stabilizes
13Examples of Biological Variability
- Cell Cycle Patterns- What time of day were the
samples isolated? - Circadian Rhythm- What is the time interval
between time course samples? - Nutrient- Media types will affect expression
levels - Tissue- Each cell type has different expression
pattern - Temperature- Growth room temperature may vary
within a 24h period - Disease- Defense genes will alter global gene
expression pattern - Germination time- Different seed batches will
alter gene expression pattern
14Practical Questions to Consider
- How much variability does your system have?
- - Understand and minimize variation
- What level of significance is needed?
- - More replicates needed for subtle changes
- How many treatments? How many controls?
- - Comparative analysis (one experimental
condition) or serial analysis - design (multiple experimental conditions)?
15Percentage CV as Estimate of Variability
- CV is a measure of variance amongst replicates
of a single condition - Defined as the standard deviation divided by the
mean multiplied by 100 - Example 5 signal values representing 5
replicates - - 230.4, 241.7, 252.9, 338.8, 178.9
- - Mean 248.56 ? 57.9 CV 23.29
16(No Transcript)
17(No Transcript)
18(No Transcript)
19Experimental Replicates
- Technical replicates from the same sample
reproduce the contribution - from the bench effects to the overall
variability - Biological replicates True replicates that
reproduce biological conditions - explored in the experimental design
- - Permit the use of formal statistical tests
- - Also allows the interrogation of technical
variability
20RNA Sample Pooling
- Can increase sample quantity
- A common variance mitigation strategy
- Can result in irreversible loss of information
by introducing a bias - If necessary pool a minimum of three or a
maximum of five RNAs - Equal pooling of RNA samples is essential
21Data Normalization
22Why Normalize ?
- To correct for systematic measurement error and
bias in data - - Differences in probe labeling
- - Target concentration
- - Hybridization efficiency
- - Scanner noise
- Allows for data comparison
23Data Normalization Methods
- Scaling Factor (linear) normalization
- - Global or selected gene set
- - Works well when data quality metrics are
consistent - - Simplifies database construction
- - Weakness assumes error is uniform across all
genes - assumes total mRNA is the
same for all cells - Non-linear
- - Can provide higher precision, especially at
the extremes - - Requires selected gene (invariant) set
- - May give false confidence in poor data
24Normalization Curves
Not normalized
Normalized
25Scaling Data to a Target Intensity
Exp. 4
Exp. 2
Exp. 6
Target Intensity (100)
Exp. 3
Exp. 1
Exp. 5
Exp. 7
TGT Average intensity x Scaling Factor
- If scaling factor is lt 3 fold, comparison can be
made between all experiments
26Expression Value Calculation (Signal)
- The signal represents the amount of transcript
in solution - Signal is calculated as follows
- - Cell intensities are preprocessed for
global background - - An ideal mismatch value is calculated and
subtracted to adjust PM intensity - - The adjusted PM intensities are log
transformed to stabilize the variance - - The Tukeys biweight estimator is used to
provide a robust mean of the signal - - Signal is output as the antilog of the mean
signal value - - Finally the signal is scaled to generated a
normalized data
27Expression Value Calculation (Signal)
Method Specific Background (SB)
SB Tbi( log2 (PM) log2
(MM)) ? IM Probe Value (PV) and Signal Log
Value (SLV) V
max(PM IM) PV
log2(V) SLV
Tbi(PV1 PVn)
28Statistical Analysis
29Statistical Software
- Affymetrix data files are accessible to many
statistical packages - These packages include DMT, Spotfire,
Genespring, STATA, Gene Data - Gene Maths, dChip, RMA, S, R, Ominiviz, etc
- For information regarding these products please
contact the manufacturers
30Microarray Data Distribution
- Are the data approximately normally distributed
with each group - having equal variance?
- Yes Parametric Analysis
- - Assumes equal variance in data in
order to determine - significance between data sets
- No Non-Parametric Analysis
- - Use ranks of numerical data in order
to determine - significance between data sets
31Normally Distributed Data
- Single symmetrical peak at the mean
- Continues on horizontal to infinity
- 68 of the data lie within one standard
- deviation from the mean
- 95 of the data lie with two standard
- deviation from the mean
- The mean and median are approximately
- equal
32Types of Statistical Analysis
- Two Sample Comparison
-
- - Parametric Students T-test
- - Non-Parametric Mann-Whitney
- Multivariate Analysis
- - Parametric Analysis of Variance (ANOVA)
- - Non-Parametric Krustal-Wallis
33Students T-test
- Compares the means and standard deviations of
two populations - Populations must be normally distributed
- Computes a p-value to test null hypothesis
34Students T-test
- Unpaired T-test
- - Compares expression patterns of genes in two
groups of samples - - More common analysis in experiments using
expression data - - The two groups can be different sizes
35Mann-Whitney Rank Test
- Non-parametric, non-paired, two sample rank test
- Ignores distribution of the data
- Sorts the data values and assigns ranks to them
- Compares the sum of ranks of two data sets
- Computes a p-value to test the null hypothesis
36Analysis of Variance (ANOVA)
- Parametric, multiple comparison test
- Population must be normally distributed
- Compares means and variance among groups
- Computes a p-value
- Determines whether the mean and varainces of the
populations are the same
37Krustal-Wallis
- Non-parametric, multiple comparison test
- Ignores distribution of data
- Sorts the data values and assigns ranks to them
- Compares the sum of ranks of more than two data
sets - Computes a p-value
38Multiple Comparison Corrections
39Bonferroni Correction
- Conservative error correction method
40Statistical Analysis Flow Diagram
41Data Interpretation
42Clustering Gene Expression Data
- Summarize genes by co- or anti- correlation of
expression profiles - Employ guilt-by-association functional
prediction - Search for regulatory elements in promoters of
co-expressed genes - Help identify interesting genes
43Clustering Algorithms
- Hierarchical
- K-means
- Self-Organizing Maps
44Hierarchical Clustering
Agglomerative
Divisive
45K-Means Clustering
- Non hierarchical
- User defines number of cluster (K)
- Data partitioned into K number of clusters
- Cluster relation is undetermined
46Self-Organizing Maps (SOM)
- Similar to K-means but constrained to a two
dimensional grid - User chooses topology of Map and hence number of
clusters - Objects are iteratively pulled towards clusters
- An object can belong to only one cluster unlike
K-means
47Summary of Data Analysis