Title: Statistica Analysis of Microarray Data
1An Analysis of MicroArray Quality Control Data
James J. Chen, Ph.D. Division of Biometry and
Risk Assessment National Center for Toxicological
Research U.S. Food and Drug Administration 2006
FDA and Industry Workshop September 29, 2006 The
views expressed in this presentation do not
represent those of the U.S. Food and Drug
Administration
2Outline
- Background MAQC experimental design and data
- Microarray Platform Comparisons
- Inter-platform analysis
- Intra-platform analysis and platforms
performance - concordance, site effects, consistency,
discriminability - sensitivity, specificity, and accuracy in gene
selection - self-consistency of titration mixture
- TaqMan and microarray platforms comparability
- Conclusion
3MicroArray Quality Control Project
Objective To compare expression data generated
at multiple test sites (labs) using several
microarray-based and alternative technology
platforms Microarray platforms
Alternatives platforms Applied Biosystems
ABI (1) Applied Biosystems
(TAQ) Affymetrix AFX (1)
Panomics (QGN) Agilent
AGI (1, 2) Gene Express
(GEX) Eppendorf EPP (1) GE
Healthcare GEH (1) Illumina
ILM (1) NCI_Operon
NCI (2)
Nature Biotechnology v24(9), Sep (2006)
4MAQC Experimental Design
- Four RNA samples
- Sample A Universal human reference RNA
(Stratagene) - Sample B Human brain reference RNA (Ambion)
-
- Sample C (75 A 25 B)
- Sample D (25 A 75 B)
-
- Three sites for each microarray platform (NCI 2
sites) - One site for the TAQ, QGN, GEX
- Five technical replicates for each microarray
platform - Four replicates for TAQ, three replicates for
QGN GEX - EPP 294 target genes QGN 245 GEX205
5MAQC Data Used for Comparisons
Array2 58 60 56 60 59 N/A
Probe 32,878 54,675 43,931 54,359 47,293 1,004
Site 3 3 3 3 3 1
Rep1 5 5 5 5 5 4
Sample 4 4 4 4 4 4
Platform ABI AFX AGIGEH ILM TAQ
12,091 common genes among microarray platforms
906 TAQ genes are among the 12,091 genes 1.
technical replicates 2. a total of 293 arrays
6Hierarchical Clustering of 293 arrays on12091
genes from all pairwise correlations between two
arrays.
7Concordance all pairwise Inter-platform sample
correlation coefficients between two arrays from
different platforms.
.82
.74
.71
.70
.68
.45
Up to 2250 (10x15x15) correlations computed for
each sample.
8Concordance all pairwise Inter-platform
fold-change correlation coefficients between two
arrays from different platforms.
.92
.85
.84
.82
.78
.78
.75
.53
90 (10 x 3 x 3) correlations for each fold-change
9 Cross Platform Consistency
- Proportion of genes shows a significant
platformsample interaction from the
(gene-by-gene) ANOVA - y m P Sample PSample e
- Significant interaction the patterns of
expression of the four samples are inconsistent
across the platforms. -
10Plot of the p-values versus ranking proportions
P r o p o r t i o n
0.3
log10 p
The proportion of significances is 30 at a 0.01
11Consistency (p gt 0.01)
Inconsistency (p lt 0.01)
12Intra-Platform Analysis
- Concordance all pairwise correlations between
two arrays from different sites for samples
A,B,C, and D (3 x 5 x 5 correlations). - Site Effects ANOVA y m sample site
samplesite e - Site Effect the variance ratio, F MSEsite/MSEe
- Consistency proportion of genes shown to have a
significant samplesite interaction (a 0.01). - Discriminability ANOVA y m sample e
- Variability residual mean square (total
variation other than sample differences). - Discriminability the proportion of the genes
shown to have significant sample effects (a
0.0001). .
13Individual Platforms Performance
- Reproducibility and Consistency
Performance -
- Median Correlation Site
Consy MSE Discrty2 - rA rB rC rD
Fm h1 s2 t - AFX .988 .988 .991 .992 24. .012
.066 .618 - ABI .968 .964 .972 .969 15. .008
.107 .620 - AG1 .978 .982 .982 .981 28. .063
.090 .633 - ILM .980 .979 .980 .981 242. .020
.266 .441 -
- GEH .925 .904 .872 .862 64. .097 .267
.453
1. a 0.01 2. a 0.0001.
14 Gold Standard Set
- A gene is differentially expressed if it was
shown to be significant in at least 2 of the 5
platforms at a 10-5. - H0 mA - mB 0 versus H1 mA - mB
? 0 -
- (8265 genes were selected)
- A gene is non-differentially expressed if its
fold change was shown to be between 0.90 and
1/0.90 in at least 2 of the 5 platforms at a
10-3. Let d - log2(0.90) - Equivalence test H0 mA-mB gt d
versus H1 mA-mB lt d - (498 genes were selected)
- Gold Standard 8607 genes (delete 78
overlaps)
15Accuracy (AC), sensitivity (SN), specificity
(SP), and FDR by FWE 0.05 and FDR 0.05 as
threshold.
FWE 0.05 FDR 0.05
AC SN SP FDR .92 .94 .55
.024 .89 .91 .59 .023 .92 .94
.55 .024 .88 .88 .95 .023 .82
.82 .69 .019
- AC SN SP FDR
- .77 .76 .95 .004
- .74 .73 .95 .004
- .81 .80 .80 .003
- .55 .53 1.0 .000
- .54 .52 .95 .005
AFX ABI AG1 ILM GEH
a 0.05/8607 5.8 x 10-6
16Comment on MAQC Gene Selection
- The MAQC project used technical replicates (small
variance) with two distinct biological samples
(large difference). - The number of differential expressed genes are
much more than typical microarray experiments. - Generating a gene list is not a problem, the
problem is determining the number of genes in the
list. - General principle to identify a list of
differentially expressed genes as accurately as
possible. -
17Reproducibility of lists of differentially
expressed genes Percentage of Overlapping Genes
(POG)
For AFX, 6319 genes have p lt 10-5 4370 genes
have FC gt 2. For AB1, 6127 genes have p lt 10-5
4835 genes have FC gt 2. At least more than
4,000 genes can be selected with an FDR estimate
less than 2/4,000.
from MAQC Fig S2 of supplements.
18 Assessment of Titration Trend
- Titration correlations 0.75A0.25B and C
0.25A0.75B and D - Titration model (A two-step test)
- The titration relationship can be modelled by
- M1t y m b Conc Site e
- Full ANOVA model.
- M1 y m Sample Site e
- S1 Test for Sample difference M1 H0t1 mA
mB mC mD - S2 Test for the goodness of fit H0t2 M1t M1
-
- Proportion of genes that reject H0t1and accept
H0t2
19Linear Titration Model
H0t1A
H0t1R,H0t2A H0t1R,H0t2R
20Titration correlation for samples C and D, and
the proportions of the genes that follow the
titration relationship.
Correlation Titration Model (a1,
a2)
Sample C Sample D (5, 5) (1, 1)
.909 .911 .963
.976 .916 .928
.954 .967 .930
.939 .923 .944
.930 .936 .937
.954 .923 .934
.988 .988
AFX ABI AG1 ILM GEH
21Taqman and microarray platform concordance
Box-Plots of all pairwise sample correlation
coefficients.
.80
.78
.77
.76
.75
.74
.74
.71
.71
.66
.62
.52
60 (4 x 15) correlations computed in each sample
22Taqman and microarray platform concordance
Box-Plots of fold-change (B/A) correlation
coefficients.
.90
.89
.89
.88
.86
.86
.82
23 Consistency of TaqMan and Microarray platforms
Taqman and microarray
microarray platforms
- Proportions of significances 0.72, 0.57, 0.49,
0.65, 0.39 Proportion of significances
microarray platforms 0.30
24Conclusion (1)
- Inter platform (microarray and Taqman)
- Concordance
- Sample correlations 0.45(D)-0.82 (A)
- FC correlations Higher B/A Lower C/A
- In-consistency
- Microarray platforms Thirty percent (30) of
genes show inconsistent expression patterns at a
0.01. - Taqman and microarray platforms The proportions
are between 0.34 to 0.74 for the five platforms. - Comparability
- Intensities measured by different microarray
platforms, and measured between microarray and
Taqman platforms are different.
25Conclusion (2)
- Titration Trend
- Titration Correlation The correlations between
observed intensity and expected intensity are
more than 90. -
- Titration trend All five platforms follow the
linear titration relationship well. - Intra microarray platforms performance
- Concordance Intra-platform correlations are
high. - Site effect All platforms show site effects.
- Consistency The patterns of expression are
consistent across three sites.