Title: Design
1Design Analysis of Microarray Studies for
Diagnostic Prognostic Classification
- Richard Simon, D.Sc.
- Chief, Biometric Research Branch
- National Cancer Institute
- http//linus.nci.nih.gov/brb
2http//linus.nci.nih.gov/brb
- http//linus.nci.nih.gov/brb
- Powerpoint presentations
- Reprints Technical Reports
- BRB-ArrayTools software
- BRB-ArrayTools Data Archive
- Sample Size Planning for Targeted Clinical Trials
3Simon R, Korn E, McShane L, Radmacher M, Wright
G, Zhao Y. Design and analysis of DNA microarray
investigations, Springer-Verlag, 2003. Radmacher
MD, McShane LM, Simon R. A paradigm for class
prediction using gene expression profiles.
Journal of Computational Biology 9505-511,
2002. Simon R, Radmacher MD, Dobbin K, McShane
LM. Pitfalls in the analysis of DNA microarray
data. Journal of the National Cancer Institute
9514-18, 2003. Dobbin K, Simon R. Comparison of
microarray designs for class comparison and class
discovery, Bioinformatics 181462-69, 2002
19803-810, 2003 212430-37, 2005 212803-4,
2005. Dobbin K and Simon R. Sample size
determination in microarray experiments for class
comparison and prognostic classification.
Biostatistics 627-38, 2005. Dobbin K, Shih J,
Simon R. Questions and answers on design of
dual-label microarrays for identifying
differentially expressed genes. Journal of the
National Cancer Institute 951362-69,
2003. Wright G, Simon R. A random variance model
for detection of differential gene expression in
small microarray experiments. Bioinformatics
192448-55, 2003. Korn EL, Troendle JF, McShane
LM, Simon R.Controlling the number of false
discoveries. Journal of Statistical Planning and
Inference 124379-08, 2004. Molinaro A, Simon R,
Pfeiffer R. Prediction error estimation A
comparison of resampling methods. Bioinformatics
213301-7,2005.
4 Simon R. Using DNA microarrays for diagnostic
and prognostic prediction. Expert Review of
Molecular Diagnostics, 3(5) 587-595, 2003. Simon
R. Diagnostic and prognostic prediction using
gene expression profiles in high dimensional
microarray data. British Journal of Cancer
891599-1604, 2003. Simon R and Maitnourim A.
Evaluating the efficiency of targeted designs for
randomized clinical trials. Clinical Cancer
Research 106759-63, 2004. Maitnourim A and
Simon R. On the efficiency of targeted clinical
trials. Statistics in Medicine 24329-339,
2005. Simon R. When is a genomic classifier
ready for prime time? Nature Clinical Practice
Oncology 14-5, 2004. Simon R. An agenda for
Clinical Trials clinical trials in the genomic
era. Clinical Trials 1468-470, 2004. Simon R.
Development and Validation of Therapeutically
Relevant Multi-gene Biomarker Classifiers.
Journal of the National Cancer Institute
97866-867, 2005. Simon R. A roadmap for
developing and validating therapeutically
relevant genomic classifiers. Journal of Clinical
Oncology (In Press). Freidlin B and Simon R.
Adaptive signature design. Clinical Cancer
Research (In Press). Simon R. Validation of
pharmacogenomic biomarker classifiers for
treatment selection. Disease Markers (In Press).
Simon R. Guidelines for the design of clinical
studies for development and validation of
therapeutically relevant biomarkers and biomarker
classification systems. In Biomarkers in Breast
Cancer, Hayes DF and Gasparini G, Humana Press
(In Press).
5Myth
- That microarray investigations should be
unstructured data-mining adventures without clear
objectives
6- Good microarray studies have clear objectives,
but not generally gene specific mechanistic
hypotheses - Design and analysis methods should be tailored to
study objectives
7Good Microarray Studies Have Clear Objectives
- Class Comparison
- Find genes whose expression differs among
predetermined classes - Class Prediction
- Prediction of predetermined class (phenotype)
using information from gene expression profile - Class Discovery
- Discover clusters of specimens having similar
expression profiles - Discover clusters of genes having similar
expression profiles
8Class Comparison and Class Prediction
- Not clustering problems
- Global similarity measures generally used for
clustering arrays may not distinguish classes - Dont control multiplicity or for distinguishing
data used for classifier development from data
used for classifier evaluation - Supervised methods
- Requires multiple biological samples from each
class
9Levels of Replication
- Technical replicates
- RNA sample divided into multiple aliquots and
re-arrayed - Biological replicates
- Multiple subjects
- Replication of the tissue culture experiment
10- Biological conclusions generally require
independent biological replicates. The power of
statistical methods for microarray data depends
on the number of biological replicates. - Technical replicates are useful insurance to
ensure that at least one good quality array of
each specimen will be obtained.
11Class Prediction
- Predict which tumors will respond to a particular
treatment - Predict which patients will relapse after a
particular treatment
12- Class prediction methods usually have gene
selection as a component - The criteria for gene selection for class
prediction and for class comparison are different - For class comparison false discovery rate is
important - For class prediction, predictive accuracy is
important
13Clarity of Objectives is Important
- Patient selection
- Many microarray studies developing classifiers
are not therapeutically relevant - Analysis methods
- Many microarray studies use cluster analysis
inappropriately or misleadingly
14Microarray Platforms for Developing Predictive
Classifiers
- Single label arrays
- Affymetrix GeneChips
- Dual label arrays using common reference design
- Dye swaps are unnecessary
15Common Reference Design
A1
A2
B1
B2
RED
R
R
R
R
GREEN
Array 1
Array 2
Array 3
Array 4
Ai ith specimen from class A
Bi ith specimen from class B
R aliquot from reference pool
16- The reference generally serves to control
variation in the size of corresponding spots on
different arrays and variation in sample
distribution over the slide. - The reference provides a relative measure of
expression for a given gene in a given sample
that is less variable than an absolute measure. - The reference is not the object of comparison.
- The relative measure of expression will be
compared among biologically independent samples
from different classes.
17(No Transcript)
18Myth
- For two color microarrays, each sample of
interest should be labeled once with Cy3 and once
with Cy5 in dye-swap pairs of arrays.
19Dye Bias
- Average differences among dyes in label
concentration, labeling efficiency, photon
emission efficiency and photon detection are
corrected by normalization procedures - Gene specific dye bias may not be corrected by
normalization
20- Dye swap technical replicates of the same two rna
samples are rarely necessary. - Using a common reference design, dye swap arrays
are not necessary for valid comparisons of
classes since specimens labeled with different
dyes are never compared. - For two-label direct comparison designs for
comparing two classes, it is more efficient to
balance the dye-class assignments for independent
biological specimens than to do dye swap
technical replicates
21Balanced Block Design
A1
B2
A3
B4
RED
A2
B1
B3
A4
GREEN
Array 1
Array 2
Array 3
Array 4
Ai ith specimen from class A
Bi ith specimen from class B
22- Detailed comparisons of the effectiveness of
designs - Dobbin K, Simon R. Comparison of microarray
designs for class comparison and class discovery.
Bioinformatics 181462-9, 2002 - Dobbin K, Shih J, Simon R. Statistical design of
reverse dye microarrays. Bioinformatics
19803-10, 2003 - Dobbin K, Simon R. Questions and answers on the
design of dual-label microarrays for identifying
differentially expressed genes, JNCI
951362-1369, 2003
23- Common reference designs are very effective for
many microarray studies. They are robust, permit
comparisons among separate experiments, and
permit many types of comparisons and analyses to
be performed. - For simple two class comparison problems,
balanced block designs require many fewer arrays
than common reference designs. - Efficiency decreases for more than two classes
- Are more difficult to apply to more complicated
class comparison problems. - They are not appropriate for class discovery or
class prediction. - Loop designs are less robust, and dominated by
either common reference designs or balanced block
designs, and are not suitable for class
prediction or class discovery.
24What We Will Not Discuss Today
- Image analysis
- Normalization
- Clustering methods
- Class comparison
- SAM and other methods of gene finding
- FDR (false discovery rate) and methods for
controlling the number of false positive genes
25Simple Procedures
- If each gene is tested for significance at level
? and there are k genes, then the expected number
of false discoveries is k ? . - To control E(FD) ? u
- Conduct each of k tests at level ? u/k
- Bonferroni control of familywise error (FWE) rate
at level 0.05 - Conduct each of k tests at level 0.05/k
- At least 95 confident that FD 0
26False Discovery Rate (FDR)
- FDR Expected proportion of false discoveries
among the tests declared significant - Studied by Benjamini and Hochberg (1995)
27Not rejected Rejected Total
True null hypotheses 890 10 False discoveries 900
False null hypotheses 10 90 True discoveries 100
100 1000
28Controlling the Expected False Discovery Rate
- Compare classes separately by gene and compute
significance levels pi - Rank genes in order of significance
- p(1) lt p(2) lt ... lt p(k)
- Find largest index i for which
- p(i)k / i ? FDR
- Consider genes with the ith smallest p values
as statistically significant
29Problems With Simple Procedures
- Bonferroni control of FWE is very conservative
- p values based on normal theory are not accurate
at extremes quantiles - Difficult to achieve extreme quantiles for
permutation p values of individual genes - Controlling expected number or proportion of
false discoveries may not provide adequate
control because distributions of FD and FDP may
have large variances - Multiple comparisons are controlled by adjustment
of univariate (single gene) p values and so may
not take advantage of correlation among genes
30Additional Procedures
- SAM - Significance Analysis of Microarrays
- Tusher et al., PNAS, 2001
- Estimate FDR
- Statistical properties unclear
- Empirical Bayes
- Efron et al., JASA, 2001
- Related to FDR
- Ad hoc aspects
- Multivariate permutation tests
- Korn et al., 2001 (http//linus.nci.nih.gov/brb)
- Control number or proportion of false discoveries
- Can specify confidence level of control
31Multivariate Permutation Procedures(Korn et al.,
2001)
- Allows statements like
- FD Procedure We are 95 confident that the
(actual) number of false discoveries is no
greater than 5. - FDP Procedure We are 95 confident that the
(actual) proportion of false discoveries does not
exceed .10.
32Class Prediction Model
- Given a sample with an expression profile vector
x of log-ratios or log signals and unknown class.
- Predict which class the sample belongs to
- The class prediction model is a function f which
maps from the set of vectors x to the set of
class labels 1,2 (if there are two classes). - f generally utilizes only some of the components
of x (i.e. only some of the genes) - Specifying the model f involves specifying some
parameters (e.g. regression coefficients) by
fitting the model to the data (learning the data).
33Problems With Many Diagnostic/Prognostic Marker
Studies
- Are not reproducible
- Retrospective non-focused analysis
- Multiplicity problems
- Inter-laboratory assay variation
- Have no impact
- Not therapeutically relevant questions
- Not therapeutically relevant group of patients
- Black box predictors
34Components of Class Prediction
- Feature (gene) selection
- Which genes will be included in the model
- Select model type
- E.g. Diagonal linear discriminant analysis,
Nearest-Neighbor, - Fitting parameters (regression coefficients) for
model - Selecting value of tuning parameters
35Do Not Confuse Statistical Methods Appropriate
for Class Comparison with Those Appropriate for
Class Prediction
- Demonstrating statistical significance of
prognostic factors is not the same as
demonstrating predictive accuracy. - Demonstrating goodness of fit of a model to the
data used to develop it is not a demonstration
of predictive accuracy. - Statisticians are used to inference, not
prediction - Most statistical methods were not developed for
pgtgtn prediction problems
36Feature Selection
- Genes that are differentially expressed among the
classes at a significance level ? (e.g. 0.01) - The ? level is selected only to control the
number of genes in the model
37t-test Comparisons of Gene Expression
- xjN(?j1 , ?j2) for class 1
- xjN(?j2 , ?j2) for class 2
- H0j ?j1 ?j2
38Estimation of Within-Class Variance
- Estimate separately for each gene
- Limited degrees of freedom
- Gene list dominated by genes with small fold
changes and small variances - Assume all genes have same variance
- Poor assumption
- Random (hierarchical) variance model
- Wright G.W. and Simon R. Bioinformatics192448-245
5,2003 - Inverse gamma distribution of residual variances
- Results in exact F (or t) distribution of test
statistics with increased degrees of freedom for
error variance - For any normal linear model
39Feature Selection
- Small subset of genes which together give most
accurate predictions - Combinatorial optimization algorithms
- Genetic algorithms
- Little evidence that complex feature selection is
useful in microarray problems - Failure to compare to simpler methods
- Some published complex methods for selecting
combinations of features do not appear to have
been properly evaluated
40Linear Classifiers for Two Classes
41Linear Classifiers for Two Classes
- Fisher linear discriminant analysis
- Requires estimating correlations among all genes
selected for model - y vector of class labels
- Diagonal linear discriminant analysis (DLDA)
assumes features are uncorrelated - Naïve Bayes classifier
- Compound covariate predictor (Radmacher) and
Golubs method are similar to DLDA in that they
can be viewed as weighted voting of univariate
classifiers
42Linear Classifiers for Two Classes
- Compound covariate predictor
- Instead of for DLDA
43Linear Classifiers for Two Classes
- Support vector machines with inner product kernel
are linear classifiers with weights determined to
separate the classes with a hyperplain that
minimizes the length of the weight vector
44Support Vector Machine
45Perceptrons
- Perceptrons are neural networks with no hidden
layer and linear transfer functions between input
output - Number of input nodes equals number of genes
selected - Number of output nodes equals number of classes
minus 1 - Number of inputs may be major principal
components of genes or major principal components
of informative genes - Perceptrons are linear classifiers
46Naïve Bayes Classifier
- Expression profiles for class j assumed normal
with mean vector mj and diagonal covariance
matrix D - Likelihood of expression profile vector x is l(x
mj ,D) - Posterior probability of class j for case with
expression profile vector x is proportional to pj
l(x mj ,D)
47Compound Covariate Bayes Classifier
- Compound covariate y ?tixi
- Sum over the genes selected as differentially
expressed - xi the expression level of the ith selected gene
for the case whose class is to be predicted - ti the t statistic for testing differential
expression for the ith gene - Proceed as for the naïve Bayes classifier but
using the single compound covariate as predictive
variable - GW Wright et al. PNAS 2005.
48 When pgtgtn The Linear Model is Too Complex
- It is always possible to find a set of features
and a weight vector for which the classification
error on the training set is zero. - Why consider more complex models?
49Myth
- Complex classification algorithms such as neural
networks perform better than simpler methods for
class prediction.
50- Artificial intelligence sells to journal
reviewers and peers who cannot distinguish hype
from substance when it comes to microarray data
analysis. - Comparative studies have shown that simpler
methods work as well or better for microarray
problems because they avoid overfitting the data.
51Other Simple Methods
- Nearest neighbor classification
- Nearest k-neighbors
- Nearest centroid classification
- Shrunken centroid classification
52Nearest Neighbor Classifier
- To classify a sample in the validation set as
being in outcome class 1 or outcome class 2,
determine which sample in the training set its
gene expression profile is most similar to. - Similarity measure used is based on genes
selected as being univariately differentially
expressed between the classes - Correlation similarity or Euclidean distance
generally used - Classify the sample as being in the same class as
its nearest neighbor in the training set
53Other Methods
- Neural networks
- Top-scoring pairs
- CART
- Random Forrest
- Genetic algorithm based classification
54Apparent Dimension Reduction Based Methods
- Principal component regression
- Supervised principal component regression
- Partial least squares
- Stepwise logistic regression
55When There Are More Than 2 Classes
- Nearest neighbor type methods
- Decision tree of binary classifiers
56Decision Tree of Binary Classifiers
- Partition the set of classes 1,2,,K into two
disjoint subsets S1 and S2 - Develop a binary classifier for distinguishing
the composite classes S1 and S2 - Compute the cross-validated classification error
for distinguishing S1 and S2 - Repeat the above steps for all possible
partitions in order to find the partition S1and
S2 for which the cross-validated classification
error is minimized - If S1and S2 are not singleton sets, then repeat
all of the above steps separately for the classes
in S1and S2 to optimally partition each of them
57(No Transcript)
58Evaluating a Classifier
- Prediction is difficult, especially the future.
- Neils Bohr
- Fit of a model to the same data used to develop
it is no evidence of prediction accuracy for
independent data.
59Evaluating a Classifier
- Fit of a model to the same data used to develop
it is no evidence of prediction accuracy for
independent data - Goodness of fit vs prediction accuracy
- Demonstrating statistical significance of
prognostic factors is not the same as
demonstrating predictive accuracy - Demonstrating stability of identification of gene
predictors is not necessary for demonstrating
predictive accuracy
60Evaluating a Classifier
- The classification algorithm includes the
following parts - Determining what type of classifier to use
- Gene selection
- Fitting parameters
- Optimizing with regard to tuning parameters
- If a re-sampling method such as cross-validation
is to be used to estimate predictive error of a
classifier, all aspects of the classification
algorithm must be repeated for each training set
and the accuracy of the resulting classifier
scored on the corresponding validation set
61Split-Sample Evaluation
- Training-set
- Used to select features, select model type,
determine parameters and cut-off thresholds - Test-set
- Withheld until a single model is fully specified
using the training-set. - Fully specified model is applied to the
expression profiles in the test-set to predict
class labels. - Number of errors is counted
- Ideally test set data is from different centers
than the training data and assayed at a different
time
62Leave-one-out Cross Validation
- Omit sample 1
- Develop multivariate classifier from scratch on
training set with sample 1 omitted - Predict class for sample 1 and record whether
prediction is correct
63Leave-one-out Cross Validation
- Repeat analysis for training sets with each
single sample omitted one at a time - e number of misclassifications determined by
cross-validation - Subdivide e for estimation of sensitivity and
specificity
64- Cross validation is only valid if the test set is
not used in any way in the development of the
model. Using the complete set of samples to
select genes violates this assumption and
invalidates cross-validation. - With proper cross-validation, the model must be
developed from scratch for each leave-one-out
training set. This means that feature selection
must be repeated for each leave-one-out training
set. - The cross-validated estimate of misclassification
error is an estimate of the prediction error for
model fit using specified algorithm to full
dataset - If you use cross-validation estimates of
prediction error for a set of algorithms indexed
by a tuning parameter and select the algorithm
with the smallest cv error estimate, you do not
have a valid estimate of the prediction error for
the selected model
65Prediction on Simulated Null Data
- Generation of Gene Expression Profiles
- 14 specimens (Pi is the expression profile for
specimen i) - Log-ratio measurements on 6000 genes
- Pi MVN(0, I6000)
- Can we distinguish between the first 7 specimens
(Class 1) and the last 7 (Class 2)? - Prediction Method
- Compound covariate prediction (discussed later)
- Compound covariate built from the log-ratios of
the 10 most differentially expressed genes.
66(No Transcript)
67Invalid Criticisms of Cross-Validation
- You can always find a set of features that will
provide perfect prediction for the training and
test sets. - For complex models, there may be many sets of
features that provide zero training errors. - A modeling strategy that either selects among
those sets or aggregates among those models, will
have a generalization error which will be validly
estimated by cross-validation.
68(No Transcript)
69(No Transcript)
70Simulated Data40 cases, 10 genes selected from
5000
Method Estimate Std Deviation
True .078
Resubstitution .007 .016
LOOCV .092 .115
10-fold CV .118 .120
5-fold CV .161 .127
Split sample 1-1 .345 .185
Split sample 2-1 .205 .184
.632 bootstrap .274 .084
71DLBCL Data
Method Bias Std Deviation MSE
LOOCV -.019 .072 .008
10-fold CV -.007 .063 .006
5-fold CV .004 .07 .007
Split 1-1 .037 .117 .018
Split 2-1 .001 .119 .017
.632 bootstrap -.006 .049 .004
72Simulated Data40 cases
Method Estimate Std Deviation
True .078
10-fold .118 .120
Repeated 10-fold .116 .109
5-fold .161 .127
Repeated 5-fold .159 .114
Split 1-1 .345 .185
Repeated split 1-1 .371 .065
73Permutation Distribution of Cross-validated
Misclassification Rate of a Multivariate
Classifier
- Randomly permute class labels and repeat the
entire cross-validation - Re-do for all (or 1000) random permutations of
class labels - Permutation p value is fraction of random
permutations that gave as few misclassifications
as e in the real data
74Gene-Expression Profiles in Hereditary Breast
Cancer
- Breast tumors studied
- 7 BRCA1 tumors
- 8 BRCA2 tumors
- 7 sporadic tumors
- Log-ratios measurements of 3226 genes for each
tumor after initial data filtering
RESEARCH QUESTION Can we distinguish BRCA1 from
BRCA1 cancers and BRCA2 from BRCA2 cancers
based solely on their gene expression profiles?
75BRCA1
76BRCA2
77Classification of BRCA2 Germline Mutations
Classification Method LOOCV Prediction Error
Compound Covariate Predictor 14
Fisher LDA 36
Diagonal LDA 14
1-Nearest Neighbor 9
3-Nearest Neighbor 23
Support Vector Machine (linear kernel) 18
Classification Tree 45
78Common Problems With Internal Classifier
Validation
- Pre-selection of genes using entire dataset
- Failure to consider optimization of tuning
parameter part of classification algorithm - Varma Simon, BMC Bioinformatics 2006
- Erroneous use of predicted class in regression
model
79Incomplete (incorrect) Cross-Validation
- Publications are using all the data to select
genes and then cross-validating only the
parameter estimation component of model
development - Highly biased
- Many published complex methods which make strong
claims based on incorrect cross-validation. - Frequently seen in complex feature set selection
algorithms - Some software encourages inappropriate
cross-validation
80Incomplete (incorrect) Cross-Validation
- Let M(b,D) denote a classification model
developed on a set of data D where the model is
of a particular type that is parameterized by a
scalar b. - Use cross-validation to estimate the
classification error of M(b,D) for a grid of
values of b Err(b). - Select the value of b that minimizes Err(b).
- Caution Err(b) is a biased estimate of the
prediction error of M(b,D). - This error is made in some commonly used methods
81Complete (correct) Cross-Validation
- Construct a learning set D as a subset of the
full set S of cases. - Use cross-validation restricted to D in order to
estimate the classification error of M(b,D) for a
grid of values of b Err(b). - Select the value of b that minimizes Err(b).
- Use the mode M(b,D) to predict for the cases in
S but not in D (S-D) and compute the error rate
in S-D - Repeat this full procedure for different learning
sets D1 , D2 and average the error rates of the
models M(bi,Di) over the corresponding
validation sets S-Di
82Does an Expression Profile Classifier Predict
More Accurately Than Standard Prognostic
Variables?
- Not an issue of which variables are significant
after adjusting for which others or which are
independent predictors - Predictive accuracy and inference are different
- The two classifiers can be compared by ROC
analysis as functions of the threshold for
classification - The predictiveness of the expression profile
classifier can be evaluated within levels of the
classifier based on standard prognostic variables
83Does an Expression Profile Classifier Predict
More Accurately Than Standard Prognostic
Variables?
- Some publications fit logistic model to standard
covariates and the cross-validated predictions of
expression profile classifiers - This is valid only with split-sample analysis
because the cross-validated predictions are not
independent
84External Validation
- Should address clinical utility, not just
predictive accuracy - Therapeutic relevance
- Should incorporate all sources of variability
likely to be seen in broad clinical application - Expression profile assay distributed over time
and space - Real world tissue handling
- Patients selected from different centers than
those used for developing the classifier
85Survival Risk Group Prediction
- Evaluate individual genes by fitting single
variable proportional hazards regression models
to log signal or log ratio for gene - Select genes based on p-value threshold for
single gene PH regressions - Compute first k principal components of the
selected genes - Fit PH regression model with the k pcs as
predictors. Let b1 , , bk denote the estimated
regression coefficients - To predict for case with expression profile
vector x, compute the k supervised pcs y1 , ,
yk and the predictive index ? b1 y1 bk yk
86Survival Risk Group Prediction
- LOOCV loop
- Create training set by omitting ith case
- Develop supervised pc PH model for training set
- Compute cross-validated predictive index for ith
case using PH model developed for training set - Compute predictive risk percentile of predictive
index for ith case among predictive indices for
cases in the training set
87Survival Risk Group Prediction
- Plot Kaplan Meier survival curves for cases with
cross-validated risk percentiles above 50 and
for cases with cross-validated risk percentiles
below 50 - Or for however many risk groups and thresholds is
desired - Compute log-rank statistic comparing the
cross-validated Kaplan Meier curves
88Survival Risk Group Prediction
- Repeat the entire procedure for all (or large
number) of permutations of survival times and
censoring indicators to generate the null
distribution of the log-rank statistic - The usual chi-square null distribution is not
valid because the cross-validated risk
percentiles are correlated among cases - Evaluate statistical significance of the
association of survival and expression profiles
by referring the log-rank statistic for the
unpermuted data to the permutation null
distribution
89Survival Risk Group Prediction
- Other approaches to survival risk group
prediction have been published - The supervised pc method is implemented in
BRB-ArrayTools - BRB-ArrayTools also provides for comparing the
risk group classifier based on expression
profiles to one based on standard covariates and
one based on a combination of both types of
variables
90Sample Size Planning References
- K Dobbin, R Simon. Sample size determination in
microarray experiments for class comparison and
prognostic classification. Biostatistics 627-38,
2005 - K Dobbin, R Simon. Sample size planning for
developing classifiers using high dimensional DNA
microarray data. Biostatistics (In Press)
91Sample Size Planning
- GOAL Identify genes differentially expressed in
a comparison of two pre-defined classes of
specimens on dual-label arrays using reference
design or single label arrays - Compare classes separately by gene with
adjustment for multiple comparisons - Approximate expression levels (log ratio or log
signal) as normally distributed - Determine number of samples n/2 per class to give
power 1-? for detecting mean difference ? at
level ?
92Comparing 2 equal size classes
- n 4?2(z?/2 z?)2/?2
- where ? mean log-ratio difference between
classes - ? within class standard deviation of
biological replicates - z?/2, z? standard normal percentiles
- Choose ?? small, e.g. ?? .001
- Use percentiles of t distribution for improved
accuracy
93Total Number of Samples for Two Class Comparison
? ? ? ? Samples Per Class
0.001 0.05 1 (2-fold) 0.5 human tissue 13
0.25 transgenic mice 6 (t approximation)
94Dual Label Arrays With Reference DesignPools of
k Biological Samples
95- m number of technical reps per sample
- k number of samples per pool
- n total number of arrays
- ? mean difference between classes in log signal
- ?2 biological variance within class
- ?2 technical variance
- ? significance level e.g. 0.001
- 1-? power
- z normal percentiles (use t percentiles for
better accuracy)
96Number of samples pooled per array Number of arrays required Number of samples required
1 25 25
2 17 34
3 14 42
4 13 52
?0.001, ?0.05, ?1, ?22?20.25, ?2/?24
97?0.001 ?0.05 ?1 ?22?20.25, ?2/?24
m technical reps n arrays required samples required
1 25 25
2 42 21
3 60 20
4 76 19
98Number of Events Needed to Detect Gene Specific
Effects on Survival
- ? standard deviation in log2 ratios for each
gene - ? hazard ratio (gt1) corresponding to 2-fold
change in gene expression - ? 1/N for 1 expected false positive gene
identified per N genes examined - ? 0.05 for 5 false negative rate
99Number of Events Required to Detect Gene Specific
Effects on Survival ?0.001,?0.05
Hazard Ratio ? ? Events Required
2 0.5 26
1.5 0.5 76
100Sample Size Planning for Classifier Development
- The expected value (over training sets) of the
probability of correct classification PCC(n)
should be within ? of the maximum achievable
PCC(?)
101Probability Model
- Two classes
- Log expression or log ratio MVN in each class
with common covariance matrix - m differentially expressed genes
- p-m noise genes
- Expression of differentially expressed genes are
independent of expression for noise genes - All differentially expressed genes have same
inter-class mean difference 2? - Common variance for differentially expressed
genes and for noise genes
102Classifier
- Feature selection based on univariate t-tests for
differential expression at significance level ? - Simple linear classifier with equal weights
(except for sign) for all selected genes. Power
for selecting each of the informative genes that
are differentially expressed by mean difference
2? is 1-?(n)
103- For 2 classes of equal prevalence, let ?1 denote
the largest eigenvalue of the covariance matrix
of informative genes. Then
104(No Transcript)
105Sample size as a function of effect size
(log-base 2 fold-change between classes divided
by standard deviation). Two different tolerances
shown, . Each class is equally represented in the
population. 22000 genes on an array.
106(No Transcript)
107Optimal significance level cutoffs for gene
selection. 50 differentially expressed genes out
of 22,000 genes on the microarrays
2d/s n10 n30 n50
1 0.167 0.003 0.00068
1.25 0.085 0.0011 0.00035
1.5 0.045 0.00063 0.00016
1.75 0.026 0.00036 0.00006
2 0.015 0.0002 0.00002
108 ?0.05, p22,000 genes, gene standard deviation
s0.75.
Distance 2d Fold change Sample size n PCC(n), m1 PCCworst(n) m1, pmin0.25 PCC(n), m5 PCCworst(n) m5, pmin0.25 Adjusted sample size
0.75 1.68 106 0.64 0.34 0.82 0.70 212
1.125 2.18 58 0.72 0.45 0.93 0.87 116
1.5 2.83 38 0.80 0.52 0.98 0.95 76
1.875 3.67 28 0.85 0.63 0.99 0.96 56
109a) Power example with 60 samples, 2d/s1/0.71
effect size for differentially expressed genes,
alpha 0.001 cutoffs for gene selection. As the
proportion in the under-represented class gets
smaller, the power to identify differentially
expressed genes decreases.
110b) PCC(60) as a function of the proportion in the
under-represented class. Parameter settings same
as a), with 10 differentially expressed genes
among 22,000 total genes. If the proportion in
the under-represented class is small (e.g.,
lt20), then the PCC(60) can decline significantly.
111(No Transcript)
112(No Transcript)
113(No Transcript)
114(No Transcript)
115(No Transcript)
116BRB-ArrayTools
- Integrated software package using Excel-based
user interface but state-of-the art analysis
methods programmed in R, Java Fortran - Publicly available for non-commercial use
http//linus.nci.nih.gov/brb
117Selected Features of BRB-ArrayTools
- Multivariate permutation tests for class
comparison to control number and proportion of
false discoveries with specified confidence level - Permits blocking by another variable, pairing of
data, averaging of technical replicates - SAM
- Fortran implementation 7X faster than R versions
- Extensive annotation for identified genes
- Internal annotation of NetAffx, Source, Gene
Ontology, Pathway information - Links to annotations in genomic databases
- Find genes correlated with quantitative factor
while controlling number of proportion of false
discoveries - Find genes correlated with censored survival
while controlling number or proportion of false
discoveries - Analysis of variance
- Time course analysis
- Log intensities for non-reference designs
- Mixed models
118Selected Features of BRB-ArrayTools
- Gene enhancement analysis.
- Find Gene Ontology groups and signaling pathways
that are differentially expressed among classes - Class prediction
- DLDA, CCP, Nearest Neighbor, Nearest Centroid,
Shrunken Centroids, SVM, Random Forests - Complete LOOCV, k-fold CV, repeated k-fold,
.632 bootstrap - permutation significance of cross-validated
error rate
119Selected Features of BRB-ArrayTools
- Clustering tools for class discovery with
reproducibility statistics on clusters - Internal access to Eisens Cluster and Treeview
- Visualization tools including rotating 3D
principal components plot exportable to
Powerpoint with rotation controls - Extensible via R plug-in feature
- Tutorials and datasets
120Acknowledgements
- Kevin Dobbin
- Michael Radmacher
- Sudhir Varma
- Yingdong Zhao
- BRB-ArrayTools Development Team
- Amy Lam
- Ming-Chung Li
- Supriya Menezes