Title: Multiple Comparisons for Microarray Experiments Motivation and Methods
1SPONSORED BY
2Microarrays, pathways and Thomas Bayes
- Mik Black
- Department of Statistics, The University of
Auckland - Otago Genomics Facility Meeting
- February 13, 2004
3Overview
- Statisticians and microarrays.
- Summary of current research.
- Statistical models for microarray data.
- False discovery rate control.
- Current and future work.
4Statisticians and microarrays
- Statisticians role
- Experimental design
- Power calculations
- Normalization
- Analysis
- Similar to any experimental situation.
- Often little scope for novel statistical research.
5Why bother?
- Build research partnerships.
- Interdisciplinary crosstalk.
- Publications.
- Increase diversity of research.
6Why bother?
- Build research partnerships.
- Interdisciplinary crosstalk.
- Publications.
- Increase diversity of research.
- Encourage good statistical practice.
- Early adoption of novel statistical methods.
7Student research
- Marcus Davy (M.Sc. thesis)
- characteristics of FDR controlling procedures for
microarray experimentation. - Hadley Wickham (M.Sc. project)
- spatial normalization.
- visualization techniques.
- Thomas Tiang (PGDipSci project)
- normalization strategies.
- Po-Hsun Huang (Summer studentship/Honours
project) - normalization for boutique arrays.
8Personal research
- Bayesian and likelihood-based models for
microarray data. - Semi-parametric modeling.
- False discovery rate control and extensions.
9Standard post-normalization analysis
- Calculate test statistics for each gene.
- These reflect the magnitude of observed
differential expression relative to observed
variability. - Use resampling methods to obtain p-values.
- Bootstrapping
- Permutations
- Use p-values to determine genes undergoing
significant differential expression, subject
pre-determined level of error rate control. - Parametric alternatives use normal distribution
theory.
10Bayesian models for microarrays
- Newton et al. (2001) introduced a Bayesian model
for single-array experiments based on Gamma
distributions. - This approach has since been extended to
encompass multiple arrays, and to provide greater
flexibility in terms of (non)parametric
assumptions (Kendziorski et al. 2002 Newton and
Kendziorski, 2002 Newton et al. 2003). - Produces estimates of the probability of
differential expression for each gene in the
experiment. - These probabilities can then be used to produce a
list of genes undergoing significant
differential expression.
11Which approach?
- P-values
- Advantages simple, (non-)parametric, standard.
- Disadvantages under-powered, genes considered in
isolation, variance structure? - Bayesian model
- Advantages intuitive output (probabilities),
models underlying distribution of expression
means (improved power), variance shrinkage. - Disadvantages poor performance under model
mis-specification, complex implementation,
non-standard.
12Determining differential expression
- Each of the previously described methods produces
an estimate of the likelihood of differential
expression for each gene. - The next step involves deciding which genes
should be considered to have undergone
significant differential expression. - This decision is closely linked to the level of
error we are willing to tolerate in our analysis. - Although there are many options available,
control of the false discovery rate has become a
popular approach.
13False discovery rate control
- Introduced by Benjamini and Hochberg (1995).
- Want to control the number of incorrect
rejections, V, as a proportion of the total
number of rejections, R.
- Stepwise p-value adjustment guarantees
- Finner and Roters (2001) showed that
where ?0 is the proportion of true null
hypotheses.
- FDR is expectation, so control is on average.
14Adaptive control of the FDR
- Control of the FDR at level a requires estimation
of the proportion of true null hypotheses, . - In the microarray setting, this is the proportion
of genes on the array which do not undergo
differential expression. - Estimation of this quantity is not
straightforward. Although Storey (2002) and
Storey and Tibshirani (2003) have proposed
methods for this, they can produce severely
biased estimates (Black, 2004). - Newton et al. (2003) demonstrated that their
Bayesian approach provides adaptive FDR control.
15Case study DNA methylation in Arabidopsis
- Two array, dye swapped design.
- ddm1 mutant versus wild-type.
- 4224 spots, 1882 features (genes).
- Multiple replicate spots per gene.
- 1523 genes with 2 spots each.
16Logged data background and foreground (array 1)
Foreground
Background
Normalized Foreground
17Logged data background and foreground (array 2)
Foreground
Background
Normalized Foreground
18Per-array loess normalization (FG only)
19Spatial check
20Simple analysis of normalized data
- Calculate two sample pooled variance t test
statistic for each gene. - Calculate p-value for each gene either based on
normal probabilities, or from bootstrapping. - Use p-values in estimation procedure.
- Use stepwise p-value adjustment to adaptively
control the FDR at level to achieve
FDR .
21Improved analysis
- Often small per-gene variances can make small
fold-changes statistically significant. - Tusher et al. (2001) proposed the SAM
(Significance Analysis of Microarrays) method to
overcome this problem. - Add small fudge factor to denominator of test
statistic. - Usually a low quantile of the distribution of
gene-specific standard errors. - Functions as a shrinkage estimation procedure.
22Bayesian analysis
- Fit model of Newton and Kendziorski (2002) to the
data. - Use probabilities of differential expression to
achieve adaptive FDR control. - Hierarchical model structure allows data
sharing across genes (effectively producing
shrinkage).
23Summary of results
- Numbers of differentially expressed genes, and
estimates of p0
24Bayes versus p-values gene order
- Percentage agreement on rankings of first n
genes
25Conclusions
- Per-gene variances resulted in a large number of
genes reported as differentially expressed under
adaptive FDR control. - Use of SAM procedure radically reduced this
number through shrinkage estimation. - Removes problem of small variances making small
fold-changes significant. - Bayesian model also used shrinkage estimators and
adaptive FDR control, but detected more (and
different) genes as differentially expressed. - Simulations support superiority of Bayesian
procedure assuming model is correctly specified.
26Current and future work
- Likelihood-based method which can control the
actual (rather than average) proportion of false
discoveries for a given set of rejections. - e.g., For a list of differentially expressed
genes, the probability that less than 10 of
these are false positives is at least 95. - Applying the Bayesian analysis approach to the
problem of identifying differentially regulated
pathways. - Extension of the work of Mootha et al. (2003).
27Acknowledgements
- Rebecca Doerge (Purdue University)
- Rob Marteinssen (Cold Spring Harbor)
- Vincent Colot (URGV)
- Zach Lippman (Cold Spring Harbor)