Multiple Comparisons for Microarray Experiments Motivation and Methods

1 / 27

About This Presentation

Title:

Multiple Comparisons for Microarray Experiments Motivation and Methods

Description:

Mik Black, The University of Auckland, February 13, 2004. SPONSORED BY ... Usually a low quantile of the distribution of gene-specific standard errors. ... –

Number of Views:45

Avg rating:3.0/5.0

Slides: 28

Provided by: tea878

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Comparisons for Microarray Experiments Motivation and Methods

1
SPONSORED BY
2
Microarrays, pathways and Thomas Bayes

Mik Black
Department of Statistics, The University of
Auckland
Otago Genomics Facility Meeting
February 13, 2004

3
Overview

Statisticians and microarrays.
Summary of current research.
Statistical models for microarray data.
False discovery rate control.
Current and future work.

4
Statisticians and microarrays

Statisticians role
Experimental design
Power calculations
Normalization
Analysis
Similar to any experimental situation.
Often little scope for novel statistical research.

5
Why bother?

Build research partnerships.
Interdisciplinary crosstalk.
Publications.
Increase diversity of research.

6
Why bother?

Build research partnerships.
Interdisciplinary crosstalk.
Publications.
Increase diversity of research.
Encourage good statistical practice.
Early adoption of novel statistical methods.

7
Student research

Marcus Davy (M.Sc. thesis)
characteristics of FDR controlling procedures for
microarray experimentation.
Hadley Wickham (M.Sc. project)
spatial normalization.
visualization techniques.
Thomas Tiang (PGDipSci project)
normalization strategies.
Po-Hsun Huang (Summer studentship/Honours
project)
normalization for boutique arrays.

8
Personal research

Bayesian and likelihood-based models for
microarray data.
Semi-parametric modeling.
False discovery rate control and extensions.

9
Standard post-normalization analysis

Calculate test statistics for each gene.
These reflect the magnitude of observed
differential expression relative to observed
variability.
Use resampling methods to obtain p-values.
Bootstrapping
Permutations
Use p-values to determine genes undergoing
significant differential expression, subject
pre-determined level of error rate control.
Parametric alternatives use normal distribution
theory.

10
Bayesian models for microarrays

Newton et al. (2001) introduced a Bayesian model
for single-array experiments based on Gamma
distributions.
This approach has since been extended to
encompass multiple arrays, and to provide greater
flexibility in terms of (non)parametric
assumptions (Kendziorski et al. 2002 Newton and
Kendziorski, 2002 Newton et al. 2003).
Produces estimates of the probability of
differential expression for each gene in the
experiment.
These probabilities can then be used to produce a
list of genes undergoing significant
differential expression.

11
Which approach?

P-values
Advantages simple, (non-)parametric, standard.
Disadvantages under-powered, genes considered in
isolation, variance structure?
Bayesian model
Advantages intuitive output (probabilities),
models underlying distribution of expression
means (improved power), variance shrinkage.
Disadvantages poor performance under model
mis-specification, complex implementation,
non-standard.

12
Determining differential expression

Each of the previously described methods produces
an estimate of the likelihood of differential
expression for each gene.
The next step involves deciding which genes
should be considered to have undergone
significant differential expression.
This decision is closely linked to the level of
error we are willing to tolerate in our analysis.
Although there are many options available,
control of the false discovery rate has become a
popular approach.

13
False discovery rate control

Introduced by Benjamini and Hochberg (1995).
Want to control the number of incorrect
rejections, V, as a proportion of the total
number of rejections, R.

Stepwise p-value adjustment guarantees

Finner and Roters (2001) showed that

where ?0 is the proportion of true null
hypotheses.

FDR is expectation, so control is on average.

14
Adaptive control of the FDR

Control of the FDR at level a requires estimation
of the proportion of true null hypotheses, .
In the microarray setting, this is the proportion
of genes on the array which do not undergo
differential expression.
Estimation of this quantity is not
straightforward. Although Storey (2002) and
Storey and Tibshirani (2003) have proposed
methods for this, they can produce severely
biased estimates (Black, 2004).
Newton et al. (2003) demonstrated that their
Bayesian approach provides adaptive FDR control.

15
Case study DNA methylation in Arabidopsis

Two array, dye swapped design.
ddm1 mutant versus wild-type.
4224 spots, 1882 features (genes).
Multiple replicate spots per gene.
1523 genes with 2 spots each.

16
Logged data background and foreground (array 1)
Foreground
Background
Normalized Foreground
17
Logged data background and foreground (array 2)
Foreground
Background
Normalized Foreground
18
Per-array loess normalization (FG only)
19
Spatial check
20
Simple analysis of normalized data

Calculate two sample pooled variance t test
statistic for each gene.
Calculate p-value for each gene either based on
normal probabilities, or from bootstrapping.
Use p-values in estimation procedure.
Use stepwise p-value adjustment to adaptively
control the FDR at level to achieve
FDR .

21
Improved analysis

Often small per-gene variances can make small
fold-changes statistically significant.
Tusher et al. (2001) proposed the SAM
(Significance Analysis of Microarrays) method to
overcome this problem.
Add small fudge factor to denominator of test
statistic.
Usually a low quantile of the distribution of
gene-specific standard errors.
Functions as a shrinkage estimation procedure.

22
Bayesian analysis

Fit model of Newton and Kendziorski (2002) to the
data.
Use probabilities of differential expression to
achieve adaptive FDR control.
Hierarchical model structure allows data
sharing across genes (effectively producing
shrinkage).

23
Summary of results

Numbers of differentially expressed genes, and
estimates of p0

24
Bayes versus p-values gene order

Percentage agreement on rankings of first n
genes

25
Conclusions

Per-gene variances resulted in a large number of
genes reported as differentially expressed under
adaptive FDR control.
Use of SAM procedure radically reduced this
number through shrinkage estimation.
Removes problem of small variances making small
fold-changes significant.
Bayesian model also used shrinkage estimators and
adaptive FDR control, but detected more (and
different) genes as differentially expressed.
Simulations support superiority of Bayesian
procedure assuming model is correctly specified.

26
Current and future work

Likelihood-based method which can control the
actual (rather than average) proportion of false
discoveries for a given set of rejections.
e.g., For a list of differentially expressed
genes, the probability that less than 10 of
these are false positives is at least 95.
Applying the Bayesian analysis approach to the
problem of identifying differentially regulated
pathways.
Extension of the work of Mootha et al. (2003).

27
Acknowledgements