Title: Controlling the FDR in the Analysis of Genetic Expression
1Controlling the FDR in the Analysis of Genetic
Expression
Anat Reiner Tel-Aviv University
2Outline
- Correlated test statistics
3The False Discovery Rate (FDR) criterion
Benjamini and Hochberg (95)
- R rejected hypotheses discoveries
- V false discoveries
- The error (type I) in the entire study is
measured by
i.e. the proportion of false discoveries among
the discoveries (0 if none found).
4FDR controlling proceures
- Linear Step-Up Procedure
- (Benjamini and Hochberg, 95)
5FDR Adjusted P-Values
- For an individual hypothesis,
6Data inter-dependencies
- Between genes
- Between measurement errors of expression levels
- co-regulation - spatial effects
- RNA source
- normalization process
- pooled variability estimation
Multiple testing of such data will produce
correlated test statistics !
7Resampling Idea
- Create B data sets by permuting subjects order
(mix treatment and control) - Underlying AssumptionThe joint distribution of
p-values corresponding to the true null
hypotheses, which is generated through the
p-value resampling scheme, represents the real
joint distribution under the null hypothesis.
8- for each value of p, the number of
resampling-based p-values less than p, denoted by
V(p), is an estimate to the expected number of
p-values corresponding to true null hypotheses
less than p. - Since the FDR is also a function of the number of
false null hypotheses being rejected, estimate
conservatively the number of false null
hypotheses less than p, denoted by .
9- FDRV/R
- RVS
- number of resampling-based p-values less than p,
V(p), is an estimate to the expected number of
p-values corresponding to true null hypotheses
less than p. - estimate conservatively the number of false null
hypotheses less than p, denoted by
.
10- Then conservatively estimate the FDR adjustment
by - where two adjustments are suggested
- The FDR local estimatorconservative on the mean
- The FDR upper limitbounds the FDR with
probability 95.
11- BH Point Estimator
- Use the linear step-up procedure to control the
FDR - Instead of raw p-values, p-values are estimated
by resampling from the marginal distribution - For the k-th gene, with an observed test
statistics tk,
12Data Lipid Metabolism Study (Yang et al,
2001)
Reference(common control) a pool from8
control mice
- Purpose
- Identify genes with altered expression
-
13 Original Data Statistics
14Study Applying Multiple Comparison Procedures to
Microarray Data
Procedures Used
- Control FWE
- Resampling-based procedure (Westfall et al 1989)
- Holms procedure
- Control FDR
- Linear step-up procedure (Benjamini and
Hochberg, 1995) - Two Resampling-based procedures(Yekutieli et al,
1999) - BH point estimator
15-ValuesAdjusted P Original Data
16Simulation Study
- To obtain simulation data
- Remove effects.
- Shuffle experiment and control groups.
- Add effects to 70 randomly selected genes.
- Apply multiple testing procedures (100
iterations). - Repeat 1-4 400 times
- Calculate the average FDR and power over the 400
simulations
17Simulation Differential Expression Patterns
r n 1/p i -1/p, i1,,n
18-ValuesAdjusted P Simulation Data
19(No Transcript)
20Test Power
21Conclusions
- All four FDR controlling procedures retain higher
power than FWE controlling procedures.
- The choice among the four is a matter of buying
more power and better properties at the expense
of more complicated computations
22Conclusions (contd)
- A substantial increase in power is gained when
the p-values are estimated by resampling, and
then used in the linear step-up procedure.
- Still, if the software is available, the
researcher may be better off using the more
powerful resampling estimators.
23Correlated Test Statistics
Positive Dependency (Benjamini Yekutieli, 2001
and Yekutieli, 2002).
- The linear step-up procedure controls the FDR for
positive dependent test statistics.
- This condition is satisfied by
- positively correlated one-sided normal and t
test statistics.
- absolute values of normal and t test
statistics, when all null hypotheses are true.
24X1 vs. X2
Correlated Test Statistics Simulation Study
abs(X1) vs. abs(X2)
25FDR vs. ?2, m2
26FDR Deviation vs. ? (m2)
27Joint Distribution of X2,X1 - FDR Areas
28FDR vs. ?2
29FDR vs. ?2
30FDR vs. ?2, m3
31FDR Deviation vs. ? (m3)
32FDR Deviation vs. ? (m4)
33FDR Deviation vs. ? (m6)
34FDR for General m and corr. 1
- Consider a set of m p-values
- m0 of them correspond the subset of true null
hypotheses - m1m-m0 correspond the subset of false null
hypotheses - If correlation is 1, all p-values in each subset
are identical - represent these m p-values by two p-values ,
- respective weights w0m0, w1m1.
35FDR for General m BH proc.
36FDR for General m LF Case
37Maximal FDR and FDR Deviation vs. m0 / m
38FDR in Complex Study
- Family of Hypotheses (Westfall Young ,1993)
- Questions asked form a natural and coherent unit
- All tests are considered simultaneously
- Probable that many or all hypotheses are true
39FDR in Complex Study
Family of Hypotheses (Westfall Young ,1993)
Should FDR be directly controlled for all of the
hypotheses in the study?
40FDR in Complex Study
- Suggestions
- Direct approach use scalability of FDR.
- Select subsets using statistics that are
independent from step to step.
41FDR in Complex Study
- Suggestions
- Organize families in hierarchical tree structure
and use appropriate FDR controlling procedures.
42Gene expression relative to behavioral markers
- Purpose
- Identify changes is gene expression related to
behavioral effects of opiods. - DataExpression of 26,300 genes for 10 mouse
strains in 5 brain regions.
43 44Research Questions
- Identify a pool of genes distinguishing between
strains - Pairwise comparisons of genes
- Test for significant interaction indicating
unusual level of expression in particular strain
by brain-region combinations. - correlating strain differences in gene expression
levels and behavior markers.
45Strain by brain-region interaction
- subset method
- 957 genes identified in the first stage at
thresh. 0.05 - 50,000 interactions tested in second stage
- only 13 interactions discovered
-
46Strain by brain-region interaction
- Hierarchical testing scheme
- use thresh. 0.017
- 758 genes are selected in the first stage
- 76 interactions discovered
-
47Correlation Analysis
- subset method
- Use thresh. 0.025 in 1st stage, 0.05 in 2nd stage
- 225 triplicates discovered
-
48Correlation Analysis
- Hierarchical testing scheme
- Use thresh. 0.025 in each stage
- 230 triplicates discovered
-
49tree 1st stage 0.025 2nd stage 0.025
50Subset 1st stage 0.025 2nd stage 0.05
51The two-staged procedure
- Benjamini, Krieger, Yekutieli(00)
- Use the BH at level q once, and get r1.
- Estimate m0 by
- Proved FDR q under independence Conjectured
FDR q under positive dependency
52Further Procedures
- Recall that for BH procedure FDR m0/mq
-
- Hence estimate m0, and
- use qqm/mo instead of q in BH
- The adaptive procedure
-
Benjamini Hochberg (89/00) - The two-stage procedure
-
Benjamini, Krieger, Yekutieli(00)