Title: Test of significance for small samples
1Test of significance for small samples Javier
Cabrera Director, Biostatistics Institute Rutgers
University Dhammika Amaratunga, Johnson
Johnson Pharmaceutical Research Development
2Outline
- Microarray Experiments and Differential
expression - Small sample size issues
- Conditional t approach
- Comparison with other methods
- Extensions
- Reference Exploration and Analysis of DNA
- Microarray and Protein Array Data. Wiley.2004.
- Amaratunga, Cabrera.
- Software DNAMR and DNAMRweb
- http//www.rci.rutgers.edu/cabrera/DNAMR
3(No Transcript)
4Microarray experiment
cDNA or oligonucleotide preparation
Glass slide
Biological sample
mRNA
Reverse transcribe and label
Print or synthesize
Sample
Microarray
5k-50k genes arrayed in rectangular grid one
spot per gene
Hybridize, wash and scan
Image
Quantify spot intensities
Gene expression data
5- Differential gene expression
- An organisms genome is the complete
- set of genes in each of its cells. Given
- an organism, every one of its cells has
- a copy of the exact same genome, but
- different cells express different genes
- different genes express under different
conditions - differential gene expression leads to
- altered cell states
6Differential Expression for small samples
C1 C2 C3 T1 T2 T3 G1
4.67 4.44 4.42 4.73 4.85 4.69 G2 3.13
2.54 1.96 0.97 2.38 3.36 G3 6.22 6.77
5.32 6.40 6.94 6.87 G4 10.74 10.81 10.69
10.75 10.68 10.68 G5 3.76 4.16 5.27 3.05
3.20 2.85 G6 6.95 6.78 6.33 6.81 6.95
7.01 G7 4.98 4.61 4.56 4.57 4.90 4.44 G8
2.72 3.30 3.24 3.22 3.42 3.22 G9 5.29
4.79 5.13 3.31 4.67 5.27 G10 5.12 4.85
3.79 4.13 3.12 4.79 G11 4.67 3.50 4.77
4.09 3.86 2.88 G12 6.22 6.42 5.02 6.38
6.54 6.80 G13 2.88 3.76 2.78 2.98 4.81
4.15 .......
- Preprocessed data.
- Perform a t-test for each gene.
- Select the most significant subset.
7The pooled variances T-test
8Plot t vs sp
Distribution of sp
300 21983
Differentially expressed genes have smaller sp.
Is this effect Statistical or Biological?
9500 Simulation 1000 Genes 4 Controls 4
Treats iid Normal(0, ?2) 100 genes are
differentially express with mean diff 1 or -1
?21 CONSTANT, a0.05 False
Discoveries True Discoveries T-test
44 22 z-test
43 29
?2 from Chi-square(df3), a0.05 False
Discoveries True Discoveries T-test
43 28 z-test
53 13
10 The effect of small sample size
- Often the sample size per group is small.
- ? unreliable variances (inferences)
- ? dependence between the test statistics (tg) and
the standard error estimates (sg) - ? borrow strength across genes (LPE/EB)
- ? regularize the test statistics (SAM)
- ? work with tgsg (Conditional t).
11- Analysis results
- Top 10 genes (sorted by t-test p-value)
- Gene Fold Dir p p(Bonf)
- G6546 2.36 D 0.000004 0.0964
- G19945 3.25 U 0.000005 0.1102
- G21586 1.64 U 0.000008 0.1765
- G18970 2.52 U 0.000019 0.4220
- G7432 3.70 D 0.000033 0.7248
- G19057 1.85 U 0.000046 1.0000
- G17361 4.34 D 0.000067 1.0000
- G8525 5.57 D 0.000067 1.0000
- G425 18.11 D 0.000078 1.0000
- G8524 4.74 D 0.000109 1.0000
12SAM Determining c
For each a
v1 (a) mad Tg
cv(?1) s1
cv(?2) s2
cv(?3) s3
cv(?4) s4
cv(?5) s5
cv(?6) s6
cv(?7) s7
cv(?)
v2(a) v3(a) v4(a) v5(a) v6(a) v7(a)
Tg
Min
sg
13SAM Gene selection
D
Expected value of under permutations
14Conditional t Basic Model
? Let Xgij denote the preprocessed intensity
measurement for gene g in array i of group j. ?
Model Xgij mgj sg egij ? Effect of
interest tg mg2 - mg1 ? Error model egij
F(location0, scale1) ? Gene mean-variance
model(mg1,sg2) Fm,s with marginals mg1 Fm
and sg2 Fs
15Possible approaches
Parametric Assume functional forms for F and
Fm,s and apply either a Bayes or Empirical Bayes
procedure. Nonparametric 1.
or For small samples is not a
good estimator of F? Use method of moments
Target estimation 2. Proceed via resampling
and estimate the distribution
t sp
(Conditional t).
16Procedure
17Procedure (cont.)
18Roadblock
Let Xij be a sample from the model with s2 Fs
and let the variance obtained from the Xij be
s2 Then Var(s2) gt Var(s2) For example, if we
assume that Fs c32, n4 and e N(0,1), then
Var(s2)6 and Var(s2)15. Fix by target
estimation Method of moments. Shrink
towards the center
19Example Checking for the distribution of ?g
Compare the distr. of sg vs simulation with
1. Df0.5
2. Df2
1. Df0.5
Mice Data
3. Df6
2. Df2
3. Df6
20Another Example
Compare the distr. of sg vs simulation with
Df0.5
Df0.5
Df3
Df6
Df3
Df6
Df3
Df6
21Fixing the variance distribution
22Fixing the variance distribution (contd)
Proceed as before
23Plot t vs sp Differentially expressed genes may
have large sp
191 22092
24500 Simulation 1000 Genes 4 Controls 4
Treats iid Normal(0, ?2) 100 genes are
differentially express with mean diff 1 or -1
?21 CONSTANT False Discoveries
True Discoveries T-test 44
22 z-test 43
29 C-t 45
30
?2 from Chi-square(df3) False
Discoveries True Discoveries T-test
43 28 z-test
53 13 C-t
42 38
25Using 8 iid samples from Khan Data, we make
changes to 50 genes to make them differentially
expressed for high level.
T-test
SAM
Ct
26Generating p-values
27Extensions ? F test - Condition on the
sqrt(MSE) ? Multiple comparisons - Tukey,
Dunnett, Bump. - Condition on the
sqrt(MSE) ? Gene Ontology. - Test for the
significance of groups. - Use Hypergeometric
Statistic, mean t, mean p-value, or other.
- Condition on log of the number of genes per
group
28Conditional F
29GO Ontology Conditioning on log(n)
Abs(T)
Log(n)
30The Details
- Reference
- Exploration and Analysis of DNA Microarray
- and Protein Array Data. Wiley . Jan 2004.
- Amaratunga, Cabrera.
- Email
- cabrera_at_stat.rutgers.edu
- damaratu_at_prdus.jnj.com
- Webpage for DNAMR and DNAMRweb
- http//www.rci.rutgers.edu/cabrera/DNAMR
31Target Estimation
- Target Estimation
- Cabrera, Fernholz (1999)
- - Bias Reduction.
- - MSE reduction.
- Recent Applications
- - Ellipse Estimation (Multivariate Target).
- - Logistic Regression
- Cabrera, Fernholz, Devas (2003)
- Patel (2003) Target Conditional MLE (TCMLE)
- Implementation in StatXact (CYTEL) and
- logXact Procs in SAS(by CYTEL).
32Target Estimation
33Target Estimation
Algorithms - Stochastic approximation.
- Simulation and iteration. - Exact
algorithm for TCMLE