Design of Experiments - PowerPoint PPT Presentation

About This Presentation
Title:

Design of Experiments

Description:

... to microarrays? which samples should be hybridized on the same ? different experimental designs reference design, loop design what is the optimal design? – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 37
Provided by: pan107
Category:

less

Transcript and Presenter's Notes

Title: Design of Experiments


1
Design of Experiments
Panu Somervuo, March 20, 2007
  • Problem formulation
  • Setting up the experiment
  • Analysis of data

2
Problem formulation
  • what is the biological question?
  • how to answer that?
  • what is already known?
  • what information is missing?
  • problem formulation ? model of the biological
    system

3
Setting up an experiment
  • what kind of data is needed to answer the
    question?
  • how to collect the data?
  • how much data is needed?
  • biological and technical replicates
  • pooling
  • how to carry out the experiment (sample
    preparation, measurements)?

4
Analysis of data
  • preprocessing
  • filtering outlier removal
  • normalization
  • statistical model fitting
  • hypothesis testing
  • reporting the results, documentation

5
Everything depends on everything
problem formulation model of the system
analysis of data statistical tests
setting up the experiment number of samples
6
Practical guidelines
  • blocking unwanted effects (e.g. dye effect)
  • randomization (avoid systematic bias by
    randomizing e.g. the order of sample
    preparations)
  • replication (replicate measurements can be
    averaged to reduce the effect of random errors)

group2
group1
group2
group1
cy3
cy3
cy3
cy5
cy5
cy5
7
log transform, normalization
y µF1F2...error
8
Pairwise sample comparison vs modeling
  • pairwise sample comparison is easy and
    straightforward
  • instead of comparing samples as such, we can
    construct a model for the measurements and then
    perform comparisons

9
Mathematical model of data
  • try to capture the essence of a (biological)
    phenomenon in mathematical terms
  • here we concentrate on linear models observation
    consists of effects of one or more factors and
    random error
  • factor may have several levels (e.g. factor sex
    has two levels, male and female)

10
Examples of models
normalization, log transform
  • single factor
  • y µ gene error
  • two factors
  • y µ treatment gene error
  • two factors including interaction term
  • y µ treatment gene
    treatment.gene error
  • four factors
  • y µ treatment gene dye array
    error

11
From model to experimental design
  • y µ drug sex drug.sex error
  • factor 1, drug 3 levels
  • factor 2, sex 2 levels
  • ?3x2 factorial design

M F
no treatment y111, y112, y113, y114 y121, y122, y123, y124
treatment A y211, y212, y213, y214 y221, y222, y223, y224
treatment B y311, y312, y313, y314 y321, y322, y323, y324
12
Analysis of variance
  • ANOVA can be used to analyse factorial designs
  • y µ drug sex drug.sex error
  • summary(aov(ydrugsex,datadata))
  • Df Sum Sq Mean Sq F value Pr(gtF)
  • drug 2 2.86750 1.43375 51.3582 3.644e-08
  • sex 1 1.26042 1.26042 45.1493 2.673e-06
  • drugsex 2 0.06583 0.03292 1.1791 0.3302
  • Residuals 18 0.50250 0.02792
  • ---
  • Signif. codes 0 ' 0.001 ' 0.01 ' 0.05
    .' 0.1 ' 1

M F
no treatment 1.0, 1.1, 0.9, 1.3 0.7, 0.5, 0.6, 0.8
treatment A 1.1, 1.2, 0.8, 1.3 0.7, 0.8, 0.6, 0.9
treatment B 2.1, 1.9, 1.7, 2.0 1.5, 1.3, 1.4, 1.1
13
Multiple pairwise comparisons
  • ANOVA tells that at least one drug treatment has
    effect, but in order to find which one we perform
    all pairwise comparisons

M F
no treatment 1.0, 1.1, 0.9, 1.3 0.7, 0.5, 0.6, 0.8
treatment A 1.1, 1.2, 0.8, 1.3 0.7, 0.8, 0.6, 0.9
treatment B 2.1, 1.9, 1.7, 2.0 1.5, 1.3, 1.4, 1.1
  • TukeyHSD(aov(ydrugsex,datadata,"drug")
  • Tukey multiple comparisons of means
  • 95 family-wise confidence level
  • factor levels have been ordered
  • Fit aov(formula y drug sex, data data)
  • drug
  • diff lwr upr
  • A-0 0.0625 -0.1507113 0.2757113
  • B-0 0.7625 0.5492887 0.9757113
  • B-A 0.7000 0.4867887 0.9132113

14
Benefits of (good) models
  • after fitting the model with data, model can be
    used to answer the questions e.g.
  • is there dye effect?
  • is the difference of gene expression levels in
    two conditions statistically significant?
  • is there interaction between gene and another
    factor?
  • simple pairwise sample comparisons cannot give
    answers to all of these questions simultaneously

yµF1F2...error
15
What is a good model?
  • good model allows us to get more detailed results
  • best model and parametrization is application
    specific
  • simple vs complex model
  • yµF1F2F3...error
  • there should be balance between model complexity
    and the amount of data

dye1 dye2
control y111, y112, y113 y121, y122, y123
treatment A y211, y212, y213 y221, y222, y223
treatment B y311, y312, y313 y321, y322, y323
16
How the number of samples affects the confidence
of our results?
  • measurement error is always present, see the
    example self-self hybridization

17
How the number of samples affects the confidence
of our results?
  • lets compute the mean average of expression
    level of a gene
  • how accurate is this value?
  • variance(mean) variance(error)/number of
    samples
  • samples from normal distribution (mean 0, sd 1)

18
Theoretical sample size calculations
  • for each statistical test, there is a
    (test-specific) relation between
  • power of a test 1 probability(type I error)
  • significance level probability(type II error)
  • error variance
  • mean difference needed to be detected
  • number of samples

19
actual situation drug has effect actual situation drug has no effect
our conclusion drug has effect correct conlusion true positive probability 1-b type I error false positive probability a
our conclusion drug has no effect type II error false negative probability b correct conclusion true negative probability 1-a
20
How many samples are needed to detect sample mean
difference of 1 unit ?
R function power.t.test gt power.t.test(delta1,p
ower0.95,sd1,sig.level0.05) Two-sample t
test power calculation n
26.98922 delta 1 sd 1
sig.level 0.05 power 0.95
alternative two.sided NOTE n is number in
each group
21
What is the power of test when using 10 samples ?
R function power.t.test gt power.t.test(n10,delt
a1,sd1,sig.level0.05) Two-sample t test
power calculation n 10
delta 1 sd 1 sig.level
0.05 power 0.5619846 alternative
two.sided NOTE n is number in each group
22
How small difference between sample means we are
able to detect using 10 samples ?
R function power.t.test gt power.t.test(n10,powe
r0.95,sd1,sig.level0.05) Two-sample t
test power calculation n 10
delta 1.706224 sd 1
sig.level 0.05 power 0.95
alternative two.sided NOTE n is number in
each group
23
Two kinds of replicates
  • biological replicates biological variability
  • technical replicates measurement accuracy
  • most statistical programs assume independent
    samples

A3
A2
A1
B3
B2
B1
C3
C2
C1
D3
D2
D1
24
Pooling
A1
A2
A3
B1
B2
B3
25
Pooling
  • ok when the interest is not on the individual,
    but on common patterns across individuals
    (population characteristics)
  • results in averaging ? reduces variability ?
    substantive features are easier to find
  • recommended when fewer than 3 arrays are used in
    each condition
  • beneficial when many subjects are pooled
  • one pool vs independent samples in multiple pools
  • C. Kendziorski, R. A. Irizarry, K.-S. Chen, J. D.
    Haag, and M. N. Gould,
  • "On the utility of pooling biological samples in
    microarray experiments",
  • PNAS March 2005, 102(12) 4252-4257

inference for most genes was not affected by
pooling
26
How to allocate the samples to microarrays?
  • which samples should be hybridized on the same
    slide?
  • different experimental designs
  • reference design, loop design
  • what is the optimal design?

27
Example of four-array experiment
B
cy5
cy3
array cy3 cy5 log(cy5/cy3)
1 A B log(B) log(A)
2 A B log(B) log(A)
3 B A log(A) log(B)
4 B A log(A) log(B)
1 2 3 4
cy3
cy5
A
28
Reference design
array cy3 cy5 log(cy5/cy3)
1 Ref A log(A) log(Ref)
2 Ref B log(B) log(Ref)
3 Ref C log(C) log(Ref)
4 Ref D log(D) log(Ref)
A
1
Ref
B
2
3
C
4
log(C/A) log(C) - log(A) log(C) - log(Ref)
log(Ref) - log(A) log(C) - log(Ref)
(log(A) - log(Ref)) logratio(array3) -
logratio(array1)
D
29
Loop design
A
array cy3 cy5 log(cy5/cy3)
1 A B log(B) log(A)
2 B C log(C) log(B)
3 C D log(D) log(C)
4 D A log(A) log(D)
1
4
B
D
2
C
3
log(C/A) log(C) log(B) log(B) log(A)
logratio(array2) logratio(array1)
log(C/A) log(C) log(D) log(D) log(A)
- logratio(array3) - logratio(array4)
log(C/A)(logratio1 logratio2)/2
30
Comparing the designs
reference design reference design with replicates loop design
number of arrays 3 6 3
amount of RNA required per sample 1Ref 2Ref 2
error 2.0 1.0 0.67
31
Design with all direct pairwise comparisons
2
3
1
4
6
5
32
Example examining genotype, phenotype, and
environment
Parental - stressed
Derived - stressed
Parental - unstressed
Derived - unstressed
33
Optimal design
  • maximize the accuracy of parameters of interest
  • procedure enumerate all possible designs,
    calculate the parameter accuracy for each of them
    and select the best design
  • optimal design is model specific

34
(No Transcript)
35
About the nature of microarray data
  • Microarray data can give hypothesis to be tested
    further
  • Results from microarray analysis should be
    cerified by other means (qPCR,...)
  • quality of microarray data depends on samples,
    probes, hybridization, lab work
  • data pre-processing, normalization, and outlier
    detection are as important as good experimental
    design

36
More about statistics
  • M.J. Crawley Statistics An Introduction using
    R, John WileySons, 2005
  • S.A. Glantz Primer of Biostatistics,
    McGraw-Hill, 5th ed., 2002
  • D.C. Montgomery Design and Analysis of
    Experiments, John WileySons, 5th ed. 2001
  • Google
Write a Comment
User Comments (0)
About PowerShow.com