Title: Identifying Differentially Expressed Genes
1Identifying Differentially Expressed Genes
- Suppose we have T genes which we measured under
two experimental conditions (Ctl and Nic) in n
replicated experiments - ti and pi are the t-statistic and the
corresponding p-value for the ith gene, i1,...,T - P-value is the probability of observing as
extreme or more extreme value of the t-statistic
under the null-distribution (i.e. the
distributions assuming that ?iCtl ?iNic ) than
the one calculated from the data (t) - The ith gene is "differentially expressed" if we
can reject the ith null hypothesis ?iCtl ?iNic
and conclude that ?iCtl ? ?iNic at a significance
level ? (i.e. if pilt?) - Type I error is committed when a null-hypothesis
is falsely rejected - Type II error is committed when a null-hypothesis
is not rejected but it is false - Experiment-wise Type I Error is committed if any
of a set of (T) null hypothesis is falsely
rejected - If the significance level is chosen prior to
conducting experiment, we know that by following
the hypothesis testing procedure, we will have
the probability of falsely concluding that any
one gene is differentially expressed (i.e.
falsely reject the null hypothesis) is equal to ?
2Experiment-wise error rate
Assuming that individual tests of hypothesis are
independent and true p(Not Committing The
Experiment-Wise Error) p(Not Rejecting H01 AND
Not Rejecting H02 AND ... AND Not Rejecting H0T)
(1- ? )(1- ? )...(1- ? ) (1- ?
)T p(Committing The Experiment-Wise Error) 1-(1-
? )T If we want to keep the FWER at ?
level Sidaks adjustment ?a 1-(1- ? )1/T
Bonferroni adjustment ?a ?/T
3Adjusting p-value
- Individual HypothesesH0i ?iCtl
?iNic ? pip(tdf gt ti) , i1,...,T - "Composite" Hypothesis
- H0 ?iCtl ?iNic, i1,...,T ? pminpi,
i1,...,T - The composite null hypothesis is rejected if even
a single individual hypothesis is rejected - Consequently the p-value for the composite
hypothesis is equal to the minimum of individual
p-values - If all tests have the same reference
distribution, this is equivalent topp(tdf gt
tmax)
4Null distribution of the composite p-value
p(minpi, i1,...,T lt ?) assuming
independence1-1-?T Instead of adjusting the
significance level, can adjust all p-values pia
1-1-piT pib piT
5This is not good enough
- Traditional statistical approaches to multiple
comparison adjustments which strictly control the
experiment-wise error rates are not optimal
probability of rejecting a single true null
hypothesis ? - Need a balance between the false positive and
false negative rates - Instead of controlling the probability of
generating a single false positive, we control
the proportion of false positives
6False Discovery Rate
- Using ? to test each hypothesis Expected number
of false positives is (m0)? - large if most null
hypothesis true and the probability of committing
a FWE is approximately 1-(1-?)(m0) - When using ?adjusted, the probability of
committing FWE is ? - False Discovery Rate (FDR) is equal to expected
V/R - We don't know m0 so multiple comparisons
adjustments are usually made assuming m0m - Turns out we can control FDR
7False Discovery Rate
Following decision making procedure will keep FDR
below q
Alternatively, adjust p-values as
8Randomization Issue
The ProblemIdentify genes whose expression in a
target organ (Lung) of a model organism (Rat) is
affected by an environmental toxicant
(W) PopulationAll model organisms of this type
(Rats) Sample12 randomly selected rats from the
population of all rats. (Randomly means that all
rats in the population have the equal chance of
being selected) RandomizationRandomly select 6
rats to be treated by the toxicant. Randomly is
the key word here that allows us to ascribe
observed changes to the treatment alone. Prepare
samples and extract RNA from all 12 rats Randomly
assign labeled RNA to different
microarrays Process microarrays in a random order
9Randomization Issue
The ProblemIdentify genes whose expression in a
target organ (Lung) of a model organism (Mouse)
is affected by an environmental toxicant
(Nickel) PopulationAll model organisms of this
type (Mice) Sample6 randomly selected rats from
the population of all rats. (Randomly means that
all rats in the population have the equal chance
of being selected) RandomizationRandomly select
3 rats to be treated by the toxicant. Randomly is
the key word here that allows us to ascribe
observed changes to the treatment alone. Prepare
samples and extract RNA from all 6 mice Randomly
assign labeled RNA to different
microarrays Process microarrays in a random order
10limma
- ... is a package for the analysis of microarray
data, especially the use of linear models for
analyzing designed experiments and the assessment
of differential expression. - Specially constructed data objects to represent
various aspects of microarray data - Specially constructed "object methods" for
importing, normalizing, displaying and analyzing
microarray data - All objects and methods are transparent
- All objects can be accessed and modified outside
of limma - Unique in the implementation of the empirical
Bayes procedure for identifying differentially
expressed genes by "borrowing" information from
different genes (everything so far has been gene
by gene)
11Measurement Error Model With Additive
BackgroundWNickel CCtl
Old Model
Foreground (F)
Background (B)
New Model
- There are other models for accounting for the
background signal - Simple subtraction of the background intensities
often introduces additional variability in the
observed signal - The problem is in the fact that we use a
single-observation estimate for ?B - With this in mind, various strategies have been
proposed to pool background information from more
than one spot to estimate ?B
12Single Channel Microarrays Each Sample Assigned
to a Different Microarray
- 6 microarrays, 6 samples (C1,...,C3,W1,...,W3)
- Randomly assign samples to different microarrays
- In terms of a single gene, 6 different "spots"
- Proceed with a two-sample t-test as we did so far
13Two-Channel Microarrays One C and One W Sample
Assigned to Each Microarray
- 3 microarrays, 6 samples (C1,...,C3,W1,...,W3)
- Randomly select pairs and assign then to
different microarrays - In terms of a single gene, 3 different "spots"
- Individual samples are no longer "free" to be
assigned to any microarray restriction on the
randomization process - Measurements are "blocked" within a microarray
(terminology) - We could still randomly assign samples and not
have treatment and the control on each
microarray, but this would be unreasonable
(arguments to come) - Need to use a paired t-test
14Paired t-test
- For a specific gene ri xiw -xic ith
difference, i1,,3
- Statistical Model of observed data
- Differential expression ? ? ? 0
t
-t
15Two-sample t-test vs paired t-test
- Reference Distribution t2n-2 tn-1
16limma
Data to import http//eh3.uc.edu/teaching/cfg/200
6/data/07-21-03_MO-S-06-23-03N17-72SV2-3-vs-CSV2-5
-B.gpr http//eh3.uc.edu/teaching/cfg/2006/data/07
-21-03_MO-S-06-23-03N73-CSV3-3-vs-72SV3-5-A.gpr ht
tp//eh3.uc.edu/teaching/cfg/2006/data/07-18-03_MO
-S-06-23-03N98-CSV1-3-vs-72SV1-5-B.gpr File
descriptionshttp//eh3.uc.edu/teaching/cfg/2006/
data/NTargets.txt Spot descriptionshttp//eh3.uc
.edu/teaching/cfg/2006/data/SpotTypes.txt Importin
g datasource("http//eh3.uc.edu/teaching/cfg/200
6/R/LimmaDataImport.R")
17limma
gt library(limma) gt data.directorylt-"http//eh3.uc.
edu/teaching/cfg/2006/data/" gt targetslt-readTarget
s("http//eh3.uc.edu/teaching/cfg/2006/data/NTarge
ts.txt") gt targets array
experiment cy3 1 17.0
07-21-03_MO-S-06-23-03N17-72SV2-3-vs-CSV2-5-B.gpr
Nic-WT72hr_2 2 73.0 07-21-03_MO-S-06-23-03N73-CSV
3-3-vs-72SV3-5-A.gpr Ctl-WT00hr_3 3 98.1
07-18-03_MO-S-06-23-03N98-CSV1-3-vs-72SV1-5-B.gpr
Ctl-WT00hr_1 cy5 date 1
Ctl-WT00hr_2 7/21/2003 2 Nic-WT72hr_3 7/21/2003 3
Nic-WT72hr_1 7/18/2003
18limma
gt spottypeslt-readSpotTypes("http//eh3.uc.edu/teac
hing/cfg/2006/data/SpotTypes.txt") gt spottypes
SpotType ID Name Color 1 cDNA
black 2 Blank Blank
blue 3 Control control red 4
Empty Empty blue 5 empty empty
blue gt
19RGList class
gt LimmaDataNickellt-read.maimages(filestargetsexp
eriment,source"genepix", path data.directory,
columnslist(Gf "F532 Median",Gb "B532
Median", Rf "F635 Median", Rb "B635
Median"), annotationc("Name","ID","Block","Row"
,"Column"),wt.funwtflags(0)) Read
http//eh3.uc.edu/teaching/cfg/2006/data//07-21-03
_MO-S-06-23-03N17-72SV2-3-vs-CSV2-5-B.gpr Read
http//eh3.uc.edu/teaching/cfg/2006/data//07-21-03
_MO-S-06-23-03N73-CSV3-3-vs-72SV3-5-A.gpr Read
http//eh3.uc.edu/teaching/cfg/2006/data//07-18-03
_MO-S-06-23-03N98-CSV1-3-vs-72SV1-5-B.gpr gt
attributes(LimmaDataNickel) names 1 "R"
"G" "Rb" "Gb" "weights" "targets"
"genes" class 1 "RGList" attr(,"package") 1
"limma"
20RGList class
gt LimmaDataNickelR13, 07-21-03_MO-S-06-23
-03N17-72SV2-3-vs-CSV2-5-B 07-21-03_MO-S-06-23-03N
73-CSV3-3-vs-72SV3-5-A 1,
2264
3642 2,
303
734 3,
140
164 07-18-03_MO-S-06-23-03N98-CSV1-
3-vs-72SV1-5-B 1,
726 2,
248 3,
120 gt LimmaDataNickelRb13,
07-21-03_MO-S-06-23-03N17-72SV2-3-vs-CSV2-5-B
07-21-03_MO-S-06-23-03N73-CSV3-3-vs-72SV3-5-A 1,
126
154 2,
126
150 3,
128
155
07-18-03_MO-S-06-23-03N98-CSV1-3-vs-72SV1-5-B 1,
129 2,
127 3,
127 gt