tTESTS, ANOVA AND OTHER METHODS - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

tTESTS, ANOVA AND OTHER METHODS

Description:

Francesca Little - 2006. References: Books: Jonathan Pevsner (2003) ... Statistical tests for differential expression in cDNA microarray experiments. ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 19
Provided by: uct9
Category:

less

Transcript and Presenter's Notes

Title: tTESTS, ANOVA AND OTHER METHODS


1
t-TESTS, ANOVA AND OTHER METHODS FOR FINDING
DIFFERENTIAL EXPRESSION IN MICROARRAY
EXPERIMENTS Francesca Little - 2006 References B
ooks Jonathan Pevsner (2003). Bioinformatics and
Functional Genomics. Wiley-Liss. Roland Ennos
(2000). Statistical and Data Handling Skills in
Biology. Prentice Hall. Papers Cui X. and
Churchill G.A. (2003). Statistical tests for
differential expression in cDNA microarray
experiments. Genome Biology, 4210.1-210.10. Tushe
r V.G., Tibshirani R. and Chu G. (2001).
Significance analysis of microarrays applied to
the ionizing radiation response. PNAS, 989,
5116-5121.
2
MICROARRAY DATA Expression values for many genes
obtained for a few samples (replications) under a
given number of conditions.
Ratios or relative intensities for Cy3(green dye
and Cy5(red) dye
3
  • OVERVIEW OF MICROARRAY DATA ANALYSIS
  • NORMALIZATION AND PRE-PROCESSING
  • to allow data sets from 2 or more samples
    to be compared to each other.
  • 2. INFERENTIAL STATISTICS
  • Hypothesis testing to allow statements about
    the likelihood that particular genes are
    significantly regulated
  • 3. EXPLORATORY STATISTICS
  • Descriptive statistics, including cluster and
    principal component analysis
  • to look for biologically meaningful patterns.

4
  • INFERENTIAL STATISTICS
  • - to test the hypothesis that some genes are
    differentially expressed in an experimental
    comparison of 2 or more conditions
  • H0 There is no difference in signal intensity
    across the conditions being tested
  • Versus
  • H1 There are differences in gene
    expression levels.
  • 2. Calculate a test statistic that
    characterizes the observed differential gene
    expression. This statistic depends on the design
    of the experiment.
  • Need to know how likely the observed value of the
    test statistic is if H0 is true.
  • This is given by the p-value.
  • Make a decision.
  • If p-value is small, reject H0.
  • What is small ?

5
THE NULL HYPOTHESIS H0 There is no difference
in expression between the conditions equivalent
to saying the true ratio between the
expression of each gene under the two conditions
1 By taking logs, we transform multiplicative
effects (ratios) into additive effect
(differences) because log(A/B) log(A) log(B)
6
  • TEST STATISTICS
  • FOLD CHANGE
  • Evaluate the log ratio between 2 conditions
    ( or the average of ratios when there are
    replicates) and consider all genes that differ by
    more than an arbitrary cut-off value to be
    differentially expressed.
  • Eg., often use a 2-fold change
  • Problems Arbitrary
  • Subject to intensity,
  • in that a 2-fold change when genes are
    expressed at 300 and 200 (with a background
    intensity of 10) does not mean the same as when
    genes are expressed at 10 000 and 5000, but both
    represents a 2-fold difference.

7
  • 2. t-test
  • Can be viewed as a signal to noise ratio.
  • There are variations
  • Gene-specific t-test
  • - low power because of small sample size
  • - unstable variance estimate ? a small variance
    estimate can lead to a large t
  • Global t-test
  • where SE std.err. computed by combining
    data across all genes
  • - based on the assumption that the variance
    is homogeneous across all genes.
  • - similar to a fold-change test in that it
    effectively ranks the genes in order and does not
    account for
  • individual variability.
  • S-test
  • Adds a constant to the std.err. to make sure
    that genes with small fold changes will not be
    selected
  • as significant. One way of choosing c is to
    make it equal to he 90th percentile of he SEg
    values
  • Regularized t-test
  • where the denominator is based on a weighted
    average of the gene specific and global variance

8
SIGNIFICANCE p-value measure of how likely it
is to observe a value for your statistic as
extreme or more extreme than the sample value, if
H0 is true ? Small number of
replications leads to power problems!
The t-distribution is the frequency distribution
of the deviations from the mean divided
by the std.err. of a large number of samples -
are we correct in assuming that
underlying distribution is a t-distribution?
9
HYPOTHESIS TESTING AND TWO TYPES OF
ERROR Decision rule Reject H0 if t
C. There is a chance that H0 is true for values
of t C. This chance is limited by a (0.05,
for example). ? False positive error type I
error There is also a chance that H0 is false for
values of t lt C ? a false negative error type
II error, when we fail to detect a differentially
expressed gene probability of this is
ß. POWER1-ß P(detecting a differentially
expressed gene it is truly different) -
dependent on study design, sample size and
precision as measured by variability.
10
MULTIPLE TESTING PROBLEM -refers to the
accumulation of type I errors ( false
positives) If we choose a0.05, we limit the
p(rejecting H0H0 is true) to be 0.05. But we
have a large number of genes, say 10 ( though
actually 1000s) P(1 H0 rejected) 1 p(no H0
rejected) 1 p(all
pigt0.05) 1
(1-0.05)10
1 0.9510 1-0.59874 0.401 i.e., even
if 10 genes are NOT differentially expressed,
there is a probability of 0.401 of finding at
least one which we will declare as significantly
different ! We define the family-wise TYPE I
error rate (FWER) as the probability of making
one or more type I errors in a set ( or family)
of tests 1- (1-a)c . We
can control the FWER by using the BONFERRONI
correction that simply divides the nominal
significance level by the number of tests, so for
example for 10 tests, use 0.05/10 0.005 for a
1000 tests, use 0.05/1000 0.00005 --- very
strict with resulting reduction in power!!
11
ALTERNATIVE APPROACH TO LIMITING FWER CONTROL
FDR FALSE DISCOVERY RATE FDR proportion of
false positives among all genes identified as
being differentially expressed FDR
expected proportion of type I errors among
rejected E(C/(CD)
12
SAM Significance Analysis of microarrays Tusher
etal Simulates lots of replicates to make up for
small sample sizes Choose a cut-off that
minimizes the FDR More specifically, Each gene
is given a score on the basis of its change in
gene expression relative to the std.dev of
repeated measurements, Genes with score gt given
threshold are considered to be potentially
significant. The percentage of such genes
identified by chance is the false discovery
rate. This FDR is estimated by simulating
nonsense genes using permutations of the sample
measurements. The threshold is adjusted to
identify smaller or larger sets of genes and
FDRs calculated for each set.
13
  • MORE THAN TWO CONDITIONS
  • Replace the t-test with an F-test and use ANOVA
  • We are interested in changes in the mean value of
    a response variable (eg, gene expression level),
    but the data may be structured in different ways
    because of the structure of the experiment.
  • COMPLETELY RANDOMISED DESIGN
  • The data is classified into two groups based on
    the values of one variable
  • Use ONE-WAY ANOVA
  • RANDOMISED BLOCK DESIGN
  • The data is classified into groups based on the
    values of TWO variables, where one variable
    represents a natural grouping of samples into
    homogeneous groups and the other represent a
    treatment or condition that we are interest in
    comparing.
  • Use TWO-WAY ANOVA
  • FACTORIAL DESIGN
  • Interested in studying the effect of two or more
    factors simultaneously on the response
  • REPEATED MEASURES ANOVA

14
METHODOLOGY --- based on a decomposition of the
total variance of the response into different
components Lets assume k conditions or
treatments, with ni replications per condition
for each gene. Let yij expression for
replication j under condition i. The total
variation in the response is measured by the sum
of the squared differences between the response
and its mean values. We can write It can be
shown that i.e., TOTAL SS WITHIN
GROUP SS B ETWEEN GROUP SS If THE
group means differ a lot, they will be very
different from the overall mean, and the BETWEEN
GROUP SS will be large relative to the WITHIN
GROUP SS. If all group means are roughly equal,
the reverse is true.
15
We assume that the ni observations in the ith
group form a random sample from a population with
mean µi and variance s2, so we assume that the
variance is CONSTANT for all groups. Our null
hypothesis of no group difference can be stated
as H0 µ1 µ2 µk µ versus
H1 one or more of
the µi differ. There are 3 estimates for the
common variance, s2 MST SST/(N-1) MSW
SSW/(N-k) and all 3 of these are unbiased if H0
is true. MSB SSB/(k-1) However, when H0 is
not true and the group means differ, The BETWEEN
group means square will increase and be greater
that the WITHIN group mean square ( which is
still unbiased). So we can test H0 by comparing
these two estimates of variance using an
F-test When k 2 and we only have 2
conditions, it can be shown that the F-test is
just the square of the t-test.
16
We can write a model for the one-way anova as
follows where µ overall mean ai
difference between overall mean and ith group
mean (BETWEEN) eij difference
between individual observation and its group mean
(WITHIN) or (RESIDUAL) Testing
for a group/condition effect is equivalent to
comparing the above model with a simpler
model In a similar way we can compare more
complicated models by adding extra
terms. Randomised block design Factorial
design The F-statistics for the different null
hypotheses reduce to a comparison of the residual
sums of squares from fitting each of these
models.
17
In microarray data, anova can be used in 2
stages 1, During NORMALIZATION where Yijgr
log (signal intensity) Ai effect
of array I Dj effect of dye j
index g refers to specific gene
index r refers to replication rijgr residual
signal intensity with effect of array and dye
differences removed, i.e,. your normalized
data 2. Stage 2 where VG is the term of
primary interest in that it captures variation in
expression levels of a gene across samples -
it is a catch-all term for the effects
associated with the samples - the term on which
youll put all the structure
18
  • NOTE
  • The anovas and hence the F-statistics are
    computed on a gene-by-gene basis.
  • Just as for the t-tests, there are variants of
    the F-tests based on whether you use the
    gene-specific within group error as the
    denominator, or a pooled estimate over all genes
    or a weighted sum between gene-specific and
    global estimate.
  • MIXED effects anova
  • Sometimes we view factors as random rather than
    fixed effects.
  • A factor should be a fixed effect if, should you
    repeat the experiment, you would be interested in
    exactly the same values / levels of that factor.
  • A factor is a random effect if it represents a
    random sample from a larger population, and hence
    should you repeat the experiment, you would not
    be including exactly the same levels for this
    factor. Examples include the terms relating to
    the array, or to biological replication. As soon
    as you have random effects, your model contains
    more than one level of variance and it becomes
    quite tricky to decide which level to use for
    constructing the F-tests.
Write a Comment
User Comments (0)
About PowerShow.com