Title: Analysis of Differential Expression
1Analysis of Differential Expression
- T-test
- ANOVA
- Non-parametric methods
- Correlation
- Regression
2Research Question
- Do nicotine-exposed rats have different X gene
expression than control rats in ventral tegmental
area? - Design an experiment in which treatment rats
(Ngt2) are exposed to nicotine and control rats
(Ngt2) are exposed to saline. - Collect RNA from VTA, convert to cDNA
- Determine the amount of X transcript in each
individual. - Perform a test of means considering the
variability within each group.
3Observed difference between groups
- May be due to
- Treatment
- Chance
4Hypothesis Testing
- Null hypothesis There is no difference between
the means of the groups. - Alternative hypothesis Means of the groups are
different.
5Hypothesis testing
- You can not accept null hypothesis
- You can reject it
- You can support it
6P-value
- The P stands for probability, and measures how
likely it is that any observed difference between
groups is due to chance, alone.
7P-value
- there is a significant difference between groups
if the P value is small enough (e.g., lt0.05). - P value equals to the probability of type I
error. - Type I error wrongly concluding that there is a
difference between groups (false positive). - Type II error wrongly concluding that there is
no difference between groups (false negative).
8Multiple tests on the same data
- Expression data on multiple genes from the same
individuals - Subsets of genes are coregulated thus they are
not independent. - Such data requires multiple tests.
9Why not do multiple t-tests? Or if you do, adjust
the p-values
- Because it increases type I error
- a study involving four treatments, there are six
possible pairwise comparisons. - If the chance of a type I error in one such
comparison is 0.05, then the chance of not
committing a type I error is 1 0.05 0.95. - then the chance of not committing a type I error
in any one of them is 0.956 0.74. - Cumulative type I error 1-0.740.26
10Normal Distribution
- it is entirely defined by two quantities its
mean and its standard deviation (SD). - The mean determines where the peak occurs and
- the SD determines the shape of the curve.
11Curves same mean, different stds
12Rules of normal distribution
- 68.3 of the distribution falls within 1 SD of
the mean (i.e. between mean SD and mean SD) - 95.4 of the distribution falls between mean 2
SD and mean 2 SD - 99.7 of the distribution falls between mean 3
SD and mean 3 SD.
13Most commonly used rule
- 95 of the distribution falls between mean 1.96
SD and mean 1.96 SD - If the data are normally distributed, one can use
a range (confidence interval) within which 95 of
the data falls into.
14A sample
- Samples vary
- Samples are collected in limited numbers
- They are representatives of a population.
- A sample
- E.g., nicotine treated rat RNA
15Sample means
- Consider all possible samples of fixed size (n)
drawn from a population. - Each of these samples has its own mean and these
means will vary between samples. - Each sample will have their own distribution,
thus their own std.
16Population mean
- The mean of all the sample means is equal to the
population mean (?). - SD of the sample means measures the deviation of
individual sample means from the population mean
(?)
17Standard error
- It reflects the effect of sample size, larger the
SE, either the variation is high or sample size
is small.
18Confidence Intervals
- a confidence interval gives a range of values
within which it is likely that the true
population value lies. - It is defined as follows
- 95 confidence interval (sample mean 1.96 SE)
to (sample mean 1.96 SE). - a 99 confidence interval (calculated as mean
2.56 SE)
19T-distribution
- The t-distribution is similar in shape to the
Normal distribution, being symmetrical and
unimodal, but is generally more spread out with
longer tails. - The exact shape depends on a quantity known as
the degrees of freedom, which in this context
is equal to - the sample size minus 1.
20T-distribution
21One-sample t-test
- Null hypothesis Sample mean does not differ from
hypothesized mean, e.g., 0 (Ho ?0) - A t-statistics (t) is calculated.
- t is the number of SEs that separate the sample
mean from the hypothesized value. - The associated P value is obtained by comparison
with the t distribution. - Larger the t-statistics, lower the probability of
obtaining such a large value, thus p is smaller
and more significant.
22Paired t-test
- Used with paired data.
- Paired data arise in a number of different
situations, - a matched casecontrol study in which individual
cases and controls are matched to each other, or - A repeat measures study in which some measurement
is made on the same set of individuals on more
than one occasion
23Paired t-test
24Two-sample t-test
- Comparison of two groups with unpaired data.
- E.g., comparison of individuals of treatment and
those of control for a particular variable. - Now there are two independent populations thus
two STDs
25Calculation of pooled STD
- The pooled SD for the difference in means is
calculated as follows
26Calculation of pooled SE
- the combined SE gives more weight to the larger
sample size (if sample sizes are unequal) because
this is likely to be more reliable. The pooled SD
for the difference in means is calculated as
follows
27Two sample T-test
- Comparison of means of two groups based on a
t-statistics and its students t-distribution. - dividing the difference between the sample means
by the standard error of the difference.
28T-statistic
- A P value may be obtained by comparison with the
t distribution on n1 n2 2 degrees of freedom. - Again, the larger the t statistic, the smaller
the P value will be.
29Example
30Calculation of SD
31Calculation of SE
32T-statistic
- t (95-81)/2.41 14/2.41 5.81,
- with a corresponding P value less than 0.0001.
- Reject null hypothesis that states that sample
means do not differ.
33Analysis of Variance
- ANOVA
- A technique for analyzing the way in which the
mean of a variable is affected by different types
and combinations of factors. - E.g., the effect of three different diets on
total serum cholesterol
34Sample Experiment
Variance
35Sum of squares calculations
total
within
between
36Degrees of freedom
37Sources of variation
P value of 0.0039 means that at least two of the
treatment groups are different.
38Multiple Tests
- Post hoc comparisons between pairs of treatments.
- Overall type I error rate increases by increasing
number of pairwise comparisons. - One has to maintain the 0.05 type I error rate
after all of the comparisons.
39Bonferroni Adjustment
- 0.05/of tests
- Too conservative
40NonParametric methods
- Many statistical methods require assumptions.
- T-test requires samples are normally distributed.
- They require transformations
- Nonparametric methods require very little or no
assumptions.
41Wilcoxon signed rank test for paired data
42Wilcoxon signed rank test
43Central venous oxygen saturation on admission and
after 6 h into ICU.
- Take the difference between the paired data
points. - Patients have SvO2 values on admission and after
6 hours.
44Central venous oxygen saturation on admission and
after 6 h into ICU.
- Rank differences regardless of their sign.
- Give a sign to the ranked differences
45Calculate
- Sum of positive ranks (R)
- Sum of negative ranks (R-)
46Sum of positive and negative ranks
47Critical values for WSR test when n 10
5
48Wilcoxon sum or Mann-Whitney test
- Wilcoxon signed rank is good for paired data.
- For unpaired data, wilcoxon sum test is used.
49Steps of Wilcoxon rank-sum test
50Total drug doses in patients with a 3 to 5 day
stay in intensive care unit.
- Rank all observations in the increasing order
regardless of groupings - Use average rank if the values tie
- Add up the ranks
- Select the smaller value, calculate a p-value for
it.
51Critical values
52Correlation and Regression
- Correlation quantifies the strength of the
relationship between two paired samples. - Regression expresses the relationship in the form
of an equation. - Example whether two genes, X and Y are
coregulated, or the expression level of gene X
can be predicted based on the expression level of
gene Y.
53Product moment correlation
r lies between -1 and 1
54Age and urea for 20 patients in emergency unit
55Scattergram
r 0.62
56Confidence intervals around r
57Confidence of r
58(No Transcript)
59Misuse of correlation
- There may be a third variable both of the
variables are related to - It does not imply causation.
- A nonlinear relationship may exist.
60Regression
61Method of least squares
- The regression line is obtained using the method
of least squares. Any line y a bx that we
draw through the points gives a predicted or
fitted value of y for each value of x in the
dataset. - For a particular value of x the vertical
difference between the observed and the fitted
value of y is known as the deviation or residual. - The method least squares finds the values a and b
that minimizes the sum of squares of all
deviations.
62Age and urea level
63Residuals
64Method of least squares