Title: Introduction to Statistical Inference
1Introduction to Statistical Inference
- J. Verducci
- MBI Summer Workshop
- August, 2005
2Recommended Text
- Fred L. RAMSEY and Daniel W. SCHAFER.
- The Statistical Sleuth
- A Course in Methods of Data Analysis
- Belmont, CA Duxbury, 2002, xxvi 742 pp.,
- 97.95 (H CD),
- ISBN 0-534-38670-9.
3Outline
- Underlying philosophies about data
- Basics of probability and random variables
- Estimating a population proportion
- Inferring a difference between two distributions
- Assumptions about distributional forms
- Normal Theory
- Nonparametrics
- Hypothesis Testing
- T-test
- Mann-Whitney-Wilcoxon test
- Multiple Comparisons
- Bonferroni
- False Discovery Rate
4Affymetrics Mas 5.0 Expression Set
5Expression Data Matrix30,000 genes x 30 patients
6Philosophy
- Frequentist Observed data X is an imperfect
representation of an underlying idealized fixed
truth q. - Law of Large Numbers When experiments are
repeated faithfully, the average of observations
comes closer to their idealization. - Bayesian Observed data X is fixed, and the
unknown generating parameter q is random - Certainty about q depends on both empirical
information X and prior knowledge about q.
7Examples q as a Population Percentage or Average
- Parameter q
- Percent of population with a particular allele
- Percent of free throws made by Shaq over his
entire career - Mean expression level of BRCA1 gene in breast
cancer cells
- Statistic x
- Percent observed in a sample of 100 people
- Set of Shaqs yearly free-throw percentages up to
June, 2005 - Sample averages from patients in Stages 1-4
patients with high and low HER2 expression
8Key Terms
- Population (Sample Space W) set of all possible
outcomes of an experiment - Sample subset of the population (Event) that
is observed - (Generative / Probability) Model description of
how samples are obtained from the population - Parameter a feature of the population used to
describe the model - Statistic a summary of the sample that conveys
information about the parameter of interest.
9Axioms of Probability
- Needed to specify sampling and modeling
- Definition A probability measure P is a
function from the set of all possible events into
0,1 such that - P(f) 0
- P(W) 1
- P( U Ai ) S P(Ai) for countable collections of
disjoint events Ai
10Random Variables
- A random variable X is a function from the sample
space W into the real numbers R - XW ? R
- X(w) x
- The value x is called a realization of the random
variable X. It can also be thought of as a
statistic, since it is a function/summary of the
sample w.
11Example
- Experiment
- role two dice (one red,one green)
- W (i,j) i 1,,6 j 1,,6
- Probability Model (based on symmetry)
- P((i,j)) 1/36 for each ordered pair
(i,j) - Random variable X((i,j)) i j
- The probability model induces a probability
function fX on the possible values x of X. - fX(x) (6 - x-7) / 36 , x 2,,12
12Probability Function for Sum of Two Dice
13Independence
- Two random variables Y and Z are independent if,
for all possible y,z - P(Yy and Zz) P(Xx) P(Yy)
- Dice Example
- Let
- Y( (i,j) ) i
- Z( (i,j) ) j
- Then, for y,z in 1,2,3,4,5,6,
- P(Yy and Zz) 1/36 1/6 1/6 P(Xx)
P(Yy)
14iid Sample
- iid independent, identically distributed
- Model
- X1, X2, , Xn are mutually independent, that is,
- P(X1 x1, , Xn xn) P(X1 x1) P(Xn
xn) - X1, X2, , Xn have the same probability function
f(. q) -
- Xi f(x q), i 1,,n
15Estimating Population Proportion q
- Code Xi
- 1 if the ith observation has the characteristic
of interest - 0 otherwise i 1,,n.
- Bernoulli Distribution
- fX(x q)
- q for x 1
- (1-q) for x 0
- 0 otherwise
- Xi are iid Bernoulli(q), i 1,,n.
16Maximum Likelihood Estimate
- Y S Xi has the Binomial(n, q) distribution
- The value of q that maximizes this probability
is called the maximum likelihood estimate it has
the form
17Generalization to Continuous Means
- Let N be the size of the whole population.
- Estimate the population average
-
-
using the sample average
18Central Limit Theorem
- Yi iid Binomial(n,q) , n large, 0 lt q lt 1
Simulating1,000,000 such Zi produces a bell
shaped curve, whose limiting form is called the
Gaussian (also called Normal) Distribution
19Normal Family of Distributions
For any interval (a,b), the area under a density
function between a and b represents
the probability of that a random variable is in
that interval.
- A density function f is a nonnegative function f
such that - Standard normal distribution is described by the
density function - If Z has a standard normal distribution, then X
q sZ has a N(q, s2) distribution with mean q
and standard deviation s.
20Normal Family of Density Functions
21Inferring a Difference in Means Between Two
Normal Populations
- Take independent samples from two populations
- Xi iid N(q1,s2) , i 1,,n1
- Yj iid N(q2,s2) , j 1,,n2
- For simplicity assume that the standard deviation
(scale parameter) s is the same in both
populations.
22Example Expression of RNA coding MCL1 protein
- Samples (Golub study)
- Xi sample from n1 27 ALL patients
- Yj sample from n2 11 AML patients
- Is the mean level of MCL1 RNA expression
different for the two populations of patients? - Sample means (Max Likelihood Estimates)
- ALL 6.8
- AML 8.1
- Is this enough evidence to conclude that the mean
MCL1 expression is higher for the AML population
than for the ALL population?
23Hypothesis Testing
- Null Hypothesis
- Suppose AML mean is less than or equal to ALL
mean - Least Favorable Configuration is when both
means are equal (q1 q2) - P-value
- Under the least favorable configuration, what
is the probability that the sample means would be
so far apart in the direction as what was
observed (8.1 6.8 1.3)?
24Sampling Distribution
- Under the least favorable configuration, how
would the difference D between
the sample means vary over different samples of
the same sizes? - D N(0, s2/n1 n2)
25Importance of s
- If s 5,
- sd(D) .81
- P(D gt 1.3)
- .054
26Estimating s2
- In any distribution, s2 is the mean squared
distance of an observation from its mean. - Estimate this by the average squared difference
of a sample observation from the sample mean. - Adjust to compensate for the fact that
- S (xi c)2 is minimized by c the sample
average. - Pooled estimate of common s2 is
27Students t test statistic
28MCL1 Expression in ALL AML
- Sample averages
- ALL 6.8 (n1 27)
- AML 8.1 (n2 11)
- Pooled estimate of s
- Spooled .622
- Test statistic
- Tobserved 5.76
- P-value
- P(T gt 5.76) lt 10-7
- Conclusion
- Mean value of MCL1 is higher in AML population
29Checking Assumptions
- Xi and Yj each have normal distributions
- Check using boxplots. Look for
- Skewness
- Outliers
- If necessary, correct by either
- Transformation
- Use of Non-Parametric Test (Mann-Whitney-Wilcoxon)
- Xi and Yj distributions have a common
standard deviation - Check using boxplots
- If necessary, correct using Welch modification
30Box Plots of Raw MCL1 Expression
31Box Plots of Log MCL1 Expression
32Mann-Whitney-Wilcoxon Test
- Based on number of pairs (Xi,Yj)
- with Xilt Yj
- Results are invariant to transformations
- Conclusion applies to population medians
- For MCL1 data, P-value is also lt 10-7
33Multiple Comparisons
- Chalk Board Presentation or Discussed Later