Introduction to Statistical Inference - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Introduction to Statistical Inference

Description:

Mean expression level of BRCA1 gene in breast cancer cells. Statistic x ... between a and b represents the. probability of that a random variable is in that interval. ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 34
Provided by: stat267
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Statistical Inference


1
Introduction to Statistical Inference
  • J. Verducci
  • MBI Summer Workshop
  • August, 2005

2
Recommended Text
  • Fred L. RAMSEY and Daniel W. SCHAFER.
  • The Statistical Sleuth
  • A Course in Methods of Data Analysis
  • Belmont, CA Duxbury, 2002, xxvi 742 pp.,
  • 97.95 (H CD),
  • ISBN 0-534-38670-9.

3
Outline
  • Underlying philosophies about data
  • Basics of probability and random variables
  • Estimating a population proportion
  • Inferring a difference between two distributions
  • Assumptions about distributional forms
  • Normal Theory
  • Nonparametrics
  • Hypothesis Testing
  • T-test
  • Mann-Whitney-Wilcoxon test
  • Multiple Comparisons
  • Bonferroni
  • False Discovery Rate

4
Affymetrics Mas 5.0 Expression Set
5
Expression Data Matrix30,000 genes x 30 patients
6
Philosophy
  • Frequentist Observed data X is an imperfect
    representation of an underlying idealized fixed
    truth q.
  • Law of Large Numbers When experiments are
    repeated faithfully, the average of observations
    comes closer to their idealization.
  • Bayesian Observed data X is fixed, and the
    unknown generating parameter q is random
  • Certainty about q depends on both empirical
    information X and prior knowledge about q.

7
Examples q as a Population Percentage or Average
  • Parameter q
  • Percent of population with a particular allele
  • Percent of free throws made by Shaq over his
    entire career
  • Mean expression level of BRCA1 gene in breast
    cancer cells
  • Statistic x
  • Percent observed in a sample of 100 people
  • Set of Shaqs yearly free-throw percentages up to
    June, 2005
  • Sample averages from patients in Stages 1-4
    patients with high and low HER2 expression

8
Key Terms
  • Population (Sample Space W) set of all possible
    outcomes of an experiment
  • Sample subset of the population (Event) that
    is observed
  • (Generative / Probability) Model description of
    how samples are obtained from the population
  • Parameter a feature of the population used to
    describe the model
  • Statistic a summary of the sample that conveys
    information about the parameter of interest.

9
Axioms of Probability
  • Needed to specify sampling and modeling
  • Definition A probability measure P is a
    function from the set of all possible events into
    0,1 such that
  • P(f) 0
  • P(W) 1
  • P( U Ai ) S P(Ai) for countable collections of
    disjoint events Ai

10
Random Variables
  • A random variable X is a function from the sample
    space W into the real numbers R
  • XW ? R
  • X(w) x
  • The value x is called a realization of the random
    variable X. It can also be thought of as a
    statistic, since it is a function/summary of the
    sample w.

11
Example
  • Experiment
  • role two dice (one red,one green)
  • W (i,j) i 1,,6 j 1,,6
  • Probability Model (based on symmetry)
  • P((i,j)) 1/36 for each ordered pair
    (i,j)
  • Random variable X((i,j)) i j
  • The probability model induces a probability
    function fX on the possible values x of X.
  • fX(x) (6 - x-7) / 36 , x 2,,12

12
Probability Function for Sum of Two Dice
13
Independence
  • Two random variables Y and Z are independent if,
    for all possible y,z
  • P(Yy and Zz) P(Xx) P(Yy)
  • Dice Example
  • Let
  • Y( (i,j) ) i
  • Z( (i,j) ) j
  • Then, for y,z in 1,2,3,4,5,6,
  • P(Yy and Zz) 1/36 1/6 1/6 P(Xx)
    P(Yy)

14
iid Sample
  • iid independent, identically distributed
  • Model
  • X1, X2, , Xn are mutually independent, that is,
  • P(X1 x1, , Xn xn) P(X1 x1) P(Xn
    xn)
  • X1, X2, , Xn have the same probability function
    f(. q)
  • Xi f(x q), i 1,,n

15
Estimating Population Proportion q
  • Code Xi
  • 1 if the ith observation has the characteristic
    of interest
  • 0 otherwise i 1,,n.
  • Bernoulli Distribution
  • fX(x q)
  • q for x 1
  • (1-q) for x 0
  • 0 otherwise
  • Xi are iid Bernoulli(q), i 1,,n.

16
Maximum Likelihood Estimate
  • Y S Xi has the Binomial(n, q) distribution
  • The value of q that maximizes this probability
    is called the maximum likelihood estimate it has
    the form

17
Generalization to Continuous Means
  • Let N be the size of the whole population.
  • Estimate the population average

using the sample average
18
Central Limit Theorem
  • Yi iid Binomial(n,q) , n large, 0 lt q lt 1

Simulating1,000,000 such Zi produces a bell
shaped curve, whose limiting form is called the
Gaussian (also called Normal) Distribution
19
Normal Family of Distributions
For any interval (a,b), the area under a density
function between a and b represents
the probability of that a random variable is in
that interval.
  • A density function f is a nonnegative function f
    such that
  • Standard normal distribution is described by the
    density function
  • If Z has a standard normal distribution, then X
    q sZ has a N(q, s2) distribution with mean q
    and standard deviation s.

20
Normal Family of Density Functions
21
Inferring a Difference in Means Between Two
Normal Populations
  • Take independent samples from two populations
  • Xi iid N(q1,s2) , i 1,,n1
  • Yj iid N(q2,s2) , j 1,,n2
  • For simplicity assume that the standard deviation
    (scale parameter) s is the same in both
    populations.

22
Example Expression of RNA coding MCL1 protein
  • Samples (Golub study)
  • Xi sample from n1 27 ALL patients
  • Yj sample from n2 11 AML patients
  • Is the mean level of MCL1 RNA expression
    different for the two populations of patients?
  • Sample means (Max Likelihood Estimates)
  • ALL 6.8
  • AML 8.1
  • Is this enough evidence to conclude that the mean
    MCL1 expression is higher for the AML population
    than for the ALL population?

23
Hypothesis Testing
  • Null Hypothesis
  • Suppose AML mean is less than or equal to ALL
    mean
  • Least Favorable Configuration is when both
    means are equal (q1 q2)
  • P-value
  • Under the least favorable configuration, what
    is the probability that the sample means would be
    so far apart in the direction as what was
    observed (8.1 6.8 1.3)?

24
Sampling Distribution
  • Under the least favorable configuration, how
    would the difference D between
    the sample means vary over different samples of
    the same sizes?
  • D N(0, s2/n1 n2)

25
Importance of s
  • If s 5,
  • sd(D) .81
  • P(D gt 1.3)
  • .054

26
Estimating s2
  • In any distribution, s2 is the mean squared
    distance of an observation from its mean.
  • Estimate this by the average squared difference
    of a sample observation from the sample mean.
  • Adjust to compensate for the fact that
  • S (xi c)2 is minimized by c the sample
    average.
  • Pooled estimate of common s2 is

27
Students t test statistic
28
MCL1 Expression in ALL AML
  • Sample averages
  • ALL 6.8 (n1 27)
  • AML 8.1 (n2 11)
  • Pooled estimate of s
  • Spooled .622
  • Test statistic
  • Tobserved 5.76
  • P-value
  • P(T gt 5.76) lt 10-7
  • Conclusion
  • Mean value of MCL1 is higher in AML population

29
Checking Assumptions
  • Xi and Yj each have normal distributions
  • Check using boxplots. Look for
  • Skewness
  • Outliers
  • If necessary, correct by either
  • Transformation
  • Use of Non-Parametric Test (Mann-Whitney-Wilcoxon)
  • Xi and Yj distributions have a common
    standard deviation
  • Check using boxplots
  • If necessary, correct using Welch modification

30
Box Plots of Raw MCL1 Expression
31
Box Plots of Log MCL1 Expression
32
Mann-Whitney-Wilcoxon Test
  • Based on number of pairs (Xi,Yj)
  • with Xilt Yj
  • Results are invariant to transformations
  • Conclusion applies to population medians
  • For MCL1 data, P-value is also lt 10-7

33
Multiple Comparisons
  • Chalk Board Presentation or Discussed Later
Write a Comment
User Comments (0)
About PowerShow.com