Statistical Inference I: Hypothesis testing; sample size - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Inference I: Hypothesis testing; sample size

Description:

... trying to emphasize small and unimportant differences in your data, show your ... When you are trying to cover-up large differences, show the error bars as ... – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 100
Provided by: Kris147
Learn more at: https://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Statistical Inference I: Hypothesis testing; sample size


1
Statistical Inference I Hypothesis testing
sample size
2
Statistics Primer
  • Statistical Inference
  • Hypothesis testing
  • P-values
  • Type I error
  • Type II error
  • Statistical power
  • Sample size calculations

3
What is a statistic?
  • A statistic is any value that can be calculated
    from the sample data.
  • Sample statistics are calculated to give us an
    idea about the larger population.

4
Examples of statistics
  • mean
  • The average cost of a gallon of gas in the US is
    2.65.
  • difference in means
  • The difference in the average gas price in Los
    Angeles (2.96) compared with Des Moines, Iowa
    (2.32) is 64 cents.
  • proportion
  • 67 of high school students in the U.S. exercise
    regularly
  • difference in proportions
  • The difference in the proportion of Democrats who
    approve of Obama (83) versus Republicans who do
    (14) is 69

5
What is a statistic?
  • Sample statistics are estimates of population
    parameters.

6
Sample statistics estimate population parameters
7
What is sampling variation?
  • Statistics vary from sample to sample due to
    random chance.
  • Example
  • A population of 100,000 people has an average IQ
    of 100 (If you actually could measure them all!)
  • If you sample 5 random people from this
    population, what will you get?

8
Sampling Variation
Mean IQ100
9
Sampling Variation and Sample Size
  • Do you expect more or less sampling variability
    in samples of 10 people?
  • Of 50 people?
  • Of 1000 people?
  • Of 100,000 people?

10
Sampling Distributions
  • Most experiments are one-shot deals. So, how do
    we know if an observed effect from a single
    experiment is real or is just an artifact of
    sampling variability (chance variation)?
  •  
  • Requires a priori knowledge about how sampling
    variability works
  • Question Why have I made you learn about
    probability distributions and about how to
    calculate and manipulate expected value and
    variance?
  • Answer Because they form the basis of describing
    the distribution of a sample statistic.

11
Standard error
  • Standard Error is a measure of sampling
    variability.
  • Standard error is the standard deviation of a
    sample statistic.
  • Its a theoretical quantity! What would the
    distribution of my statistic be if I could repeat
    my experiment many times (with fixed sample
    size)? How much chance variation is there?
  • Standard error decreases with increasing sample
    size and increases with increasing variability of
    the outcome (e.g., IQ).
  • Standard errors can be predicted by computer
    simulation or mathematical theory (formulas).
  • The formula for standard error is different for
    every type of statistic (e.g., mean, difference
    in means, odds ratio).

12
What is statistical inference?
  • The field of statistics provides guidance on how
    to make conclusions in the face of chance
    variation (sampling variability).

13
Example 1 Difference in proportions
  • Research Question Are antidepressants a risk
    factor for suicide attempts in children and
    adolescents?
  • Example modified from Antidepressant Drug
    Therapy and Suicide in Severely Depressed
    Children and Adults Olfson et al. Arch Gen
    Psychiatry.200663865-872.

14
Example 1
  • Design Case-control study
  • Methods Researchers used Medicaid records to
    compare prescription histories between 263
    children and teenagers (6-18 years) who had
    attempted suicide and 1241 controls who had never
    attempted suicide (all subjects suffered from
    depression).
  • Statistical question Is a history of use of
    antidepressants more common among cases than
    controls?

15
Example 1
  • Statistical question Is a history of use of
    particular antidepressants more common among
    heart disease cases than controls?
  • What will we actually compare?
  • Proportion of cases who used antidepressants in
    the past vs. proportion of controls who did

16
Results
No () of cases (n263)
No () of controls (n1241)
Any antidepressant drug ever
120 (46)
 448 (36)
46
36
Difference10
17
What does a 10 difference mean?
  • Before we perform any formal statistical analysis
    on these data, we already have a lot of
    information.
  • Look at the basic numbers first THEN consider
    statistical significance as a secondary guide.

18
Is the association statistically significant?
  • This 10 difference could reflect a true
    association or it could be a fluke in this
    particular sample.
  • The question is 10 bigger or smaller than the
    expected sampling variability?

19
What is hypothesis testing?
  • Statisticians try to answer this question with a
    formal hypothesis test

20
Hypothesis testing
Step 1 Assume the null hypothesis.
Null hypothesis There is no association between
antidepressant use and suicide attempts in the
target population ( the difference is 0)
21
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truemath theory (formula)
The standard error of the difference in two
proportions is
Thus, we expect to see differences between the
group as big as about 6.6 (2 standard errors)
just by chance
22
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truecomputer simulation
  • In computer simulation, you simulate taking
    repeated samples of the same size from the same
    population and observe the sampling variability.
  • I used computer simulation to take 1000 samples
    of 263 cases and 1241 controls assuming the null
    hypothesis is true (e.g., no difference in
    antidepressant use between the groups).

23
Computer Simulation Results
24
What is standard error?
Standard error measure of variability of sample
statistics
25
Hypothesis Testing
Step 3 Do an experiment
We observed a difference of 10 between cases and
controls.
26
Hypothesis Testing
Step 4 Calculate a p-value
P-valuethe probability of your data or something
more extreme under the null hypothesis.
27
Hypothesis Testing
Step 4 Calculate a p-valuemathematical theory
28
The p-value from computer simulation
29
P-value
P-valuethe probability of your data or something
more extreme under the null hypothesis. From our
simulation, we estimate the p-value to be 3/1000
or .003
30
Hypothesis Testing
Step 5 Reject or do not reject the null
hypothesis.
Here we reject the null. Alternative hypothesis
There is an association between antidepressant
use and suicide in the target population.
31
What does a 10 difference mean?
  • Is it statistically significant? YES
  • Is it clinically significant?
  • Is this a causal association?

32
What does a 10 difference mean?
  • Is it statistically significant? YES
  • Is it clinically significant? MAYBE
  • Is this a causal association? MAYBE

Statistical significance does not necessarily
imply clinical significance.
Statistical significance does not necessarily
imply a cause-and-effect relationship.
33
What would a lack of statistical significance
mean?
  • If this study had sampled only 50 cases and 50
    controls, the sampling variability would have
    been much higheras shown in this computer
    simulation

34
(No Transcript)
35
With only 50 cases and 50 controls
36
Two-tailed p-value
37
What does a 10 difference mean (50 cases/50
controls)?
  • Is it statistically significant? NO
  • Is it clinically significant? MAYBE
  • Is this a causal association? MAYBE

No evidence of an effect ? Evidence of no effect.
38
Example 2 Difference in means
  • Example Rosental, R. and Jacobson, L. (1966)
    Teachers expectancies Determinates of pupils
    I.Q. gains. Psychological Reports, 19, 115-118.

39
The Experiment (note exact numbers have been
altered)
  • Grade 3 at Oak School were given an IQ test at
    the beginning of the academic year (n90).
  • Classroom teachers were given a list of names of
    students in their classes who had supposedly
    scored in the top 20 percent these students were
    identified as academic bloomers (n18).
  • BUT the children on the teachers lists had
    actually been randomly assigned to the list.
  • At the end of the year, the same I.Q. test was
    re-administered.

40
Example 2
  • Statistical question Do students in the
    treatment group have more improvement in IQ than
    students in the control group?
  • What will we actually compare?
  • One-year change in IQ score in the treatment
    group vs. one-year change in IQ score in the
    control group.

41
Results
Academic bloomers (n18)
Controls (n72)
Change in IQ score
12.2 (2.0)
 8.2 (2.0)
12.2 points
8.2 points
Difference4 points
42
What does a 4-point difference mean?
  • Before we perform any formal statistical analysis
    on these data, we already have a lot of
    information.
  • Look at the basic numbers first THEN consider
    statistical significance as a secondary guide.

43
Is the association statistically significant?
  • This 4-point difference could reflect a true
    effect or it could be a fluke.
  • The question is a 4-point difference bigger or
    smaller than the expected sampling variability?

44
Hypothesis testing
Step 1 Assume the null hypothesis.
Null hypothesis There is no difference between
academic bloomers and normal students ( the
difference is 0)
45
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truemath theory
The standard error of the difference in two means
is
We expect to see differences between the group as
big as about 1.0 (2 standard errors) just by
chance
46
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truecomputer simulation
  • In computer simulation, you simulate taking
    repeated samples of the same size from the same
    population and observe the sampling variability.
  • I used computer simulation to take 1000 samples
    of 18 treated and 72 controls, assuming the null
    hypothesis (that the treatment doesnt affect
    IQ).

47
Computer Simulation Results
48
What is the standard error?
Standard error measure of variability of sample
statistics
49
Hypothesis Testing
Step 3 Do an experiment
We observed a difference of 4 between treated and
controls.
50
Hypothesis Testing
Step 4 Calculate a p-value
P-valuethe probability of your data or something
more extreme under the null hypothesis.
51
Hypothesis Testing
Step 4 Calculate a p-valuemathematical theory
  •  

p-value lt.0001

52
Getting the P-value from computer simulation
53
P-value
P-valuethe probability of your data or something
more extreme under the null hypothesis. Here,
p-valuelt.0001
54
Hypothesis Testing
Step 5 Reject or do not reject the null
hypothesis.
Here we reject the null. Alternative hypothesis
There is an association between being labeled as
gifted and subsequent academic achievement.
55
What does a 4-point difference mean?
  • Is it statistically significant? YES
  • Is it clinically significant?
  • Is this a causal association?

56
What does a 4-point difference mean?
  • Is it statistically significant? YES
  • Is it clinically significant? MAYBE
  • Is this a causal association? MAYBE

Statistical significance does not necessarily
imply clinical significance.
Statistical significance does not necessarily
imply a cause-and-effect relationship.
57
What if our standard deviation had been higher?
  • The standard deviation for change scores in both
    treatment and control was 2.0. What if change
    scores had been much more variablesay a standard
    deviation of 10.0?

58
(No Transcript)
59
With a std. dev. of 10.0
60
What would a 4.0 difference mean (std. dev10)?
  • Is it statistically significant? NO
  • Is it clinically significant? MAYBE
  • Is this a causal association? MAYBE

No evidence of an effect ? Evidence of no effect.
61
Hypothesis testing summary
  • Null hypothesis the hypothesis of no effect
    (usually the opposite of what you hope to prove).
    The straw man you are trying to shoot down.
  • Example antidepressants have no effect on
    suicide risk
  • P-value the probability of your observed data if
    the null hypothesis is true.
  • Example The probability that the study would
    have found 10 higher suicide attempts in the
    antidepressant group (compared with control) if
    antidepressants had no effect (i.e., just by
    chance).
  • If the p-value is low enough (i.e., if our data
    are very unlikely given the null hypothesis),
    this is evidence that the null hypothesis is
    wrong.
  • If p-value is low enough (typically lt.05), we
    reject the null hypothesis and conclude that
    antidepressants do have an effect.

62
Summary The Underlying Logic of hypothesis tests
Follows this logic Assume A. If A, then
B. Not B. Therefore, Not A. But throw in a bit
of uncertaintyIf A, then probably B
63
Error and power
  • Type I error rate (or significance level) the
    probability of finding an effect that isnt real
    (false positive).
  • If we require p-valuelt.05 for statistical
    significance, this means that 1/20 times we will
    find a positive result just by chance.
  • Type II error rate the probability of missing an
    effect (false negative).
  • Statistical power the probability of finding an
    effect if it is there (the probability of not
    making a type II error).
  • When we design studies, we typically aim for a
    power of 80 (allowing a false negative rate, or
    type II error rate, of 20).

64
Type I and Type II Error in a box
65
Reminds me ofPascals Wager
66
Type I and Type II Error in a box
67
Review Question 1
  • If we have a p-value of 0.03 and so decide that
    our effect is statistically significant, what is
    the probability that were wrong (i.e., that the
    hypothesis test gave us a false positive)?
  • .03
  • .06
  • Cannot tell
  • 1.96
  • 95

68
Review Question 1
  • If we have a p-value of 0.03 and so decide that
    our effect is statistically significant, what is
    the probability that were wrong (i.e., that the
    hypothesis test gave us a false positive)?
  • .03
  • .06
  • Cannot tell
  • 1.96
  • 95

69
Review Question 2
  • Standard error is
  • For a given variable, its standard deviation
    divided by the square root of n.
  • A measure of the variability of a sample
    statistic.
  • The inverse of sample size.
  • A measure of the variability of a characteristic.
  • All of the above.

70
Review Question 2
  • Standard error is
  • For a given variable, its standard deviation
    divided by the square root of n.
  • A measure of the variability of a sample
    statistic.
  • The inverse of sample size.
  • A measure of the variability of a characteristic.
  • All of the above.

71
Review Question 3
  • A randomized trial of two treatments for
    depression failed to show a statistically
    significant difference in improvement from
    depressive symptoms (p-value .50). It follows
    that
  • The treatments are equally effective.
  • Neither treatment is effective.
  • The study lacked sufficient power to detect a
    difference.
  • The null hypothesis should be rejected.
  • There is not enough evidence to reject the null
    hypothesis.

72
Review Question 3
  • A randomized trial of two treatments for
    depression failed to show a statistically
    significant difference in improvement from
    depressive symptoms (p-value .50). It follows
    that
  • The treatments are equally effective.
  • Neither treatment is effective.
  • The study lacked sufficient power to detect a
    difference.
  • The null hypothesis should be rejected.
  • There is not enough evidence to reject the null
    hypothesis.

73
Review Question 4
  • Following the introduction of a new treatment
    regime in a rehab facility, alcoholism cure
    rates increased. The proportion of successful
    outcomes in the two years following the change
    was significantly higher than in the preceding
    two years (p-value lt.005).  It follows that
  • The improvement in treatment outcome is
    clinically important.
  • The new regime cannot be worse than the old
    treatment.
  • Assuming that there are no biases in the study
    method, the new treatment should be recommended
    in preference to the old.
  • All of the above.
  • None of the above.

74
Review Question 4
  • Following the introduction of a new treatment
    regime in a rehab facility, alcoholism cure
    rates increased. The proportion of successful
    outcomes in the two years following the change
    was significantly higher than in the preceding
    two years (p-value lt.005).  It follows that
  • The improvement in treatment outcome is
    clinically important.
  • The new regime cannot be worse than the old
    treatment.
  • Assuming that there are no biases in the study
    method, the new treatment should be recommended
    in preference to the old.
  • All of the above.
  • None of the above.

75
Statistical Power
  • Statistical power is the probability of finding
    an effect if its real.

76
Can we quantify how much power we have for given
sample sizes?
77
study 1 263 cases, 1241 controls
Null Distribution difference0.
Clinically relevant alternative difference10.
78
study 1 263 cases, 1241 controls
Power chance of being in the rejection region if
the alternative is truearea to the right of this
line (in yellow)
Power here gt80
79
study 1 50 cases, 50 controls
Power closer to 20 now.
80
Study 2 18 treated, 72 controls, STD DEV 2
Clinically relevant alternative difference4
points
Power is nearly 100!
81
Study 2 18 treated, 72 controls, STD DEV10
Power is about 40
82
Study 2 18 treated, 72 controls, effect size1.0
Power is about 50
Clinically relevant alternative difference1
point
83
Factors Affecting Power
  • 1. Size of the effect
  • 2. Standard deviation of the characteristic
  • 3. Bigger sample size
  • 4. Significance level desired

84
1. Bigger difference from the null mean
85
2. Bigger standard deviation
86
3. Bigger Sample Size
87
4. Higher significance level
88
Sample size calculations
  • Based on these elements, you can write a formal
    mathematical equation that relates power, sample
    size, effect size, standard deviation, and
    significance level

89
Simple formula for difference in proportions
90
Simple formula for difference in means
91
Sample size calculators on the web
  • http//biostat.mc.vanderbilt.edu/twiki/bin/view/Ma
    in/PowerSampleSize
  • http//calculators.stat.ucla.edu
  • http//hedwig.mgh.harvard.edu/sample_size/size.htm
    l

92
These sample size calculations are idealized
  • They do not account for losses-to-follow up
    (prospective studies)
  • They do not account for non-compliance (for
    intervention trial or RCT)
  • They assume that individuals are independent
    observations (not true in clustered designs)
  • Consult a statistician!

93
Review Question 5
  • Which of the following elements does not increase
    statistical power?
  • Increased sample size
  • Measuring the outcome variable more precisely
  • A significance level of .01 rather than .05
  • A larger effect size.

94
Review Question 5
  • Which of the following elements does not increase
    statistical power?
  • Increased sample size
  • Measuring the outcome variable more precisely
  • A significance level of .01 rather than .05
  • A larger effect size.

95
Review Question 6
  • Most sample size calculators ask you to input a
    value for ?. What are they asking for?
  • The standard error
  • The standard deviation
  • The standard error of the difference
  • The coefficient of deviation
  • The variance

96
Review Question 6
  • Most sample size calculators ask you to input a
    value for ?. What are they asking for?
  • The standard error
  • The standard deviation
  • The standard error of the difference
  • The coefficient of deviation
  • The variance

97
Review Question 7
  • For your RCT, you want 80 power to detect a
    reduction of 10 points or more in the treatment
    group relative to placebo. What is 10 in your
    sample size formula?
  • a. Standard deviation
  • b. mean change
  • c. Effect size
  • d. Standard error
  • e. Significance level

98
Review Question 7
  • For your RCT, you want 80 power to detect a
    reduction of 10 points or more in the treatment
    group relative to placebo. What is 10 in your
    sample size formula?
  • a. Standard deviation
  • b. mean change
  • c. Effect size
  • d. Standard error
  • e. Significance level

99
Homework
  • Problem Set 3
  • Reading continue reading textbook
  • Reading p-value article
  • Journal article/article review sheet
Write a Comment
User Comments (0)
About PowerShow.com