Title: Statistical Inference I: Hypothesis testing; sample size
1Statistical Inference I Hypothesis testing
sample size
2Statistics Primer
- Statistical Inference
- Hypothesis testing
- P-values
- Type I error
- Type II error
- Statistical power
- Sample size calculations
3What is a statistic?
- A statistic is any value that can be calculated
from the sample data. - Sample statistics are calculated to give us an
idea about the larger population.
4Examples of statistics
- mean
- The average cost of a gallon of gas in the US is
2.65. - difference in means
- The difference in the average gas price in Los
Angeles (2.96) compared with Des Moines, Iowa
(2.32) is 64 cents. - proportion
- 67 of high school students in the U.S. exercise
regularly - difference in proportions
- The difference in the proportion of Democrats who
approve of Obama (83) versus Republicans who do
(14) is 69
5What is a statistic?
- Sample statistics are estimates of population
parameters.
6Sample statistics estimate population parameters
7What is sampling variation?
- Statistics vary from sample to sample due to
random chance. - Example
- A population of 100,000 people has an average IQ
of 100 (If you actually could measure them all!) - If you sample 5 random people from this
population, what will you get?
8Sampling Variation
Mean IQ100
9Sampling Variation and Sample Size
- Do you expect more or less sampling variability
in samples of 10 people? - Of 50 people?
- Of 1000 people?
- Of 100,000 people?
10Sampling Distributions
- Most experiments are one-shot deals. So, how do
we know if an observed effect from a single
experiment is real or is just an artifact of
sampling variability (chance variation)? - Â
- Requires a priori knowledge about how sampling
variability works - Question Why have I made you learn about
probability distributions and about how to
calculate and manipulate expected value and
variance? - Answer Because they form the basis of describing
the distribution of a sample statistic.
11Standard error
- Standard Error is a measure of sampling
variability. - Standard error is the standard deviation of a
sample statistic. - Its a theoretical quantity! What would the
distribution of my statistic be if I could repeat
my experiment many times (with fixed sample
size)? How much chance variation is there? - Standard error decreases with increasing sample
size and increases with increasing variability of
the outcome (e.g., IQ). - Standard errors can be predicted by computer
simulation or mathematical theory (formulas). - The formula for standard error is different for
every type of statistic (e.g., mean, difference
in means, odds ratio).
12What is statistical inference?
- The field of statistics provides guidance on how
to make conclusions in the face of chance
variation (sampling variability).
13Example 1 Difference in proportions
- Research Question Are antidepressants a risk
factor for suicide attempts in children and
adolescents?
- Example modified from Antidepressant Drug
Therapy and Suicide in Severely Depressed
Children and Adults Olfson et al. Arch Gen
Psychiatry.200663865-872.
14Example 1
- Design Case-control study
- Methods Researchers used Medicaid records to
compare prescription histories between 263
children and teenagers (6-18 years) who had
attempted suicide and 1241 controls who had never
attempted suicide (all subjects suffered from
depression). - Statistical question Is a history of use of
antidepressants more common among cases than
controls?
15Example 1
- Statistical question Is a history of use of
particular antidepressants more common among
heart disease cases than controls? - What will we actually compare?
- Proportion of cases who used antidepressants in
the past vs. proportion of controls who did
16Results
No () of cases (n263)
No () of controls (n1241)
Any antidepressant drug ever
120 (46)
 448 (36)
46
36
Difference10
17What does a 10 difference mean?
- Before we perform any formal statistical analysis
on these data, we already have a lot of
information. - Look at the basic numbers first THEN consider
statistical significance as a secondary guide.
18Is the association statistically significant?
- This 10 difference could reflect a true
association or it could be a fluke in this
particular sample. - The question is 10 bigger or smaller than the
expected sampling variability?
19What is hypothesis testing?
- Statisticians try to answer this question with a
formal hypothesis test
20Hypothesis testing
Step 1 Assume the null hypothesis.
Null hypothesis There is no association between
antidepressant use and suicide attempts in the
target population ( the difference is 0)
21Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truemath theory (formula)
The standard error of the difference in two
proportions is
Thus, we expect to see differences between the
group as big as about 6.6 (2 standard errors)
just by chance
22Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truecomputer simulation
- In computer simulation, you simulate taking
repeated samples of the same size from the same
population and observe the sampling variability. - I used computer simulation to take 1000 samples
of 263 cases and 1241 controls assuming the null
hypothesis is true (e.g., no difference in
antidepressant use between the groups).
23Computer Simulation Results
24What is standard error?
Standard error measure of variability of sample
statistics
25Hypothesis Testing
Step 3 Do an experiment
We observed a difference of 10 between cases and
controls.
26Hypothesis Testing
Step 4 Calculate a p-value
P-valuethe probability of your data or something
more extreme under the null hypothesis.
27Hypothesis Testing
Step 4 Calculate a p-valuemathematical theory
28The p-value from computer simulation
29P-value
P-valuethe probability of your data or something
more extreme under the null hypothesis. From our
simulation, we estimate the p-value to be 3/1000
or .003
30Hypothesis Testing
Step 5 Reject or do not reject the null
hypothesis.
Here we reject the null. Alternative hypothesis
There is an association between antidepressant
use and suicide in the target population.
31What does a 10 difference mean?
- Is it statistically significant? YES
- Is it clinically significant?
- Is this a causal association?
32What does a 10 difference mean?
- Is it statistically significant? YES
- Is it clinically significant? MAYBE
- Is this a causal association? MAYBE
Statistical significance does not necessarily
imply clinical significance.
Statistical significance does not necessarily
imply a cause-and-effect relationship.
33What would a lack of statistical significance
mean?
- If this study had sampled only 50 cases and 50
controls, the sampling variability would have
been much higheras shown in this computer
simulation
34(No Transcript)
35With only 50 cases and 50 controls
36Two-tailed p-value
37What does a 10 difference mean (50 cases/50
controls)?
- Is it statistically significant? NO
- Is it clinically significant? MAYBE
- Is this a causal association? MAYBE
No evidence of an effect ? Evidence of no effect.
38Example 2 Difference in means
- Example Rosental, R. and Jacobson, L. (1966)
Teachers expectancies Determinates of pupils
I.Q. gains. Psychological Reports, 19, 115-118.
39The Experiment (note exact numbers have been
altered)
- Grade 3 at Oak School were given an IQ test at
the beginning of the academic year (n90). - Classroom teachers were given a list of names of
students in their classes who had supposedly
scored in the top 20 percent these students were
identified as academic bloomers (n18). - BUT the children on the teachers lists had
actually been randomly assigned to the list. - At the end of the year, the same I.Q. test was
re-administered.
40Example 2
- Statistical question Do students in the
treatment group have more improvement in IQ than
students in the control group? - What will we actually compare?
- One-year change in IQ score in the treatment
group vs. one-year change in IQ score in the
control group.
41Results
Academic bloomers (n18)
Controls (n72)
Change in IQ score
12.2 (2.0)
 8.2 (2.0)
12.2 points
8.2 points
Difference4 points
42What does a 4-point difference mean?
- Before we perform any formal statistical analysis
on these data, we already have a lot of
information. - Look at the basic numbers first THEN consider
statistical significance as a secondary guide.
43Is the association statistically significant?
- This 4-point difference could reflect a true
effect or it could be a fluke. - The question is a 4-point difference bigger or
smaller than the expected sampling variability?
44Hypothesis testing
Step 1 Assume the null hypothesis.
Null hypothesis There is no difference between
academic bloomers and normal students ( the
difference is 0)
45Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truemath theory
The standard error of the difference in two means
is
We expect to see differences between the group as
big as about 1.0 (2 standard errors) just by
chance
46Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truecomputer simulation
- In computer simulation, you simulate taking
repeated samples of the same size from the same
population and observe the sampling variability. - I used computer simulation to take 1000 samples
of 18 treated and 72 controls, assuming the null
hypothesis (that the treatment doesnt affect
IQ).
47Computer Simulation Results
48What is the standard error?
Standard error measure of variability of sample
statistics
49Hypothesis Testing
Step 3 Do an experiment
We observed a difference of 4 between treated and
controls.
50Hypothesis Testing
Step 4 Calculate a p-value
P-valuethe probability of your data or something
more extreme under the null hypothesis.
51Hypothesis Testing
Step 4 Calculate a p-valuemathematical theory
p-value lt.0001
52Getting the P-value from computer simulation
53P-value
P-valuethe probability of your data or something
more extreme under the null hypothesis. Here,
p-valuelt.0001
54Hypothesis Testing
Step 5 Reject or do not reject the null
hypothesis.
Here we reject the null. Alternative hypothesis
There is an association between being labeled as
gifted and subsequent academic achievement.
55What does a 4-point difference mean?
- Is it statistically significant? YES
- Is it clinically significant?
- Is this a causal association?
56What does a 4-point difference mean?
- Is it statistically significant? YES
- Is it clinically significant? MAYBE
- Is this a causal association? MAYBE
Statistical significance does not necessarily
imply clinical significance.
Statistical significance does not necessarily
imply a cause-and-effect relationship.
57What if our standard deviation had been higher?
- The standard deviation for change scores in both
treatment and control was 2.0. What if change
scores had been much more variablesay a standard
deviation of 10.0?
58(No Transcript)
59With a std. dev. of 10.0
60What would a 4.0 difference mean (std. dev10)?
- Is it statistically significant? NO
- Is it clinically significant? MAYBE
- Is this a causal association? MAYBE
No evidence of an effect ? Evidence of no effect.
61Hypothesis testing summary
- Null hypothesis the hypothesis of no effect
(usually the opposite of what you hope to prove).
The straw man you are trying to shoot down. - Example antidepressants have no effect on
suicide risk - P-value the probability of your observed data if
the null hypothesis is true. - Example The probability that the study would
have found 10 higher suicide attempts in the
antidepressant group (compared with control) if
antidepressants had no effect (i.e., just by
chance). - If the p-value is low enough (i.e., if our data
are very unlikely given the null hypothesis),
this is evidence that the null hypothesis is
wrong. - If p-value is low enough (typically lt.05), we
reject the null hypothesis and conclude that
antidepressants do have an effect.
62Summary The Underlying Logic of hypothesis tests
Follows this logic Assume A. If A, then
B. Not B. Therefore, Not A. But throw in a bit
of uncertaintyIf A, then probably B
63Error and power
- Type I error rate (or significance level) the
probability of finding an effect that isnt real
(false positive). - If we require p-valuelt.05 for statistical
significance, this means that 1/20 times we will
find a positive result just by chance. - Type II error rate the probability of missing an
effect (false negative). - Statistical power the probability of finding an
effect if it is there (the probability of not
making a type II error). - When we design studies, we typically aim for a
power of 80 (allowing a false negative rate, or
type II error rate, of 20).
64Type I and Type II Error in a box
65Reminds me ofPascals Wager
66Type I and Type II Error in a box
67Review Question 1
- If we have a p-value of 0.03 and so decide that
our effect is statistically significant, what is
the probability that were wrong (i.e., that the
hypothesis test gave us a false positive)? - .03
- .06
- Cannot tell
- 1.96
- 95
68Review Question 1
- If we have a p-value of 0.03 and so decide that
our effect is statistically significant, what is
the probability that were wrong (i.e., that the
hypothesis test gave us a false positive)? - .03
- .06
- Cannot tell
- 1.96
- 95
69Review Question 2
- Standard error is
- For a given variable, its standard deviation
divided by the square root of n. - A measure of the variability of a sample
statistic. - The inverse of sample size.
- A measure of the variability of a characteristic.
- All of the above.
70Review Question 2
- Standard error is
- For a given variable, its standard deviation
divided by the square root of n. - A measure of the variability of a sample
statistic. - The inverse of sample size.
- A measure of the variability of a characteristic.
- All of the above.
71Review Question 3
- A randomized trial of two treatments for
depression failed to show a statistically
significant difference in improvement from
depressive symptoms (p-value .50). It follows
that - The treatments are equally effective.
- Neither treatment is effective.
- The study lacked sufficient power to detect a
difference. - The null hypothesis should be rejected.
- There is not enough evidence to reject the null
hypothesis.
72Review Question 3
- A randomized trial of two treatments for
depression failed to show a statistically
significant difference in improvement from
depressive symptoms (p-value .50). It follows
that - The treatments are equally effective.
- Neither treatment is effective.
- The study lacked sufficient power to detect a
difference. - The null hypothesis should be rejected.
- There is not enough evidence to reject the null
hypothesis.
73Review Question 4
- Following the introduction of a new treatment
regime in a rehab facility, alcoholism cure
rates increased. The proportion of successful
outcomes in the two years following the change
was significantly higher than in the preceding
two years (p-value lt.005). Â It follows that - The improvement in treatment outcome is
clinically important. - The new regime cannot be worse than the old
treatment. - Assuming that there are no biases in the study
method, the new treatment should be recommended
in preference to the old. - All of the above.
- None of the above.
74Review Question 4
- Following the introduction of a new treatment
regime in a rehab facility, alcoholism cure
rates increased. The proportion of successful
outcomes in the two years following the change
was significantly higher than in the preceding
two years (p-value lt.005). Â It follows that - The improvement in treatment outcome is
clinically important. - The new regime cannot be worse than the old
treatment. - Assuming that there are no biases in the study
method, the new treatment should be recommended
in preference to the old. - All of the above.
- None of the above.
75Statistical Power
- Statistical power is the probability of finding
an effect if its real.
76Can we quantify how much power we have for given
sample sizes?
77study 1 263 cases, 1241 controls
Null Distribution difference0.
Clinically relevant alternative difference10.
78study 1 263 cases, 1241 controls
Power chance of being in the rejection region if
the alternative is truearea to the right of this
line (in yellow)
Power here gt80
79study 1 50 cases, 50 controls
Power closer to 20 now.
80Study 2 18 treated, 72 controls, STD DEV 2
Clinically relevant alternative difference4
points
Power is nearly 100!
81Study 2 18 treated, 72 controls, STD DEV10
Power is about 40
82Study 2 18 treated, 72 controls, effect size1.0
Power is about 50
Clinically relevant alternative difference1
point
83Factors Affecting Power
- 1. Size of the effect
- 2. Standard deviation of the characteristic
- 3. Bigger sample size
- 4. Significance level desired
841. Bigger difference from the null mean
852. Bigger standard deviation
863. Bigger Sample Size
874. Higher significance level
88Sample size calculations
- Based on these elements, you can write a formal
mathematical equation that relates power, sample
size, effect size, standard deviation, and
significance level
89Simple formula for difference in proportions
90Simple formula for difference in means
91Sample size calculators on the web
- http//biostat.mc.vanderbilt.edu/twiki/bin/view/Ma
in/PowerSampleSize - http//calculators.stat.ucla.edu
- http//hedwig.mgh.harvard.edu/sample_size/size.htm
l
92These sample size calculations are idealized
- They do not account for losses-to-follow up
(prospective studies) - They do not account for non-compliance (for
intervention trial or RCT) - They assume that individuals are independent
observations (not true in clustered designs) - Consult a statistician!
93Review Question 5
- Which of the following elements does not increase
statistical power? - Increased sample size
- Measuring the outcome variable more precisely
- A significance level of .01 rather than .05
- A larger effect size.
94Review Question 5
- Which of the following elements does not increase
statistical power? - Increased sample size
- Measuring the outcome variable more precisely
- A significance level of .01 rather than .05
- A larger effect size.
95Review Question 6
- Most sample size calculators ask you to input a
value for ?. What are they asking for? - The standard error
- The standard deviation
- The standard error of the difference
- The coefficient of deviation
- The variance
96Review Question 6
- Most sample size calculators ask you to input a
value for ?. What are they asking for? - The standard error
- The standard deviation
- The standard error of the difference
- The coefficient of deviation
- The variance
97Review Question 7
- For your RCT, you want 80 power to detect a
reduction of 10 points or more in the treatment
group relative to placebo. What is 10 in your
sample size formula? - a. Standard deviation
- b. mean change
- c. Effect size
- d. Standard error
- e. Significance level
98Review Question 7
- For your RCT, you want 80 power to detect a
reduction of 10 points or more in the treatment
group relative to placebo. What is 10 in your
sample size formula? - a. Standard deviation
- b. mean change
- c. Effect size
- d. Standard error
- e. Significance level
99Homework
- Problem Set 3
- Reading continue reading textbook
- Reading p-value article
- Journal article/article review sheet