Title: Statistical Inference III
1Statistical Inference III
2But First Diagnostic Testing and Screening Tests
3Characteristics of a diagnostic test
- Sensitivity Probability that, if you truly have
the disease, the diagnostic test will catch it. - SpecificityProbability that, if you truly do not
have the disease, the test will register
negative.
4Calculating sensitivity and specificity from a
2x2 table
ab
cd
Among those with true disease, how many test
positive?
Among those without the disease, how many test
negative?
5Hypothetical Example
10
990
Sensitivity9/10.90
1 false negatives out of 10 cases
Specificity 881/990 .89
109 false positives out of 990
6What factors determine the effectiveness of
screening?
- The prevalence (risk) of disease.
- The effectiveness of screening in preventing
illness or death. - Is the test any good at detecting
disease/precursor (sensitivity of the test)? - Is the test detecting a clinically relevant
condition? - Is there anything we can do if disease (or
pre-disease) is detected (cures, treatments)? - Does detecting and treating disease at an earlier
stage really result in a better outcome? - The risks of screening, such as false positives
and radiation.
7Positive predictive value
- The probability that if you test positive for the
disease, you actually have the disease. - Depends on the characteristics of the test
(sensitivity, specificity) and the prevalence of
disease.
8Example Mammography
- Mammography utilizes ionizing radiation to image
breast tissue. - The examination is performed by compressing the
breast firmly between a plastic plate and an
x-ray cassette that contains special x-ray film. - Mammography can identify breast cancers too small
to detect on physical examination. - Early detection and treatment of breast cancer
(before metastasis) can improve a womans chances
of survival. - Studies show that, among 50-69 year-old women,
screening results in 20-35 reductions in
mortality from breast cancer.
9Mammography
- Controversy exists over the efficacy of
mammography in reducing mortality from breast
cancer in 40-49 year old women. - Mammography has a high rate of false positive
tests that cause anxiety and necessitate further
costly diagnostic procedures. - Mammography exposes a woman to some radiation,
which may slightly increase the risk of mutations
in breast tissue.
10Example
- A 60-year old woman has an abnormal mammogram
what is the chance that she has breast cancer?
E.g., what is the positive predictive value?
11Calculating PPV and NPV from a 2x2 table
ac
bd
Among those who test positive, how many truly
have the disease?
Among those who test negative, how many truly do
not have the disease?
12Hypothetical Example
118
882
PPV9/1187.6
NPV881/88299.9
Prevalence of disease 10/1000 1
13What if disease was twice as prevalent in the
population?
20
980
sensitivity18/20.90
specificity872/980.89
Sensitivity and specificity are characteristics
of the test, so they dont change!
14What if disease was more prevalent?
126
874
PPV18/12614.3
NPV872/87499.8
Prevalence of disease 20/1000 2
15Conclusions
- Positive predictive value increases with
increasing prevalence of disease - Or if you change the diagnostic tests to improve
their accuracy.
16Review question 1
- In a group of patients presenting to the
hospital casualty department with abdominal pain,
30 of patients have acute appendicitis. 70 of
patients with appendicitis have a temperature
greater than 37.5ºC 40 of patients without
appendicitis have a temperature greater than
37.5ºC. - The sensitivity of temperature greater than
37.5ºC as a marker for appendicitis is 21/49. - The specificity of temperature greater than
37.5ºC as a marker for appendicitis is 42/70. - The positive predictive value of temperature
greater than 37.5ºC as a marker for appendicitis
is 21/30. - The predictive value of the test will be the same
in a different population. - The specificity of the test will depend upon the
prevalence of appendicitis in the population to
which it is applied.
17Review question 1
- In a group of patients presenting to the
hospital casualty department with abdominal pain,
30 of patients have acute appendicitis. 70 of
patients with appendicitis have a temperature
greater than 37.5ºC 40 of patients without
appendicitis have a temperature greater than
37.5ºC. - The sensitivity of temperature greater than
37.5ºC as a marker for appendicitis is 21/49. - The specificity of temperature greater than
37.5ºC as a marker for appendicitis is 42/70. - The positive predictive value of temperature
greater than 37.5ºC as a marker for appendicitis
is 21/30. - The predictive value of the test will be the same
in a different population. - The specificity of the test will depend upon the
prevalence of appendicitis in the population to
which it is applied.
18Pitfalls of hypothesis testing
19Pitfall 1 over-emphasis on p-values
- Clinically unimportant effects may be
statistically significant if a study is large
(and therefore, has a small standard error and
extreme precision). - Report the effect size and confidence interval.
20Pitfall 2 association does not equal causation
- Statistical significance does not imply a
cause-effect relationship. - Interpret results in the context of the study
design.
21Pitfall 3 data dredging/multiple comparisons
- In 1980, researchers at Duke randomized 1073
heart disease patients into two groups, but
treated the groups equally. - Not surprisingly, there was no difference in
survival. - Then they divided the patients into 18 subgroups
based on prognostic factors. - In a subgroup of 397 patients (with three-vessel
disease and an abnormal left ventricular
contraction) survival of those in group 1 was
significantly different from survival of those in
group 2 (p - How could this be since there was no treatment?
(Lee et al. Clinical judgment and statistics
lessons from a simulated randomized trial in
coronary artery disease, Circulation, 61
508-515, 1980.)
22Pitfall 3 multiple comparisons
- The difference resulted from the combined effect
of small imbalances in the subgroups
23Pitfall 3 multiple comparisons
- If we compare survival of treatment and
control within each of 18 subgroups, thats 18
comparisons. - If these comparisons were independent, the chance
of at least one false positive would be
24Multiple comparisons
With 18 independent comparisons, we have 60
chance of at least 1 false positive.
25Multiple comparisons
With 18 independent comparisons, we expect about
1 false positive.
26Pitfall 3 multiple comparisons
- A significance level of 0.05 means that your
false positive rate for one test is 5. - If you run more than one test, your false
positive rate will be higher than 5. - Control study-wide type I error by planning a
limited number of tests. Distinguish between
planned and exploratory tests in the results.
Correct for multiple comparisons.
27Results from Class survey
- My research question was to test whether or not
being born on odd or even days predicted anything
about your future. - I discovered that people who were born on even
days are more politically left-leaning and enjoy
manuscript writing more than people who were born
on odd days (p.03, p.02). - The differences were large and clinically
meaningful. Those born on even days were 2.1
units more left-leaning on average (7.9 vs. 5.8
on a scale from 1-10) and enjoyed writing by 2.5
points more (6.0 vs. 3.5).
28Results from Class survey
- I can see the NEJM article title now
- Being born on even days predisposes you to
leftist politics and prolificness.
29Results from Class survey
- Assuming that this difference cant be explained
by astrology, its obviously an artifact! - Whats going on?
30Results from Class survey
- After the odd/even day question, I asked you 24
other questions - I ran 24 statistical tests (comparing the outcome
variable between odd-day born people and even-day
born people). - So, there was a high chance of finding at least
one false positive!
31P-value distribution for the 24 tests
Recall Under the null hypothesis of no
associations (which well assume is true here!),
p-values follow a uniform distribution
32Compare with
Next, I generated 24 p-values from a random
number generator (uniform distribution). These
were the results from two runs
33In the medical literature
- Hypothetical example
- Researchers wanted to compare nutrient intakes
between women who had fractured and women who had
not fractured. - They used a food-frequency questionnaire and a
food diary to capture food intake. - From these two instruments, they calculated daily
intakes of all the vitamins, minerals,
macronutrients, antioxidants, etc. - Then they compared fracturers to non-fracturers
on all nutrients from both questionnaires. - They found a statistically significant difference
in vitamin K between the two groups (p - They had a lovely explanation of the role of
vitamin K in injury repair, bone, clotting, etc.
34In the medical literature
- Hypothetical example
- Of course, they found the association only on the
FFQ, not the food diary. - Whats going on? Almost certainly artifactual
(false positive!).
35Pitfall 4 high type II error (low statistical
power)
- Results that are not statistically significant
should not be interpreted as "evidence of no
effect, but as no evidence of effect - Studies may miss effects if they are
insufficiently powered (lack precision). - Design adequately powered studies and report
approximate study power if results are null.
36Statistical Power
- Statistical power is the probability of finding
an effect if its real.
37Can we quantify how much power we have for given
sample sizes?
38(No Transcript)
39(No Transcript)
40study 1 263 cases, 1241 controls
Null Distribution difference0.
Clinically relevant alternative difference10.
41study 1 263 cases, 1241 controls
Power chance of being in the rejection region if
the alternative is truearea to the right of this
line (in yellow)
Power here 80
42study 1 50 cases, 50 controls
Power closer to 20 now.
43Study 2 18 treated, 72 controls, STD DEV 2
Clinically relevant alternative difference4
points
Power is nearly 100!
44Study 2 18 treated, 72 controls, STD DEV10
Power is about 40
45Study 2 18 treated, 72 controls, effect size1.0
Power is about 50
Clinically relevant alternative difference1
point
46Factors Affecting Power
- 1. Size of the effect
- 2. Standard deviation of the characteristic
- 3. Bigger sample size
- 4. Significance level desired
471. Bigger difference from the null mean
482. Bigger standard deviation
493. Bigger Sample Size
504. Higher significance level
51Sample size calculations
- Based on these elements, you can write a formal
mathematical equation that relates power, sample
size, effect size, standard deviation, and
significance level
52Simplified formula for difference in proportions
53Simplified formula for difference in means
54Sample size calculators on the web
- http//biostat.mc.vanderbilt.edu/twiki/bin/view/Ma
in/PowerSampleSize - http//calculators.stat.ucla.edu
- http//hedwig.mgh.harvard.edu/sample_size/size.htm
l
55These sample size calculations are idealized
- They do not account for losses-to-follow up
(prospective studies) - They do not account for non-compliance (for
intervention trial or RCT) - They assume that individuals are independent
observations (not true in clustered designs) - Consult a statistician!
56Review Question 2
- Which of the following elements does not increase
statistical power? - Increased sample size
- Measuring the outcome variable more precisely
- A significance level of .01 rather than .05
- A larger effect size.
57Review Question 2
- Which of the following elements does not increase
statistical power? - Increased sample size
- Measuring the outcome variable more precisely
- A significance level of .01 rather than .05
- A larger effect size.
58Review Question 3
- Most sample size calculators ask you to input a
value for ?. What are they asking for? - The standard error
- The standard deviation
- The standard error of the difference
- The coefficient of deviation
- The variance
59Review Question 3
- Most sample size calculators ask you to input a
value for ?. What are they asking for? - The standard error
- The standard deviation
- The standard error of the difference
- The coefficient of deviation
- The variance
60Review Question 4
- You are conducting an RCT of a new blood
pressure drug. You will test control and
treatment patients before and after receiving
drug and placebo. What test will you use to
compare blood pressure changes between treatment
and placebo? - A paired ttest
- A two-sample ttest
- A two-sample test of proportions.
- An odds ratio.
61Review Question 4
- You are conducting an RCT of a new blood
pressure drug. You will test control and
treatment patients before and after receiving
drug and placebo. What test will you use to
compare blood pressure changes between treatment
and placebo? - A paired ttest
- A two-sample ttest
- A two-sample test of proportions.
- An odds ratio.
62Review Question 5
- For your RCT, you want 80 power to detect a
reduction of 10 points or more in the treatment
group relative to placebo. What is 10 in your
sample size formula? - a. Standard deviation
- b. mean change
- c. Effect size
- d. Standard error
- e. Significance level
63Review Question 5
- For your RCT, you want 80 power to detect a
reduction of 10 points or more in the treatment
group relative to placebo. What is 10 in your
sample size formula? - a. Standard deviation
- b. mean change
- c. Effect size
- d. Standard error
- e. Significance level
64Non-parametric tests
- t-tests require your outcome variable to be
normally distributed (or close enough). - Non-parametric tests are based on RANKS instead
of means and standard deviations (population
parameters).
65Example non-parametric tests
10 dieters following Atkins diet vs. 10 dieters
following Jenny Craig Hypothetical
RESULTS Atkins group loses an average of 34.5
lbs. J. Craig group loses an average of 18.5
lbs. Conclusion Atkins is better?
66Example non-parametric tests
BUT, take a closer look at the individual
data Atkins, change in weight (lbs) 4, 3,
0, -3, -4, -5, -11, -14, -15, -300 J. Craig,
change in weight (lbs) -8, -10, -12, -16, -18,
-20, -21, -24, -26, -30
67Jenny Craig
30
25
20
P
e
r
c
15
e
n
t
10
5
0
-30
-25
-20
-15
-10
-5
0
5
10
15
20
Weight Change
68Atkins
30
25
20
P
e
r
c
15
e
n
t
10
5
0
-300
-280
-260
-240
-220
-200
-180
-160
-140
-120
-100
-80
-60
-40
-20
0
20
Weight Change
69t-test doesnt work
- Comparing the mean weight loss of the two groups
is not appropriate here. - The distributions do not appear to be normally
distributed. - Moreover, there is an extreme outlier (this
outlier influences the mean a great deal).
70Statistical tests to compare ranks
- Wilcoxon Rank-Sum test is analogue of two-sample
t-test.
71Wilcoxon Rank-Sum test
- RANK the values, 1 being the least weight loss
and 20 being the most weight loss. - Atkins
- 4, 3, 0, -3, -4, -5, -11, -14, -15, -300
- Â 1, 2, 3, 4, 5, 6, 9, 11, 12, 20
- J. Craig
- -8, -10, -12, -16, -18, -20, -21, -24, -26, -30
- 7, 8, 10, 13, 14, 15, 16, 17, 18,
19
72Wilcoxon Rank-Sum test
- Sum of Atkins ranks
- Â 1 2 3 4 5 6 9 11 12 2073
- Sum of Jenny Craigs ranks
- 7 8 10 13 14 1516 17 1819137
- Jenny Craig clearly ranked higher!
- P-value (from computer) .018
73Non-normal class datacoffee
74Hypothesis
- Students who played varsity sports in high school
(call them the athlete group) drink more
caffeinated coffee than those who did not (call
them the non-athlete group). - Null hypothesis no difference in coffee drinking
between athletes and non-athletes
75Use Wilcoxon rank-sum
- Because numbers are small and outcome variable is
non-normal, use Wilcoxon rank-sum test rather
than the ttest - Non-athlete values 0 0 5 5 8 1 16 16
- Athlete values 0 0 0 0 0 0 0 1 2 3 4 8 8
- P-value (From computer) 0.08
76Review Question 6
- When you want to compare mean blood pressure
between two groups, you should - Use a ttest
- Use a nonparametric test
- Use a ttest if blood pressure is normally
distributed. - Use a two-sample proportions test.
- Use a two-sample proportions test only if blood
pressure is normally distributed.
77Review Question 6
- When you want to compare mean blood pressure
between two groups, you should - Use a ttest
- Use a nonparametric test
- Use a ttest if blood pressure is normally
distributed. - Use a two-sample proportions test.
- Use a two-sample proportions test only if blood
pressure is normally distributed.
78Review Question 7
- You want to compare two groups with regards to
the proportion of people that have high blood
pressure. What test do you use? - Use a ttest
- Use a nonparametric test
- Use a ttest if blood pressure is normally
distributed. - Use a two-sample proportions test.
- Use a two-sample proportions test only if blood
pressure is normally distributed.
79Review Question 7
- You want to compare two groups with regards to
the proportion of people that have high blood
pressure. What test do you use? - Use a ttest
- Use a nonparametric test
- Use a ttest if blood pressure is normally
distributed. - Use a two-sample proportions test.
- Use a two-sample proportions test only if blood
pressure is normally distributed.
80Review Question 8
- The other statistic available to compare
proportions between two groups is? - Wilcoxon Rank-sum test
- Odds ratio/risk ratio
- Linear regression
- Paired ttest
81Review Question 8
- The other statistic available to compare
proportions between two groups is? - Wilcoxon Rank-sum test
- Odds ratio/risk ratio
- Linear regression
- Paired ttest
82Review Question 9
- A randomized trial of two treatments for
depression failed to show a statistically
significant difference in improvement from
depressive symptoms (p-value .50). It follows
that - The treatments are equally effective.
- Neither treatment is effective.
- The study lacked sufficient power to detect a
difference. - The null hypothesis should be rejected.
- There is not enough evidence to reject the null
hypothesis.
83Review Question 9
- A randomized trial of two treatments for
depression failed to show a statistically
significant difference in improvement from
depressive symptoms (p-value .50). It follows
that - The treatments are equally effective.
- Neither treatment is effective.
- The study lacked sufficient power to detect a
difference. - The null hypothesis should be rejected.
- There is not enough evidence to reject the null
hypothesis.
84Review Question 10
- Following the introduction of a new treatment
regime in a rehab facility, alcoholism cure
rates increased. The proportion of successful
outcomes in the two years following the change
was significantly higher than in the preceding
two years (p-value - The improvement in treatment outcome is
clinically important. - The new regime cannot be worse than the old
treatment. - Assuming that there are no biases in the study
method, the new treatment should be recommended
in preference to the old. - All of the above.
- None of the above.
85Review Question 10
- Following the introduction of a new treatment
regime in a rehab facility, alcoholism cure
rates increased. The proportion of successful
outcomes in the two years following the change
was significantly higher than in the preceding
two years (p-value - The improvement in treatment outcome is
clinically important. - The new regime cannot be worse than the old
treatment. - Assuming that there are no biases in the study
method, the new treatment should be recommended
in preference to the old. - All of the above.
- None of the above.
86Homework
- Reading continue reading textbook
- Problem Set 6
- Journal Article