Statistical Inference III - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Inference III

Description:

Sensitivity= Probability that, if you truly have the disease, the diagnostic test will ... Sum of Jenny Craig's ranks: 7 8 10 13 14 15 16 17 18 19=137 ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 87
Provided by: Kris147
Learn more at: https://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Statistical Inference III


1
Statistical Inference III
2
But First Diagnostic Testing and Screening Tests
3
Characteristics of a diagnostic test
  • Sensitivity Probability that, if you truly have
    the disease, the diagnostic test will catch it.
  • SpecificityProbability that, if you truly do not
    have the disease, the test will register
    negative.

4
Calculating sensitivity and specificity from a
2x2 table
ab
cd
Among those with true disease, how many test
positive?
Among those without the disease, how many test
negative?
5
Hypothetical Example
10
990
Sensitivity9/10.90
1 false negatives out of 10 cases
Specificity 881/990 .89
109 false positives out of 990
6
What factors determine the effectiveness of
screening?
  • The prevalence (risk) of disease.
  • The effectiveness of screening in preventing
    illness or death.
  • Is the test any good at detecting
    disease/precursor (sensitivity of the test)?
  • Is the test detecting a clinically relevant
    condition?
  • Is there anything we can do if disease (or
    pre-disease) is detected (cures, treatments)?
  • Does detecting and treating disease at an earlier
    stage really result in a better outcome?
  • The risks of screening, such as false positives
    and radiation.

7
Positive predictive value
  • The probability that if you test positive for the
    disease, you actually have the disease.
  • Depends on the characteristics of the test
    (sensitivity, specificity) and the prevalence of
    disease.

8
Example Mammography
  • Mammography utilizes ionizing radiation to image
    breast tissue.
  • The examination is performed by compressing the
    breast firmly between a plastic plate and an
    x-ray cassette that contains special x-ray film.
  • Mammography can identify breast cancers too small
    to detect on physical examination.
  • Early detection and treatment of breast cancer
    (before metastasis) can improve a womans chances
    of survival.
  • Studies show that, among 50-69 year-old women,
    screening results in 20-35 reductions in
    mortality from breast cancer.

9
Mammography
  • Controversy exists over the efficacy of
    mammography in reducing mortality from breast
    cancer in 40-49 year old women.
  • Mammography has a high rate of false positive
    tests that cause anxiety and necessitate further
    costly diagnostic procedures.
  • Mammography exposes a woman to some radiation,
    which may slightly increase the risk of mutations
    in breast tissue.

10
Example
  • A 60-year old woman has an abnormal mammogram
    what is the chance that she has breast cancer?
    E.g., what is the positive predictive value?

11
Calculating PPV and NPV from a 2x2 table
ac
bd
Among those who test positive, how many truly
have the disease?
Among those who test negative, how many truly do
not have the disease?
12
Hypothetical Example
118
882
PPV9/1187.6
NPV881/88299.9
Prevalence of disease 10/1000 1
13
What if disease was twice as prevalent in the
population?
20
980
sensitivity18/20.90
specificity872/980.89
Sensitivity and specificity are characteristics
of the test, so they dont change!
14
What if disease was more prevalent?
126
874
PPV18/12614.3
NPV872/87499.8
Prevalence of disease 20/1000 2
15
Conclusions
  • Positive predictive value increases with
    increasing prevalence of disease
  • Or if you change the diagnostic tests to improve
    their accuracy.

16
Review question 1
  • In a group of patients presenting to the
    hospital casualty department with abdominal pain,
    30 of patients have acute appendicitis. 70 of
    patients with appendicitis have a temperature
    greater than 37.5ºC 40 of patients without
    appendicitis have a temperature greater than
    37.5ºC.
  • The sensitivity of temperature greater than
    37.5ºC as a marker for appendicitis is 21/49.
  • The specificity of temperature greater than
    37.5ºC as a marker for appendicitis is 42/70.
  • The positive predictive value of temperature
    greater than 37.5ºC as a marker for appendicitis
    is 21/30.
  • The predictive value of the test will be the same
    in a different population.
  • The specificity of the test will depend upon the
    prevalence of appendicitis in the population to
    which it is applied.

17
Review question 1
  • In a group of patients presenting to the
    hospital casualty department with abdominal pain,
    30 of patients have acute appendicitis. 70 of
    patients with appendicitis have a temperature
    greater than 37.5ºC 40 of patients without
    appendicitis have a temperature greater than
    37.5ºC.
  • The sensitivity of temperature greater than
    37.5ºC as a marker for appendicitis is 21/49.
  • The specificity of temperature greater than
    37.5ºC as a marker for appendicitis is 42/70.
  • The positive predictive value of temperature
    greater than 37.5ºC as a marker for appendicitis
    is 21/30.
  • The predictive value of the test will be the same
    in a different population.
  • The specificity of the test will depend upon the
    prevalence of appendicitis in the population to
    which it is applied.

18
Pitfalls of hypothesis testing
19
Pitfall 1 over-emphasis on p-values
  • Clinically unimportant effects may be
    statistically significant if a study is large
    (and therefore, has a small standard error and
    extreme precision).
  • Report the effect size and confidence interval.

20
Pitfall 2 association does not equal causation
  • Statistical significance does not imply a
    cause-effect relationship.
  • Interpret results in the context of the study
    design.

21
Pitfall 3 data dredging/multiple comparisons
  • In 1980, researchers at Duke randomized 1073
    heart disease patients into two groups, but
    treated the groups equally.
  • Not surprisingly, there was no difference in
    survival.
  • Then they divided the patients into 18 subgroups
    based on prognostic factors.
  • In a subgroup of 397 patients (with three-vessel
    disease and an abnormal left ventricular
    contraction) survival of those in group 1 was
    significantly different from survival of those in
    group 2 (p
  • How could this be since there was no treatment?

(Lee et al. Clinical judgment and statistics
lessons from a simulated randomized trial in
coronary artery disease, Circulation, 61
508-515, 1980.)
22
Pitfall 3 multiple comparisons
  • The difference resulted from the combined effect
    of small imbalances in the subgroups

23
Pitfall 3 multiple comparisons
  • If we compare survival of treatment and
    control within each of 18 subgroups, thats 18
    comparisons.
  • If these comparisons were independent, the chance
    of at least one false positive would be

24
Multiple comparisons
With 18 independent comparisons, we have 60
chance of at least 1 false positive.
25
Multiple comparisons
With 18 independent comparisons, we expect about
1 false positive.
26
Pitfall 3 multiple comparisons
  • A significance level of 0.05 means that your
    false positive rate for one test is 5.
  • If you run more than one test, your false
    positive rate will be higher than 5.
  • Control study-wide type I error by planning a
    limited number of tests. Distinguish between
    planned and exploratory tests in the results.
    Correct for multiple comparisons.

27
Results from Class survey
  • My research question was to test whether or not
    being born on odd or even days predicted anything
    about your future.
  • I discovered that people who were born on even
    days are more politically left-leaning and enjoy
    manuscript writing more than people who were born
    on odd days (p.03, p.02).
  • The differences were large and clinically
    meaningful. Those born on even days were 2.1
    units more left-leaning on average (7.9 vs. 5.8
    on a scale from 1-10) and enjoyed writing by 2.5
    points more (6.0 vs. 3.5).

28
Results from Class survey
  • I can see the NEJM article title now
  • Being born on even days predisposes you to
    leftist politics and prolificness.

29
Results from Class survey
  • Assuming that this difference cant be explained
    by astrology, its obviously an artifact!
  • Whats going on?

30
Results from Class survey
  • After the odd/even day question, I asked you 24
    other questions
  • I ran 24 statistical tests (comparing the outcome
    variable between odd-day born people and even-day
    born people).
  • So, there was a high chance of finding at least
    one false positive!

31
P-value distribution for the 24 tests
Recall Under the null hypothesis of no
associations (which well assume is true here!),
p-values follow a uniform distribution
32
Compare with
Next, I generated 24 p-values from a random
number generator (uniform distribution). These
were the results from two runs
33
In the medical literature
  • Hypothetical example
  • Researchers wanted to compare nutrient intakes
    between women who had fractured and women who had
    not fractured.
  • They used a food-frequency questionnaire and a
    food diary to capture food intake.
  • From these two instruments, they calculated daily
    intakes of all the vitamins, minerals,
    macronutrients, antioxidants, etc.
  • Then they compared fracturers to non-fracturers
    on all nutrients from both questionnaires.
  • They found a statistically significant difference
    in vitamin K between the two groups (p
  • They had a lovely explanation of the role of
    vitamin K in injury repair, bone, clotting, etc.

34
In the medical literature
  • Hypothetical example
  • Of course, they found the association only on the
    FFQ, not the food diary.
  • Whats going on? Almost certainly artifactual
    (false positive!).

35
Pitfall 4 high type II error (low statistical
power)
  • Results that are not statistically significant
    should not be interpreted as "evidence of no
    effect, but as no evidence of effect
  • Studies may miss effects if they are
    insufficiently powered (lack precision).
  • Design adequately powered studies and report
    approximate study power if results are null.

36
Statistical Power
  • Statistical power is the probability of finding
    an effect if its real.

37
Can we quantify how much power we have for given
sample sizes?
38
(No Transcript)
39
(No Transcript)
40
study 1 263 cases, 1241 controls
Null Distribution difference0.
Clinically relevant alternative difference10.
41
study 1 263 cases, 1241 controls
Power chance of being in the rejection region if
the alternative is truearea to the right of this
line (in yellow)
Power here 80
42
study 1 50 cases, 50 controls
Power closer to 20 now.
43
Study 2 18 treated, 72 controls, STD DEV 2
Clinically relevant alternative difference4
points
Power is nearly 100!
44
Study 2 18 treated, 72 controls, STD DEV10
Power is about 40
45
Study 2 18 treated, 72 controls, effect size1.0
Power is about 50
Clinically relevant alternative difference1
point
46
Factors Affecting Power
  • 1. Size of the effect
  • 2. Standard deviation of the characteristic
  • 3. Bigger sample size
  • 4. Significance level desired

47
1. Bigger difference from the null mean
48
2. Bigger standard deviation
49
3. Bigger Sample Size
50
4. Higher significance level
51
Sample size calculations
  • Based on these elements, you can write a formal
    mathematical equation that relates power, sample
    size, effect size, standard deviation, and
    significance level

52
Simplified formula for difference in proportions
53
Simplified formula for difference in means
54
Sample size calculators on the web
  • http//biostat.mc.vanderbilt.edu/twiki/bin/view/Ma
    in/PowerSampleSize
  • http//calculators.stat.ucla.edu
  • http//hedwig.mgh.harvard.edu/sample_size/size.htm
    l

55
These sample size calculations are idealized
  • They do not account for losses-to-follow up
    (prospective studies)
  • They do not account for non-compliance (for
    intervention trial or RCT)
  • They assume that individuals are independent
    observations (not true in clustered designs)
  • Consult a statistician!

56
Review Question 2
  • Which of the following elements does not increase
    statistical power?
  • Increased sample size
  • Measuring the outcome variable more precisely
  • A significance level of .01 rather than .05
  • A larger effect size.

57
Review Question 2
  • Which of the following elements does not increase
    statistical power?
  • Increased sample size
  • Measuring the outcome variable more precisely
  • A significance level of .01 rather than .05
  • A larger effect size.

58
Review Question 3
  • Most sample size calculators ask you to input a
    value for ?. What are they asking for?
  • The standard error
  • The standard deviation
  • The standard error of the difference
  • The coefficient of deviation
  • The variance

59
Review Question 3
  • Most sample size calculators ask you to input a
    value for ?. What are they asking for?
  • The standard error
  • The standard deviation
  • The standard error of the difference
  • The coefficient of deviation
  • The variance

60
Review Question 4
  • You are conducting an RCT of a new blood
    pressure drug. You will test control and
    treatment patients before and after receiving
    drug and placebo. What test will you use to
    compare blood pressure changes between treatment
    and placebo?
  • A paired ttest
  • A two-sample ttest
  • A two-sample test of proportions.
  • An odds ratio.

61
Review Question 4
  • You are conducting an RCT of a new blood
    pressure drug. You will test control and
    treatment patients before and after receiving
    drug and placebo. What test will you use to
    compare blood pressure changes between treatment
    and placebo?
  • A paired ttest
  • A two-sample ttest
  • A two-sample test of proportions.
  • An odds ratio.

62
Review Question 5
  • For your RCT, you want 80 power to detect a
    reduction of 10 points or more in the treatment
    group relative to placebo. What is 10 in your
    sample size formula?
  • a. Standard deviation
  • b. mean change
  • c. Effect size
  • d. Standard error
  • e. Significance level

63
Review Question 5
  • For your RCT, you want 80 power to detect a
    reduction of 10 points or more in the treatment
    group relative to placebo. What is 10 in your
    sample size formula?
  • a. Standard deviation
  • b. mean change
  • c. Effect size
  • d. Standard error
  • e. Significance level

64
Non-parametric tests
  • t-tests require your outcome variable to be
    normally distributed (or close enough).
  • Non-parametric tests are based on RANKS instead
    of means and standard deviations (population
    parameters).

65
Example non-parametric tests
10 dieters following Atkins diet vs. 10 dieters
following Jenny Craig Hypothetical
RESULTS Atkins group loses an average of 34.5
lbs. J. Craig group loses an average of 18.5
lbs. Conclusion Atkins is better?
66
Example non-parametric tests
BUT, take a closer look at the individual
data Atkins, change in weight (lbs) 4, 3,
0, -3, -4, -5, -11, -14, -15, -300 J. Craig,
change in weight (lbs) -8, -10, -12, -16, -18,
-20, -21, -24, -26, -30
67
Jenny Craig
30
25
20
P
e
r
c
15
e
n
t
10
5
0
-30
-25
-20
-15
-10
-5
0
5
10
15
20
Weight Change
68
Atkins
30
25
20
P
e
r
c
15
e
n
t
10
5
0
-300
-280
-260
-240
-220
-200
-180
-160
-140
-120
-100
-80
-60
-40
-20
0
20
Weight Change
69
t-test doesnt work
  • Comparing the mean weight loss of the two groups
    is not appropriate here.
  • The distributions do not appear to be normally
    distributed.
  • Moreover, there is an extreme outlier (this
    outlier influences the mean a great deal).

70
Statistical tests to compare ranks
  • Wilcoxon Rank-Sum test is analogue of two-sample
    t-test.

71
Wilcoxon Rank-Sum test
  • RANK the values, 1 being the least weight loss
    and 20 being the most weight loss.
  • Atkins
  • 4, 3, 0, -3, -4, -5, -11, -14, -15, -300
  •  1, 2, 3, 4, 5, 6, 9, 11, 12, 20
  • J. Craig
  • -8, -10, -12, -16, -18, -20, -21, -24, -26, -30
  • 7, 8, 10, 13, 14, 15, 16, 17, 18,
    19

72
Wilcoxon Rank-Sum test
  • Sum of Atkins ranks
  •  1 2 3 4 5 6 9 11 12 2073
  • Sum of Jenny Craigs ranks
  • 7 8 10 13 14 1516 17 1819137
  • Jenny Craig clearly ranked higher!
  • P-value (from computer) .018

73
Non-normal class datacoffee
74
Hypothesis
  • Students who played varsity sports in high school
    (call them the athlete group) drink more
    caffeinated coffee than those who did not (call
    them the non-athlete group).
  • Null hypothesis no difference in coffee drinking
    between athletes and non-athletes

75
Use Wilcoxon rank-sum
  • Because numbers are small and outcome variable is
    non-normal, use Wilcoxon rank-sum test rather
    than the ttest
  • Non-athlete values 0 0 5 5 8 1 16 16
  • Athlete values 0 0 0 0 0 0 0 1 2 3 4 8 8
  • P-value (From computer) 0.08

76
Review Question 6
  • When you want to compare mean blood pressure
    between two groups, you should
  • Use a ttest
  • Use a nonparametric test
  • Use a ttest if blood pressure is normally
    distributed.
  • Use a two-sample proportions test.
  • Use a two-sample proportions test only if blood
    pressure is normally distributed.

77
Review Question 6
  • When you want to compare mean blood pressure
    between two groups, you should
  • Use a ttest
  • Use a nonparametric test
  • Use a ttest if blood pressure is normally
    distributed.
  • Use a two-sample proportions test.
  • Use a two-sample proportions test only if blood
    pressure is normally distributed.

78
Review Question 7
  • You want to compare two groups with regards to
    the proportion of people that have high blood
    pressure. What test do you use?
  • Use a ttest
  • Use a nonparametric test
  • Use a ttest if blood pressure is normally
    distributed.
  • Use a two-sample proportions test.
  • Use a two-sample proportions test only if blood
    pressure is normally distributed.

79
Review Question 7
  • You want to compare two groups with regards to
    the proportion of people that have high blood
    pressure. What test do you use?
  • Use a ttest
  • Use a nonparametric test
  • Use a ttest if blood pressure is normally
    distributed.
  • Use a two-sample proportions test.
  • Use a two-sample proportions test only if blood
    pressure is normally distributed.

80
Review Question 8
  • The other statistic available to compare
    proportions between two groups is?
  • Wilcoxon Rank-sum test
  • Odds ratio/risk ratio
  • Linear regression
  • Paired ttest

81
Review Question 8
  • The other statistic available to compare
    proportions between two groups is?
  • Wilcoxon Rank-sum test
  • Odds ratio/risk ratio
  • Linear regression
  • Paired ttest

82
Review Question 9
  • A randomized trial of two treatments for
    depression failed to show a statistically
    significant difference in improvement from
    depressive symptoms (p-value .50). It follows
    that
  • The treatments are equally effective.
  • Neither treatment is effective.
  • The study lacked sufficient power to detect a
    difference.
  • The null hypothesis should be rejected.
  • There is not enough evidence to reject the null
    hypothesis.

83
Review Question 9
  • A randomized trial of two treatments for
    depression failed to show a statistically
    significant difference in improvement from
    depressive symptoms (p-value .50). It follows
    that
  • The treatments are equally effective.
  • Neither treatment is effective.
  • The study lacked sufficient power to detect a
    difference.
  • The null hypothesis should be rejected.
  • There is not enough evidence to reject the null
    hypothesis.

84
Review Question 10
  • Following the introduction of a new treatment
    regime in a rehab facility, alcoholism cure
    rates increased. The proportion of successful
    outcomes in the two years following the change
    was significantly higher than in the preceding
    two years (p-value
  • The improvement in treatment outcome is
    clinically important.
  • The new regime cannot be worse than the old
    treatment.
  • Assuming that there are no biases in the study
    method, the new treatment should be recommended
    in preference to the old.
  • All of the above.
  • None of the above.

85
Review Question 10
  • Following the introduction of a new treatment
    regime in a rehab facility, alcoholism cure
    rates increased. The proportion of successful
    outcomes in the two years following the change
    was significantly higher than in the preceding
    two years (p-value
  • The improvement in treatment outcome is
    clinically important.
  • The new regime cannot be worse than the old
    treatment.
  • Assuming that there are no biases in the study
    method, the new treatment should be recommended
    in preference to the old.
  • All of the above.
  • None of the above.

86
Homework
  • Reading continue reading textbook
  • Problem Set 6
  • Journal Article
Write a Comment
User Comments (0)
About PowerShow.com