Title: Statistical Inference in Cancer Clinical Trials
1Statistical Inference in Cancer Clinical Trials
- Mark Krailo
- Keck School of Medicine, University of Southern
California - And
- Childrens Oncology Group
2Planning and Execution
Do It
Protocol
3Inference
4Inference
5Single Therapy Phase II Study
- Enroll 20 patients
- If 2 or few respond reject the agent as
ineffective otherwise move it forward - Design characteristics
- True response rate 5 gt agent rejected for
further development with probability 93 - True response rate 35 gt agent accepted for
further development with probability 99
6Single Therapy Phase II Study
- Enroll 20 patients and only 2 respond
- We do not know the true response rate
- Infer what it could be based on how unusual the
result is - The agent isnt effective because the data we see
are consistent with a 5 response rate
7Single Therapy Phase II Study
- What would be as or less likely than 2 of 20
responders if RR0.05?
8Single Therapy Phase II Study
- What would be as or less likely than 2 of 20
responders if RR0.35?
9Single Therapy Phase II Study
- RR0.05 is tenable RR0.35 is not tenable
- Convention If a parameter yields a p-value of
less than or equal to 0.05 it is not tenable - What we see is so unusual (an outcome as or less
likely), we infer the parameter cannot take on
that value - Identify all response rates that are tenable
(1-p) x 100 confidence interval
10Single Therapy Phase II Study
- Precision is inversely related to confidence
interval width - Confidence intervals are related to hypothesis
testing - The confidence interval does not necessarily
contain the truth
. cii 20 2, b exact
-- Binomial Exact --
Variable Obs Mean Std. Err.
95 Conf. Interval----------------------
--------------------------------------------------
---- 20 .1
.067082 .0123485 .3169827
11Confidence Intervals Become More Precise When
- The number of observations increase
12Confidence Intervals Become More Precise When
- The observed proportion of successes moves away
from 50
13How You Get There Can be Important Too
- Design
- Enroll 10 patients If none respond, stop and
conclude the agent is ineffective - If 1 or more patients respond, enroll 10 more
If 2 or fewer respond, conclude the agent is
ineffective - Simon Optimal 2 stage design
- a0.07 for RR0.05 1-ß0.12 for RR0.25
14How You Get There Can be Important Too
- Result
- Enrolled 10 patients 1 patient responded
- Enroll 10 more patients 1 patient responded
- Conclusion Not enough evidence to consider the
agent is not of sufficient activity to study
further - Naïve 95 confidence interval 1.23-31.7
- Wrong but not by much
- Actual 95 confidence interval 1.42-34.3
15How You Get There Can be Important Too
- Why?
- Stopping early affects the possible outcomes of
the trial - Cannot get 0 of 10 responders and then 2 of 10
responders in the two-stage design - Reduces the number of possible ways to get to 2
responders of 20 and so the calculation of what
is as or less likely than the observed (2 of 20)
outcome - Advantage Can stop early if things look
particularly bad - Cost Slightly reduced precision at the trials
conclusion
16When Designing or Evaluating a Single Therapy
Phase II
- Start with an hypothesized true response rate
(null hypothesis H0) - Conventionally one that indicates the agent is of
no further interest - The plausibility of the null hypothesis is
assessed by determining the probability of seeing
an outcome as or less likely that what was
observed (p-value) - Set level of plausibility that implies doubt in
the null before the data are collected - Type I (a) error Chance of declaring the null
hypothesis false when, actually, it is true - Conventionally set at 5-10
17When Designing or Evaluating a Single Therapy
Phase II
- Enroll enough patients to be confident that, if a
large enough difference exists, it will be
identified - Set level to be high
- Referred to as power and conventionally set at
80-90 - Type II (ß) error Chance of declaring H0 tenable
when, actually, it is false - Upon reporting the result
- Describe the design fully
- Evaluate the hypothesis plausibility
- Provide a measure of precision through a
confidence interval 95 is usually used
18When Designing or Evaluating a Single Therapy
Phase II
- How do we know Enroll enough patients to be
confident that, if a large enough difference
exists, it will be identified? - We could enroll 1 patient
- Rule If the patient fails, H0 is supported
- Otherwise it is rejected
- If H0 is true, the reject only 5 of the time
- Unfortunately reject 65 of the time when RR
0.35
19When Designing or Evaluating a Single Therapy
Phase II
- We could enroll 100,000 patient
- Rule If less than 5113 patients fails, H0 is
supported - Otherwise it is rejected
- If H0 is true, the reject only 5 of the time
- Reject gt99.99 of the time when RR 0.35
- Software that can find the number of patients N
such that, if RR 0.35, Pr(more than k of N
patients respond) 1-ß
20Comparing Categorical Factors
21Categorical Factors and Outcomes Evaluating
Associations
- Two characteristics measured on each individual
- Age at Enrollment
- Disease status at 5 year (recurrence v. disease
free) - Null hypothesis The probability of being a
5-year survivor is the same in both age groups - The plausibility of the null hypothesis is
assessed by determining the probability of seeing
an outcome as or less likely that what was
observed (p-value) - a will be set at 5
22Categorical Factors and Outcomes Evaluating
Associations
Column Five-Year Event-Free Survivor By Row
Age at Enrollment Yes No
TOTAL ---------------------- -17
221 200 421 Row
52.5 47.5 Exp. 210.9 210.0
---------------------- 18
20 40 60 Row 33.3
66.7 Exp. 30.0 29.9
241
240 481 Row 50.1 49.9
23Categorical Factors and Outcomes Evaluating
Associations
24Categorical Factors and Outcomes Evaluating
Associations
Column Five-Year Event-Free Survivor By Row
Age at Enrollment Yes No
TOTAL ---------------------- -17
221 200 421 Row
52.5 47.5 Exp. 210.9 210.0
---------------------- 18
20 40 60 Row 33.3
66.7 Exp. 30.0 29.9
241
240 481 Row 50.1
49.9p-value 0.006
25Categorical Factors and Outcomes Evaluating
Associations
- The hypothesis that the failure rate is the same
is not tenable - Why?
- The statistical inference is easy the causal
inference is difficult - Older patients more frail
- Only older patients with high risk disease
participated in the investigation - More older patients have a poor pharmacological
characteristics
26Categorical Factors and Outcomes Point Estimates
and Precision
- Point estimates and measures of precision
- OR 1 implies the probability of success is the
same in both groups - OR gt 1 implies the probability of success is
greater in Group 1 than in Group 2 - OR lt 1 implies the probability of success is less
in Group 1 when compared with Group 2 - OR 1 is referred to as Group membership is
independent of success probability.
27Categorical Factors and Outcomes Point Estimates
and Precision
- Estimate OR by using the observed proportions of
those succeeding and failing in each group - In our case (0.525x0.667)/(0.333x0.475) 2.21
- 95 confidence interval
- Identify all the ORs that would be considered
tenable according to our p-value criteria - Short cut 2.21 1.96x0.697
- 1.3,3.9
28Categorical Factors and Outcomes
- Is a relative risk of 2 large? Some
well-investigated ORs - Smoking and lung cancer 23
- Age at menarche and risk for breast cancer 1.2
for those with menarche 14 or older v. 11 and
under - First degree relative with breast cancer and
subjects risk for breast cancer 2 - Most improvements in treatment outcome are
between 1.5 and 3 fold OR (0.33-0.67 OR of
failure)
29Categorical Factors and Outcomes Assigning
Intervention
- Two characteristics measured on each individual
- Treatment regimen
- Disease status at 5 year (recurrence v. disease
free) - Null hypothesis The probability of being a
5-year survivor is the same in both treatment
groups - The plausibility of the null hypothesis is
assessed by determining the probability of seeing
an outcome as or less likely that what was
observed (p-value) - a will be set at 5
- Assign the factor by randomization
30Comparing Categorical Factors
31Comparing Categorical Factors
Column Five-Year Event-Free Survivor By Row
Regimen Yes No TOTAL
Standard
110 136 246 Row 44.7
55.3 Exp. 123.2 122.7
---------------------- Experimental 131
104 235 Row 55.7 44.3
Exp. 117.7 117.2
241
240 481 Row 50.1
49.9p-value 0.018
32Categorical Factors and Outcomes
- The hypothesis that the failure rate is the same
is not tenable - Why?
- The statistical inference is easy the causal
inference is (relatively) easy - Is the experimental treatment superior?
- Did the process have put all the good prognosis
patients on the experimental regimen
purposefully? (No) - Did the process accidentally assign good
prognosis patients to the experimental regimen?
(with this many patients, the probability of a
10 imbalance is much less than 0.4) - The only plausible explanation is the first
choice the experimental regimen is superior - Randomization is the basis for causal inference
33One Example of Why Randomization is Best
- National Wilms Tumor Study (NWTS) -I and II
- NWTS-I established a 2-drug regimen as the
standard - NWTS-II randomized between 2 and 3 drugs
- 3-drug regimen was no better than historical
controls, but significantly better than
randomized 2-drug regimen - Farewell and DAngio A simulated study of
historical controls using real data. Biometrics
37, 169-176, 1981
34Categorical Factors and Outcomes
- Estimate OR
- OR of being a 5-year EF survivor for the
Experimental regimen 1.56 - 95 confidence interval
- 1.09,2.23
35When Designing or Evaluating a Comparative Trial
- Start with an hypothesized true response rate
(null hypothesis H0) - Conventionally one that indicates there is no
difference between the groups - The plausibility of the null hypothesis is
assessed by determining the probability of seeing
an outcome as or less likely that what was
observed (p-value) - Set level of plausibility that implies doubt in
the null before the data are collected - Type I (a) error Chance of declaring the null
hypothesis false when, actually, it is true - Conventionally set at 5-10
36When Designing or Evaluating a Comparative Trial
- Enroll enough patients to be confident that, if a
clinically relevant difference exists, it will
be identified - Set level to be high
- Referred to as power and conventionally set at
80-90 - Type II (ß) error Chance of declaring H0 tenable
when, actually, it is false - Upon reporting the result
- Describe the design fully
- Evaluate the hypothesis plausibility
- Provide a measure of precision through a
confidence interval 95 is usually used
37When Designing or Evaluating a Comparative Trial
- How do we know Enroll enough patients to be
confident that, if a large enough difference
exists, it will be identified? - We could randomize 8 patients (4 to each regimen)
- Rule If the all four fail one therapy and all
four succeed on the other H0 is rejected - Otherwise it is accepted
- If H0 is true, the reject only 2 of the time
- Unfortunately reject 9 of the time when
Pr(success with standard)0.30 and Pr(success
with experimental) 0.50
38When Designing or Evaluating a Comparative Trial
Size Matters
39The Abuse of Power
- Power is calculated before the trial starts as an
aid to determine whether the trial is worth doing - Quantifies the chance of the experiment
identifying a regimen that should replace the
standard when a difference of believable size
exits - After the experiment is conducted, the confidence
interval for the parameter quantifying difference
is what matters - Post-hoc calculations of power are not useful
40The Abuse of Power
- Column Five-Year Event-Free Survivor By
Row Regimen (18 or Over) Yes
No TOTAL
Standard 10 18 28
Row 35.7 64.3
---------------------- Experimental 10
22 32 Row 31.3 68.8
20 40 60 Row 33.3
66.7 p-value.78741 - OR 0.81 95 Confidence Interval 0.28-2.4
41The Abuse of Power
- The subset is small and the results are
consistent with the overall result - Confidence interval supports values indicative of
an advantage for the Experimental regimen - Post-hoc calculations of power are not relevant
since the (lack of) precision is quantified by
the 95 confidence interval
42The Abuse of Power
- Column Five-Year Event-Free Survivor
- By Row Metastases Present at Study Entry
- (Experimental Patients Only) Yes
No TOTAL
Yes 11 44 55
Row 20.0 80.0
---------------------- No 120
60 180 Row 66.7 33.3
131 104 235 Row 55.7
44.3p-value lt 0.001 - OR 0.125 95 Confidence Interval 0.061-0.257
43The Abuse of Power
- Even though the subset is small, the estimate and
precision are indicative of an association - Post-hoc calculations of power are not relevant
- Even though a group of this size would be
unlikely to identify an OR of 0.66, that does not
reflect on actual estimated OR and its precision
44Underpowered Randomized Phase II Designs
- Identify two regimens that are to be evaluated
- Randomize a moderate number of patients between
the two regimens - Select the regimen that has the best nominal
response rate to go forward - A certain COG study Randomize 72 patients to VTC
v. VTCBevacizumab - Select a regimen to be incorporated into the next
front-line randomized trial
45Underpowered Randomized Phase II Designs
- If we did choose a conventional level of evidence
(a0.05) it is likely VTCB would not be selected
even if OR was 2.25 - 1 will likely be in the confidence interval for
the OR - No evidence to definitively differentiate between
the regimens - Our criteria for evidence, however, essentially
is a0.50 - Because of the imprecision of the conclusions, we
should not use randomized phase II results to
displace the current standard
46Comparing Time to Event
47Comparing Time to Event
- Std 1, 47, 63, 67, 67, (200 eligible patients
randomized to standard) - Exptl 2, 3, 43, 64, 111, (198 eligible
patients randomized to experimental) - Characterize Outcome
- Average? (1 47 63? )/197
- Median? How is 63 considered
- How do we compare across treatments?
48Comparing Time to Event
49Comparing Time to Event
50Comparing Time to Event
51Comparing Time to Event
- Add the scores
- Divide by a scaling factor variance
- Size of this statistic is compared with
standardized tables - Log-Rank
- Is 10.02
- Should be 3.84 P-Value 0.0016
- Difference not likely because of chance
52Comparing Time to Event
- Non-graphical summary
- Relative hazard rate (RHR)
- Average of how fast one group fails compared
with the group identified as the baseline group
53Comparing Time to Event
- A value of 1 for the RHR means that the average
outcome is the same for both groups - The precision in the estimate of the RHR can be
quantified and confidence intervals formed
54Comparing Time to Event
- For the example below
- RHR (E v. S) 0.625
- 95 Confidence Interval 0.45-0.91
55Comparing Time to Event
- As with complete data studies, study size can be
planned - Level of evidence needed to doubt H0 (a)
- Hazard rates that are plausible
- The certainty with which one wants to identify
the difference (1-ß) - Accrual rate
- Study size depends on the number of patients
through the number of events
56Comparing Time to Event
57Comparing Time to Event
58Comparing Time to Event
59Comparing Time to Event
60Compliance
- Well randomize him and if its not the therapy
hes likely to take, we can always report he
electively switched - Every eligible patient enrolled on a randomized
study counts in outcome assessment - If a patient refuses initial randomization
- Exclude as a non-complier (Treated as Randomized)
- Assign outcome to the regimen he or she actually
got (As Treated) - Assign the outcome to the randomized regimen
regardless of what she or he actually got (As
Randomized)
61ComplianceWhen Switching is Not Related to
Prognosis
62ComplianceWhen Switching is Related to Prognosis
63Compliance
- As randomized analysis gives a predictable result
- Correctly identifies when there is no treatment
difference - Underestimates true treatment difference
- Other two strategies can identify the better
regimen as inferior in some circumstances - Rule of thumb 4 compliant patients are needed
for every 1 non-compliant patient to maintain the
designed power
64Example of Non-Compliance
65Example of Non-Compliance
66Example of Non-Compliance
67Example of Non-Compliance
- As randomized analysis indicates CrxXRT provides
an outcome advantage - The true treatment difference is underestimated
- The As-Treated Analysis is not signficant
- Implies As-Treated is even more severely biased
68When Publishing or Evaluating a Publication
- Specify the design parameters
- Level of evidence required to consider H0 as
untenable - Alternative considered clinically relevant and
power - Planned time for enrollment and follow-up of last
patient - Interim monitoring plans
- Number of non-compliers
- How these were considered in the analysis?
- Was accrual adjusted to maintain power if
non-compliance was observed?