Statistical Inference in Cancer Clinical Trials - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Statistical Inference in Cancer Clinical Trials

Description:

Keck School of Medicine, University of Southern California. And. Children's Oncology Group ... True response rate 35% = agent accepted for further ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 69
Provided by: aroy9
Category:

less

Transcript and Presenter's Notes

Title: Statistical Inference in Cancer Clinical Trials


1
Statistical Inference in Cancer Clinical Trials
  • Mark Krailo
  • Keck School of Medicine, University of Southern
    California
  • And
  • Childrens Oncology Group

2
Planning and Execution
Do It
Protocol
3
Inference
4
Inference
5
Single Therapy Phase II Study
  • Enroll 20 patients
  • If 2 or few respond reject the agent as
    ineffective otherwise move it forward
  • Design characteristics
  • True response rate 5 gt agent rejected for
    further development with probability 93
  • True response rate 35 gt agent accepted for
    further development with probability 99

6
Single Therapy Phase II Study
  • Enroll 20 patients and only 2 respond
  • We do not know the true response rate
  • Infer what it could be based on how unusual the
    result is
  • The agent isnt effective because the data we see
    are consistent with a 5 response rate

7
Single Therapy Phase II Study
  • What would be as or less likely than 2 of 20
    responders if RR0.05?

8
Single Therapy Phase II Study
  • What would be as or less likely than 2 of 20
    responders if RR0.35?

9
Single Therapy Phase II Study
  • RR0.05 is tenable RR0.35 is not tenable
  • Convention If a parameter yields a p-value of
    less than or equal to 0.05 it is not tenable
  • What we see is so unusual (an outcome as or less
    likely), we infer the parameter cannot take on
    that value
  • Identify all response rates that are tenable
    (1-p) x 100 confidence interval

10
Single Therapy Phase II Study
  • Precision is inversely related to confidence
    interval width
  • Confidence intervals are related to hypothesis
    testing
  • The confidence interval does not necessarily
    contain the truth

. cii 20 2, b exact
-- Binomial Exact --
Variable Obs Mean Std. Err.
95 Conf. Interval----------------------
--------------------------------------------------
---- 20 .1
.067082 .0123485 .3169827
11
Confidence Intervals Become More Precise When
  • The number of observations increase

12
Confidence Intervals Become More Precise When
  • The observed proportion of successes moves away
    from 50

13
How You Get There Can be Important Too
  • Design
  • Enroll 10 patients If none respond, stop and
    conclude the agent is ineffective
  • If 1 or more patients respond, enroll 10 more
    If 2 or fewer respond, conclude the agent is
    ineffective
  • Simon Optimal 2 stage design
  • a0.07 for RR0.05 1-ß0.12 for RR0.25

14
How You Get There Can be Important Too
  • Result
  • Enrolled 10 patients 1 patient responded
  • Enroll 10 more patients 1 patient responded
  • Conclusion Not enough evidence to consider the
    agent is not of sufficient activity to study
    further
  • Naïve 95 confidence interval 1.23-31.7
  • Wrong but not by much
  • Actual 95 confidence interval 1.42-34.3

15
How You Get There Can be Important Too
  • Why?
  • Stopping early affects the possible outcomes of
    the trial
  • Cannot get 0 of 10 responders and then 2 of 10
    responders in the two-stage design
  • Reduces the number of possible ways to get to 2
    responders of 20 and so the calculation of what
    is as or less likely than the observed (2 of 20)
    outcome
  • Advantage Can stop early if things look
    particularly bad
  • Cost Slightly reduced precision at the trials
    conclusion

16
When Designing or Evaluating a Single Therapy
Phase II
  • Start with an hypothesized true response rate
    (null hypothesis H0)
  • Conventionally one that indicates the agent is of
    no further interest
  • The plausibility of the null hypothesis is
    assessed by determining the probability of seeing
    an outcome as or less likely that what was
    observed (p-value)
  • Set level of plausibility that implies doubt in
    the null before the data are collected
  • Type I (a) error Chance of declaring the null
    hypothesis false when, actually, it is true
  • Conventionally set at 5-10

17
When Designing or Evaluating a Single Therapy
Phase II
  • Enroll enough patients to be confident that, if a
    large enough difference exists, it will be
    identified
  • Set level to be high
  • Referred to as power and conventionally set at
    80-90
  • Type II (ß) error Chance of declaring H0 tenable
    when, actually, it is false
  • Upon reporting the result
  • Describe the design fully
  • Evaluate the hypothesis plausibility
  • Provide a measure of precision through a
    confidence interval 95 is usually used

18
When Designing or Evaluating a Single Therapy
Phase II
  • How do we know Enroll enough patients to be
    confident that, if a large enough difference
    exists, it will be identified?
  • We could enroll 1 patient
  • Rule If the patient fails, H0 is supported
  • Otherwise it is rejected
  • If H0 is true, the reject only 5 of the time
  • Unfortunately reject 65 of the time when RR
    0.35

19
When Designing or Evaluating a Single Therapy
Phase II
  • We could enroll 100,000 patient
  • Rule If less than 5113 patients fails, H0 is
    supported
  • Otherwise it is rejected
  • If H0 is true, the reject only 5 of the time
  • Reject gt99.99 of the time when RR 0.35
  • Software that can find the number of patients N
    such that, if RR 0.35, Pr(more than k of N
    patients respond) 1-ß

20
Comparing Categorical Factors
21
Categorical Factors and Outcomes Evaluating
Associations
  • Two characteristics measured on each individual
  • Age at Enrollment
  • Disease status at 5 year (recurrence v. disease
    free)
  • Null hypothesis The probability of being a
    5-year survivor is the same in both age groups
  • The plausibility of the null hypothesis is
    assessed by determining the probability of seeing
    an outcome as or less likely that what was
    observed (p-value)
  • a will be set at 5

22
Categorical Factors and Outcomes Evaluating
Associations
Column Five-Year Event-Free Survivor By Row
Age at Enrollment Yes No
TOTAL ---------------------- -17
221 200 421 Row
52.5 47.5 Exp. 210.9 210.0
---------------------- 18
20 40 60 Row 33.3
66.7 Exp. 30.0 29.9
241
240 481 Row 50.1 49.9
23
Categorical Factors and Outcomes Evaluating
Associations
24
Categorical Factors and Outcomes Evaluating
Associations
Column Five-Year Event-Free Survivor By Row
Age at Enrollment Yes No
TOTAL ---------------------- -17
221 200 421 Row
52.5 47.5 Exp. 210.9 210.0
---------------------- 18
20 40 60 Row 33.3
66.7 Exp. 30.0 29.9
241
240 481 Row 50.1
49.9p-value 0.006
25
Categorical Factors and Outcomes Evaluating
Associations
  • The hypothesis that the failure rate is the same
    is not tenable
  • Why?
  • The statistical inference is easy the causal
    inference is difficult
  • Older patients more frail
  • Only older patients with high risk disease
    participated in the investigation
  • More older patients have a poor pharmacological
    characteristics

26
Categorical Factors and Outcomes Point Estimates
and Precision
  • Point estimates and measures of precision
  • OR 1 implies the probability of success is the
    same in both groups
  • OR gt 1 implies the probability of success is
    greater in Group 1 than in Group 2
  • OR lt 1 implies the probability of success is less
    in Group 1 when compared with Group 2
  • OR 1 is referred to as Group membership is
    independent of success probability.

27
Categorical Factors and Outcomes Point Estimates
and Precision
  • Estimate OR by using the observed proportions of
    those succeeding and failing in each group
  • In our case (0.525x0.667)/(0.333x0.475) 2.21
  • 95 confidence interval
  • Identify all the ORs that would be considered
    tenable according to our p-value criteria
  • Short cut 2.21 1.96x0.697
  • 1.3,3.9

28
Categorical Factors and Outcomes
  • Is a relative risk of 2 large? Some
    well-investigated ORs
  • Smoking and lung cancer 23
  • Age at menarche and risk for breast cancer 1.2
    for those with menarche 14 or older v. 11 and
    under
  • First degree relative with breast cancer and
    subjects risk for breast cancer 2
  • Most improvements in treatment outcome are
    between 1.5 and 3 fold OR (0.33-0.67 OR of
    failure)

29
Categorical Factors and Outcomes Assigning
Intervention
  • Two characteristics measured on each individual
  • Treatment regimen
  • Disease status at 5 year (recurrence v. disease
    free)
  • Null hypothesis The probability of being a
    5-year survivor is the same in both treatment
    groups
  • The plausibility of the null hypothesis is
    assessed by determining the probability of seeing
    an outcome as or less likely that what was
    observed (p-value)
  • a will be set at 5
  • Assign the factor by randomization

30
Comparing Categorical Factors
31
Comparing Categorical Factors
Column Five-Year Event-Free Survivor By Row
Regimen Yes No TOTAL
Standard
110 136 246 Row 44.7
55.3 Exp. 123.2 122.7
---------------------- Experimental 131
104 235 Row 55.7 44.3
Exp. 117.7 117.2
241
240 481 Row 50.1
49.9p-value 0.018
32
Categorical Factors and Outcomes
  • The hypothesis that the failure rate is the same
    is not tenable
  • Why?
  • The statistical inference is easy the causal
    inference is (relatively) easy
  • Is the experimental treatment superior?
  • Did the process have put all the good prognosis
    patients on the experimental regimen
    purposefully? (No)
  • Did the process accidentally assign good
    prognosis patients to the experimental regimen?
    (with this many patients, the probability of a
    10 imbalance is much less than 0.4)
  • The only plausible explanation is the first
    choice the experimental regimen is superior
  • Randomization is the basis for causal inference

33
One Example of Why Randomization is Best
  • National Wilms Tumor Study (NWTS) -I and II
  • NWTS-I established a 2-drug regimen as the
    standard
  • NWTS-II randomized between 2 and 3 drugs
  • 3-drug regimen was no better than historical
    controls, but significantly better than
    randomized 2-drug regimen
  • Farewell and DAngio A simulated study of
    historical controls using real data. Biometrics
    37, 169-176, 1981

34
Categorical Factors and Outcomes
  • Estimate OR
  • OR of being a 5-year EF survivor for the
    Experimental regimen 1.56
  • 95 confidence interval
  • 1.09,2.23

35
When Designing or Evaluating a Comparative Trial
  • Start with an hypothesized true response rate
    (null hypothesis H0)
  • Conventionally one that indicates there is no
    difference between the groups
  • The plausibility of the null hypothesis is
    assessed by determining the probability of seeing
    an outcome as or less likely that what was
    observed (p-value)
  • Set level of plausibility that implies doubt in
    the null before the data are collected
  • Type I (a) error Chance of declaring the null
    hypothesis false when, actually, it is true
  • Conventionally set at 5-10

36
When Designing or Evaluating a Comparative Trial
  • Enroll enough patients to be confident that, if a
    clinically relevant difference exists, it will
    be identified
  • Set level to be high
  • Referred to as power and conventionally set at
    80-90
  • Type II (ß) error Chance of declaring H0 tenable
    when, actually, it is false
  • Upon reporting the result
  • Describe the design fully
  • Evaluate the hypothesis plausibility
  • Provide a measure of precision through a
    confidence interval 95 is usually used

37
When Designing or Evaluating a Comparative Trial
  • How do we know Enroll enough patients to be
    confident that, if a large enough difference
    exists, it will be identified?
  • We could randomize 8 patients (4 to each regimen)
  • Rule If the all four fail one therapy and all
    four succeed on the other H0 is rejected
  • Otherwise it is accepted
  • If H0 is true, the reject only 2 of the time
  • Unfortunately reject 9 of the time when
    Pr(success with standard)0.30 and Pr(success
    with experimental) 0.50

38
When Designing or Evaluating a Comparative Trial
Size Matters
39
The Abuse of Power
  • Power is calculated before the trial starts as an
    aid to determine whether the trial is worth doing
  • Quantifies the chance of the experiment
    identifying a regimen that should replace the
    standard when a difference of believable size
    exits
  • After the experiment is conducted, the confidence
    interval for the parameter quantifying difference
    is what matters
  • Post-hoc calculations of power are not useful

40
The Abuse of Power
  • Column Five-Year Event-Free Survivor By
    Row Regimen (18 or Over) Yes
    No TOTAL
    Standard 10 18 28
    Row 35.7 64.3
    ---------------------- Experimental 10
    22 32 Row 31.3 68.8

    20 40 60 Row 33.3
    66.7 p-value.78741
  • OR 0.81 95 Confidence Interval 0.28-2.4

41
The Abuse of Power
  • The subset is small and the results are
    consistent with the overall result
  • Confidence interval supports values indicative of
    an advantage for the Experimental regimen
  • Post-hoc calculations of power are not relevant
    since the (lack of) precision is quantified by
    the 95 confidence interval

42
The Abuse of Power
  • Column Five-Year Event-Free Survivor
  • By Row Metastases Present at Study Entry
  • (Experimental Patients Only) Yes
    No TOTAL
    Yes 11 44 55
    Row 20.0 80.0
    ---------------------- No 120
    60 180 Row 66.7 33.3

    131 104 235 Row 55.7
    44.3p-value lt 0.001
  • OR 0.125 95 Confidence Interval 0.061-0.257

43
The Abuse of Power
  • Even though the subset is small, the estimate and
    precision are indicative of an association
  • Post-hoc calculations of power are not relevant
  • Even though a group of this size would be
    unlikely to identify an OR of 0.66, that does not
    reflect on actual estimated OR and its precision

44
Underpowered Randomized Phase II Designs
  • Identify two regimens that are to be evaluated
  • Randomize a moderate number of patients between
    the two regimens
  • Select the regimen that has the best nominal
    response rate to go forward
  • A certain COG study Randomize 72 patients to VTC
    v. VTCBevacizumab
  • Select a regimen to be incorporated into the next
    front-line randomized trial

45
Underpowered Randomized Phase II Designs
  • If we did choose a conventional level of evidence
    (a0.05) it is likely VTCB would not be selected
    even if OR was 2.25
  • 1 will likely be in the confidence interval for
    the OR
  • No evidence to definitively differentiate between
    the regimens
  • Our criteria for evidence, however, essentially
    is a0.50
  • Because of the imprecision of the conclusions, we
    should not use randomized phase II results to
    displace the current standard

46
Comparing Time to Event
47
Comparing Time to Event
  • Std 1, 47, 63, 67, 67, (200 eligible patients
    randomized to standard)
  • Exptl 2, 3, 43, 64, 111, (198 eligible
    patients randomized to experimental)
  • Characterize Outcome
  • Average? (1 47 63? )/197
  • Median? How is 63 considered
  • How do we compare across treatments?

48
Comparing Time to Event
49
Comparing Time to Event
50
Comparing Time to Event
51
Comparing Time to Event
  • Add the scores
  • Divide by a scaling factor variance
  • Size of this statistic is compared with
    standardized tables
  • Log-Rank
  • Is 10.02
  • Should be 3.84 P-Value 0.0016
  • Difference not likely because of chance

52
Comparing Time to Event
  • Non-graphical summary
  • Relative hazard rate (RHR)
  • Average of how fast one group fails compared
    with the group identified as the baseline group

53
Comparing Time to Event
  • A value of 1 for the RHR means that the average
    outcome is the same for both groups
  • The precision in the estimate of the RHR can be
    quantified and confidence intervals formed

54
Comparing Time to Event
  • For the example below
  • RHR (E v. S) 0.625
  • 95 Confidence Interval 0.45-0.91

55
Comparing Time to Event
  • As with complete data studies, study size can be
    planned
  • Level of evidence needed to doubt H0 (a)
  • Hazard rates that are plausible
  • The certainty with which one wants to identify
    the difference (1-ß)
  • Accrual rate
  • Study size depends on the number of patients
    through the number of events

56
Comparing Time to Event
57
Comparing Time to Event
58
Comparing Time to Event
59
Comparing Time to Event
60
Compliance
  • Well randomize him and if its not the therapy
    hes likely to take, we can always report he
    electively switched
  • Every eligible patient enrolled on a randomized
    study counts in outcome assessment
  • If a patient refuses initial randomization
  • Exclude as a non-complier (Treated as Randomized)
  • Assign outcome to the regimen he or she actually
    got (As Treated)
  • Assign the outcome to the randomized regimen
    regardless of what she or he actually got (As
    Randomized)

61
ComplianceWhen Switching is Not Related to
Prognosis
62
ComplianceWhen Switching is Related to Prognosis
63
Compliance
  • As randomized analysis gives a predictable result
  • Correctly identifies when there is no treatment
    difference
  • Underestimates true treatment difference
  • Other two strategies can identify the better
    regimen as inferior in some circumstances
  • Rule of thumb 4 compliant patients are needed
    for every 1 non-compliant patient to maintain the
    designed power

64
Example of Non-Compliance
65
Example of Non-Compliance
66
Example of Non-Compliance
67
Example of Non-Compliance
  • As randomized analysis indicates CrxXRT provides
    an outcome advantage
  • The true treatment difference is underestimated
  • The As-Treated Analysis is not signficant
  • Implies As-Treated is even more severely biased

68
When Publishing or Evaluating a Publication
  • Specify the design parameters
  • Level of evidence required to consider H0 as
    untenable
  • Alternative considered clinically relevant and
    power
  • Planned time for enrollment and follow-up of last
    patient
  • Interim monitoring plans
  • Number of non-compliers
  • How these were considered in the analysis?
  • Was accrual adjusted to maintain power if
    non-compliance was observed?
Write a Comment
User Comments (0)
About PowerShow.com