Title: Design and Statistical Principles for Randomized Clinical Trials: An Overview
1Design and Statistical Principles for Randomized
Clinical Trials An Overview
2Randomized Phase III Trials
- Comparing multiple treatments to definitively
demonstrate - Superiority test better than control (with
respect to the primary endpoint) - Non-inferiority (equivalence) test no worse than
control by a pre-specified delta - Failing to detect a significant difference does
NOT imply equivalence
3Randomized Trials
- Treatment assignment randomized, i.e. unbiased
- Except planned treatment differences, treatment
arms equal with respective to baseline patient
factors and study conduct - Outcome differences attributable to different
treatments
4Randomized Trial Design Elements
- Randomization ratio
- Stratification
- Degree of treatment identity blinding
5Randomization Ratio
- Equal N to each arm smallest total N
- Unequal N per arm, e.g. 21 experimental vs.
control ratio - More safety data for experimental arm, more
attractive to potential subjects if placebo
controlled, etc - Incrementally higher N if ratio not too drastic,
e.g. 2-arm total N 12 14 higher for 21 vs.
11 ratio
6Randomization Stratification
- To ensure balanced randomization with respect to
established important baseline prognostic factors - Example
- Gender known to be highly related to trials
primary endpoint, e.g. females do better - Randomize within gender to guard against chance
imbalance, e.g. test same as control but
disproportionally more females on test arm,
making it look better than control - Not necessary if N really large
7Randomized Trials Blinding
- Open label treatment assignment identity known
to all - Single blind subject
- Double blind entire study staff (except masked
treatment/device dispenser and statistician) and
subject highest assurance of trial conduct
uniformity across treatment arms - 1.5 blind outcome scorer subject
8Randomized Trials Intent to TreatEffectiveness
Analysis Principle (ITT)
- For treatment effectiveness, all randomized
patients included and analyzed according to the
randomization treatment assignment regardless of
actual treatment deviations
9Intent to Treat Principle
- Validity of treatment comparison rests on
equality of treatment arms at baseline - Between-arm patient balance achieved by
randomization must be preserved for analysis
10Intent to Treat Principle
- Post randomization deviations often treatment
related - Those randomized to test but could not be treated
or received control instead could be the more
difficult cases for the test treatment - Compliance doing poorly ? noncompliance
- A better than B ? more B patients not doing well
/ noncompliant ? noncompliance exclusions reduce
treatment differences - Double blinded study exclusions ok if due
inc/excl violations or no treatment received
11ITT Principle
- Generally, once randomized patient counts towards
the randomly assigned treatment regardless of
actual treatment deviations (received little/no
treatment, wrong treatment) - No randomization cancelations / post-hoc
exclusions (except for lack of data due to
consent withdrawal or loss to follow-up)
12ITT Principle Implications
- Minimize potential post-randomization protocol
deviations by - Truly informed consent to reduce post-hoc
refusals, dropouts, e.g. subjects favoring one
treatment over another or having study
requirement compliance issues should not be
consented - Timing of randomization as close to treatment
divergence as logistically feasible (e.g.
intra-operative randomization)
13Statistical Considerations
- False positive error
- False negative error
- Sample size estimation
14False Positive Error (type I error, a error, p
value, significance level)
- False positive chance of a comparison (consumers
risk) seeing a statistically significant
difference when truth is null - Calculated from trial data
- Threshold pre-specified, p lt0.05 commonly
accepted as a small enough false positive risk
for concluding an observed difference to reflect
a real difference - Related to trial size
15False Positive Error (type I error, a error, p
value, significance level)
- Arm 1 success rate 10/100 (10)
- Arm 2 success rate 30/100 (30)
- p value 0.0007 with a false positive chance
of 7/10,000, infer the true underlying rates to
be different -
- Arm 1 success rate 1/10 (10)
- Arm 2 success rate 3/10 (30)
- p value 0.58 with a false positive chance of
58/100, can not infer the true underlying rates
to be different
16False Positive Error(type I error, a error, p
value, significance level)
- Most quoted p values are 2-sided, i.e. testing
whether two treatments are the same - Most trials NOT symmetrical, e.g. interested in
if experimental better than control - A 2-sided plt0.05 level is really a 1-sided
plt0.025 level true false positive rate - Non-inferiority inherently 1-sided
17False Positive Error (type I error, a error, p
value, significance level)
- 1-sided a threshold 0.025 ? 1 out of every 40
positive trials would be falsely positive - An error rate not necessarily acceptable from
FDAs perspective - Two independent, identically designed positive
trials with p1lt0.025 each chance of both trials
falsely positive lt0.000625 or 6/10,000 - Better off (smaller N) doing a single large trial
with 1-sided a pre-set at 0.000625
18False Negative Error (ß)
- A trials chance of missing a significant
difference when a true difference exists
(sponsors risk) - For the same a, ß decreases when N increases
19Statistical Error Rates
- a and ß specified when designing the trial
- Along with primary endpoints data type and
effect size of interest, determine the trials
sample size
20Approximate Total Sample Size 2 ArmsContinuous
(Bell Shape) Data 11 randomization, a 0.05
21Sample Size Continuous Bell Shape Data
- N determinant mean ?/SD ratio
- The larger the signal over noise ratio, the
smaller the N, twice the ratio ? ¼ the N - N ? incrementally as power ? from 0.80 to 0.90
- 0.80 power should not be the norm, too high a
sponsors risk to miss 1/5 good new agents or
devices - Power 0.85 or 0.90, i.e. ß of 1/7 or 1/10, should
be considered
22Approximate Total Sample Size 2 ArmsBinary
(Yes/No) Data, 11 randomization, a 0.05
23Sample Size Binary Data
- Primary sample size determinant absolute success
rate difference (not relative difference or
ratio) - Actual rates also matter, e.g. for the same ? of
30, N smaller for 10 vs. 40 compared to 35
vs. 65 (N largest for rates around 50) - 0.85 or 0.90 power recommended
24Randomized Phase II Trials
- Screening for promising new treatment
- Relax a to 1-sided 0.10 to 0.20 - results not
definitive (1-sided a 0.025 for phase IIIs) - Power 0.90
- Single arm phase II more efficient, 1/4 the N, if
historical controls well established and stable
252-arm Randomized Phase II Time to Event Data
(e.g. disease progression)Total Events Required
11 Randomization 0.90 Power
26Interim Analysis
- Can be done as often as desired if results not
disclosed and no action of any kind will be taken
- If results for interim presentation/publication
only, trial WILL continue to pre-planned accrual
goal regardless of interim results (extreme /-) - Early disclosure must have no potential of
altering trial conduct, e.g. enrollment pattern
(no difference ? no more enrollment),
randomization acceptance (test appears beneficial
? no one wants control), etc. - Present patient characteristics, safety data
only maintain treatment masking if blinded
trial
27Formal Interim Analysis - Traditional
- Early trial termination considered if extremely
unfavorable or favorable findings detected - Analysis plan with early stopping rules must be
pre-specified, i.e. not data driven - Statistical cost
- The more often data are acted upon, the more
chances for error - Total 1-sided false positive error rate needs to
be lt0.025 - Typical interim 1-sided a set at lt0.0025 for
consideration of early trial closure
28Formal Interim Analyses - Traditional
- Reason for very small interim a threshold early
a values not stable ? only extreme findings are
unlikely to be false signals and warrant drastic
actions - Interim analysis for futility difficult if
primary endpoint requires substantial follow-up
enrollment may near completion when endpoint data
available for e.g. first 50 of subjects
29Formal Interim Analyses Adaptive Design
- Stop for extreme interim results
- Otherwise calculate conditional power (CP) at
interim trend, i.e. chance of a positive trial at
planned N given interim data - Proceed as planned to original N if CP reasonable
- Increase N if CP promising but less than ideal,
e.g. gt0.50 but lt0.80 - No a inflation under broad conditions
30Common Mis-practices
- Overlapping confidence intervals imply no
statistically significant differences - Subset analyses
- Survival by another time related outcome
- More treatment is better
- Responders live longer
31Overlapping Confidence Intervals
32Overlapping Confidence Intervals
33Overlapping Confidence Intervals
- Do not imply no statistically significant
differences - Has to be a highly significant difference in
order for confidence intervals to be
non-overlapping
34Subset Analysis
- No overall treatment difference
- Is treatment effective in some subset of
patients? (males, extremity lesions, good
performance, ... ) - Overall treatment difference
- Is treatment more effective in some subsets than
others?
35Subset Analysis5-FU/Levamisole as Adjuvant
Therapy for Colon Cancer
- NCCTG Trial
- Therapy Most Effective for
- Females
- Young
- SWOG Trial
- Therapy Most Effective for
- Males
- Old
36Subset Analysis NCI Melanoma Interferon Trials
- E1684, HD IFN vs. Obs (N 300)
- E1690, HD IFN vs. LD IFN vs. Obs (N 600)
- E1694, HD IFN vs. GMK vaccine (N 900)
- Virtually identical patient populations
- Thick primary node negative, or any positive nods
37Subset Analysis Melanoma HD IFN Studies
- E1684, HD IFN vs. Obs
- Benefited most from IFN single node
- E1690, HD IFN vs. LD IFN vs. Obs
- Benefited most from HD IFN 2-3 node
- E1694, HD IFN vs. GMK vaccine
- Benefited most from IFN node negative
38Post-Hoc Subset Analysis
- Most trials under powered to begin with subset
analysis has very little power - High false negative rate
- Often, many subset analyses are done without
correcting a for multiple comparisons - High false positive rate
- Almost all post-hoc subset analyses are wrong
- All post-hoc subset analyses must be confirmed
39Subset AnalysisHow Should One Proceed
- Definitive subset analysis
- Pre-trial specified hypotheses by subset
- Adequate sample size in subset of interest
- Adjust a for multiple comparisons
- Post-hoc subset analysis (exploratory)
- Global test of subset by treatment interaction
- Do subsets just for randomization stratification
factors - Can not get around small ns / high false
negative rate
40Survival by Another Time-Related Outcome
41Survival by Another Time-Related Outcome
NSABP Breast Cancer Trial of LPAM
42DFS by Delivered Dose NSABP Trial of LPAM
- LPAM Dose Recd
- gt85
- 65-84
- lt65
- Placebo
43Disease-Free Survival by Dose of Placebo
44Survival by Delivered Dose
- Time bias
- The shortest living patients by default receive
the least amount of treatment - Extreme example comparing survival duration for
heart transplant patients (treatmentyes) with
those who die on the wait list (treatmentno) - Treatment received favors longer living patients
45Survival by Delivered Dose
- Common patient factors related to both survival
and delivered dose - Ice cream sale related to drowning deaths
- Dont have ice cream before swimming?
- Both are effects of a 3rd factor
- Association ? cause and effect
46Survival by Delivered Dose
- The more placebo the better? Unlikely
- More likely
- More placebo delivered selects out longer
survivors - Those with favorable baseline characteristics
live longer and receive more treatment - Same for active treatment
47DFS by Delivered Dose NSABP Trial of LPAM
- LPAM Dose Recd
- gt85
- 65-84
- lt65
- Placebo
48Survival by Tumor Response
- Same problems as survival by dose received both
are time-related outcomes - Time bias
- The shortest living patients are by default in
the no response group - Common patient factors related to both survival
and response - Baseline prognosis (e.g. disease sites)
- Age
49Survival by a Time Related Factor
- Dose intensity questions should be answered by
randomized trials - Survival by response status
- Landmark analysis can remove time bias
- Proper analysis techniques may remove time bias
and demonstrate association, but can not remove
common factor self selection, therefore still
can not infer cause and effect