EPI 260 Statistics in Phase II Clinical Trials

About This Presentation

Title:

EPI 260 Statistics in Phase II Clinical Trials

Description:

... preliminary evidence of efficacy and safety If the new treatment works well ... level decision rule ... the NULL Introduction to ROC curves ROC ... – PowerPoint PPT presentation

Number of Views:440

Avg rating:3.0/5.0

Slides: 104

Provided by: JimmyHw

Category:

more less

Transcript and Presenter's Notes

Title: EPI 260 Statistics in Phase II Clinical Trials

1
EPI 260Statistics in Phase II Clinical Trials
Jimmy Hwang, Ph.D. Biostatistics Core, Cancer
Center UC San Francisco April 29, 2010
2
Early Phase Clinical Development Phase II
studies Statistics (in syllabus)

Purpose of Phase II clinical studies
Phase II study design
formulation of testable hypotheses
determine the study endpoints and when they will
be evaluated
define the population to be studied
select the appropriate study design
Determine the required sample size by making
assumptions about the extent of benefit to be
achieved with the new treatment and acceptable
errors in making a final decision about whether
the null hypothesis can be rejected
Methods for statistical analysis

3
Four types of trial designs (1)

Phase I pharmacologically oriented
The safe dose range
The side effects
How the body copes with the drug
If the treatment shrinks cancer
Phase II preliminary evidence of efficacy and
safety
If the new treatment works well enough to test in
phase 3
Which types of cancer it is effective
against
More about side effects and how to manage
them
More about the most effective dose to use

4
Four types of trial designs (2)

Phase III new treatments are compared with the
best currently available treatment (the standard
treatment).
A completely new treatment with the standard
treatment
Different doses or ways of giving a standard
treatment
A new radiotherapy schedule with the standard one
Phase IV post-marketing surveillance
More about the side effects and safety of the
drug
What the long term risks and benefits are
How well the drug works when its used more
widely than in clinical trials

5
Statistical Considerations

Define Clinical Question (Objectives).
Study Development and Protocol Development
Types of Study (pilot, clinical trial,
observational, etc.)
Endpoints (feasibility and appropriateness)
Protocol Development (objectives, aims,
statistical design, patient selection, data
collection procedures, number of points, stopping
rules and interim analysis, statistical
endpoints, analysis plan, sample size)
During Study Randomization, data quality
control, interim analysis and/or monitoring of
patient safety
Study Finishing Data lock, data analysis and
interpretation, assist decisions for the
follow-up studies, and preparation for papers and
presentations

6
Statistical Perspectives

Philosophy of inference divides statisticians
Frequentists versus Bayesian
Statistical procedures are not standardized.
Things to consider
Randomization
Intent-to-treat Design
Unbalanced groups
Stratification
Large-scale, small clinical trials, meta analysis
Adjusted or weighted analysis
Trials can provide confirmatory evidence.
Other methods are valid for making clinical
inferences.

7
Basic Question

Clinical reasoning requires generalizing from
individual patients.
Statistical reasoning emphasizes inference based
on structured data processing.

Which treatment is safer and better?

Benefit could be defined as
Antitumor activity
Safety
The pharmacokinetics or pharmacodynamics
The biologic correlates which may predict
response or resistance to treatment and/or
toxicity

9
Intent-to-treat (ITT) Principle

Unlike animal studies, investigator cannot
dictate what a participant should do in a
clinical trial.
A participant may forget to take the pills,
receive dose reduction due to toxicity, drop out
from the study at any point or lost to f/u.
Use only full compliers? Use all subjects?
ITT compares intervention strategies and not
interventions.

10
Standards of Ethical Conduct

The study participants must give voluntary
consent.
There must be no reasonable alternative to
conducting the experiment.
The anticipated results must have a basis in
biological knowledge and animal experimentation.
The procedures should avoid unnecessary suffering
and injury.
There is no expectation for death or disability
as a result of the trial.

11
Standards of Ethical Conduct

The degree of risk for the patient is consistent
with the humanitarian importance of the study.
The subjects are protected against even a remote
possibility of death or injury.
The study must be conducted by qualified
scientists.
The subject can stop participation at will.
The investigator has an obligation to terminate
the experiment if injury seems likely.

12
Study Protocol

Every well-designed study required a protocol.
Protocol is a written agreement between
investigators, participants, and the scientific
community.
Protocol is a comprehensive operational manual.
It specifies the standard operation procedure
(SOP).

13
Defining study questions

Each clinical trial must have a primary question.
The primary question, as well as any secondary or
subsidiary questions, should be carefully
selected, clearly defined, and stated in advance.
Selection of the questions
Primary and secondary objectives
Interventions
Response variables
Surrogate endpoints, biomarkers

14
Primary Objective

Define one question the investigators are most
interested in answering and is capable of being
adequately answered.
Define the primary endpoint
Toxicity, efficacy (response/survival), QOL
Define the type of study
Hypothesis testing or estimation,
Superiority or equivalence trials
The sample size is based on.

15
Secondary Objectives

Different endpoints
Subgroup hypotheses
Prospectively defined
Based on reasonable expectations
Limited in number
Hypothesis testing vs. hypothesis generating
Hunting expedition vs. fishing expedition
Multiplicity Issues

16
What Study Aims Tell You

Type of study general design
(pilot, phase I, II or III study arms)
Who is eligible
Outcome measure
(e.g. toxicity, response, duration, biomarker)
When outcome will be evaluated
(Timing of evaluations)

17
Interim Analysis Why ?

Many trials require large N and/or long duration.
Interim analysis can result in more efficient
designs and correct conclusion can be reached
sooner.
Ethical considerations
Pace of scientific advancement demands learning
from the observed data.
Public health concerns, pressure from activists
Requirement from IRB and other regulatory agencies

18
Interim Analysis Factors to Consider before
Early Termination

Possible difference in prognostic factors among
arms
Bias in assessing response variables
Impact of missing data
Differential concomitant tx or adherence
Differential side effects
Secondary outcomes
Internal consistency
External consistency, other trials

19
Interim Analysis Reasons for early stopping

Efficacy Treatments are convincingly different
or not different (by impartial knowledgeable
experts)
Toxicity Serious Adverse Events, Side effects or
toxicity are too severe (outweight the potential
benefits)
Futility Significant difference at the end of
the trial is unlikely
Data are of poor quality
Accrual too slow in a timely fashion
More information becomes available outside the
study (unnecessary or unethical to continue)
Scientific questions are no longer important
Poor adherence (preventing answers to basic
question)
Resources to study are lost or no longer
available
Fraud or misconduct undermines study integrity.

20
Interim Analysis To Stop or Not To Stop?

How sure?
Is the evidence strong enough, or just due to
stochastic variation, or imbalance in covariates
or other factors?
Wrongly stopping for efficacy false positive
False claim that the drug is active
Waste time and money for future development
Wrongly stopping for futility false negative
Kill a promising drug
Group ethics vs. individual ethics

21
Data Assessment Reasons for Noncompliance

Toxicity or side effects
Involving life style/behavior change
Complex or inconvenient interventions
Insufficient or lack of understanding
instructions
Change of mind, refusal
Lack of family support
If non-compliance is treatment dependent, it will
result in biased data

22
Data Assessment Non-adherence

Include non- or partial compliers, drop-in and
drop-outs
Could due to toxicity, lack of efficacy, refusal.
Need to compare the non-adherence rate between
arms
Exclude in the analysis
Rationale pts not taking medication will not
benefit from it.
Compare the optimal intervention vs. control
Can lead to biased result
Include in the analysis
Intend-to-treat (ITT) principle
Power reduced but also less bias
More relevant to generalize study result to the
real world setting
Do both. Sensitivity analysis

23
Data Assessment Poor Quality or Missing Data

Missing visits may or may not due to outcomes
related to treatment, such as pts health status
Informative or non-informative missing
Missing completely at random
Missing at random (missing does not depends on
unobserved values)
Not missing at random
Available methods
Complete case analysis
Last value carried forward
Single imputation
Multiple imputation
Sensitivity analysis

24
Defining Response Variables

Dose limiting toxicities (DLT), complications
Response, incidence of a disease, total
mortality, death from a specific cause
Overall survival, time to progression, time to
cancer
Blood pressure, biomarkers, PSA, CD4 count
Quality of life
Cost and ease of administrating the intervention
In general, a single response variable should be
identified to answer the primary question.

25
Defining Response Variables

Define the questions prospectively and
specifically
Study drug can increase the response rate
(PRCR) from 25 to 50 in patients with certain
cancer
The primary response variable can be assessed in
all participants and as completely as possible
Informative drop-out or lost to f/u due to
toxicity
Participation generally ends when the primary
response variable occurs
Off-drug, off-study, extended f/u
Response variables should be unbiased and
precisely assessed
Hard, objective endpoints vs. soft, subjective
endpoints
Standardization of evaluation, central lab and
pre-trial training

26
Scales of measurement

Nominal
Ordinal
Interval
Ratio

27
Statistical Methods for Categorical Data

Goal Analysis
Describe one group Proportion
Compare one group to a Chi-square test
hypothetical value
Compare two unpaired groups Chi-square test
Compare two paired groups McNemar's test
Compare three or more Chi-square test
unmatched groups
Model the effect of multiple Logistic regression
prognostic variables
When sample size is small, use Fishers exact
test

28
Statistical Methods for Continuous Data

Goal Analysis
Describe one group Mean, SD
Compare one group to a One-sample t-test
hypothetical value
Compare two unpaired groups Two-sample t-test
Compare paired data Paired t-test
Compare three or more One-way ANOVA
unmatched groups

29
Statistical Methods for Non-Parametric Data

Goal Analysis
Describe one group Median, Percentiles
Compare one group to a Signed-rank test
hypothetical value
Compare two unpaired groups Mann-Whitney test
Wilcoxon rank sum test
Compare paired data Signed-rank test
Compare three or more
unmatched groups Kruskal-Wallis test

30
Statistical Methods for Survival Data

Goal Analysis
Describe one group Kaplan-Meier
Compare two unpaired groups log-rank test
Compare three or more Cox regression
unmatched groups/continuous
risk factors
Model the effect of multiple Cox regression
prognostic factors

31
Samples and Population

Research findings are based on samples drawn from
populations
Inferential statistics allow us to infer what the
population is like, based on sample data
The defined group of individuals from which a
sample is drawn
Sample should closely reflect the population
otherwise there is sampling bias.

32
Sampling

The process of choosing members of a population
to be included in the sample
Research uses data from a sample to make
inferences about a population.

33
Variability

How much do scores vary about the average?
Variance (sum of squared deviations of each
score from the mean)/(n-1)
Variance is small when scores are close to the
mean
Standard deviation square root of variance

34
Within-group variability

Variability within-groups is measured by the
variance and divided by sample size
Tells us how far individual scores deviate from
the group mean
This reflects "error"
The number becomes lower with increasing sample
size

35
Two Group Means

Ask samples of males and females about their
number of doctor visits during the past year
Suppose the mean for males is 1.3 and the mean
for females is 2.1

36
Do males and females differ?

Is the mean number for males different from the
mean number for females?
Obviously, the sample means are different
Can we infer that the population means differ as
well?

37
Whats the Problem?

The difference observed in the samples may be
real
However, the difference could just reflect the
fact that there is some chance of error there
is always a margin of error around the sample
value

38
Hypothesis Testing
a Type I error (level of significance) b
Type II error (1- b Power)
(specificity)
(false - )
power
(sensitivity)
(false )
Inverse relationship between a and b for given
sample size Sample Size Calculation Find N s.t.
to a and b are under control. Typically, compute
N for a given a to yield (1-b)x100 power. For
example, compute N for a 0.05 to yield 80
power.
39
Null and Research Hypotheses

Null hypothesis H0
Population means are in fact equal
Any mean difference observed in the samples
reflects the margin of error
straw man or what you want to reject
any observed deviation from what we expect to see
is due to chance variability
Research hypothesis H1
Population means are not equal
The mean difference observed is real
claim, or what you want to accept or test)

40
Alternative Hypotheses H1
Is the New" Treatment Different from the
standard? (2-sided) Better than the standard?
(1-sided, directional) Not different from the
standard? (Equivalency) Not worse than
the standard? (Not inferiority)
41
Hypothesis testing

Problem Determine whether or not the population
means of two groups of subjects truly differ with
respect to the outcome of interest.
Solution Assume that the two groups do not
differ, and see if the sample data disagree with
this assumption. That is, perform a hypothesis
test.

42
Hypothesis testing (contd)

The null hypothesis assumes that there is no
difference in outcome between the two groups.
The alternative hypothesis assumes that one group
has a more favorable outcome than the other.
The research hypothesis is usually the
alternative hypothesis.

43
Hypothesis testing (contd)

To do a hypothesis test
Calculate a test statistic from the data.
Determine whether the value of the test statistic
is likely or unlikely under the null hypothesis.
If the value is very unlikely, reject the null
hypothesis.

44
Hypothesis testing (contd)

Problem we might reject the null hypothesis
when it is true.
That is, we might commit Type I error.
Solution Construct the test so that there is
only a 5 chance of incorrectly rejecting the
null hypothesis.
That is, the level of the test (alpha) is 0.05.

45
Type I Error

The chance of rejecting a NULL which is true is
a this type of mistake is called a Type I error
or false positive
Reject the null hypothesis when it is true
Likelihood is set the alpha level decision rule
(.05 usually)
5 is a reasonably low probability of being
wrong, but could set lower
For early phase II trials, we often use more
liberal type I errors for not missing the
possible treatments
In medical contexts, the specificity of a test is
the chance that the test result is negative given
that the subject is negative this is just 1 - a

46
P lt .05

The alpha level for rejecting the null hypothesis
is conventionally set as .05
Obtained sample data are inconsistent with what
the null hypothesis expects
Reject the null hypothesis and therefore accept
the research hypothesis
Therefore, conclude that the obtained difference
in means is statistically significant

47
Type II Error

Incorrectly accepting the null hypothesis when
there really is a difference
The chance of not rejecting a NULL which is false
is ß this type of mistake is called a Type II
error or a false negative
In medical contexts, the sensitivity of a test is
the chance that the test result is positive given
that the subject is positive this is just 1 - ß,
also called power

48
Power

Probability of correctly rejecting the null
hypothesis
1-Beta
Power is higher with
Large sample size
Large difference between group means
Low within-group variability

49
What is p value?

The p-value is the probability of obtaining data
as extreme as the observed result when the null
hypothesis is true.
That is, the p-value is the strength of the
evidence against the null hypothesis.
For a level 0.05 test, we reject the null
hypothesis if the p-value is 0.05 or less.
Smaller p-values ? stronger evidence against H0.
Statistical Significance or Clinical Significance
Large samples small differences may be
significant
Small samples large differences may not be
significant
The frequentist inference depends on sample
space, i.e. the design.

50
What is p value?

Decide on whether or not to reject the NULL
hypothesis H0 based on the chance of obtaining a
TS as or more extreme (as far away from what we
expected or even farther, in the direction of the
ALT) than the one we got, ASSUMING THE NULL IS
TRUE
The likelihood of observing the same outcome or
one more extreme if the study were carried out
again.
This chance is called the observed significance
level, or p-value
A TS with a p-value less than some prespecified
false positive level (or size) a is said to be
statistically significant at that level

51
What is p value?

The interpretation of a p-value is a little
tricky In particular, it does NOT tell us the
probability that the NULL hypothesis is true
The p-value represents the chance that we would
see a difference as big as we saw (or bigger) if
there were really nothing happening other than
chance variability.
p 0.08, 8 times out of 100 the same result or
more extreme would occur due to chance alone
A single convenient number giving a measure of
the degree of surprise which the experiment
should cause a believer of the null hypothesis

52
Judging a p-value
0.01
0.05
The results are significant.
The results are highly significant.
The results are very highly significant
lt 0.001
The results are not statistically significant
gt 0.05
0.05
0.10
A trend toward statistical significance
53
Statistical Significance Tests

Significance tests provide a way of making a
decision about the population means
There are many such tests used for different
types of data. But all use the same logic

54
Test statistic

Measure how far the observed data are from what
is expected assuming the NULL (H0) by computing
the value of a test statistic (TS) from the data
The particular TS computed depends on the
parameter
For example, to test the population mean µ, the
TS is the sample mean (or standardized sample
mean)

55
Example

An experiment is conducted to study the effect of
exercise on the reduction of the cholesterol
level in slightly obese patients considered to be
at risk for heart attack. 80 patients are put on
a specified exercise plan while maintaining a
normal diet. At the end of 4 weeks the change in
cholesterol level will be noted. It is thought
that the program will reduce the average
cholesterol reading by more than 25 points.
Data
sample mean 27
sample SD 18

56
Steps in hypothesis testing (I)

1. Identify the population parameter being tested
(ie population mean). Here, the parameter being
tested is the population mean cholesterol reading
µ
2. Formulate the NULL (H0) and ALT hypotheses
(H1)
H0 µ 25 (or µ 25)
Ha µ gt 25
3. Compute the test statistic (TS)
t (27 25)/(18/v 80) .99

57
Steps in hypothesis testing (II)

Compute the p-value.
Here, p P(T79 gt .99) .16
(Optional) Decision Rule
REJECT H0 if the p-value a
(This is a type of argument by contradiction)
A typical value of a is .05, but theres no law
that it needs to be. If we use .05, the decision
here will be)
DO NOT REJECT H0

58
Summary
Hypotheses Null New drug doesnt work
Alternative New drug works Decisions New
drug works Correctly reject H0Power Abandon
new drug Correctly dont reject H0 Proceed
with an ineffective drug Type I error
Abandon a drug that might work Type II error
59
Pitfalls in hypothesis testing

Even if a result is statistically significant,
it can still be due to chance
Statistical significance is not the same as
practical importance
A test of significance does not say how important
the difference is, or what caused it
A test does not check the study design If the
test is applied to a nonrandom sample (or the
whole population), the p-value may be meaningless
Data-snooping makes p-values hard to interpret

60
Introduction to Permutation test (Rank Test)

A type of nonparametric hypothesis test
Also called randomization test, exact test
Very widely applicable class of tests
Introduced in the 1930s
Usually require only a few weak assumptions
Often shows good power

61
5 Steps to a permutation test

1. Analyze the problem identify the NULL and ALT
hypotheses
2. Choose a test statistic (TS)
3. Compute the TS for the original labeling of
the observations
4. Rearrange (permute) the labels and recompute
the TS for the rearranged labels (do for all
possible permutations)
5. Decide whether to reject NULL based on this
permutation distribution

62
Permutations

A permutation is a reordering of the numbers 1,
..., n
Example What are some permutations of the
numbers 1, 2, 3, 4??
The NULL specifies that the permutations are all
equally likely
The sampling distribution of the TS under the
NULL is computed by forming all permutations,
calculating the TS for each and considering these
values all equally likely

63
Example

Suppose we wish to compare the length of stay in
the hospital for patients with the same diagnosis
at two different hospitals. We have the following
results
1st hospital
21,10,32,60,8,44,29,5,13,26,33
2nd hospital
86,27,10,68,87,76,125,60,35,73,96,44,238
How could we carry out a permutation test to test
the NULL hypothesis of no difference between two
hospitals?
Why is a t test not useful in this case?

64
Example

The distribution of length of stay is very skewed
and far from normal distribution.
Using Rank-sum test,
R 83.5, T 3.10 p 0.002

This is an example of an unpaired 2 sample test
Here, we have to find all of the combinations
(since order within each group doesnt matter)

65
Advantages

Can get a permutation test for any TS, even if
its sampling distribution is unknown
This gives more freedom in choosing a TS
Can use on unbalanced designs
Can combine dependent tests on mixtures of
different data types (e.g. with numerical and
categorical data)

66
Limitations

Assumption that the observations are exchangeable
under the NULL
This allows us to randomly move observations
between the groups
For example, when testing for a difference in 2
group means you would need to assume that the
distributions in both groups have the same shape
and spread
Cannot use for testing hypotheses in a single
population, or to compare groups that are
different under the NULL

67
Introduction to ROC curves

ROC Receiver Operating Characteristic
Started in electronic signal detection theory
(1940s - 1950s)
Has become very popular in biomedical
applications, particularly radiology and imaging
Also used in machine learning applications to
assess classifiers
Can be used to compare tests/procedures
True positive rate (sensitivity) vs. false
positive rate (1-specificity)

68
Examples using ROC analysis

Threshold selection for tuning on already trained
classifier (eg neural nets)
Defining signal thresholds in DNA microarrays
Comparing test statistics for identifying
differentially expressed genes in replicated
microarray data
Assessing performance of different protein
prediction algorithms
Inferring protein homology

69
ROC curves simplest case

Consider diagnostic test for a disease
Test has 2 possible outcomes
positive suggesting presence of disease
negative
An individual can test either positive or
negative for the disease

70
Specific Example
Test Result
71
Threshold
Test Result
72
Four groups
True Positives
True Negatives
False Negatives
False Positives
Test Result
73
Moving the threshold
True Positives
True Negatives
False Negatives
False Positives
Test Result
74
ROC Curve
True positive rate (sensitivity)
False Positive Rate (1-specificity)
75
ROC Curve
True positive rate (sensitivity)
False Positive Rate (1-specificity)
76
Area under ROC curve (AUC)

Overall measure of test performance
Comparisons between two tests based on
differences between (estimated) AUC
For continuous data, AUC is equivalent to
Mann-whitney U-statistic (non-parametric test of
difference in location between two populations)

77
Interpretation of AUC

The probability that the test result from a
randomly chosen diseased individual is more
indicative of disease than that from a randomly
chosen nondiseased individual P(Xi gt Xj Di1,
Dj0)
A nonparametric distance between
disease/nondisease test results.
No clinically relevant meaning
A lot of the area is coming from the range of
large false positive values, no one cares whats
going on in that region.
The curves might cross, so that there might be a
meaningful difference in performance that is not
picked up by AUC

78
Elements of sample size calculation

Hypothesis
H0 New treatment standard treatment
Ha New treatment is better.
Type I and Type II errors
? .025 (or two-sided ? .05)
? .15 (Power 85)
Effect size
? mu1 mu2 (for continuous outcomes)
? Pi1 Pi2 (for dichotomous outcomes)
Sample variation
s(? )

79
Test of Proportions

Determining the Sample Size
What is the level of significance?
(Prob. or ? level)
Rejecting a true null hypothesis
What are the chances of detecting
a real difference? (Power)
How large a difference (?) is clinically
important?

80
Determining the Sample Size

Criteria are inter-related
If you know 3 of 4 parameters, the other is fixed
(n, ?, ? and ?)
Must keep the study feasible
There are trade offs
There is no one correct answer

81
Sample Size Calculation Is Only An Estimate

Parameters used in calculation are estimates
themselves with a level of uncertainty.
Estimated treatment effect may be based on a
different population.
Estimated treatment effect is often overly
optimistic based on highly selected pilot
studies.
Patients eligibility criteria may be changed,
thus, affect the sample population.
Better to design a larger study with early
stopping and a smaller study than try to expand N
/extend f/u during the trial.

82
Sample Size and Power Why?

Before a study how large of a sample does a
study require? (in planning)
After a study if no association was found, could
it be due to either true lack of association in
population low power and small sample size?

83
Power sample size

Problem we might fail to reject the null
hypothesis when the alternative is true.
That is, we might commit Type II error.
Solution Select a large enough sample so that
there is an 80 chance of rejecting the null
hypothesis if the alternative is true.
Then the power to detect the alternative is 80.

84
Power sample size (contd)

Problem Sometimes the sample size required is
too large.
Solutions
Be content to detect with less power (allow more
type II error).
Increase the level of the test (allow more type I
error).
Pick a more extreme alternative.

85
Sample Size

Larger sample sizes provide more accurate
estimates of the characteristics of the
population
Confidence interval specify where the
population value probably lies
As sample size increases, there is less margin of
error

86
Change in Sample Size Test of Proportions
Test of Hypothesis for Phase II Trial 1 arm H0
p lt 0.10 H1 p gt 0.25 n
40 Design ?10.04 1-sided test 1 - ?
0.82 ? 0.15 1 arm
87
Change in Sample Size Test of Proportions
Test of Hypothesis for Phase II Trial 1 arm H0
p lt 0.10 H1 p gt 0.25 ? 0.15 ?1
0.05 0.025 0.01 1 - ? 0.80 40 49 62 0.90 55
64 78 0.95 70 79 103
88
TTP Example
Assumptions 1
arm ?1 0.05 Power 0.80 H0 Med30
mos. H1 Med40 mos. Hazard Reduction 26 Accrua
l 12/mo. Duration of Accrual 14.7
mos. Follow-up 24 mos. Total Sample
Size 176 pts.
89
Change to a 2 Arm Study
Assumptions 2 arm study ?1 0.05 Power
0.80 H0 Med30 mos. H1 Med40 mos. Hazard
Reduction 26 Accrual12/mo. Duration of Accrual
(mos) 43.1 Follow-up 24 mos. Total Sample
Size 518
90
Increase Power
Assumptions 2 arm study ?1 0.05 Power
0.80 0.90 H0 Med30 mos. H1 Med40
mos. Hazard Reduction 26 Accrual12/mo. Duration
of Accrual (mos) 43.1 55.8 Follow-up 24
mos. Total Sample Size 518 670
91
Statistical Power
Unacceptable
0.01
0.69
Poor
0.80
Good
0.89
0.90
0.99
Excellent
92
Characteristics of Phase I Trials

Small sample sizes
Not hypothesis driven
Toxicity (DLT and MTD) and Efficacy
Patient safety and benefit
Dose escalation and drug discovery
Clinician, Patients and Drug Development

93
Phase I trial designs

Conventional/Standard Method
33 Dose Escalation Design
Sequential/Bayesian Methods
Continual Reassessment Method (CRM)
Random Walk Rules (RWR)
Decision-theoretic Approaches
Escalation with Overdose Control (EWOC)

94
Phase I Dose StudyStandard Method- 33 design

At each predefined dose level, treat 3 patients
with dose level 1.
If 0 of 3 have DLT, increase to next level
If 2 or more have DLT, decrease to previous level
If 1 of 3 has DLT, treat 3 more at current dose
If 1 of 6 has DLT, increase to next level
If 2 or more have DLT, decrease to previous level
If a dose has de-escalated to previous level
If only 3 had been treated, enroll 3 more for a
total of 6
If 6 have been treated, stop study and declare it
as MTD.
MTD the largest dose for which 1 or fewer DLT
occurred.
Escalation never occurs to a dose at which 2 or
more DLT have occurred.

95
Sample Size for Safety Trials

Type I Error (Alpha)
Acceptable Safety Rate (Rho)
Sample Size (N)

Alpha 0.10 0.05 0.05 0.10 0.10
Rho 5 10 14 20 25
Sample Size 45 28 20 10 8
96
Characteristics of Phase II Trials

Aim To determine the efficacy of a new
treatment (what outcomes to observe)
Small study of one experimental treatment (E)
Often a single-arm trial of E alone, without
randomization
Efficacy and safety are evaluated using an
early outcome
Data on E are compared to historical data on
standard treatment (S)
If E is promising, then Organize a randomized
phase III trial of E-vs-S based on a
time-to-event outcome (T)

97
Primary Outcome Measure Point Estimate
Mean Hgb µ Proportion Responding p Median
Nadir PSA Failure Rate ?
98
Typical Phase II Trials

Typical cancer phase II trials investigate the
response rate
Historical reference p0
Desired clinical significant response p1
Hypotheses
H0 pp0 (If true response rate is no larger
than p0, a minimum response rate of interest)
H1 pp1 (If true response rate is at least
p1, a target response rate)
Stop the trial early if p is not sufficiently
promising

99
Typical Phase II Trials

One stage design Using the Fishers exact test
to reject null
Two stage design First stage to have N1
patients. If not enough responses, stop the
trial Otherwise, continue to full N (gt N1 )
patients evaluate treatment response based on
the number of responses
The choice of N and N1 according prespecified
type I and II errors.

100
Phase II Trial Designs