Title: Handling Missing Data in the Analysis of CTN Trials:
1CTN Design Analysis Workshop
Handling Missing Data in the Analysis of CTN
Trials Pitfalls and Possible Solutions
Neal Oden, PhD, DSC2-EMMES Gaurav Sharma, PhD,
DSC2-EMMES Paul Van Veldhuisen, PhD,
DSC2-EMMES Paul Wakim, PhD, CCTN, NIDA
15 March 2011
2Todays Workshop
- The problem
- Prevention
- Types of missing data
- Analysis methods
- Case study
- Open discussion
3Missing Data
- Information within a trial that is meaningful for
analysis but not collected - Focus here mostly on primary outcome data, but
relevant to missing secondary outcomes and
covariates too
4Missing Data
- Randomization
- Balances treatment groups for known and unknown
factors - Lose benefits if there is drop-out, as groups at
outcome may not have been similar at baseline - Intention-to-treat principle
- Violates principle if not all participants
contribute to the primary analysis
5Missing Data
- If missing unrelated to assigned treatment
- Reduces statistical power
- If missing related to assigned treatment or to
outcome - Biases the estimate of the treatment effect
6Causes of Missing Data
- Due to discontinuation of study treatment
- Outcomes undefined for some participants
- QOL measures after death
- Quantitative drug use hair analysis in
individuals without hair - Test fails/specimen lost
- Attrition
- Related to health status/drug use
- Unrelated to health status/drug use (e.g., moved)
7Continuing Data Collection for Drop-Outs
- Distinction between
- Premature end of treatment
- AND
- End of study
- Does collecting data after premature end of
treatment make sense?
8Rationale
- Preserves intention-to-treat approach
- Many CTN trials are pragmatic trials
- NOT Does treatment work if perfectly delivered?
- but RATHER
- Is this a good treatment strategy or policy?
- OR
- What happens once treatment starts or is
recommended?
9Rationale
- Delivery of medicine deals with people in the
real world - A 100 efficacious cure for stimulant use is
useless for public health if nobody can stand it. - Strive to collect complete data for primary
outcome on ALL participants, even in those who do
not complete intervention - Too much missing data - gt no way result will be
believable no matter how sophisticated the
statistical method
10Why Do We Like It?
- Weight loss diet
- People on the effective arm lose weight and stay
in the study - Some on the ineffective arm get discouraged and
quit - If we analyzed only the people who stayed in the
trial, the ineffective arm would look too good
11Approaches to Missing Data
- Design and conduct of clinical trial that
minimizes missing data - May require trade-offs with generalizability
- Apply analysis methods that use information in
observed data to help analyze primary outcome
data in the presence of missing data
12B. Franklin
An ounce of prevention is worth a pound of cure
13Minimize Missing Data in.. Trial Design
- Flexible dose
- Target population
- Allow rescue therapy for poor responders
- Define primary outcomes that are highly
ascertainable - Minimize participant burden/reduce follow-up
- Number of visits/assessments
14Minimize Missing Data in... Trial Conduct
- Explain importance of trial participation during
consent process - Emphasize to staff importance of maintaining
follow-up even when treatment is refused - Incentives
- For participants, need to ensure level is not
viewed as coercive
15Minimize Missing Data in... Trial Conduct
- Expression of thanks
- Written/verbal
- Assistance with travel
- Reminders before visits
- Welcoming staff/friendly environment
- Keep locator information current
- Monitor and report to investigators extent of
missing data
16Availability of Primary Outcome Percent of
Measures with Values(N29 trials)
17Whats the big deal?
We need N 400 (based on power analysis) But we
expect 20 missing So we set the initial N
500 So that the final (analyzed) N 400
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
18Technical terms that we cant escape
Missing at random (MAR) Missing completely at
random (MCAR) Missing not at random
(MNAR) Ignorable Non-ignorable
but what do they mean?
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
19Missing Completely at Random (MCAR)
(Non-technical) Definition The fact that Y is
missing has nothing to do with the unobserved
value of Y, or with other variables Therefore Th
e set of participants with complete data can be
regarded as a simple random (or representative)
sample of all participants What to do? Ignore
the missing data and analyze the available data
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
20Missing at Random (MAR)
(Non-technical) Definition The fact that Y is
missing can be explained by other observed values
of Y, or by other measured variables Therefore T
he observed data can be used to account for the
missing data What to do? Use Maximum Likelihood
or Multiple Imputation approach, and include in
the model the other measured variables that
explain missingness
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
21Missing Not at Random (MNAR)
(Non-technical) Definition The fact that Y is
missing cannot be explained by other observed
values of Y, or by other measured
variables Therefore The observed data cannot be
used to account for the missing data and outside
information is needed In simple English We have
a problem
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
22In Summary
Missingness (i.e. whether the data are missing or not) Missingness (i.e. whether the data are missing or not)
is related to is not related to
MCAR observed or unobserved data
MAR observed data unobserved data
MNAR unobserved data
Based on Graham 2009
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
23Bottom Line
MCAR No big deal MAR Use available collected
data to explain missing mechanism, and use
existing statistical methods MNAR Need outside
information to explain missing mechanism
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
24Ignorable Non-Ignorable (roughly speaking)
- Ignorable (available data are sufficient)
- Missing Completely At Random (MCAR)
- Missing At Random (MAR)
- Non-Ignorable (need outside information)
- Missing Not At Random (MNAR)
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
25Missing Data Analysis Methods
26Complete Case and Pairwise Deletion
- CC PD
- Y1 Y2 Y3 Y1 Y2
Y3 - X X X X
X X - X X X X
X X - X X - X
X - - X X - X
X - - (Correlation Illustration)
- Simple, Default in Statistical Software
- Potential loss of info and precision
- Biased when observation is not MCAR
27Single Imputation
- Impute a single value, i.e. mean, BOCF, LOCF,
imputing missing as positive - Simple, artificially increases sample size
- Underestimate SE and incorrect p-values
- Most SI methods require MCAR assumptions to hold,
while some, such as LOCF, even require very
strong and often unrealistic assumptions
28Multiple Imputation (MI)
- Observed Data Imputations
- 1 2 m
-
-
-
-
- A simulation based approach to missing data
?
?
?
?
29The General Idea
- IMPUTATION ANALYSIS POOLING
- (1) (2) (3)
- Incomplete Data Imputed Data Analysis
Results Final Results
30(1) IMPUTATION Models
- The imputation model should include primary
predictive variables and other variables
associated with missingness - Multiple Imputation method is robust even with
approximate imputation models
31(2) ANALYSIS Models
- Regression Model
- General Linear Model
- Generalized Linear Model (Logistic Regression,
Poisson Regression)
32(3) Rules for POOLING
-
- Confidence Interval for Parameter of Interest is
given by - Mean of Estimate tdf v(Total Variance)
Estimate 1 Variance 1
Estimate 2 Variance 2
Estimate 3 Variance 3
Estimate m Variance m
Mean of Estimate Within Variance Between
Variance Total Variance
33Desirable Features
- MI gives approximately unbiased estimates of all
parameters - MI provides good estimates of the standard errors
- MI can be used with many kinds of data and
analyses without specialized software - Requires MAR assumption
34Maximum likelihood
- Basic idea
- Given some data,
- Try to guess the parameter(s) of the probability
distribution that generated the data - MLE of a parameter is the value that maximizes
the probability of the data you already have
35Example
- Flip a coin, get 45 heads, 36 tails
- We dont know p, but whatever it is
- Pr(45 H in 81 tosses) K p45(1-p)36
- How to guess p?
- Pick the value of p that maximizes the
probability of what already happened - Pick p to maximize L p45(1-p)36
- Best guess turns out to be 45/81
36Maximum likelihood estimates have nice properties
- Consistent
- Asymptotically
- Normal
- Unbiased
- minimum variance
- etc.
37New problem
- H 45
- T 36
- ? 19
- Now how to guess p?
- If we knew how many missing were H and how many
T, we would know what to do. - But we dont.
- What to do?
38A solution
- If data are MAR,
- you can get MLEs by
- maximizing the (conditional) likelihood for the
nonmissing data - ignoring the missing data mechanism.
39Important Application
- Longitudinal analysis
- Participant 1, visit 1, 2, 3,
- Participant 2, visit 1, 2, 3,
- For each visit, y a b1 x1 b2 x2
- First approach
- Treat all visits as independent
- Do the regression on all visits together
- Wrong, because visits from a single participant
are related, not independent
40Important Application (contd)
- Second approach
- The visits from a single participant have
covariance - Use a mixed model
- It used to be that you had to have all visits
nonmissing for this analysis - But modern software (SAS MIXED, GLIMMIX) ignores
the missing-data mechanism and gets MLEs from
only the nonmissing data, even if some visits are
missing. - If data are MAR, this is fine!
41Modern longitudinal ML software uses more data
Neither old nor new method can use this visit
Older CC analysis would use only these cases
42Another application
- Survival analysis
- Example time to relapse
- For some people, you have the time
- For others, you dont because
- Study ended
- People died
- People dropped out
- etc.
- People without relapse times are said to be
CENSORED
43Another application (contd)
- For censored people, you dont know the relapse
time, but you know it is after the censor time - Survival analysis handles censored data, but
- You have to make the assumption that censoring is
noninformative. - If people drop out because they know they are
going to relapse the next day, the censoring is
informative. - Informative censoring gives biased survival time
estimates - The noninformative censoring assumption is
basically an MAR assumption.
44What if data are not MAR?
- When the missing data are nonignorable (i.e.,
MNAR), standard statistical models can yield
badly biased results - Cannot test MAR versus MNAR
45Sensitivity Analysis
- The missing data mechanism is not identifiable
from observed data - We dont know what we dont know
- One or more analyses can be performed using
different assumptions - Example Worst Case Analysis
- (wont work with a lot of missing data)
46Goals of Sensitivity Analysis
- Consider a range of potential associations
between missingness and response - Assess the degree to which conclusion can be
influenced by the missingness mechanism - If the conclusion is largely unchanged the result
may be considered robust - Otherwise, the conclusion should be interpreted
cautiously and may be misleading
47MNAR models
- Use of non-ignorable models can be helpful in
conducting a sensitivity analysis - Not necessarily a good idea to rely on a single
MNAR model, because the assumptions about the
missing data are impossible to assess with the
observed data - One should use MNAR models sensibly, possibly
examining several types of such models for a
given dataset
48Two general classes of MNAR models
- Selection Models use model for the full data
response and a selection mechanism - Pattern Mixture Models use mixture of missing
data pattern information in the model
49Case Study CTN0010 - BUP for Adolescents
Two groups Bup/Nal detoxification over 2 weeks
vs. Bup/Nal maintenance over 12 weeks N
(analyzed) 152 at 6 community treatment
programs Main outcome measure Opioid-positive
urine test result at weeks 4, 8 12 Evaluation
weekly for 12 weeks, comprehensive at 4, 8, 12,
24, 36 52 weeks
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
50Woody, JAMA 2008
51Missingness in CTN0010 (from Paul Allisons
analysis)
20 participants had missing outcome for all 12
weeks (effective sample size N 20) Available
Data (after removing the 20 cases)
Week 1 2 3 4 5 6 7 8 9 10 11 12
present 90 74 60 78 48 45 44 69 40 37 37 67
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
52Paul Allisons Analysis
- Included in the model each of Weeks 1 to 12
- Used Maximum Likelihood Estimation (MLE) and
Multiple Imputation (MI) approaches (MLE is
preferred over MI) - Used random effects (mixed) logit model with SAS
PROC GLIMMIX
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
53National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
54National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
55National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
56(No Transcript)
57(No Transcript)
58Take-Home Messages
- Model all the available outcome data at all time
points, including outcome at baseline (t0), and
then test the time points (contrasts) of interest - There are good data analytic methods for dealing
with missing data in repeated-measures designs
(under MAR assumption) use random effects
(mixed) models estimated by maximum likelihood - Allow for a linear and quadratic time trend
(saves degrees of freedom), or spline model
(broken line) - If no time-related pattern, use time as a class
variable, i.e. each time point is a category (not
continuous)
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
59Take-Home Messages (contd)
- Imputing missing outcomes as positive is a crude
approach one can often do better - Incorporation of covariates and auxiliary
variables - Sensitivity analysis is absolutely vital
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
60References
Allison, Missing Data, Sage University Papers
Series on Quantitative Applications in the Social
Sciences, 07-136, Thousand Oaks, CA Sage, 2001.
Fitzmaurice, Laird Ware, Applied Longitudinal
Analysis, Wiley, 2004. Graham, Missing Data
Analysis Making It Work in the Real World,
Annual Review of Psychology, 2009, 60
549-576. Liang Zeger, Longitudinal Data
Analysis of Continuous and Discrete Responses for
Pre-Post Designs, Sankhya, 2000, 62(B) 134-148.
Weiss, An Introduction to Modeling Longitudinal
Data, presentation at UCLA CALDAR Summer
Institute on Longitudinal Research, August
2010. Woody et al., Extended vs Short-term
Buprenorphine-Naloxone for Treatment of
Opioid-Addicted Youth A Randomized Trial, JAMA,
2008, 300(17) 2003-2011.
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
61Contact Information
Neal Oden noden_at_emmes.com Gaurav Sharma
gsharma_at_emmes.com Paul Van Veldhuisen
pvanveldhuisen_at_emmes.com Paul Wakim
pwakim_at_nida.nih.gov
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services
62Questions Comments
National Institute on Drug Abuse - National
Institutes of Health - U.S. Department of Health
and Human Services