Title: Greg Stoddard
1Introduction to Multiple Imputation
- Greg Stoddard
- September 29, 2011
- University of Utah School of Medicine
2- Outline
- The missing value problem
- Types of missing
- Simple approaches
- Multiple Imputation State of Art
- When to use each approach
3Consider the following dataset.
- N 6
- When a linear regression, Y a b1(X1)b2(X2)
- is fitted, the sample size drops to N 3, due
to listwise deletion of missing values.
ID Y X1 X2
1 11 1 2
2 10 miss 5
3 miss 3 2
4 9 miss miss
5 12 5 7
6 7 6 3
4To get our sample size back to 6, we plug up the
missing holes with imputed values.
ID Y X1 X2
1 11 1 2
2 10 3 5
3 11 3 2
4 9 6 3
5 12 5 7
6 7 6 3
5To offer a precise definitionmissing data
imputation is substituting values for missing
valuesbut it is not fabricating data because we
do it statistically.
6Every study is going to have some missing values.
You rarely see any mention of it in a published
article, however. Should you mention it?
7In the guidance article,Vandenbrouchke JP, von
Elm E, Altman DG, et al. Strengthening and
reporting of observational studies in
epidemiology (STROBE) explanation and
elaboration. Ann Intern Med 2007147(8)W-163 to
W-194.on page W-176, it advises, 12(c)
Explain how missing data were addressed
8This guidance article then gives the following
example, used in a paper by Chandola et al.
(2006) of how you might say this,Our missing
data analysis procedures used missing at random
(MAR) assumptions. We used the MICE
(multivariate imputation by chained equations)
method of multiple multivariate imputation in
STATA. We independently analyzed 10 copies of
the data, each with missing values suitably
imputed , in the multivariate logistic regression
analyses. We average estimates of the variables
to give a single mean estimate and adjusted
standard errors according to Rubins
rules.------- Chandola T, Brunner E, Marmot M.
Chronic stress at work and the metabolic
syndrome prospective study. BMJ 2006332521-5.
PMID 16428252
9Although missing value imputation is too
complex of a subject to include in introductory
statistics textbooks, entire chapters are devoted
in specialized applied texts. Some examples
areHarrell Jr FE. Regression Modeling
Strategies With Applications to Linear Models,
Logistic Regression, and Survival Analysis. New
York, Springer-Verlag, 2001, pp.41-52.Twisk
JWR. Applied Longitudinal Data Analysis for
Epidemiology A Practical Guide. Cambridge,
Cambridge University Press, 2003,
pp.202-224.Fleiss JL, Levin B, Paik MC.
Statistical Methods for Rates and
Proportions, 3rd ed. Hoboken NJ, John Wiley
Sons, 2003, pp.491-560.
10A popular classification scheme for missing
data is (Harrell, 2001, pp.41-52) 1. Missing
completely at random (MCAR) 2. Missing at
random (MAR) 3. Informative missing (IM)
------ Harrell Jr FE. Regression Modeling
Strategies With Applications to Linear Models,
Logistic Regression, and Survival Analysis. New
York, Springer-Verlag, 2001, pp.41-52.
11Missing completely at random (MCAR) Data
are missing for reasons unrelated to any
characteristics or responses of the subject,
including the value of the missing value, were it
to be known. An example is the accidental
dropping of test tube resulting in missing
laboratory measurements. (Here, the best guess
of the missing variable is simply the sample
median).------ Harrell Jr FE, 2001,
pp.41-52.
12Missing at random (MAR) Data elements
are not missing at random, but the probability
that a value is missing depends on values of
variables that were actually measured. For
example, suppose males are less likely to respond
to their income question in general, but the
likelihood of responding is independent of their
actual income. In this case, unbiased
sex-specific income estimates can be made if we
have data on the sex variable (by replacing the
missing value with the sex-specific median
income, for example).------ Harrell Jr FE,
2001, pp.41-52.
13Informative missing (IM) Data elements
are more likely to be missing if their true
values of the variable in question are
systematically higher or lower. For example,
this occurs if lower income subjects, or high
income subjects, or both, are less likely to
answer the income question in a survey. This is
the most difficult type of missing data to
handle, and in many cases there is no good value
to substitute for the missing value.
Furthermore, if you analyze your data by just
dropping these subjects, your results will be
biased, so that does not work either.------
Harrell Jr FE, 2001, pp.41-52.
14Missing Comorbidities In Patient Medical
Record A special case of missing is a
comorbidity not listed in the patients medical
record. For example, if no mention of diabetes
was ever made and a diagnostic code for diabetes
was never entered for any clinic visit, the fact
that it is missing suggests that the patient does
not have diabetes. Defining a coding rule to
replace this missing value with 0, or absent,
will most likely produce the least amount of
misclassification, or mis-imputation, error.
15Steyerberg (2009) mentions this coding rule
approach, An alternative in such a situation
might be to change the definition of the
predictor, i.e., by assuming that if no value is
available from a patient chart, the
characteristic is absent rather than
missing.------- Steyerberg EW. (2009).
Clinical Prediction Models A Practical Approach
to Development, Validation, and Updating. New
York, Springer, 2009, pp.130-131.
16Replacing Missing Values with Mean, Median,
or Mode Before the more sophisticated
imputation schemes were developed, it was common
practice to replace the missing value with a
likely value, being the mean, median, or mode.
One criticism of this approach is that you
artificially shrink the variance, since so many
observations will then have the average value.
17 Royston (2004) makes this
criticism, Old-fashioned imputation typically
replaced missing values with the mean or mode of
the nonmissing values for that variable. That
approach is now regarded as inadequate. For
subsequent statistical inference to be valid, it
is essential to inject the correct degree of
randomness into the imputations and to
incorporate that uncertainty when computing
standard errors and confidence intervals for
parameters of interest.------ Royston P.
Multiple imputation of missing values. The Stata
Journal 20044(3)227-241.
18 One possible approach is imputing the
missing value with a likely value, such as the
median. Then, add a random residual back to the
median imputed value to maintain the correct
standard error (Harrell, 2001).---Harrell Jr
FE. Regression Modeling Strategies With
Applications to Linear Models, Logistic
Regression, and Survival Analysis. New York,
Springer-Verlag, 2001, pp.45-46.
19 Such a direct approach is not usually
done, however, since the more widely accepted
approaches, accomplish the same thing.
Furthermore, if there are a lot of missing
data, imputing with a likely value might
adversely affect the regression coefficient.
The imputation methods of multiple imputation
and maximum likelihood not only provide the best
standard errors, but also the best regression
coefficients.
20What About Imputing the Outcome Variable?
-
-
- At first, it seems like you must at least have
the outcome variable measured or the subject
should be excluded.
ID Y X1 X2
1 11 1 2
2 10 miss 5
3 miss 3 2
4 9 miss miss
5 12 5 7
6 7 6 3
21 It is common to discard subjects with a
missing outcome variable, but imputing missing
values of the outcome variable frequently leads
to more efficient estimates of the regression
coefficients when the imputation is based on the
nonmissing predictor variables (Harrell,
2001).-----Harrell Jr FE. Regression Modeling
Strategies With Applications to Linear
Models, Logistic Regression, and Survival
Analysis. New York, Springer-Verlag, 2001,
pp.43.
22 Missing Value Indicator ApproachA
historically popular approach in epidemiologic
research was to use a missing value indicator,
which has a value of 1 if the variable is missing
and 0 otherwise. For example, given the
following variable for gender, 1. male n
50 2. female n 40 Missing n
10
23we would recode this to two indicators, male
and malemissingand then include both
indicator variables into the regression model.
With this approach, the missing value indicator
is not interpreted, or reported in an article,
but simply acts as a place holder so the subjects
with missing values are not dropped out of the
analysis.
Original gender variable Male indicator Malemissing indicator
1. male (n50) 1 (n50) 0 (n50)
2. female (n40) 0 (n40) 0 (n40)
. missing (n10) 0 (n10) 1 (n10)
24 Greenland and Finkle (1995) suggest
not using the missing value indictor
approach,The method based on missing-data
indicators can exhibit severe bias even when the
data are missing completely at random, In
general, the authors recommend that
epidemiologists avoid using the missing-indicator
method and use more sophisticated methods
whenever a large proportion of data are
missing.------ Greenland S, Finkle WD. (1995).
A critical look at methods for handling
missing covariates in epidemiologic
regression analysis. Am J Epidemiol
142(12)1255-64.
25 Steyerberg (2009, pp.130-131) likewise
advises not using the missing value indicator
approach, such a procedure ignores
correlation of the values of predictors among
each other. Simulations have shown that the
procedure may lead to severe bias in estimated
regression coefficients.155,295. The missing
indicator should hence generally not be
used.------- Steyerberg EW. (2009). Clinical
Prediction Models A Practical Approach to
Development, Validation, and Updating. New York,
Springer, 2009, pp.130-131.
26Hotdeck Imputation In this method, the missing
values are replaced by randomly selected values
from the nonmissing values of the same variable.
ID Y X1 X2
1 11 1 2
2 10 3 5
3 11 3 2
4 9 6 3
5 12 5 7
6 7 6 3
27 Hotdeck imputation has the advantage
of being simple to use, it preserves the
distributional characteristics of the variable,
and performs nearly as well as the more
sophisticated imputation approaches (Roth,
1994).------- Roth, P. Missing data A
conceptual review for applied psychologists.
Personnel Psychology 199447537-560.
28 In hotdeck imputation, you set a
random number seed before the imputation, so you
can replicate the imputation and subsequent
analysis. You then create imputed variables,
using different variable names, such as male ?
male_imp.Then you use ordinary statistical
methods, such as linear regression on the imputed
variables. The original variables, with
missing values, are preserved for final analyses
using multiple imputation, if you choose to do
so.
29 What you are going to discover,
however, is that you do not get the answer you
want in your regression model. So, you become
curious and choose a different random number
seed, and this time you get the desired answer
using your second round of imputed variables. If
you pre-specified your random number seed, this
will be very unsatisifying because you are
actually stuck with your first modelbut which
model is the right one?It seems the only
correct thing to do, then, is to do hotdeck
several times and then average the results to
arrive at a stable right answer. This is the
basic idea of multiple imputation, which we will
get to later.
30ID Y X1 X2
1 11 1 2
2 10 3 5
3 11 3 2
4 9 6 3
5 12 5 7
6 7 6 3
- Recall, with hotdeck imputation, we simply use a
random value from the nonmissing values. This
has the advantage that an actual possible value
is used, but it does not take into account that a
more likely value could be obtained by using
information from the other variables the imputed
variable is correlated with. For example, body
weight will be different between teenagers and
adults. -
-
31 Regression ApproachIn this method,
you impute the missing value with the predicted
value from an appropriate regression model. For
example, you could use a linear regression, with
age and gender as the predictors, to impute the
missing body weight variables. This is similar
to imputing with means, but more like using
subgroup specific means. A disadvantage is
that it does not preserve the variability of the
data.
32 Multiple ImputationThe idea of
multiple imputation is that instead of filling in
missing values to create a single imputed
dataset, several (or more) imputed data sets are
created each of which contains different imputed
values. The analysis of a statistical model is
then done on each of the imputed data sets. The
multiple analyses are then combined to yield a
single set of results. The major advantage of
multiple imputation over single imputation is
that it produces standard errors that reflect the
degree of uncertainty due to the imputation
missing values. In general, multiple imputation
techniques require that missing observations are
missing at random (MAR).------http//www.ats.uc
la.edu/stat/stata/library/ice.htm
33 There are two major approaches to
creating multiply imputed datasets1)
multivariate normal2) imputation by chained
equations------http//www.ats.ucla.edu/stat/sta
ta/library/ice.htm
34 1) multivariate normalThis approach
is based on the joint distribution of all the
variables in the imputation model, including
variables to be imputed and variables to be used
only for the purpose of imputing other variables.
In this approach, the joint distribution of all
variables in the imputation model is assumed to
be multivariate normal. ------http//www.ats.u
cla.edu/stat/stata/library/ice.htm
35 2) imputation by chained equationsThis
method is based on each conditional density of a
variable given other variables.
------http//www.ats.ucla.edu/stat/stata/librar
y/ice.htm
36imputation by chained equationsThe user
selects how many imputed datasets to combine.
Usually 3 is enough, with little if any advantage
to using more than 10 sets.
37To create one imputed dataset,1) Drop any
subject that is missing every variable that will
be considered in the regression model, as these
subjects are impossible to impute.2) All missing
values for each specific variable is filled in
with randomly selected nonmissing values from the
same variable (hotdeck approach).3) For each
variable, in turn, ignore the filled-in value and
instead impute the missing value by predicting it
from the remaining variables, where the remaining
variables were filled in if needed in step 2
(regression approach).------http//www.ats.ucla.
edu/stat/stata/library/ice.htm
38Next, each imputed dataset is analyzed
independently with the desired regression
model.Then, Estimates of parameters of
interest are averaged across the copies to give
a single estimate. Standard errors are computed
according to the Rubin rules, devised to allow
for the between- and within-imputation
components of variation in the parameter
estimates.------ Royston P. Multiple
imputation of missing values. The Stata
Journal 20044(3)227-241.
39Do we really need to go to all this trouble?
It depends on how much missing data you
have.Harrell (2001) provides some crude
guidelines for Proportion of missings
0.05 Proportion of missings 0.05 to
0.15 Proportion of missings gt 0.15------
Harrell Jr FE. Regression Modeling Strategies
With Applications to Linear Models, Logistic
Regression, and Survival Analysis. New York,
Springer-Verlag, 2001, p.49.
40Proportion of missings ? 0.05 It doesnt
matter very much how you impute missings or
whether you adjust variance of regression
coefficient estimates for having imputed data in
this case. For continuous variables imputing
missings with the median nonmissing value is
adequate for categorical predictors the most
frequent category can be used. Complete case
analysis is an option here.
41Proportion of missings 0.05 to 0.15 If a
predictor is unrelated to all of the other
predictors, imputations can be done the same as
the above (i.e., impute a reasonable constant
value). If the predictor is correlated with
other predictors, develop a customized model (or
have the transcan fuction available for S-Plus
from Harrells website do it for you) to predict
the predictor from all of the other predictors.
Then impute missings with predicted values. For
categorical variables, classification trees are
good methods for developing customized imputation
models. For continuous variables, ordinary
regression can be used if the variable in
question does not require a nonmonotonic
transformation to be predicted from the other
variables. For either the related or unrelated
predictor case, variances may need to adjusted
for imputation. Single imputation is probably OK
here, but multiple imputation doesnt hurt.
42Proportion of missings gt 0.15 This situation
requires the same considerations as in the
previous case, and adjusting variances for
imputation is even more important. To estimate
the strength of the effect of a predictor that is
frequently missing, it may be necessary to refit
the model on the subject of observations for
which that predictor is not missing, if Y is not
used for imputation. Multiple imputation is
preferred for most models.
43That ends the lecture.As a discussion topic,
however, lets return to the informative missing
situation.
44Informative missing (IM) Data elements
are more likely to be missing if their true
values of the variable in question are
systematically higher or lower. For example,
this occurs if lower income subjects, or high
income subjects, or both, are less likely to
answer the income question in a survey. This is
the most difficult type of missing data to
handle, and in many cases there is no good value
to substitute for the missing value.
Furthermore, if you analyze your data by just
dropping these subjects, your results will be
biased, so that does not work either.------
Harrell Jr FE, 2001, pp.41-52.
45Discussion Question Suppose you want to
want to predict opioid abuse in patients
receiving prescription opioid pain medications.
Your primary predictor variable is an opioid
abuse potential scale that is scored as 1) at
risk for abuse, 0) not at risk.Your primary
outcome variable is a question on your survey,
Do you currently or have you in the past year
used opioid prescription drugs simply to get
high? (Yes or No)You told your subjects that
they had the option to skip any question they
wanted to. About 20 of your subjects chose to
not answer that question (many of them, perhaps,
because they would be admitting to doing
something illegal). How are you going to
analyze these data?