Missing Data - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Missing Data

Description:

For multiple imputation, the dependent variable in a regression analysis should ... Imputation only increases sampling variability ... – PowerPoint PPT presentation

Number of Views:680

Avg rating:3.0/5.0

Slides: 23

Provided by: paulda1

Category:

more less

Transcript and Presenter's Notes

Title: Missing Data

1
Missing Data

Paul D. Allison
2004
www.ssc.upenn.edu/allison
allison_at_ssc.upenn.edu

2
Assumptions

Missing completely at random (MCAR)
Suppose some data are missing on Y. These data
are said to be MCAR if the probability that Y is
missing is unrelated to Y or other variables X
(where X is a vector of variables).
Pr (Y is missingX,Y) Pr(Y is missing)
MCAR is the best situation to be in.
If data are MCAR, complete data sample is a
random subsample of original target sample.
MCAR allows for the possibility that missingness
on one variable may be related to missingness on
another
e.g., sets of variables may always be missing
together

3
Assumptions

Missing at random (MAR)
Data on Y are missing at random if the
probability that Y is missing does not depend on
the value of Y, after controlling for other
observed variables
Pr (Y is missingX,Y) Pr(Y is missingX)
E.g., the probability of missing income depends
on marital status, but within each marital
status, the probability of missing income does
not depend on income.
Considerably weaker assumption than MCAR
Can test whether missingness on Y depends on X
Cannot test whether missingness on Y depends on Y

4
Assumptions

Not missing at random (NMAR)
If the MAR assumption is violated, the missing
data mechanism must be modeled to get good
parameter estimates.
Heckmans regression model for sample selection
bias is a good example.
Effective estimation for NMAR missing data
requires very good prior knowledge about missing
data mechanism.
Data contain no information about what models
would be appropriate
No way to test goodness of fit of missing data
model
Results often very sensitive to choice of model
Listwise deletion able to handle one important
kind of NMAR

5
Multiple Imputation

Upside
Properties similar to ML
Consistent, asymptotically efficient (almost),
asympotically normal
Can be used with any kind of data or model
Analysis can be done with conventional software
Downside
Get a different result every time you use it
Implementation may be complex, many different
approaches

6
Software

NORM Freeware from J.L. Schafer
http//www.stat.psu.edu/jls/
PROC MI (SAS 8.1 and later) (produces
imputations)
PROC MIANALYZE (combines analyses based on MI)
Both packages assume multivariate normality for
producing regression imputations.
Harmless assumption for variables with no missing
data.
Works well even if assumption is violated.

7
Steps for MI

1. Choose an appropriate set of variables
All variables in the intended model (including
the dependent variable).
Other variables that may be associated with
variables that have missing data or with their
probability of being missing.
Better to err on the inclusive side.
2. Where necessary, transform variables to
achieve approximate normality.
3. Run PROC MI on the specified set of variables
to produce multiple imputed data sets.

8
Steps for MI (continued)

4. Back transform any normalized variables and
round imputations for discrete variables.
5. Use standard software to estimate desired
model on each imputed data set.
6. Use PROC MIANALYZE to combine results into a
single set of parameter estimates, standard
errors and test statistics.
When generating imputed data sets, you may want
to produce an extra set for exploratory analysis.
Once youve decided on the model, then apply
these six steps.

9
Imputation with the Dependent Variable

For multiple imputation, the dependent variable
in a regression analysis should always be
included. This means that the dependent variable
is used to impute missing values of the
independent variables.
Wont this create bias?
Yes, for conventional deterministic, imputation.
No, for imputation with a random component. In
fact, leaving out the dependent variable will
cause bias.
Goal of multiple imputation is to reproduce all
the relationships in the data as closely as
possible. This can only be accomplished if the
dependent variable is included in the imputation
process.

10
Should Missing Data on the Dependent Variable Be
Imputed?

If theres no missing data on predictors and no
auxiliary variables, the answer is NO.
In this cases ML is the same as listwise
deletion. Imputation only increases sampling
variability
If there are auxiliary variables that are
strongly correlated with the dependent variables,
YES.
Auxiliary variables can yield much better
imputations for the dependent variable.
If there are no auxiliary variables and there are
cases with missing data on predictors, the answer
is maybe but probably not.

11
College Example

1994 U.S. News Guide to Best Colleges
1302 four-year colleges in U.S.
Goal estimate a regression model predicting
graduation rate ( graduating/enrolled 4 years
earlier x 100)
98 colleges have missing data on graduation rate
Independent variables
1st year enrollment (logged, 5 cases missing)
Room Board Fees (40 missing)
Student/Faculty Ratio (2 cases missing)
Private1, Public0
Mean Combined SAT Score (40 missing)
Auxiliary variable Mean ACT scores (45 missing)

12
SAS Program (using defaults)

PROC MI DATAmy.college OUTmiout
VAR gradrat csat lenroll stufac private rmbrd
act
RUN
PROC REG DATAmiout OUTESTa COVOUT
MODEL gradratcsat lenroll stufac private
rmbrd
BY _IMPUTATION_
RUN
PROC MIANALYZE DATAa
VAR INTERCEPT csat lenroll stufac private
rmbrd
RUN
(See Output 3)

13
Why do multiple imputations?

Introduction of random error avoids biases
endemic to conventional imputation
Doing it multiple times
Produces more efficient estimates
Makes it possible to get good standard error
estimates

14
Formula for Standard Error

bk is the parameter estimate
sk is the standard error of bk
M is the number of replications
This formula is extremely general. Its used
with virtually every application of multiple
imputation.
Applying this formula to the correlation example,
we get .042062, noticeably higher than the
reported standard errors.

15
Results

Multiple Imputation Parameter
Estimates
Parameter Estimate Std Error DF
t Pr gt t
csat 0.065450 0.005656 15.069
11.57 lt.0001
lenroll 2.043879 0.621364 65.985
3.29 0.0016
private 12.718801 1.354096 49.591
9.39 lt.0001
stufac -0.217541 0.099291 47.842
-2.19 0.0334
rmbrd 2.512032 0.000684 9.308
3.67 0.0048

16
PROC MI Options

Change number of imputed data sets
PROC MI DATAmy.college OUTmiout NIMPUTE7
The more the better More data sets gives more
stable parameter estimates, and better standard
error estimates.
But theres rapidly diminishing returns. With
moderate amounts of missing data, 5 is
sufficient. But with more missing data, you
should have more data sets.

17
Categorical Variables

When imputing 2-category variables, like gender
or alive/dead (coded as 0-1 variables), imputed
values can be any real number, usually between 0
and 1.
If the variable is a predictor variable in a
regression analysis, leave the imputed values as
they are.
If the analysis method requires that the variable
be a dichotomy (e.g., a dependent variable in a
logistic regression), use a different multiple
imputation method (e.g., sequential generalized
regression).
Simply rounding the imputed values is inadequate
Horton et al., American Statistician, Nov. 2003.

18
Categorical Variables (cont.)

The same principles apply to nominal variables
with more than two categories
If analysis method requires categorical data, use
a different method.
If analysis method does NOT require categorical
data, create a set of dummy variables (one less
than the number of categories) and impute them
like any other variable.
In the analysis, dont modify the imputed values.
Version 9 of PROC MI has a CLASS statement for
nominal variables. Can only be used with monotone
missing data. Imputation is based on discriminant
function or logistic regression.

19
Transformations for Normality

Imputations can be improved by transforming
variables to achieve approximate normality before
imputing, then reversing the transformation after
imputation.
In SAS, this can be done in DATA steps, but PROC
MI can do many transformations more easily.
For example, RMBRD is somewhat skewed to the
right. A logarithmic transformation removes the
skewness.
PROC MI DATAmy.college OUTmiout
VAR gradrat csat lenroll stufac private rmbrd
act
TRANSFORM LOG(rmbrd)
RUN
This applies the transformation, imputes, and
back-transforms. Other available transforms
BOXCOX, EXP, LOGIT, POWER

20
Output from Other SAS PROCS

MIANALYZE can be used in the same way with the
following regression PROCS that use the OUTEST
and COVOUT options
REG, LOGISTIC, PROBIT, LIFEREG, PHREG
For other PROCS, must use ODS (output delivery
system) to produce data sets containing the
estimates and their covariance matrix.

21
Summary and Review

Among conventional methods, listwise deletion is
the least problematic.
Unbiased if MCAR
Standard errors good estimates of true standard
errors
Resistant to NMAR for independent variables in
regression
All other conventional methods introduce bias
into parameter estimates or standard error
estimates
By contrast ML and MI have optimal properties
under MAR, or under a correctly specified model
for missingness
Parameter estimates approximately unbiased and
efficient
Good estimates of standard errors and test
statistics.