Missing Data - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Missing Data

Description:

For multiple imputation, the dependent variable in a regression analysis should ... Imputation only increases sampling variability ... – PowerPoint PPT presentation

Number of Views:678
Avg rating:3.0/5.0
Slides: 23
Provided by: paulda1
Category:

less

Transcript and Presenter's Notes

Title: Missing Data


1
Missing Data
  • Paul D. Allison
  • 2004
  • www.ssc.upenn.edu/allison
  • allison_at_ssc.upenn.edu

2
Assumptions
  • Missing completely at random (MCAR)
  • Suppose some data are missing on Y. These data
    are said to be MCAR if the probability that Y is
    missing is unrelated to Y or other variables X
    (where X is a vector of variables).
  • Pr (Y is missingX,Y) Pr(Y is missing)
  • MCAR is the best situation to be in.
  • If data are MCAR, complete data sample is a
    random subsample of original target sample.
  • MCAR allows for the possibility that missingness
    on one variable may be related to missingness on
    another
  • e.g., sets of variables may always be missing
    together

3
Assumptions
  • Missing at random (MAR)
  • Data on Y are missing at random if the
    probability that Y is missing does not depend on
    the value of Y, after controlling for other
    observed variables
  • Pr (Y is missingX,Y) Pr(Y is missingX)
  • E.g., the probability of missing income depends
    on marital status, but within each marital
    status, the probability of missing income does
    not depend on income.
  • Considerably weaker assumption than MCAR
  • Can test whether missingness on Y depends on X
  • Cannot test whether missingness on Y depends on Y

4
Assumptions
  • Not missing at random (NMAR)
  • If the MAR assumption is violated, the missing
    data mechanism must be modeled to get good
    parameter estimates.
  • Heckmans regression model for sample selection
    bias is a good example.
  • Effective estimation for NMAR missing data
    requires very good prior knowledge about missing
    data mechanism.
  • Data contain no information about what models
    would be appropriate
  • No way to test goodness of fit of missing data
    model
  • Results often very sensitive to choice of model
  • Listwise deletion able to handle one important
    kind of NMAR

5
Multiple Imputation
  • Upside
  • Properties similar to ML
  • Consistent, asymptotically efficient (almost),
    asympotically normal
  • Can be used with any kind of data or model
  • Analysis can be done with conventional software
  • Downside
  • Get a different result every time you use it
  • Implementation may be complex, many different
    approaches

6
Software
  • NORM Freeware from J.L. Schafer
  • http//www.stat.psu.edu/jls/
  • PROC MI (SAS 8.1 and later) (produces
    imputations)
  • PROC MIANALYZE (combines analyses based on MI)
  • Both packages assume multivariate normality for
    producing regression imputations.
  • Harmless assumption for variables with no missing
    data.
  • Works well even if assumption is violated.

7
Steps for MI
  • 1. Choose an appropriate set of variables
  • All variables in the intended model (including
    the dependent variable).
  • Other variables that may be associated with
    variables that have missing data or with their
    probability of being missing.
  • Better to err on the inclusive side.
  • 2. Where necessary, transform variables to
    achieve approximate normality.
  • 3. Run PROC MI on the specified set of variables
    to produce multiple imputed data sets.

8
Steps for MI (continued)
  • 4. Back transform any normalized variables and
    round imputations for discrete variables.
  • 5. Use standard software to estimate desired
    model on each imputed data set.
  • 6. Use PROC MIANALYZE to combine results into a
    single set of parameter estimates, standard
    errors and test statistics.
  • When generating imputed data sets, you may want
    to produce an extra set for exploratory analysis.
    Once youve decided on the model, then apply
    these six steps.

9
Imputation with the Dependent Variable
  • For multiple imputation, the dependent variable
    in a regression analysis should always be
    included. This means that the dependent variable
    is used to impute missing values of the
    independent variables.
  • Wont this create bias?
  • Yes, for conventional deterministic, imputation.
  • No, for imputation with a random component. In
    fact, leaving out the dependent variable will
    cause bias.
  • Goal of multiple imputation is to reproduce all
    the relationships in the data as closely as
    possible. This can only be accomplished if the
    dependent variable is included in the imputation
    process.

10
Should Missing Data on the Dependent Variable Be
Imputed?
  • If theres no missing data on predictors and no
    auxiliary variables, the answer is NO.
  • In this cases ML is the same as listwise
    deletion. Imputation only increases sampling
    variability
  • If there are auxiliary variables that are
    strongly correlated with the dependent variables,
    YES.
  • Auxiliary variables can yield much better
    imputations for the dependent variable.
  • If there are no auxiliary variables and there are
    cases with missing data on predictors, the answer
    is maybe but probably not.

11
College Example
  • 1994 U.S. News Guide to Best Colleges
  • 1302 four-year colleges in U.S.
  • Goal estimate a regression model predicting
    graduation rate ( graduating/enrolled 4 years
    earlier x 100)
  • 98 colleges have missing data on graduation rate
  • Independent variables
  • 1st year enrollment (logged, 5 cases missing)
  • Room Board Fees (40 missing)
  • Student/Faculty Ratio (2 cases missing)
  • Private1, Public0
  • Mean Combined SAT Score (40 missing)
  • Auxiliary variable Mean ACT scores (45 missing)

12
SAS Program (using defaults)
  • PROC MI DATAmy.college OUTmiout
  • VAR gradrat csat lenroll stufac private rmbrd
    act
  • RUN
  • PROC REG DATAmiout OUTESTa COVOUT
  • MODEL gradratcsat lenroll stufac private
    rmbrd
  • BY _IMPUTATION_
  • RUN
  • PROC MIANALYZE DATAa
  • VAR INTERCEPT csat lenroll stufac private
    rmbrd
  • RUN
  • (See Output 3)

13
Why do multiple imputations?
  • Introduction of random error avoids biases
    endemic to conventional imputation
  • Doing it multiple times
  • Produces more efficient estimates
  • Makes it possible to get good standard error
    estimates

14
Formula for Standard Error
  • bk is the parameter estimate
  • sk is the standard error of bk
  • M is the number of replications
  • This formula is extremely general. Its used
    with virtually every application of multiple
    imputation.
  • Applying this formula to the correlation example,
    we get .042062, noticeably higher than the
    reported standard errors.

15
Results
  • Multiple Imputation Parameter
    Estimates

  • Parameter Estimate Std Error DF
    t Pr gt t
  • csat 0.065450 0.005656 15.069
    11.57 lt.0001
  • lenroll 2.043879 0.621364 65.985
    3.29 0.0016
  • private 12.718801 1.354096 49.591
    9.39 lt.0001
  • stufac -0.217541 0.099291 47.842
    -2.19 0.0334
  • rmbrd 2.512032 0.000684 9.308
    3.67 0.0048

16
PROC MI Options
  • Change number of imputed data sets
  • PROC MI DATAmy.college OUTmiout NIMPUTE7
  • The more the better More data sets gives more
    stable parameter estimates, and better standard
    error estimates.
  • But theres rapidly diminishing returns. With
    moderate amounts of missing data, 5 is
    sufficient. But with more missing data, you
    should have more data sets.

17
Categorical Variables
  • When imputing 2-category variables, like gender
    or alive/dead (coded as 0-1 variables), imputed
    values can be any real number, usually between 0
    and 1.
  • If the variable is a predictor variable in a
    regression analysis, leave the imputed values as
    they are.
  • If the analysis method requires that the variable
    be a dichotomy (e.g., a dependent variable in a
    logistic regression), use a different multiple
    imputation method (e.g., sequential generalized
    regression).
  • Simply rounding the imputed values is inadequate
    Horton et al., American Statistician, Nov. 2003.

18
Categorical Variables (cont.)
  • The same principles apply to nominal variables
    with more than two categories
  • If analysis method requires categorical data, use
    a different method.
  • If analysis method does NOT require categorical
    data, create a set of dummy variables (one less
    than the number of categories) and impute them
    like any other variable.
  • In the analysis, dont modify the imputed values.
  • Version 9 of PROC MI has a CLASS statement for
    nominal variables. Can only be used with monotone
    missing data. Imputation is based on discriminant
    function or logistic regression.

19
Transformations for Normality
  • Imputations can be improved by transforming
    variables to achieve approximate normality before
    imputing, then reversing the transformation after
    imputation.
  • In SAS, this can be done in DATA steps, but PROC
    MI can do many transformations more easily.
  • For example, RMBRD is somewhat skewed to the
    right. A logarithmic transformation removes the
    skewness.
  • PROC MI DATAmy.college OUTmiout
  • VAR gradrat csat lenroll stufac private rmbrd
    act
  • TRANSFORM LOG(rmbrd)
  • RUN
  • This applies the transformation, imputes, and
    back-transforms. Other available transforms
    BOXCOX, EXP, LOGIT, POWER

20
Output from Other SAS PROCS
  • MIANALYZE can be used in the same way with the
    following regression PROCS that use the OUTEST
    and COVOUT options
  • REG, LOGISTIC, PROBIT, LIFEREG, PHREG
  • For other PROCS, must use ODS (output delivery
    system) to produce data sets containing the
    estimates and their covariance matrix.

21
Summary and Review
  • Among conventional methods, listwise deletion is
    the least problematic.
  • Unbiased if MCAR
  • Standard errors good estimates of true standard
    errors
  • Resistant to NMAR for independent variables in
    regression
  • All other conventional methods introduce bias
    into parameter estimates or standard error
    estimates
  • By contrast ML and MI have optimal properties
    under MAR, or under a correctly specified model
    for missingness
  • Parameter estimates approximately unbiased and
    efficient
  • Good estimates of standard errors and test
    statistics.

22
Summary and Review
  • ML attractive for linear or loglinear models
  • Widely available software
  • Simple decision process
  • Always produces the same results
  • For other estimation tasks, consider MI
  • Works for any kind of model or data
  • May be more robust than ML
  • But does not produce a deterministic result
  • There are many different ways to do it leading to
    uncertainty and confusion.
  • Can also use ML and MI for nonignorable missing
    data, but
  • Requires very good knowledge of missing data
    process
  • Should always be accompanied by a sensitivity
    analysis
Write a Comment
User Comments (0)
About PowerShow.com