Data Analysis With Missing Values May 18 - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Data Analysis With Missing Values May 18

Description:

Var(b0 ) estimates the variance in b0. due to imputation. 27. MI standard errors ... Mean(s2b0) Var(b0 ) Actually, there's a correction factor of (1 1/M) ... – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0
Slides: 31
Provided by: dingc
Category:
Tags: analysis | data | missing | values | var

less

Transcript and Presenter's Notes

Title: Data Analysis With Missing Values May 18


1
Data Analysis With Missing Values May 18 May
23
  • Dingcai Cao
  • d-cao_at_uchicago.edu

2
Missing Values
  • Missing data are observations that we intended to
    be made but did not make.
  • For example, an individual may only respond to
    certain questions in a survey, or may not respond
    at all to a particular wave of a longitudinal
    survey.
  • Goal making inferences that apply to the
    population targeted by the complete sample - i.e.
    the goal remains what it was if we had seen the
    complete data.
  • Why we care?
  • Missing data are common.
  • However, they are usually inadequately handled in
    both epidemiological and experimental research.
  • For example, Wood et al. (2004) reviewed 71
    recently published BMJ, JAMA, Lancet and NEJM
    papers. 89 had partly missing outcome data. In
    37 trials with repeated outcome measures, 46
    performed complete case analysis.

3
Missing Pattern
  • In practice the data consist of (a) the
    observations actually made (where '?' denotes a
    missing observation)
  • and (b) the pattern of missing values

4
Missing Pattern Univariate Pattern
5
Missing Pattern Monotonic Pattern
6
Missing Pattern Arbitrary Pattern
7
Missing mechanisms
  • Missing Completely At Random (MCAR)
  • Missing At Random (MAR)
  • Missing Not At Random (MNAR)
  • Knowing missing mechanism is critical for data
    analysis.

8
Missing Completely at Random (MCAR)
  • Suppose some data are missing on Y.
  • These data are said to be MCAR if the probability
    that Y is missing is unrelated to the missing
    values of Y or other variables X.
  • e.g. simple random sample of Ys are missing
  • Pr(Y is missing X,Y) Pr(Y is missing)
  • MCAR is best situation to be in. under MCAR, the
    analysis of only those units with complete data
    gives valid inferences.

9
Missing At Random (MAR)
  • Data on Y are Missing At Random (MAR) if the
    probability that Y is missing does not depend on
    the value of Y, after controlling for other
    observed variables X.
  • e.g. MCAR within strata
  • Pr(Y is missing X,Y) Pr(Y is missing X)
  • Much weaker assumption than MCAR
  • Can test whether missingness on Y depends on X
  • Practically the same as ignorable missingness,
    i.e. no need to model the missing data mechanism
    in the analysis of the data

10
Missing Not At Random (MNAR)
  • When neither MCAR nor MAR hold, we say the data
    are Missing Not At Random, abbreviated MNAR. In
    the likelihood setting (see end of previous
    section) the missingness mechanism is termed
    non-ignorable.
  • What this means is
  • Under MNAR, the missing data mechanism must be
    modeled to get good parameter estimates
  • Requires good prior knowledge about causes of
    missingness
  • Data contain no info on what would be appropriate
  • No way to test goodness of fit of missing data
    model
  • Results may be very sensitive to choice of model
  • Sensitivity analysis is critical to data analysis
    process

11
Missing Mechanisms Example
  • X age, Y income
  • If the probability of recording income is the
    same for all individuals, regardless of their age
    or income, then the data are missing completely
    at random (MCAR)
  • If the probability of recording income varies
    according to the age of the respondent but does
    not vary according to the income of respondents
    within an age group, then the data are mssing at
    random (MAR).
  • If the probability of recording incme varies
    according to income within each age group, the
    the data are neither MCAR nor MAR (MNAR).

12
Methods Preliminary Analysis
  • Missing data pattern
  • missing on each variable
  • of missings per subject
  • Missings together
  • Correlations between variables

13
Methods Data Analysis with Missing Values
  • Procedures based on completely units
  • Listwise deletion (analyze only complete units)
  • Imputation-based procedures
  • Single imputation
  • Hot deck imputation using the recorded units to
    substitute missing records
  • Mean imputation using the mean of the recorded
    values to substitute missing values
  • Regression imputation missing variables for a
    unit are estimated by predicted values from
    regression on the known variables for that unit.
  • Multiple imputation
  • For each missing value, generate multiple values
    to impute based on regression or some other
    statistical models
  • Maximum likelihood-based procedures
  • Inference is based on maximum likelihood (not
    covered in this class)

14
Simulated data
  • Population
  • Gender X11 if male, 0 if female
  • age X2 N (50,102)
  • weight Y b0 X1 b1 (X220)b2 e
  • e N(0,s2)
  • b0125, b140, b21, s15
  • Samples
  • small n20 to illustrate procedures
  • large N10,000 to check bias and efficiency
  • Various patterns of missingness
  • X1 is completely observed
  • X2 and/or Y may have missing values

15
Simulated sample
  • n20
  • Women (gray)less likely to disclose
  • weight (Y)
  • age (X2)

16
Method 1Listwise deletion
  • delete all cases with missing values for Y, X1,
    or X2
  • analyze remaining (complete) cases
  • common software default

/ In SAS / proc reg datamissing_weight_age
/ Use data on right / model weightmaleness
years_over_20 / Regress weight on sex and age
/ run Parameter
Standard Variable DF Estimate
Error Intercept 1 160.98328
22.48391 maleness 1 70.24778
18.70752 years_over_20 1 -0.47579
0.69354
17
Assumption listwise deletion
  • LD assumes deletion does not depend on e
  • Otherwise es for complete cases wont have mean
    of 0
  • Y b0 X1 b1 (X220)b2 e, where e N(0,s2)
  • Assumption satisfied
  • women (X10) less likely to disclose weight Y or
    age X2
  • deletion depends on X1
  • Assumption violated
  • overweight (egt0) less likely to disclose weight Y
    or age X2
  • delete mostly positive es, leaving negative es
  • complete cases are mostly underweight
  • Results are biased

18
Mean imputation
  • Technique
  • Calculate mean over cases that have values for Y
  • Impute this mean where Y is missing
  • Same procedure for X1, X2, etc.
  • Implicit models
  • YmY
  • X1m1
  • X2m2
  • Problems
  • ignores relationships among X and Y
  • underestimates covariances

19
Regression imputation
  • Technique implicit models
  • If Y is missing
  • impute mean of cases with similar values for X1,
    X2
  • Y b0 X1 b1 X2 b2
  • Likewise, if X2 is missing
  • impute mean of cases with similar values for X1,
    Y
  • X1 g0 X1 g1 Y g2
  • If both Y and X2 are missing
  • impute means of cases with similar values for X1
  • Y d0 X1 d1
  • X2 f0 X1 f1
  • Problem
  • Ignores random components (no e)
  • ?Underestimates variances, ses

20
Single random imputation
  • Like regression imputation
  • but imputed value includes a random residual
  • Implicit models
  • If Y is missing
  • Y b0 X1 b1 X2 b2 eY.12
  • Likewise, if X2 is missing
  • X2 g0 X1 g1 Y g2 e2.1Y
  • If both Y and X2 are missing
  • Y d0 X1 d1 eY.1
  • X2 f0 X1 f1 e2.1

21
Problem with single imputation
  • Still underestimates ses!
  • treats imputed values like observed values
  • when they are actually less certain
  • ignores imputation variation

22
Imputation variation
  • Sampling variation
  • If you take a different sample
  • you get different parameter estimates
  • Standard errors reflect this
  • One way to estimate sampling variation
  • measure variation across multiple samples
  • called bootstrapping
  • Imputation variation
  • If you impute different values
  • you get different parameter estimates
  • Standard errors should reflect this, too
  • One way to estimate imputation variation
  • measure variation across multiple imputed data
    sets
  • called multiple imputation

23
Multiple imputation
  • Case 1 is missing weight
  • Given 1s sex and age
  • and relationships in other cases
  • generate a plausible distributionfor 1s weight
  • At random, sample 5 (or more) plausible weights
    for case 1
  • Yes! Impute Y!
  • For case 6, sample from conditional distribution
    of age.
  • Yes! Use Y to impute X!
  • For case 7, sample from conditional bivariate
    distribution of age weight

24
Multiple imputation
  • We impute these plausible values, creating 5
    versions of the data setmultiple imputations

proc mi datamissing_weight_age / Input data
missing_weight_age / outweight_age_mi /
Output 5 imputed data sets into
weight_age_mi (left)/ var years_over_20
weight maleness / These variables have
missing values or can help impute them. / run
http//support.sas.com/rnd/app/papers/mi.pdf
25
Multiple imputation
  • For each imputeddata set, estimate
  • parameters(white)
  • sampling variances and covariances (gray)

proc reg / Regression / dataweight_age_mi
/ Input MI data / outestparameters covout
/ Output data set called parameters (right)
/ model weight maleness years_over_20 /
Regress weight on sex and age / by
_imputation_ / Separate analysis for each
imputed data set / run
26
Sampling variation vs. imputation variation
  • Over the 5 analyses,
  • Mean( b0 ) estimates b0
  • Mean(s2b0) estimates the variance in b0 due to
    sampling
  • Var(b0 ) estimates the variance in b0 due to
    imputation

27
MI standard errors
  • Total variance in b0
  • Variation due to sampling variation due to
    imputation
  • Mean(s2b0) Var(b0 )
  • Actually, theres a correction factor of (11/M)
  • for the number of imputations M. (Here M5.)
  • So total variance in estimating b0 is
  • Mean(s2b0) (11/M) Var(b0 ) 179.53 (1.2)
    511.59 793.44
  • Standard error is ?793.44 28.17

28
MI estimates in SAS
Multiple Imputation Parameter Estimates Parameter
Estimate Std Error 95
Confidence Limits DF intercept
178.564526 28.168160 70.58553
286.5435 2.2804 maleness 67.110037
14.801696 21.52721 112.6929
3.1866 years_over_20 -0.960283
0.819559 -3.57294 1.6524 2.991
proc MIAnalyze dataparameters / Synthesize
parameters from 5 separate regressions / var
intercept maleness years_over_20 / Variables
with slopes of interest / run
http//support.sas.com/rnd/app/papers/mianalyze.pd
f
29
(No Transcript)
30
Methods Data Analysis with Missing Values
  • Procedures based on completely units
  • Listwise deletion (analyze only complete units)
  • Good for MCAR
  • Imputation-based procedures
  • Single imputation
  • Hot deck imputation Bad
  • Mean imputation Bad
  • Regression imputation Better
  • Multiple imputation
  • Best
  • Maximum likelihood-based procedures
  • Best
Write a Comment
User Comments (0)
About PowerShow.com