Title: Data Analysis With Missing Values May 18
1Data Analysis With Missing Values May 18 May
23
- Dingcai Cao
- d-cao_at_uchicago.edu
2Missing Values
- Missing data are observations that we intended to
be made but did not make. - For example, an individual may only respond to
certain questions in a survey, or may not respond
at all to a particular wave of a longitudinal
survey. - Goal making inferences that apply to the
population targeted by the complete sample - i.e.
the goal remains what it was if we had seen the
complete data. - Why we care?
- Missing data are common.
- However, they are usually inadequately handled in
both epidemiological and experimental research. - For example, Wood et al. (2004) reviewed 71
recently published BMJ, JAMA, Lancet and NEJM
papers. 89 had partly missing outcome data. In
37 trials with repeated outcome measures, 46
performed complete case analysis.
3Missing Pattern
- In practice the data consist of (a) the
observations actually made (where '?' denotes a
missing observation) - and (b) the pattern of missing values
4Missing Pattern Univariate Pattern
5Missing Pattern Monotonic Pattern
6Missing Pattern Arbitrary Pattern
7Missing mechanisms
- Missing Completely At Random (MCAR)
- Missing At Random (MAR)
- Missing Not At Random (MNAR)
- Knowing missing mechanism is critical for data
analysis.
8Missing Completely at Random (MCAR)
- Suppose some data are missing on Y.
- These data are said to be MCAR if the probability
that Y is missing is unrelated to the missing
values of Y or other variables X. - e.g. simple random sample of Ys are missing
- Pr(Y is missing X,Y) Pr(Y is missing)
- MCAR is best situation to be in. under MCAR, the
analysis of only those units with complete data
gives valid inferences.
9Missing At Random (MAR)
- Data on Y are Missing At Random (MAR) if the
probability that Y is missing does not depend on
the value of Y, after controlling for other
observed variables X. - e.g. MCAR within strata
- Pr(Y is missing X,Y) Pr(Y is missing X)
- Much weaker assumption than MCAR
- Can test whether missingness on Y depends on X
- Practically the same as ignorable missingness,
i.e. no need to model the missing data mechanism
in the analysis of the data
10Missing Not At Random (MNAR)
- When neither MCAR nor MAR hold, we say the data
are Missing Not At Random, abbreviated MNAR. In
the likelihood setting (see end of previous
section) the missingness mechanism is termed
non-ignorable. - What this means is
- Under MNAR, the missing data mechanism must be
modeled to get good parameter estimates - Requires good prior knowledge about causes of
missingness - Data contain no info on what would be appropriate
- No way to test goodness of fit of missing data
model - Results may be very sensitive to choice of model
- Sensitivity analysis is critical to data analysis
process
11Missing Mechanisms Example
- X age, Y income
- If the probability of recording income is the
same for all individuals, regardless of their age
or income, then the data are missing completely
at random (MCAR) - If the probability of recording income varies
according to the age of the respondent but does
not vary according to the income of respondents
within an age group, then the data are mssing at
random (MAR). - If the probability of recording incme varies
according to income within each age group, the
the data are neither MCAR nor MAR (MNAR).
12Methods Preliminary Analysis
- Missing data pattern
- missing on each variable
- of missings per subject
- Missings together
- Correlations between variables
13Methods Data Analysis with Missing Values
- Procedures based on completely units
- Listwise deletion (analyze only complete units)
- Imputation-based procedures
- Single imputation
- Hot deck imputation using the recorded units to
substitute missing records - Mean imputation using the mean of the recorded
values to substitute missing values - Regression imputation missing variables for a
unit are estimated by predicted values from
regression on the known variables for that unit. - Multiple imputation
- For each missing value, generate multiple values
to impute based on regression or some other
statistical models - Maximum likelihood-based procedures
- Inference is based on maximum likelihood (not
covered in this class)
14Simulated data
- Population
- Gender X11 if male, 0 if female
- age X2 N (50,102)
- weight Y b0 X1 b1 (X220)b2 e
- e N(0,s2)
- b0125, b140, b21, s15
- Samples
- small n20 to illustrate procedures
- large N10,000 to check bias and efficiency
- Various patterns of missingness
- X1 is completely observed
- X2 and/or Y may have missing values
15Simulated sample
- n20
- Women (gray)less likely to disclose
- weight (Y)
- age (X2)
16Method 1Listwise deletion
- delete all cases with missing values for Y, X1,
or X2 - analyze remaining (complete) cases
- common software default
/ In SAS / proc reg datamissing_weight_age
/ Use data on right / model weightmaleness
years_over_20 / Regress weight on sex and age
/ run Parameter
Standard Variable DF Estimate
Error Intercept 1 160.98328
22.48391 maleness 1 70.24778
18.70752 years_over_20 1 -0.47579
0.69354
17Assumption listwise deletion
- LD assumes deletion does not depend on e
- Otherwise es for complete cases wont have mean
of 0 - Y b0 X1 b1 (X220)b2 e, where e N(0,s2)
- Assumption satisfied
- women (X10) less likely to disclose weight Y or
age X2 - deletion depends on X1
- Assumption violated
- overweight (egt0) less likely to disclose weight Y
or age X2 - delete mostly positive es, leaving negative es
- complete cases are mostly underweight
- Results are biased
18Mean imputation
- Technique
- Calculate mean over cases that have values for Y
- Impute this mean where Y is missing
- Same procedure for X1, X2, etc.
- Implicit models
- YmY
- X1m1
- X2m2
- Problems
- ignores relationships among X and Y
- underestimates covariances
19Regression imputation
- Technique implicit models
- If Y is missing
- impute mean of cases with similar values for X1,
X2 - Y b0 X1 b1 X2 b2
- Likewise, if X2 is missing
- impute mean of cases with similar values for X1,
Y - X1 g0 X1 g1 Y g2
- If both Y and X2 are missing
- impute means of cases with similar values for X1
- Y d0 X1 d1
- X2 f0 X1 f1
- Problem
- Ignores random components (no e)
- ?Underestimates variances, ses
20Single random imputation
- Like regression imputation
- but imputed value includes a random residual
- Implicit models
- If Y is missing
- Y b0 X1 b1 X2 b2 eY.12
- Likewise, if X2 is missing
- X2 g0 X1 g1 Y g2 e2.1Y
- If both Y and X2 are missing
- Y d0 X1 d1 eY.1
- X2 f0 X1 f1 e2.1
21Problem with single imputation
- Still underestimates ses!
- treats imputed values like observed values
- when they are actually less certain
- ignores imputation variation
22Imputation variation
- Sampling variation
- If you take a different sample
- you get different parameter estimates
- Standard errors reflect this
- One way to estimate sampling variation
- measure variation across multiple samples
- called bootstrapping
- Imputation variation
- If you impute different values
- you get different parameter estimates
- Standard errors should reflect this, too
- One way to estimate imputation variation
- measure variation across multiple imputed data
sets - called multiple imputation
23Multiple imputation
- Case 1 is missing weight
- Given 1s sex and age
- and relationships in other cases
- generate a plausible distributionfor 1s weight
- At random, sample 5 (or more) plausible weights
for case 1 - Yes! Impute Y!
- For case 6, sample from conditional distribution
of age. - Yes! Use Y to impute X!
- For case 7, sample from conditional bivariate
distribution of age weight
24Multiple imputation
- We impute these plausible values, creating 5
versions of the data setmultiple imputations
proc mi datamissing_weight_age / Input data
missing_weight_age / outweight_age_mi /
Output 5 imputed data sets into
weight_age_mi (left)/ var years_over_20
weight maleness / These variables have
missing values or can help impute them. / run
http//support.sas.com/rnd/app/papers/mi.pdf
25Multiple imputation
- For each imputeddata set, estimate
- parameters(white)
- sampling variances and covariances (gray)
proc reg / Regression / dataweight_age_mi
/ Input MI data / outestparameters covout
/ Output data set called parameters (right)
/ model weight maleness years_over_20 /
Regress weight on sex and age / by
_imputation_ / Separate analysis for each
imputed data set / run
26Sampling variation vs. imputation variation
- Over the 5 analyses,
- Mean( b0 ) estimates b0
- Mean(s2b0) estimates the variance in b0 due to
sampling - Var(b0 ) estimates the variance in b0 due to
imputation
27MI standard errors
- Total variance in b0
- Variation due to sampling variation due to
imputation - Mean(s2b0) Var(b0 )
- Actually, theres a correction factor of (11/M)
- for the number of imputations M. (Here M5.)
- So total variance in estimating b0 is
- Mean(s2b0) (11/M) Var(b0 ) 179.53 (1.2)
511.59 793.44 - Standard error is ?793.44 28.17
28MI estimates in SAS
Multiple Imputation Parameter Estimates Parameter
Estimate Std Error 95
Confidence Limits DF intercept
178.564526 28.168160 70.58553
286.5435 2.2804 maleness 67.110037
14.801696 21.52721 112.6929
3.1866 years_over_20 -0.960283
0.819559 -3.57294 1.6524 2.991
proc MIAnalyze dataparameters / Synthesize
parameters from 5 separate regressions / var
intercept maleness years_over_20 / Variables
with slopes of interest / run
http//support.sas.com/rnd/app/papers/mianalyze.pd
f
29(No Transcript)
30Methods Data Analysis with Missing Values
- Procedures based on completely units
- Listwise deletion (analyze only complete units)
- Good for MCAR
- Imputation-based procedures
- Single imputation
- Hot deck imputation Bad
- Mean imputation Bad
- Regression imputation Better
- Multiple imputation
- Best
- Maximum likelihood-based procedures
- Best