Data Analysis With Missing Values May 18 - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Data Analysis With Missing Values May 18

Description:

Var(b0 ) estimates the variance in b0. due to imputation. 27. MI standard errors ... Mean(s2b0) Var(b0 ) Actually, there's a correction factor of (1 1/M) ... – PowerPoint PPT presentation

Number of Views:143

Avg rating:3.0/5.0

Slides: 31

Provided by: dingc

Category:

more less

Transcript and Presenter's Notes

Title: Data Analysis With Missing Values May 18

1
Data Analysis With Missing Values May 18 May
23

Dingcai Cao
d-cao_at_uchicago.edu

2
Missing Values

Missing data are observations that we intended to
be made but did not make.
For example, an individual may only respond to
certain questions in a survey, or may not respond
at all to a particular wave of a longitudinal
survey.
Goal making inferences that apply to the
population targeted by the complete sample - i.e.
the goal remains what it was if we had seen the
complete data.
Why we care?
Missing data are common.
However, they are usually inadequately handled in
both epidemiological and experimental research.
For example, Wood et al. (2004) reviewed 71
recently published BMJ, JAMA, Lancet and NEJM
papers. 89 had partly missing outcome data. In
37 trials with repeated outcome measures, 46
performed complete case analysis.

3
Missing Pattern

In practice the data consist of (a) the
observations actually made (where '?' denotes a
missing observation)
and (b) the pattern of missing values

4
Missing Pattern Univariate Pattern
5
Missing Pattern Monotonic Pattern
6
Missing Pattern Arbitrary Pattern
7
Missing mechanisms

Missing Completely At Random (MCAR)
Missing At Random (MAR)
Missing Not At Random (MNAR)
Knowing missing mechanism is critical for data
analysis.

8
Missing Completely at Random (MCAR)

Suppose some data are missing on Y.
These data are said to be MCAR if the probability
that Y is missing is unrelated to the missing
values of Y or other variables X.
e.g. simple random sample of Ys are missing
Pr(Y is missing X,Y) Pr(Y is missing)
MCAR is best situation to be in. under MCAR, the
analysis of only those units with complete data
gives valid inferences.

9
Missing At Random (MAR)

Data on Y are Missing At Random (MAR) if the
probability that Y is missing does not depend on
the value of Y, after controlling for other
observed variables X.
e.g. MCAR within strata
Pr(Y is missing X,Y) Pr(Y is missing X)
Much weaker assumption than MCAR
Can test whether missingness on Y depends on X
Practically the same as ignorable missingness,
i.e. no need to model the missing data mechanism
in the analysis of the data

10
Missing Not At Random (MNAR)

When neither MCAR nor MAR hold, we say the data
are Missing Not At Random, abbreviated MNAR. In
the likelihood setting (see end of previous
section) the missingness mechanism is termed
non-ignorable.
What this means is
Under MNAR, the missing data mechanism must be
modeled to get good parameter estimates
Requires good prior knowledge about causes of
missingness
Data contain no info on what would be appropriate
No way to test goodness of fit of missing data
model
Results may be very sensitive to choice of model
Sensitivity analysis is critical to data analysis
process

11
Missing Mechanisms Example

X age, Y income
If the probability of recording income is the
same for all individuals, regardless of their age
or income, then the data are missing completely
at random (MCAR)
If the probability of recording income varies
according to the age of the respondent but does
not vary according to the income of respondents
within an age group, then the data are mssing at
random (MAR).
If the probability of recording incme varies
according to income within each age group, the
the data are neither MCAR nor MAR (MNAR).

12
Methods Preliminary Analysis

Missing data pattern
missing on each variable
of missings per subject
Missings together
Correlations between variables

13
Methods Data Analysis with Missing Values

Procedures based on completely units
Listwise deletion (analyze only complete units)
Imputation-based procedures
Single imputation
Hot deck imputation using the recorded units to
substitute missing records
Mean imputation using the mean of the recorded
values to substitute missing values
Regression imputation missing variables for a
unit are estimated by predicted values from
regression on the known variables for that unit.
Multiple imputation
For each missing value, generate multiple values
to impute based on regression or some other
statistical models
Maximum likelihood-based procedures
Inference is based on maximum likelihood (not
covered in this class)

14
Simulated data

Population
Gender X11 if male, 0 if female
age X2 N (50,102)
weight Y b0 X1 b1 (X220)b2 e
e N(0,s2)
b0125, b140, b21, s15
Samples
small n20 to illustrate procedures
large N10,000 to check bias and efficiency
Various patterns of missingness
X1 is completely observed
X2 and/or Y may have missing values

15
Simulated sample

n20
Women (gray)less likely to disclose
weight (Y)
age (X2)

16
Method 1Listwise deletion

delete all cases with missing values for Y, X1,
or X2
analyze remaining (complete) cases
common software default

/ In SAS / proc reg datamissing_weight_age
/ Use data on right / model weightmaleness
years_over_20 / Regress weight on sex and age
/ run Parameter
Standard Variable DF Estimate
Error Intercept 1 160.98328
22.48391 maleness 1 70.24778
18.70752 years_over_20 1 -0.47579
0.69354
17
Assumption listwise deletion

LD assumes deletion does not depend on e
Otherwise es for complete cases wont have mean
of 0
Y b0 X1 b1 (X220)b2 e, where e N(0,s2)
Assumption satisfied
women (X10) less likely to disclose weight Y or
age X2
deletion depends on X1
Assumption violated
overweight (egt0) less likely to disclose weight Y
or age X2
delete mostly positive es, leaving negative es
complete cases are mostly underweight
Results are biased

18
Mean imputation

Technique
Calculate mean over cases that have values for Y
Impute this mean where Y is missing
Same procedure for X1, X2, etc.
Implicit models
YmY
X1m1
X2m2
Problems
ignores relationships among X and Y
underestimates covariances

19
Regression imputation

Technique implicit models
If Y is missing
impute mean of cases with similar values for X1,
X2
Y b0 X1 b1 X2 b2
Likewise, if X2 is missing
impute mean of cases with similar values for X1,
Y
X1 g0 X1 g1 Y g2
If both Y and X2 are missing
impute means of cases with similar values for X1
Y d0 X1 d1
X2 f0 X1 f1
Problem
Ignores random components (no e)
?Underestimates variances, ses

20
Single random imputation

Like regression imputation
but imputed value includes a random residual
Implicit models
If Y is missing
Y b0 X1 b1 X2 b2 eY.12
Likewise, if X2 is missing
X2 g0 X1 g1 Y g2 e2.1Y
If both Y and X2 are missing
Y d0 X1 d1 eY.1
X2 f0 X1 f1 e2.1

21
Problem with single imputation

Still underestimates ses!
treats imputed values like observed values
when they are actually less certain
ignores imputation variation

22
Imputation variation

Sampling variation
If you take a different sample
you get different parameter estimates
Standard errors reflect this
One way to estimate sampling variation
measure variation across multiple samples
called bootstrapping
Imputation variation
If you impute different values
you get different parameter estimates
Standard errors should reflect this, too
One way to estimate imputation variation
measure variation across multiple imputed data
sets
called multiple imputation

23
Multiple imputation

Case 1 is missing weight
Given 1s sex and age
and relationships in other cases
generate a plausible distributionfor 1s weight

At random, sample 5 (or more) plausible weights
for case 1
Yes! Impute Y!
For case 6, sample from conditional distribution
of age.
Yes! Use Y to impute X!
For case 7, sample from conditional bivariate
distribution of age weight

24
Multiple imputation

We impute these plausible values, creating 5
versions of the data setmultiple imputations

proc mi datamissing_weight_age / Input data
missing_weight_age / outweight_age_mi /
Output 5 imputed data sets into
weight_age_mi (left)/ var years_over_20
weight maleness / These variables have
missing values or can help impute them. / run
http//support.sas.com/rnd/app/papers/mi.pdf
25
Multiple imputation

For each imputeddata set, estimate
parameters(white)
sampling variances and covariances (gray)

proc reg / Regression / dataweight_age_mi
/ Input MI data / outestparameters covout
/ Output data set called parameters (right)
/ model weight maleness years_over_20 /
Regress weight on sex and age / by
_imputation_ / Separate analysis for each
imputed data set / run
26
Sampling variation vs. imputation variation

Over the 5 analyses,
Mean( b0 ) estimates b0
Mean(s2b0) estimates the variance in b0 due to
sampling
Var(b0 ) estimates the variance in b0 due to
imputation

27
MI standard errors

Total variance in b0
Variation due to sampling variation due to
imputation
Mean(s2b0) Var(b0 )
Actually, theres a correction factor of (11/M)
for the number of imputations M. (Here M5.)
So total variance in estimating b0 is
Mean(s2b0) (11/M) Var(b0 ) 179.53 (1.2)
511.59 793.44
Standard error is ?793.44 28.17

28
MI estimates in SAS
Multiple Imputation Parameter Estimates Parameter
Estimate Std Error 95
Confidence Limits DF intercept
178.564526 28.168160 70.58553
286.5435 2.2804 maleness 67.110037
14.801696 21.52721 112.6929
3.1866 years_over_20 -0.960283
0.819559 -3.57294 1.6524 2.991
proc MIAnalyze dataparameters / Synthesize
parameters from 5 separate regressions / var
intercept maleness years_over_20 / Variables
with slopes of interest / run
http//support.sas.com/rnd/app/papers/mianalyze.pd
f
29
(No Transcript)
30
Methods Data Analysis with Missing Values