Data analysis with missing values http:www'sociology'ohiostate'edupeopleptvfaqmissingmissing'ppt - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Data analysis with missing values http:www'sociology'ohiostate'edupeopleptvfaqmissingmissing'ppt

Description:

Data analysis with missing values. http://www.sociology.ohio-state.edu/people/ptv/faq/missing/missing.ppt. Ohio State University ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 34

Provided by: paultvo

Category:

more less

Transcript and Presenter's Notes

Title: Data analysis with missing values http:www'sociology'ohiostate'edupeopleptvfaqmissingmissing'ppt

1
Data analysis with missing valueshttp//www.socio
logy.ohio-state.edu/people/ptv/faq/missing/missing
.ppt

Ohio State University
Department of Sociology brownbag
Paul T. von Hippel
May 2, 2003

2
Missing values

Common in social research
nonresponse, loss to followup
lack of overlap between linked data sets
social processes
dropping out of school, graduation, etc.
survey design
skip patterns between respondents

3
Methods

Always bad methods
Mean (median, mode) imputation
Pairwise deletion a.k.a. available case analysis
Dummy variable adjustment
Often good methods
Listwise deletion (LD) a.k.a. complete case
analysis
Multiple imputation (MI)
(Full information) maximum likelihood (ML)

4
Simulated data

Population
maleness X11 if male, 0 if female
age X2 N (50,102)
weight Y b0 X1 b1 (X220)b2 e
e N(0,s2)
b0125, b140, b21, s15
Samples
small n20 to illustrate procedures
large N10,000 to check bias and efficiency
Various patterns of missingness
X1 is completely observed
X2 and/or Y may have missing values

5
Simulated sample

n20
Women (gray)less likely to disclose
weight (Y)
age (X2)

6
Method 1Listwise deletion

delete all cases with missing values for Y, X1,
or X2
analyze remaining (complete) cases
common software default

/ In SAS / proc reg datamissing_weight_age
/ Use data on right / model weightmaleness
years_over_20 / Regress weight on sex and age
/ run Parameter
Standard Variable DF Estimate
Error Intercept 1 160.98328
22.48391 maleness 1 70.24778
18.70752 years_over_20 1 -0.47579
0.69354
7
Myths about listwise deletion

Myth
LD always inefficient
Fact
LD efficient if only Y is missing
Myth
LD biased unless cases are deleted at random
Fact
LD unbiased unless deletion depends on e

8
Assumption listwise deletion

LD assumes deletion does not depend on e
Otherwise es for complete cases wont have mean
of 0
Y b0 X1 b1 (X220)b2 e, where e N(0,s2)
Assumption satisfied
women (X10) less likely to disclose weight Y or
age X2
deletion depends on X1
Assumption violated
overweight (egt0) less likely to disclose weight Y
or age X2
delete mostly positive es, leaving negative es
complete cases are mostly underweight
Results are biased

9
N.B. Assumptions relate to model

If model neglects sex and age
Y m e, where e N(0,s2)
then sex X1 is in e, and womens nondisclosure
causes bias
More simply
Complete cases mostly men (egt0)
m will be overestimated

10
Method 2 Multiple imputation
MULTIPLE IMPUTATION (best)
random imputation
conditional mean imputation
mean imputation
Steps to multiple imputation
Thomas Kincade, Stairway to Paradise
11
a. Mean imputation

Technique
Calculate mean over cases that have values for Y
Impute this mean where Y is missing
Ditto for X1, X2, etc.
Implicit models
YmY
X1m1
X2m2
Problems
ignores relationships among X and Y
underestimates covariances

12
b. Conditional mean imputation

Technique implicit models
If Y is missing
impute mean of cases with similar values for X1,
X2
Y b0 X1 b1 X2 b2
Likewise, if X2 is missing
impute mean of cases with similar values for X1,
Y
X1 g0 X1 g1 Y g2
If both Y and X2 are missing
impute means of cases with similar values for X1
Y d0 X1 d1
X2 f0 X1 f1
Problem
Ignores random components (no e)
?Underestimates variances, ses

13
c. Single random imputation

Implemented in SPSS MVA module
available in SRL
http//www.spss.com/PDFs/SMV115SPClr.pdf
Like conditional mean imputation
but imputed value includes a random residual
Implicit models
If Y is missing
Y b0 X1 b1 X2 b2 eY.12
Likewise, if X2 is missing
X2 g0 X1 g1 Y g2 e2.1Y
If both Y and X2 are missing
Y d0 X1 d1 eY.1
X2 f0 X1 f1 e2.1

14
Problem with single imputation

Still underestimates ses!
treats imputed values like observed values
when they are actually less certain
ignores imputation variation

15
Imputation variation

Sampling variation
If you take a different sample
you get different parameter estimates
Standard errors reflect this
One way to estimate sampling variation
measure variation across multiple samples
called bootstrapping
Imputation variation
If you impute different values
you get different parameter estimates
Standard errors should reflect this, too
One way to estimate imputation variation
measure variation across multiple imputed data
sets
called multiple imputation

16
d. Multiple imputation

Case 1 is missing weight
Given 1s sex and age
and relationships in other cases
generate a plausible distributionfor 1s weight

At random, sample 5 (or more) plausible weights
for case 1
Yes! Impute Y!
For case 6, sample from conditional distribution
of age.
Yes! Use Y to impute X!
For case 7, sample from conditional bivariate
distribution of age weight

17
d. Multiple imputation

We impute these plausible values, creating 5
versions of the data setmultiple imputations

proc mi datamissing_weight_age / Input data
missing_weight_age / outweight_age_mi /
Output 5 imputed data sets into
weight_age_mi (left)/ var years_over_20
weight maleness / These variables have
missing values or can help impute them. / run
http//support.sas.com/rnd/app/papers/mi.pdf
18
d. Multiple imputation

For each imputeddata set, estimate
parameters(white)
sampling variances and covariances (gray)

proc reg / Regression / dataweight_age_mi
/ Input MI data / outestparameters covout
/ Output data set called parameters (right)
/ model weight maleness years_over_20 /
Regress weight on sex and age / by
_imputation_ / Separate analysis for each
imputed data set / run
19
Sampling variation vs. imputation variation

Over the 5 analyses,
Mean( b0 ) estimates b0
Mean(s2b0) estimates the variance in b0 due to
sampling
Var(b0 ) estimates the variance in b0 due to
imputation

20
MI standard errors

Total variance in b0
Variation due to sampling variation due to
imputation
Mean(s2b0) Var(b0 )
Actually, theres a correction factor of (11/M)
for the number of imputations M. (Here M5.)
So total variance in estimating b0 is
Mean(s2b0) (11/M) Var(b0 ) 179.53 (1.2)
511.59 793.44
Standard error is ?793.44 28.17

21
MI estimates in SAS
Multiple Imputation Parameter Estimates Parameter
Estimate Std Error 95
Confidence Limits DF intercept
178.564526 28.168160 70.58553
286.5435 2.2804 maleness 67.110037
14.801696 21.52721 112.6929
3.1866 years_over_20 -0.960283
0.819559 -3.57294 1.6524 2.991
proc MIAnalyze dataparameters / Synthesize
parameters from 5 separate regressions / var
intercept maleness years_over_20 / Variables
with slopes of interest / run
http//support.sas.com/rnd/app/papers/mianalyze.pd
f
22
(Full information) maximum likelihood (ML)

Suppose Y has a missing value
We estimate the distribution of possible Y values

In MI
impute as few as 5 values from this distribution

In ML,
integrate across the full distribution of
possible Y values
Like MI with an infinite number of imputations

23
ML in AMOS
24
ML vs. MI Example
Multiple Imputation Parameter Estimates (from
SAS) Parameter Estimate Std Error
95 Confidence Limits DF intercept
178.564526 28.168160 70.58553
286.5435 2.2804 maleness 67.110037
14.801696 21.52721 112.6929
3.1866 years_over_20 -0.960283
0.819559 -3.57294 1.6524 2.991 MSE
(not provided) 282.9894081
Maximum Likelihood Parameter Estimates (from
AMOS) Regression Weights
Estimate S.E. C.R. Label
-------------------
-------- ------- ------- ------- weight
lt---------------- maleness 68.569 12.860
5.332 b1 weight lt-----------
years_over_20 -0.712 0.682 -1.044
b2 Intercepts
Estimate S.E. C.R. Label
-----------
-------- ------- ------- -------
weight 166.675 20.542
8.114 Variances
Estimate S.E. C.R. Label
----------
-------- ------- ------- -------
e 243.169
126.993 1.915 s2
25
Assumption MI and ML

Remember LD assumes deletion independent of e
MI and ML have a less restrictive assumption
Values are missing at random (MAR)
The probability that a value is missing
depends only on values that are not missing
e.g., women X1 (complete) are more likely to
withhold weight Y and age X2

26
MAR with deletion independent of e

Women (X10) less likely to disclose
weight Y and age X2
Data MAR
Deletion independent of e
All methods approximately unbiased
LD slightly less efficient

27
MAR with deletion dependent on e

Overweight (egt0) less likely to disclose age X2
LD biased because deletion depends on e
bias evident in b2, s, ses
MI ML approximately unbiased because values are
MAR

28
Summary

MI ML
more efficient than LD
unless only Y is missing
unbiased under less restrictive assumptions
MI ML require MAR
LD requires deletion independent of e
But theres a fly in the ointment

29
Values missing
Values not missing at random(NMAR)

Probability that values are missing depends on
the missing values themselves
e.g., the probability that weight Y is missing
is higher for the overweight (depends on Y)
is higher for women (depends on X1)
and sometimes X1 is missing, too.

30
NMAR

If values are NMAR,
e.g., overweight less likely to disclose weight
all todays methods are biased

Approaches
Selection models (e.g., Heckman correction)
Pattern mixture (Rubin 1987)
Beyond scope of this brownbag

31
Software

Both AMOS ML and SAS PROC MI assume
missing values are multivariate normal
But your data may be
nonnormal
categorical
clustered or nested
Consider ad hoc adjustments (Allison 2002)
Or use different software
MI
www.stat.psu.edu/jls/misoftwa.html
www.multiple-imputation.com
review in Horton Lipsitz (2001)
ML for categorical data (links from Allison 2002)
http//www.kub.nl/faculteiten/fsw/organisatie/depa
rtementen/mto/software2.html (Lem)
www2.qimr.edu.au/davidD (LOGLIN)

32
Concise reference works

Allison, P. (2002). Missing data. Thousand Oaks,
CA Sage greenback.
Horton, NJ Lipsitz, SR. (2001) Multiple
imputation in practice Comparison of software
packages for regression models with missing
variables. The American Statistician 55(3)
244-254.
Little, R.J.A. (1992) Regression with missing
Xs A review. Journal of the American
Statistical Association 87(420)1227-1237.

33
Further references

Mostly ML
Anderson, T.W. (1956) Maximum likelihood
estimates for a multivariate normal distribution
when some observations are missing.
Little, RL Rubin, DB. (1st ed. 1990, 2nd ed.
2002). Statistical analysis with missing data.
New York Wiley.
Mostly MI
Schafer, JL. (1997a). Analysis of Incomplete
Multivariate Data. London Chapman Hall.
Rubin, DB. (1987). Multiple imputation for survey
nonresponse. New York Wiley.