Title: Data analysis with missing values http:www'sociology'ohiostate'edupeopleptvfaqmissingmissing'ppt
1Data analysis with missing valueshttp//www.socio
logy.ohio-state.edu/people/ptv/faq/missing/missing
.ppt
- Ohio State University
- Department of Sociology brownbag
- Paul T. von Hippel
- May 2, 2003
2Missing values
- Common in social research
- nonresponse, loss to followup
- lack of overlap between linked data sets
- social processes
- dropping out of school, graduation, etc.
- survey design
- skip patterns between respondents
3Methods
- Always bad methods
- Mean (median, mode) imputation
- Pairwise deletion a.k.a. available case analysis
- Dummy variable adjustment
- Often good methods
- Listwise deletion (LD) a.k.a. complete case
analysis - Multiple imputation (MI)
- (Full information) maximum likelihood (ML)
4Simulated data
- Population
- maleness X11 if male, 0 if female
- age X2 N (50,102)
- weight Y b0 X1 b1 (X220)b2 e
- e N(0,s2)
- b0125, b140, b21, s15
- Samples
- small n20 to illustrate procedures
- large N10,000 to check bias and efficiency
- Various patterns of missingness
- X1 is completely observed
- X2 and/or Y may have missing values
5Simulated sample
- n20
- Women (gray)less likely to disclose
- weight (Y)
- age (X2)
6Method 1Listwise deletion
- delete all cases with missing values for Y, X1,
or X2 - analyze remaining (complete) cases
- common software default
/ In SAS / proc reg datamissing_weight_age
/ Use data on right / model weightmaleness
years_over_20 / Regress weight on sex and age
/ run Parameter
Standard Variable DF Estimate
Error Intercept 1 160.98328
22.48391 maleness 1 70.24778
18.70752 years_over_20 1 -0.47579
0.69354
7Myths about listwise deletion
- Myth
- LD always inefficient
- Fact
- LD efficient if only Y is missing
- Myth
- LD biased unless cases are deleted at random
- Fact
- LD unbiased unless deletion depends on e
8Assumption listwise deletion
- LD assumes deletion does not depend on e
- Otherwise es for complete cases wont have mean
of 0 - Y b0 X1 b1 (X220)b2 e, where e N(0,s2)
- Assumption satisfied
- women (X10) less likely to disclose weight Y or
age X2 - deletion depends on X1
- Assumption violated
- overweight (egt0) less likely to disclose weight Y
or age X2 - delete mostly positive es, leaving negative es
- complete cases are mostly underweight
- Results are biased
9N.B. Assumptions relate to model
- If model neglects sex and age
- Y m e, where e N(0,s2)
- then sex X1 is in e, and womens nondisclosure
causes bias - More simply
- Complete cases mostly men (egt0)
- m will be overestimated
10Method 2 Multiple imputation
MULTIPLE IMPUTATION (best)
random imputation
conditional mean imputation
mean imputation
Steps to multiple imputation
Thomas Kincade, Stairway to Paradise
11a. Mean imputation
- Technique
- Calculate mean over cases that have values for Y
- Impute this mean where Y is missing
- Ditto for X1, X2, etc.
- Implicit models
- YmY
- X1m1
- X2m2
- Problems
- ignores relationships among X and Y
- underestimates covariances
12b. Conditional mean imputation
- Technique implicit models
- If Y is missing
- impute mean of cases with similar values for X1,
X2 - Y b0 X1 b1 X2 b2
- Likewise, if X2 is missing
- impute mean of cases with similar values for X1,
Y - X1 g0 X1 g1 Y g2
- If both Y and X2 are missing
- impute means of cases with similar values for X1
- Y d0 X1 d1
- X2 f0 X1 f1
- Problem
- Ignores random components (no e)
- ?Underestimates variances, ses
13c. Single random imputation
- Implemented in SPSS MVA module
- available in SRL
- http//www.spss.com/PDFs/SMV115SPClr.pdf
- Like conditional mean imputation
- but imputed value includes a random residual
- Implicit models
- If Y is missing
- Y b0 X1 b1 X2 b2 eY.12
- Likewise, if X2 is missing
- X2 g0 X1 g1 Y g2 e2.1Y
- If both Y and X2 are missing
- Y d0 X1 d1 eY.1
- X2 f0 X1 f1 e2.1
14Problem with single imputation
- Still underestimates ses!
- treats imputed values like observed values
- when they are actually less certain
- ignores imputation variation
15Imputation variation
- Sampling variation
- If you take a different sample
- you get different parameter estimates
- Standard errors reflect this
- One way to estimate sampling variation
- measure variation across multiple samples
- called bootstrapping
- Imputation variation
- If you impute different values
- you get different parameter estimates
- Standard errors should reflect this, too
- One way to estimate imputation variation
- measure variation across multiple imputed data
sets - called multiple imputation
16d. Multiple imputation
- Case 1 is missing weight
- Given 1s sex and age
- and relationships in other cases
- generate a plausible distributionfor 1s weight
- At random, sample 5 (or more) plausible weights
for case 1 - Yes! Impute Y!
- For case 6, sample from conditional distribution
of age. - Yes! Use Y to impute X!
- For case 7, sample from conditional bivariate
distribution of age weight
17d. Multiple imputation
- We impute these plausible values, creating 5
versions of the data setmultiple imputations
proc mi datamissing_weight_age / Input data
missing_weight_age / outweight_age_mi /
Output 5 imputed data sets into
weight_age_mi (left)/ var years_over_20
weight maleness / These variables have
missing values or can help impute them. / run
http//support.sas.com/rnd/app/papers/mi.pdf
18d. Multiple imputation
- For each imputeddata set, estimate
- parameters(white)
- sampling variances and covariances (gray)
proc reg / Regression / dataweight_age_mi
/ Input MI data / outestparameters covout
/ Output data set called parameters (right)
/ model weight maleness years_over_20 /
Regress weight on sex and age / by
_imputation_ / Separate analysis for each
imputed data set / run
19Sampling variation vs. imputation variation
- Over the 5 analyses,
- Mean( b0 ) estimates b0
- Mean(s2b0) estimates the variance in b0 due to
sampling - Var(b0 ) estimates the variance in b0 due to
imputation
20MI standard errors
- Total variance in b0
- Variation due to sampling variation due to
imputation - Mean(s2b0) Var(b0 )
- Actually, theres a correction factor of (11/M)
- for the number of imputations M. (Here M5.)
- So total variance in estimating b0 is
- Mean(s2b0) (11/M) Var(b0 ) 179.53 (1.2)
511.59 793.44 - Standard error is ?793.44 28.17
21MI estimates in SAS
Multiple Imputation Parameter Estimates Parameter
Estimate Std Error 95
Confidence Limits DF intercept
178.564526 28.168160 70.58553
286.5435 2.2804 maleness 67.110037
14.801696 21.52721 112.6929
3.1866 years_over_20 -0.960283
0.819559 -3.57294 1.6524 2.991
proc MIAnalyze dataparameters / Synthesize
parameters from 5 separate regressions / var
intercept maleness years_over_20 / Variables
with slopes of interest / run
http//support.sas.com/rnd/app/papers/mianalyze.pd
f
22(Full information) maximum likelihood (ML)
- Suppose Y has a missing value
- We estimate the distribution of possible Y values
- In MI
- impute as few as 5 values from this distribution
- In ML,
- integrate across the full distribution of
possible Y values - Like MI with an infinite number of imputations
23ML in AMOS
24ML vs. MI Example
Multiple Imputation Parameter Estimates (from
SAS) Parameter Estimate Std Error
95 Confidence Limits DF intercept
178.564526 28.168160 70.58553
286.5435 2.2804 maleness 67.110037
14.801696 21.52721 112.6929
3.1866 years_over_20 -0.960283
0.819559 -3.57294 1.6524 2.991 MSE
(not provided) 282.9894081
Maximum Likelihood Parameter Estimates (from
AMOS) Regression Weights
Estimate S.E. C.R. Label
-------------------
-------- ------- ------- ------- weight
lt---------------- maleness 68.569 12.860
5.332 b1 weight lt-----------
years_over_20 -0.712 0.682 -1.044
b2 Intercepts
Estimate S.E. C.R. Label
-----------
-------- ------- ------- -------
weight 166.675 20.542
8.114 Variances
Estimate S.E. C.R. Label
----------
-------- ------- ------- -------
e 243.169
126.993 1.915 s2
25Assumption MI and ML
- Remember LD assumes deletion independent of e
- MI and ML have a less restrictive assumption
- Values are missing at random (MAR)
- The probability that a value is missing
- depends only on values that are not missing
- e.g., women X1 (complete) are more likely to
withhold weight Y and age X2
26MAR with deletion independent of e
- Women (X10) less likely to disclose
- weight Y and age X2
- Data MAR
- Deletion independent of e
- All methods approximately unbiased
- LD slightly less efficient
27MAR with deletion dependent on e
- Overweight (egt0) less likely to disclose age X2
- LD biased because deletion depends on e
- bias evident in b2, s, ses
- MI ML approximately unbiased because values are
MAR
28Summary
- MI ML
- more efficient than LD
- unless only Y is missing
- unbiased under less restrictive assumptions
- MI ML require MAR
- LD requires deletion independent of e
- But theres a fly in the ointment
29Values missing
Values not missing at random(NMAR)
- Probability that values are missing depends on
the missing values themselves - e.g., the probability that weight Y is missing
- is higher for the overweight (depends on Y)
- is higher for women (depends on X1)
- and sometimes X1 is missing, too.
30NMAR
- If values are NMAR,
- e.g., overweight less likely to disclose weight
- all todays methods are biased
- Approaches
- Selection models (e.g., Heckman correction)
- Pattern mixture (Rubin 1987)
- Beyond scope of this brownbag
31Software
- Both AMOS ML and SAS PROC MI assume
- missing values are multivariate normal
- But your data may be
- nonnormal
- categorical
- clustered or nested
- Consider ad hoc adjustments (Allison 2002)
- Or use different software
- MI
- www.stat.psu.edu/jls/misoftwa.html
- www.multiple-imputation.com
- review in Horton Lipsitz (2001)
- ML for categorical data (links from Allison 2002)
- http//www.kub.nl/faculteiten/fsw/organisatie/depa
rtementen/mto/software2.html (Lem) - www2.qimr.edu.au/davidD (LOGLIN)
32Concise reference works
- Allison, P. (2002). Missing data. Thousand Oaks,
CA Sage greenback. - Horton, NJ Lipsitz, SR. (2001) Multiple
imputation in practice Comparison of software
packages for regression models with missing
variables. The American Statistician 55(3)
244-254. - Little, R.J.A. (1992) Regression with missing
Xs A review. Journal of the American
Statistical Association 87(420)1227-1237.
33Further references
- Mostly ML
- Anderson, T.W. (1956) Maximum likelihood
estimates for a multivariate normal distribution
when some observations are missing. - Little, RL Rubin, DB. (1st ed. 1990, 2nd ed.
2002). Statistical analysis with missing data.
New York Wiley. - Mostly MI
- Schafer, JL. (1997a). Analysis of Incomplete
Multivariate Data. London Chapman Hall. - Rubin, DB. (1987). Multiple imputation for survey
nonresponse. New York Wiley.