Data analysis with missing values http:www'sociology'ohiostate'edupeopleptvfaqmissingmissing'ppt - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Data analysis with missing values http:www'sociology'ohiostate'edupeopleptvfaqmissingmissing'ppt

Description:

Data analysis with missing values. http://www.sociology.ohio-state.edu/people/ptv/faq/missing/missing.ppt. Ohio State University ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 34
Provided by: paultvo
Category:

less

Transcript and Presenter's Notes

Title: Data analysis with missing values http:www'sociology'ohiostate'edupeopleptvfaqmissingmissing'ppt


1
Data analysis with missing valueshttp//www.socio
logy.ohio-state.edu/people/ptv/faq/missing/missing
.ppt
  • Ohio State University
  • Department of Sociology brownbag
  • Paul T. von Hippel
  • May 2, 2003

2
Missing values
  • Common in social research
  • nonresponse, loss to followup
  • lack of overlap between linked data sets
  • social processes
  • dropping out of school, graduation, etc.
  • survey design
  • skip patterns between respondents

3
Methods
  • Always bad methods
  • Mean (median, mode) imputation
  • Pairwise deletion a.k.a. available case analysis
  • Dummy variable adjustment
  • Often good methods
  • Listwise deletion (LD) a.k.a. complete case
    analysis
  • Multiple imputation (MI)
  • (Full information) maximum likelihood (ML)

4
Simulated data
  • Population
  • maleness X11 if male, 0 if female
  • age X2 N (50,102)
  • weight Y b0 X1 b1 (X220)b2 e
  • e N(0,s2)
  • b0125, b140, b21, s15
  • Samples
  • small n20 to illustrate procedures
  • large N10,000 to check bias and efficiency
  • Various patterns of missingness
  • X1 is completely observed
  • X2 and/or Y may have missing values

5
Simulated sample
  • n20
  • Women (gray)less likely to disclose
  • weight (Y)
  • age (X2)

6
Method 1Listwise deletion
  • delete all cases with missing values for Y, X1,
    or X2
  • analyze remaining (complete) cases
  • common software default

/ In SAS / proc reg datamissing_weight_age
/ Use data on right / model weightmaleness
years_over_20 / Regress weight on sex and age
/ run Parameter
Standard Variable DF Estimate
Error Intercept 1 160.98328
22.48391 maleness 1 70.24778
18.70752 years_over_20 1 -0.47579
0.69354
7
Myths about listwise deletion
  • Myth
  • LD always inefficient
  • Fact
  • LD efficient if only Y is missing
  • Myth
  • LD biased unless cases are deleted at random
  • Fact
  • LD unbiased unless deletion depends on e

8
Assumption listwise deletion
  • LD assumes deletion does not depend on e
  • Otherwise es for complete cases wont have mean
    of 0
  • Y b0 X1 b1 (X220)b2 e, where e N(0,s2)
  • Assumption satisfied
  • women (X10) less likely to disclose weight Y or
    age X2
  • deletion depends on X1
  • Assumption violated
  • overweight (egt0) less likely to disclose weight Y
    or age X2
  • delete mostly positive es, leaving negative es
  • complete cases are mostly underweight
  • Results are biased

9
N.B. Assumptions relate to model
  • If model neglects sex and age
  • Y m e, where e N(0,s2)
  • then sex X1 is in e, and womens nondisclosure
    causes bias
  • More simply
  • Complete cases mostly men (egt0)
  • m will be overestimated

10
Method 2 Multiple imputation
MULTIPLE IMPUTATION (best)
random imputation
conditional mean imputation
mean imputation
Steps to multiple imputation
Thomas Kincade, Stairway to Paradise
11
a. Mean imputation
  • Technique
  • Calculate mean over cases that have values for Y
  • Impute this mean where Y is missing
  • Ditto for X1, X2, etc.
  • Implicit models
  • YmY
  • X1m1
  • X2m2
  • Problems
  • ignores relationships among X and Y
  • underestimates covariances

12
b. Conditional mean imputation
  • Technique implicit models
  • If Y is missing
  • impute mean of cases with similar values for X1,
    X2
  • Y b0 X1 b1 X2 b2
  • Likewise, if X2 is missing
  • impute mean of cases with similar values for X1,
    Y
  • X1 g0 X1 g1 Y g2
  • If both Y and X2 are missing
  • impute means of cases with similar values for X1
  • Y d0 X1 d1
  • X2 f0 X1 f1
  • Problem
  • Ignores random components (no e)
  • ?Underestimates variances, ses

13
c. Single random imputation
  • Implemented in SPSS MVA module
  • available in SRL
  • http//www.spss.com/PDFs/SMV115SPClr.pdf
  • Like conditional mean imputation
  • but imputed value includes a random residual
  • Implicit models
  • If Y is missing
  • Y b0 X1 b1 X2 b2 eY.12
  • Likewise, if X2 is missing
  • X2 g0 X1 g1 Y g2 e2.1Y
  • If both Y and X2 are missing
  • Y d0 X1 d1 eY.1
  • X2 f0 X1 f1 e2.1

14
Problem with single imputation
  • Still underestimates ses!
  • treats imputed values like observed values
  • when they are actually less certain
  • ignores imputation variation

15
Imputation variation
  • Sampling variation
  • If you take a different sample
  • you get different parameter estimates
  • Standard errors reflect this
  • One way to estimate sampling variation
  • measure variation across multiple samples
  • called bootstrapping
  • Imputation variation
  • If you impute different values
  • you get different parameter estimates
  • Standard errors should reflect this, too
  • One way to estimate imputation variation
  • measure variation across multiple imputed data
    sets
  • called multiple imputation

16
d. Multiple imputation
  • Case 1 is missing weight
  • Given 1s sex and age
  • and relationships in other cases
  • generate a plausible distributionfor 1s weight
  • At random, sample 5 (or more) plausible weights
    for case 1
  • Yes! Impute Y!
  • For case 6, sample from conditional distribution
    of age.
  • Yes! Use Y to impute X!
  • For case 7, sample from conditional bivariate
    distribution of age weight

17
d. Multiple imputation
  • We impute these plausible values, creating 5
    versions of the data setmultiple imputations

proc mi datamissing_weight_age / Input data
missing_weight_age / outweight_age_mi /
Output 5 imputed data sets into
weight_age_mi (left)/ var years_over_20
weight maleness / These variables have
missing values or can help impute them. / run
http//support.sas.com/rnd/app/papers/mi.pdf
18
d. Multiple imputation
  • For each imputeddata set, estimate
  • parameters(white)
  • sampling variances and covariances (gray)

proc reg / Regression / dataweight_age_mi
/ Input MI data / outestparameters covout
/ Output data set called parameters (right)
/ model weight maleness years_over_20 /
Regress weight on sex and age / by
_imputation_ / Separate analysis for each
imputed data set / run
19
Sampling variation vs. imputation variation
  • Over the 5 analyses,
  • Mean( b0 ) estimates b0
  • Mean(s2b0) estimates the variance in b0 due to
    sampling
  • Var(b0 ) estimates the variance in b0 due to
    imputation

20
MI standard errors
  • Total variance in b0
  • Variation due to sampling variation due to
    imputation
  • Mean(s2b0) Var(b0 )
  • Actually, theres a correction factor of (11/M)
  • for the number of imputations M. (Here M5.)
  • So total variance in estimating b0 is
  • Mean(s2b0) (11/M) Var(b0 ) 179.53 (1.2)
    511.59 793.44
  • Standard error is ?793.44 28.17

21
MI estimates in SAS
Multiple Imputation Parameter Estimates Parameter
Estimate Std Error 95
Confidence Limits DF intercept
178.564526 28.168160 70.58553
286.5435 2.2804 maleness 67.110037
14.801696 21.52721 112.6929
3.1866 years_over_20 -0.960283
0.819559 -3.57294 1.6524 2.991
proc MIAnalyze dataparameters / Synthesize
parameters from 5 separate regressions / var
intercept maleness years_over_20 / Variables
with slopes of interest / run
http//support.sas.com/rnd/app/papers/mianalyze.pd
f
22
(Full information) maximum likelihood (ML)
  • Suppose Y has a missing value
  • We estimate the distribution of possible Y values
  • In MI
  • impute as few as 5 values from this distribution
  • In ML,
  • integrate across the full distribution of
    possible Y values
  • Like MI with an infinite number of imputations

23
ML in AMOS
24
ML vs. MI Example
Multiple Imputation Parameter Estimates (from
SAS) Parameter Estimate Std Error
95 Confidence Limits DF intercept
178.564526 28.168160 70.58553
286.5435 2.2804 maleness 67.110037
14.801696 21.52721 112.6929
3.1866 years_over_20 -0.960283
0.819559 -3.57294 1.6524 2.991 MSE
(not provided) 282.9894081
Maximum Likelihood Parameter Estimates (from
AMOS) Regression Weights
Estimate S.E. C.R. Label
-------------------
-------- ------- ------- ------- weight
lt---------------- maleness 68.569 12.860
5.332 b1 weight lt-----------
years_over_20 -0.712 0.682 -1.044
b2 Intercepts
Estimate S.E. C.R. Label
-----------
-------- ------- ------- -------
weight 166.675 20.542
8.114 Variances
Estimate S.E. C.R. Label
----------
-------- ------- ------- -------
e 243.169
126.993 1.915 s2
25
Assumption MI and ML
  • Remember LD assumes deletion independent of e
  • MI and ML have a less restrictive assumption
  • Values are missing at random (MAR)
  • The probability that a value is missing
  • depends only on values that are not missing
  • e.g., women X1 (complete) are more likely to
    withhold weight Y and age X2

26
MAR with deletion independent of e
  • Women (X10) less likely to disclose
  • weight Y and age X2
  • Data MAR
  • Deletion independent of e
  • All methods approximately unbiased
  • LD slightly less efficient

27
MAR with deletion dependent on e
  • Overweight (egt0) less likely to disclose age X2
  • LD biased because deletion depends on e
  • bias evident in b2, s, ses
  • MI ML approximately unbiased because values are
    MAR

28
Summary
  • MI ML
  • more efficient than LD
  • unless only Y is missing
  • unbiased under less restrictive assumptions
  • MI ML require MAR
  • LD requires deletion independent of e
  • But theres a fly in the ointment

29
Values missing
Values not missing at random(NMAR)
  • Probability that values are missing depends on
    the missing values themselves
  • e.g., the probability that weight Y is missing
  • is higher for the overweight (depends on Y)
  • is higher for women (depends on X1)
  • and sometimes X1 is missing, too.

30
NMAR
  • If values are NMAR,
  • e.g., overweight less likely to disclose weight
  • all todays methods are biased
  • Approaches
  • Selection models (e.g., Heckman correction)
  • Pattern mixture (Rubin 1987)
  • Beyond scope of this brownbag

31
Software
  • Both AMOS ML and SAS PROC MI assume
  • missing values are multivariate normal
  • But your data may be
  • nonnormal
  • categorical
  • clustered or nested
  • Consider ad hoc adjustments (Allison 2002)
  • Or use different software
  • MI
  • www.stat.psu.edu/jls/misoftwa.html
  • www.multiple-imputation.com
  • review in Horton Lipsitz (2001)
  • ML for categorical data (links from Allison 2002)
  • http//www.kub.nl/faculteiten/fsw/organisatie/depa
    rtementen/mto/software2.html (Lem)
  • www2.qimr.edu.au/davidD (LOGLIN)

32
Concise reference works
  • Allison, P. (2002). Missing data. Thousand Oaks,
    CA Sage greenback.
  • Horton, NJ Lipsitz, SR. (2001) Multiple
    imputation in practice Comparison of software
    packages for regression models with missing
    variables. The American Statistician 55(3)
    244-254.
  • Little, R.J.A. (1992) Regression with missing
    Xs A review. Journal of the American
    Statistical Association 87(420)1227-1237.

33
Further references
  • Mostly ML
  • Anderson, T.W. (1956) Maximum likelihood
    estimates for a multivariate normal distribution
    when some observations are missing.
  • Little, RL Rubin, DB. (1st ed. 1990, 2nd ed.
    2002). Statistical analysis with missing data.
    New York Wiley.
  • Mostly MI
  • Schafer, JL. (1997a). Analysis of Incomplete
    Multivariate Data. London Chapman Hall.
  • Rubin, DB. (1987). Multiple imputation for survey
    nonresponse. New York Wiley.
Write a Comment
User Comments (0)
About PowerShow.com