European Conference on Quality - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

European Conference on Quality

Description:

Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure ... Probit regression to explain, why firms offer vocational training ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 23
Provided by: Wag53
Category:

less

Transcript and Presenter's Notes

Title: European Conference on Quality


1
Comparing Fully and Partially Synthetic Data Sets
for Statistical Disclosure Control in the German
IAB Establishment Panel
Jörg Drechsler, Stefan Bender (Institute for
Employment Research, Germany) Susanne
Rässler (University of Bamberg)
  • European Conference on Quality
  • in Official Statistics 2008
  • Rome, 08.-11. July 2008

2
Overview
  • Multiple Imputation for Statistical Disclosure
    Control
  • The IAB Establishment Panel
  • Application of The Two Approaches
  • Comparison of The Results
  • Conclusion

3
Fully synthetic data sets (Rubin 1993)
X
Ynot observed
Ysynthetisch
Ysynthetisch
Ysynthetisch
Ysynthetisch
Ysynthetic
Yobserved
  • advantages - data are fully synthetic
  • - re-identification of single units almost
    impossible
  • - all variables are still fully available
  • disadvantages - strong dependence on the
    imputation model
  • - setting up a model might be difficult/impossibl
    e

4
Partially synthetic data sets (Little 1993)
  • only potentially identifying or sensitive
    variables are replaced

5
Partially synthetic data sets (Little 1993)
  • only potentially identifying or sensitive
    variables are replaced

6
Partially synthetic data sets (Little 1993)
  • only potentially identifying or sensitive
    variables are replaced
  • advantages - model dependence decreases
  • - models are easier to set up
  • disadvantages - true values remain in the data
    set
  • - disclosure might still be possible

7
Overview
  • Multiple Imputation for Statistical Disclosure
    Control
  • The IAB Establishment Panel
  • Application of The Two Approaches
  • Comparison of the Results
  • Conclusions

8
The IAB Establishment Panel
  • Annually conducted Establishment Survey
  • Since 1993 in Western Germany, since 1996 in
    Eastern Germany
  • Population All establishments with at least one
    employee covered by social security
  • Source Official Employment Statistics
  • Response rate of repeatedly interviewed
    establishments more than 80
  • Sample of more than 16.000 establishments in the
    last wave
  • Contents employment structure, changes in
    employment, business policies, investment,
    training, remuneration, working hours,
    collective wage agreements, works councils

9
Overview
  • Multiple Imputation for Statistical Disclosure
    Control
  • The IAB Establishment Panel
  • Application of the Two Approaches
  • Comparison of the Results
  • Conclusions

10
Generating fully synthetic data sets for the IAB
Establishment Panel
  • Create a synthetic data set for selected
    variables from the wave 1997 from the
    Establishment Panel
  • Draw 10 new sample from the Official Employment
    Statistics using the same sampling design as for
    the Establishment Panel (Stratification by
    industry, size, and region)
  • The number of observations in each sample equals
    the number of observations in the panel
    nsnp7332
  • Every sample is imputed ten times using
    sequential regression
  • Number of variables from the establishment panel
    48
  • Imputations are generated using IVEware by
    Raghunathan, Solenberger and Hoewyk (2001)

11
Imputation procedure for partially synthetic data
  • Only two variables are synthesized - number of
    employees
  • - industry (16 categories)
  • Same variables for the imputation models
  • Imputation by sequential regression
  • Imputation model - multinomial logit for the
    industry
  • - linear model for the cubic root of the nb of
    employees
  • - 4 independent linear models defined by
    quartiles for the establishment size
  • Imputations based on own coding in R.

12
Overview
  • Multiple Imputation for Statistical Disclosure
    Control
  • The IAB Establishment Panel
  • Application of The Two Approaches
  • Comparison of the Results
  • Conclusion

13
Analytical validity
  • Compare regression results from the original data
    with results from the synthetic data
  • First regression
  • Zwick (2005) analyses the productivity effects of
    different continuing vocational training forms in
    Germany
  • Probit regression to explain, why firms offer
    vocational training
  • 13 Explanatory variables including Share of
    qualified employees, establishment size,
    industry, collective wage agreement, high
    qualification needs expected
  • Second regression
  • Log(number of employees) on 15 industry dummies
  • Two data utility measures
  • - Comparison of the beta coefficients from the
    original data set and the synthetic data
    sets
  • - confidence interval overlap

14
Confidence interval overlap
  • Suggested by Karr et al. (2006)
  • Measure the overlap of CIs from the original data
    and CIs from the synthetic data
  • The higher the overlap, the higher the data
    utility
  • Compute the average relative CI overlap for any

CI for the synthetic data
CI for the original data
15
Results from the first regression (Zwick 2005)
16
Average confidence interval (CI) overlap for the
estimates from the first regression
0,808
0,926
Average overlap
17
Results from the second regression (log(nb. of
employees) on industry)
Significant at the 0,1 level
Significant at the 1 level
Significant at the 5 level
insignificant
18
Average confidence interval (CI) overlap for the
estimates from the second regression
0,699
0,839
Average overlap
19
Disclosure risk
  • Difficult to compare between partially and fully
    synthetic data sets
  • Disclosure risk is low for fully synthetic data
    sets, although not zero
  • DR is higher for partially synthetic data sets,
    because
  • True values remain in the data set
  • Only survey respondents are included
  • For partially synthetic data sets a careful
    disclosure risk evaluation is necessary

20
Overview
  • Multiple Imputation for Statistical Disclosure
    Control
  • The IAB Establishment Panel
  • Application of The Two Approaches
  • Comparison of the Results
  • Conclusions

21
Conclusions
  • Generating synthetic data sets can be a useful
    method for SDC
  • Advantages for partially synthetic data sets
  • Higher data validity
  • Imputation models easier to set up
  • Lower risk of biased imputations
  • Disadvantages for partially synthetic data sets
  • Higher risk of disclosure
  • Careful disclosure risk evaluation necessary
  • Agencies will have to decide depending on the
    complexity of the survey and the expected risk
    of disclosure

22
Thank you for your attention
Write a Comment
User Comments (0)
About PowerShow.com