European Conference on Quality - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

European Conference on Quality

Description:

Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure ... Probit regression to explain, why firms offer vocational training ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 23

Provided by: Wag53

Category:

more less

Transcript and Presenter's Notes

Title: European Conference on Quality

1
Comparing Fully and Partially Synthetic Data Sets
for Statistical Disclosure Control in the German
IAB Establishment Panel
Jörg Drechsler, Stefan Bender (Institute for
Employment Research, Germany) Susanne
Rässler (University of Bamberg)

European Conference on Quality
in Official Statistics 2008
Rome, 08.-11. July 2008

2
Overview

Multiple Imputation for Statistical Disclosure
Control
The IAB Establishment Panel
Application of The Two Approaches
Comparison of The Results
Conclusion

3
Fully synthetic data sets (Rubin 1993)
X
Ynot observed
Ysynthetisch
Ysynthetisch
Ysynthetisch
Ysynthetisch
Ysynthetic
Yobserved

advantages - data are fully synthetic
- re-identification of single units almost
impossible
- all variables are still fully available
disadvantages - strong dependence on the
imputation model
- setting up a model might be difficult/impossibl
e

4
Partially synthetic data sets (Little 1993)

only potentially identifying or sensitive
variables are replaced

5
Partially synthetic data sets (Little 1993)

only potentially identifying or sensitive
variables are replaced

6
Partially synthetic data sets (Little 1993)

only potentially identifying or sensitive
variables are replaced

advantages - model dependence decreases
- models are easier to set up
disadvantages - true values remain in the data
set
- disclosure might still be possible

7
Overview

Multiple Imputation for Statistical Disclosure
Control
The IAB Establishment Panel
Application of The Two Approaches
Comparison of the Results
Conclusions

8
The IAB Establishment Panel

Annually conducted Establishment Survey
Since 1993 in Western Germany, since 1996 in
Eastern Germany
Population All establishments with at least one
employee covered by social security
Source Official Employment Statistics
Response rate of repeatedly interviewed
establishments more than 80
Sample of more than 16.000 establishments in the
last wave
Contents employment structure, changes in
employment, business policies, investment,
training, remuneration, working hours,
collective wage agreements, works councils

9
Overview

Multiple Imputation for Statistical Disclosure
Control
The IAB Establishment Panel
Application of the Two Approaches
Comparison of the Results
Conclusions

10
Generating fully synthetic data sets for the IAB
Establishment Panel

Create a synthetic data set for selected
variables from the wave 1997 from the
Establishment Panel
Draw 10 new sample from the Official Employment
Statistics using the same sampling design as for
the Establishment Panel (Stratification by
industry, size, and region)
The number of observations in each sample equals
the number of observations in the panel
nsnp7332
Every sample is imputed ten times using
sequential regression
Number of variables from the establishment panel
48
Imputations are generated using IVEware by
Raghunathan, Solenberger and Hoewyk (2001)

11
Imputation procedure for partially synthetic data

Only two variables are synthesized - number of
employees
- industry (16 categories)
Same variables for the imputation models
Imputation by sequential regression
Imputation model - multinomial logit for the
industry
- linear model for the cubic root of the nb of
employees
- 4 independent linear models defined by
quartiles for the establishment size
Imputations based on own coding in R.

12
Overview

Multiple Imputation for Statistical Disclosure
Control
The IAB Establishment Panel
Application of The Two Approaches
Comparison of the Results
Conclusion

13
Analytical validity

Compare regression results from the original data
with results from the synthetic data
First regression
Zwick (2005) analyses the productivity effects of
different continuing vocational training forms in
Germany
Probit regression to explain, why firms offer
vocational training
13 Explanatory variables including Share of
qualified employees, establishment size,
industry, collective wage agreement, high
qualification needs expected
Second regression
Log(number of employees) on 15 industry dummies
Two data utility measures
- Comparison of the beta coefficients from the
original data set and the synthetic data
sets
- confidence interval overlap

14
Confidence interval overlap

Suggested by Karr et al. (2006)
Measure the overlap of CIs from the original data
and CIs from the synthetic data
The higher the overlap, the higher the data
utility
Compute the average relative CI overlap for any

CI for the synthetic data
CI for the original data
15
Results from the first regression (Zwick 2005)
16
Average confidence interval (CI) overlap for the
estimates from the first regression
0,808
0,926
Average overlap
17
Results from the second regression (log(nb. of
employees) on industry)
Significant at the 0,1 level
Significant at the 1 level
Significant at the 5 level
insignificant
18
Average confidence interval (CI) overlap for the
estimates from the second regression
0,699
0,839
Average overlap
19
Disclosure risk

Difficult to compare between partially and fully
synthetic data sets
Disclosure risk is low for fully synthetic data
sets, although not zero
DR is higher for partially synthetic data sets,
because
True values remain in the data set
Only survey respondents are included
For partially synthetic data sets a careful
disclosure risk evaluation is necessary

20
Overview

Multiple Imputation for Statistical Disclosure
Control
The IAB Establishment Panel
Application of The Two Approaches
Comparison of the Results
Conclusions

21
Conclusions

Generating synthetic data sets can be a useful
method for SDC
Advantages for partially synthetic data sets
Higher data validity
Imputation models easier to set up
Lower risk of biased imputations
Disadvantages for partially synthetic data sets
Higher risk of disclosure
Careful disclosure risk evaluation necessary
Agencies will have to decide depending on the
complexity of the survey and the expected risk
of disclosure

22
Thank you for your attention

Write a Comment

User Comments (0)