Title: Methods of Geographical Perturbation for Disclosure Control
1Methods of Geographical Perturbation for
Disclosure Control
POPFEST June 2006, Liverpool
- Division of Social Statistics
- And Department of Geography
Caroline Young
Supervised jointly by Prof. Chris Skinner
(Statistics) and Prof. David Martin (Geography)
2Overview of Presentation
- Part I - Description of Disclosure Control
- Introduction to PhD topic - disclosure by
differencing - Part 2 Methodology to protect against
Differencing - Conclusions and Future Work
3What is Disclosure Control?
- Protecting confidentiality of statistical data,
particularly the Census - UK Census a promise given to respondents to
protect confidentiality (also legal obligations) - Disclosure control procedures are necessary to
ensure confidentiality
4How can Disclosure Occur?
5What is Statistical Disclosure Control?
- Statistical Disclosure Control refers to
statistical methods which modify the data to
control the disclosure risk
6Disclosure by Differencing
Disclosure by geographical differencing occurs
when multiple geographies can be linked to reveal
new information
7Differencing from two geographies
Census User A wants Geography A.
8Differencing from two geographies
Census User B wants Geography B.
9Differencing from two geographies
Differenced area
Nested geography
Ref Duke-Williams Rees (1998)
10Disclosure by Geographical Differencing
Fictitious Table 1 Claimants in Small Area (to
larger boundary)
Fictitious Table 2 Claimants in Small Area A (to
smaller boundary)
11Disclosure by Geographical Differencing
Calculated Table 2 Claimants in Differenced Area
Differenced area in yellow
12 Demand for Multiple Geographies
- Increased user demand for flexible or
non-standard geographies
13- Part II
- Methodology to protect against Differencing
14Random Record Swapping (UK Census 2001)
- Introduce uncertainty into the true geographical
location of a subset of households - Basic idea Swap the location of
- household A with the location of similar
household B - A unique household in an area (cell value of one)
may not be the true household may have been
swapped. Cannot disclose information with any
certainty.
15Assessing Performance of a Swapping Method
- Risk-Utility concept - finding a balance
- MAXIMISE UTILITY
- Measure of damage/utility Average Absolute
Deviation (AAD) per cell (averaged over all
tables) - MINIMISE RISK
- Measure of risk of true uniques in table
(averaged over all tables) - Identification Rate of cell counts where
which - relate to the same household as
Let represent cell of table
and the number of cells in table .
16Experiments
- Performed simulations on a synthetic census
dataset - Random record swapping method (UK Census 2001)
used as benchmark to assess new approaches - Examine disclosure risk at small area level
(postcodes) since the aim is to protect slivers
produced by differencing - Some simplified results here
17Simulating Census Swaps
- Full details of methods are unknown as they are
confidential - MAKE A GUESS...
- (1) UK Random Record Swap
- Swap a random sample (10) of households between
Enumeration Districts (EDs) but not out of Local
Authority district. Pair similar households (plus
other constraints) - (2) US Targeted Record Swap
- Swap 10 of risky households only (households
that are unique)
18Disclosure Risk
In practice, other post-tabulation methods were
also used (small cell adjustment) to offer more
protection at small area level But we need a
pre-tabulation method one method that protects
data before aggregation
19100 swapping
- Reduce disclosure risk swap ALL households
- Maximise Utility swap shorter distances
(between adjacent postcodes instead of EDs)
- Disclosure risk is much reduced at small area
level - Too much damage at higher levels of aggregation
20Distance Swap
- Current swapping distances are dependent on
pre-set geographies which have different shapes
and population distributions. Plus boundaries
often change - New Distance swap sample swapping distances
from a distribution equivalent to 100 random
swap (truncated normal with same mean and std)
21Density Swap
- How to improve distance swap?
- Want more control over damage and risk.
- Solution
- Low density areas are more vulnerable to
disclosure attacks - fewer people living there.
These households require greater perturbation. - Households in high density areas are less risky
and require perturbing smaller distances (also
reduces damage).
22Density Swap
- Change sampling distribution sample number of
households - Takes into account local population density
- Distance is not Euclidean but in terms of number
of households
Urban area
Rural area
23Effectiveness of Density Swap
- Choice of sampling distribution is very important
(normal, exponential, etc) - Sort households appropriately to control pairing
of households - Match households appropriately definition of
similar households
24Results of all 100 swaps
25Conclusions and Further Work
- Density Swap appears to be a good solution BUT
need to examine at other measures of damage and
risk - Is the density swap better than the combination
of methods used on the 2001 Census? (swapping
plus small cell adjustment) - Discriminate between local-area uniques and
wide-area uniques
26References
- Brown D. (2003) Different Approaches to
Disclosure Control Problems Associated with
Geography. Joint ECE/ Eurostat work session on
statistical data confidentiality, Luxembourg. - Duke-Williams, O. and Rees, P. (1998) Can Census
Offices publish statistics for more than one
small area geography? An analysis of the
differencing problem in statistical disclosure
International Journal of Geographical Information
Science 12, 579-605 - Elliot, M. J., (2005) An overview of Statistical
Disclosure Control Paper presented to RSS Social
Statistics Committee conference on Linking survey
and administrative data and statistical
disclosure control. London May 2005. - L. Willenborg and T. de Waal. Statistical
Disclosure Control in Practice. Springer-Verlag,
New York, 1996. - Voas D. and Williamson P. (2000) 'An evaluation
of the combinatorial optimisation approach to the
creation of synthetic microdata', International
Journal of Population Geography, 6, 349-366.