A Common Measure of Identity and Value Disclosure Risk - PowerPoint PPT Presentation

About This Presentation

Title:

A Common Measure of Identity and Value Disclosure Risk

Description:

A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky krishm_at_uky.edu Rathin Sarathy Oklahoma State University – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 33

Provided by: sarat156

Learn more at: https://unece.org

Category:

more less

Transcript and Presenter's Notes

Title: A Common Measure of Identity and Value Disclosure Risk

1
A Common Measure of Identity and Value Disclosure
Risk

Krish Muralidhar
University of Kentucky
krishm_at_uky.edu
Rathin Sarathy
Oklahoma State University
sarathy_at_okstate.edu

2
Context

This study presents developments in the context
of numerical data that have been masked and
released
We assume that the categorical data (if any) have
not been masked
This assumption can be relaxed

3
Empirical Assessment of Disclosure Risk

Is there a link between both identity and value
disclosure that will allow us to use a common
measure?

4
Basis for Disclosure

The strength of the relationship, in a
multivariate sense, between the two datasets
(original and masked) accounts for disclosure
risk

5
Value Disclosure

Value disclosure based on strength of
relationship
Palley Simonoff(1987) (R2 measure for
individual variables)
Tendick (1992) (R2 for linear combinations)
Muralidhar Sarathy(2002) (Canonical
Correlation)
Implicit assumption snooper can use linear
models to improve their prediction of
confidential values (Palley Simonoff(1987),
Fuller(1993), Tendick(1992), Muralidhar
Sarathy(1999,2001))

6
Identity Disclosure

Assessment of identity disclosure is often
empirical in nature e.g., Winklers software
(Census Bureau) based on a modified
Fellegi-Sunter algorithm.
The number (or proportion) of observations
correctly re-identified represents an assessment
of identity disclosure risk
Theoretical attempts for numerical data
Fuller (1993) (Linear model)
Tendick (1992) (Linear model)
Fienberg, Makov, Sanil (1997) (Bayesian)

7
Fullers Measure

Given the masked dataset Y, and the original
dataset X, and assuming normality, the
probability that the jth released record
corresponds to any particular record that the
intruder may possess is given by Pj (? kt)-1
kj.
The intruder chooses the record j which maximizes
kj given by
exp-0.5 (X YH)A-1(X YH),
where A ?XX ?XY(?YY)-1?YX and H
(?YY)-1?YX
Pj may be treated as the identification
probability (identity risk) of any particular
record and averaging over every record gives a
mean identification probability or mean identity
disclosure risk for whole masked dataset

8
Fullers distance measure

Based on best conditional densities
While restricted to normal datasets, it relates
identity risk to the association between the two
datasets (though somewhat indirectly) as
indicated by kj which contains ?XY.
Shows the connection between distance-based
measures and probability-based measures

9
Our Goal

To show that both value disclosure and identity
disclosure are determined by the degree of
association between the masked and original
datasets. This must be true, since both are based
on best predictors
When the best predictors are linear (e.g.,
multivariate normal datasets) canonical
correlation can capture the association, and both
value disclosure and identity disclosure risk
must be expressible in terms of canonical
correlations
Already shown for value disclosure (Muralidhar et
al. 1999, and Sarathy et al. 2002). We will show
here the relationship between identity disclosure
and canonical correlation

10
Canonical Correlation Version of Fullers
Distance Measure

(X YH)A-1(X YH) (U V?0.5) C-1 (U
V?0.5) ,
where
U X(?xx)-0.5e (the canonical variates for the X
variables)
V Y(?yy)-0.5f (the canonical variates for the Y
variables)
C (I ?)
e is eigenvector of (?XX)-0.5?XY(?YY)-1?YX(?XX)-
0.5.
f is eigenvector of (?YY)-0.5?YX(?XX)-1?XY(?YY)-
0.5.
? is diagonal matrix of eigenvalues and is also
the vector of squared canonical correlations

11
Therefore

Identity disclosure risk is a function of the
(linear) association between the two datasets
(the lambdas, which are the square of the
canonical correlations)
(U V?0.5) (I- ? )-1 (U V?0.5) relates this
association to identity disclosure as well as
provide an operational way to assess this
risk.
Compute this distance measure and match each
original record to masked record that minimizes
the expression. Then the number of re-identified
records gives an overall empirical assessment of
identity disclosure risk for a masked data
release (Empirical results shown later.)

12
Mean Identification Probability (MIDP)

Tendick computed bounds on identification
probabilities for correlated additive noise
methods
His expressions are specific to the method and
not for the general case
We show a lower bound on MIDP for the general
case (regardless of masking technique) that is
based on canonical correlations

13
Bound on MIDP

For a data set (size n) with k confidential
variables X, masked using any procedure to result
in Y, the mean identification probability is
given by

14
Identification Probability (IDP)

For any given observation i in the original data
set, the probability that it will be
re-identified is given by
where Uij is the canonical variate for Xij

15
An Example

Consider a data set with 10 variables and a
specified covariance matrix
Assume that the data is to be perturbed using
simple noise addition with different levels of
variance
Compute MIDP for different sample sizes and
different noise variances

16
Covariance Matrix of X
17
MIDP
18
Additive (Correlated) Noise

Kim (1986) suggested that covariance structure of
the noise term should be the same as that of the
original confidential variables (dSXX) where d is
a constant representing the level of noise
In this case, canonical correlation for each
(masked, original) variable pair is 1/(1d)0.5

19
MIDP
20
Comparison of Simple additive and Correlated noise

For the same noise level
Correlated noise results in higher identity
disclosure risk Tendick (1993) also observed
this
Correlated noise results in lower value
disclosure risk (Tendick and Matloff 1994
Muralidhar et al. 1999)

21
Other Procedures

For some other procedures (micro-aggregation,
data swapping, etc.), it may be necessary to
perform the masking and use the data to compute
the canonical correlations

22
Data sets with Categorical non-confidential
Variables

MIDP can be computed for subsets as well
Example
Data set with 2000 observations
Six numerical variables
Three categorical (non-confidential) variables
Gender
Marital status
Age group (1 6)
Masking procedure is Rank Based Proximity Swap

23
MIDP
24
Using IDP

We can use the IDP bound to implement a record
re-identification procedure by choosing masked
record with highest IDP value

25
An IDP Example

Data set consisting of 25 observations from a
MVN(0,1)
Perturbed using independent noise with variance
0.45
MIDP 0.2375
Approximately 6 observations should be
re-identified using this criteria
Re-identification by chance 1/n 0.04

26
An IDP Example
27
Advantages

Possible to compute MIDP with just aggregate
information
Possible to use IDP as record-linkage tool for
assessing disclosure risk characteristics of a
masking technique
Computationally easier than alternative existing
methods

28
Disadvantages

Assumes that the data has a multivariate normal
distribution
For large n, the lower bound is weak. MIDP
appears to be overly pessimistic, we are working
on finding out why this is so, and possibly
modifying the bound.

29
Weak Bound?

Sample result
n50
simple noise addition

Noise MIDP Actual
0.10 0.990408 1.00
0.20 0.787552 0.94
0.30 0.034811 0.88
0.40 0.000000 0.72
0.50 0.000000 0.62
0.75 0.000000 0.46
1.00 0.000000 0.36
30
Conclusion

Canonical correlation analysis can be used to
assess both identity and value disclosure
For normal data, this provides the best measure
of both identity and value disclosure

31
Further Research