A Common Measure of Identity and Value Disclosure Risk - PowerPoint PPT Presentation

About This Presentation
Title:

A Common Measure of Identity and Value Disclosure Risk

Description:

A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky krishm_at_uky.edu Rathin Sarathy Oklahoma State University – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 33
Provided by: sarat156
Learn more at: https://unece.org
Category:

less

Transcript and Presenter's Notes

Title: A Common Measure of Identity and Value Disclosure Risk


1
A Common Measure of Identity and Value Disclosure
Risk
  • Krish Muralidhar
  • University of Kentucky
  • krishm_at_uky.edu
  • Rathin Sarathy
  • Oklahoma State University
  • sarathy_at_okstate.edu

2
Context
  • This study presents developments in the context
    of numerical data that have been masked and
    released
  • We assume that the categorical data (if any) have
    not been masked
  • This assumption can be relaxed

3
Empirical Assessment of Disclosure Risk
  • Is there a link between both identity and value
    disclosure that will allow us to use a common
    measure?

4
Basis for Disclosure
  • The strength of the relationship, in a
    multivariate sense, between the two datasets
    (original and masked) accounts for disclosure
    risk

5
Value Disclosure
  • Value disclosure based on strength of
    relationship
  • Palley Simonoff(1987) (R2 measure for
    individual variables)
  • Tendick (1992) (R2 for linear combinations)
  • Muralidhar Sarathy(2002) (Canonical
    Correlation)
  • Implicit assumption snooper can use linear
    models to improve their prediction of
    confidential values (Palley Simonoff(1987),
    Fuller(1993), Tendick(1992), Muralidhar
    Sarathy(1999,2001))

6
Identity Disclosure
  • Assessment of identity disclosure is often
    empirical in nature e.g., Winklers software
    (Census Bureau) based on a modified
    Fellegi-Sunter algorithm.
  • The number (or proportion) of observations
    correctly re-identified represents an assessment
    of identity disclosure risk
  • Theoretical attempts for numerical data
  • Fuller (1993) (Linear model)
  • Tendick (1992) (Linear model)
  • Fienberg, Makov, Sanil (1997) (Bayesian)

7
Fullers Measure
  • Given the masked dataset Y, and the original
    dataset X, and assuming normality, the
    probability that the jth released record
    corresponds to any particular record that the
    intruder may possess is given by Pj (? kt)-1
    kj.
  • The intruder chooses the record j which maximizes
    kj given by
  • exp-0.5 (X YH)A-1(X YH),
  • where A ?XX ?XY(?YY)-1?YX and H
    (?YY)-1?YX
  • Pj may be treated as the identification
    probability (identity risk) of any particular
    record and averaging over every record gives a
    mean identification probability or mean identity
    disclosure risk for whole masked dataset

8
Fullers distance measure
  • Based on best conditional densities
  • While restricted to normal datasets, it relates
    identity risk to the association between the two
    datasets (though somewhat indirectly) as
    indicated by kj which contains ?XY.
  • Shows the connection between distance-based
    measures and probability-based measures

9
Our Goal
  • To show that both value disclosure and identity
    disclosure are determined by the degree of
    association between the masked and original
    datasets. This must be true, since both are based
    on best predictors
  • When the best predictors are linear (e.g.,
    multivariate normal datasets) canonical
    correlation can capture the association, and both
    value disclosure and identity disclosure risk
    must be expressible in terms of canonical
    correlations
  • Already shown for value disclosure (Muralidhar et
    al. 1999, and Sarathy et al. 2002). We will show
    here the relationship between identity disclosure
    and canonical correlation

10
Canonical Correlation Version of Fullers
Distance Measure
  • (X YH)A-1(X YH) (U V?0.5) C-1 (U
    V?0.5) ,
  • where
  • U X(?xx)-0.5e (the canonical variates for the X
    variables)
  • V Y(?yy)-0.5f (the canonical variates for the Y
    variables)
  • C (I ?)
  • e is eigenvector of (?XX)-0.5?XY(?YY)-1?YX(?XX)-
    0.5.
  • f is eigenvector of (?YY)-0.5?YX(?XX)-1?XY(?YY)-
    0.5.
  • ? is diagonal matrix of eigenvalues and is also
    the vector of squared canonical correlations

11
Therefore
  • Identity disclosure risk is a function of the
    (linear) association between the two datasets
    (the lambdas, which are the square of the
    canonical correlations)
  • (U V?0.5) (I- ? )-1 (U V?0.5) relates this
    association to identity disclosure as well as
    provide an operational way to assess this
    risk.
  • Compute this distance measure and match each
    original record to masked record that minimizes
    the expression. Then the number of re-identified
    records gives an overall empirical assessment of
    identity disclosure risk for a masked data
    release (Empirical results shown later.)

12
Mean Identification Probability (MIDP)
  • Tendick computed bounds on identification
    probabilities for correlated additive noise
    methods
  • His expressions are specific to the method and
    not for the general case
  • We show a lower bound on MIDP for the general
    case (regardless of masking technique) that is
    based on canonical correlations

13
Bound on MIDP
  • For a data set (size n) with k confidential
    variables X, masked using any procedure to result
    in Y, the mean identification probability is
    given by

14
Identification Probability (IDP)
  • For any given observation i in the original data
    set, the probability that it will be
    re-identified is given by
  • where Uij is the canonical variate for Xij

15
An Example
  • Consider a data set with 10 variables and a
    specified covariance matrix
  • Assume that the data is to be perturbed using
    simple noise addition with different levels of
    variance
  • Compute MIDP for different sample sizes and
    different noise variances

16
Covariance Matrix of X
17
MIDP
18
Additive (Correlated) Noise
  • Kim (1986) suggested that covariance structure of
    the noise term should be the same as that of the
    original confidential variables (dSXX) where d is
    a constant representing the level of noise
  • In this case, canonical correlation for each
    (masked, original) variable pair is 1/(1d)0.5

19
MIDP
20
Comparison of Simple additive and Correlated noise
  • For the same noise level
  • Correlated noise results in higher identity
    disclosure risk Tendick (1993) also observed
    this
  • Correlated noise results in lower value
    disclosure risk (Tendick and Matloff 1994
    Muralidhar et al. 1999)

21
Other Procedures
  • For some other procedures (micro-aggregation,
    data swapping, etc.), it may be necessary to
    perform the masking and use the data to compute
    the canonical correlations

22
Data sets with Categorical non-confidential
Variables
  • MIDP can be computed for subsets as well
  • Example
  • Data set with 2000 observations
  • Six numerical variables
  • Three categorical (non-confidential) variables
  • Gender
  • Marital status
  • Age group (1 6)
  • Masking procedure is Rank Based Proximity Swap

23
MIDP
24
Using IDP
  • We can use the IDP bound to implement a record
    re-identification procedure by choosing masked
    record with highest IDP value

25
An IDP Example
  • Data set consisting of 25 observations from a
    MVN(0,1)
  • Perturbed using independent noise with variance
    0.45
  • MIDP 0.2375
  • Approximately 6 observations should be
    re-identified using this criteria
  • Re-identification by chance 1/n 0.04

26
An IDP Example
27
Advantages
  • Possible to compute MIDP with just aggregate
    information
  • Possible to use IDP as record-linkage tool for
    assessing disclosure risk characteristics of a
    masking technique
  • Computationally easier than alternative existing
    methods

28
Disadvantages
  • Assumes that the data has a multivariate normal
    distribution
  • For large n, the lower bound is weak. MIDP
    appears to be overly pessimistic, we are working
    on finding out why this is so, and possibly
    modifying the bound.

29
Weak Bound?
  • Sample result
  • n50
  • simple noise addition

Noise MIDP Actual
0.10 0.990408 1.00
0.20 0.787552 0.94
0.30 0.034811 0.88
0.40 0.000000 0.72
0.50 0.000000 0.62
0.75 0.000000 0.46
1.00 0.000000 0.36
30
Conclusion
  • Canonical correlation analysis can be used to
    assess both identity and value disclosure
  • For normal data, this provides the best measure
    of both identity and value disclosure

31
Further Research
  • Sensitivity to normality assumption
  • Comparison with Fellegi-Sunter based record
    linkage procedures
  • Refining the bounds

32
Our Research
  • You can find the details of our current and prior
    research at
  • http//gatton.uky.edu/faculty/muralidhar
Write a Comment
User Comments (0)
About PowerShow.com