Anatomy: Simple and Effective Privacy Preservation - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Anatomy: Simple and Effective Privacy Preservation

Description:

Allow researchers to effectively study the correlation between various attributes ... Incognito: Efficient Full-Domain K-Anonymity. SIGMOD 2005. ... – PowerPoint PPT presentation

Number of Views:199

Avg rating:3.0/5.0

Slides: 44

Provided by: fox83

Category:

more less

Transcript and Presenter's Notes

Title: Anatomy: Simple and Effective Privacy Preservation

1
AnatomySimple and Effective Privacy Preservation

Xiaokui Xiao, Yufei Tao
Chinese University of Hong Kong

2
Privacy preserving data publishing

Microdata
Purposes
Allow researchers to effectively study the
correlation between various attributes
Protect the privacy of every patient

It does not work. See next.

Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
publish
4
Inference attack
Published table

An adversary knows that Bob
has been hospitalized before
is 23 years old
lives in an area with zipcode 11000

Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
Quasi-identifier (QI) attributes
5
Generalization

Transform each QI value into a less specific form

A generalized table
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
Name Age Sex Zipcode
Bob 23 M 11000
How much generalization do we need?
6
l-diversity

A QI-group with m tuples is l-diverse, iff each
sensitive value appears no more than m / l times
in the QI-group.
A table is l-diverse, iff all of its QI-groups
are l-diverse.
The above table is 2-diverse.

Quasi-identifier (QI) attributes
Sensitive attribute
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
2 QI-groups
7
What l-diversity guarantees

From an l-diverse generalized table, an adversary
(without any prior knowledge) can infer the
sensitive value of each individual with
confidence at most 1/l

A 2-diverse generalized table
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
Name Age Sex Zipcode
Bob 23 M 11000
A. Machanavajjhala et al. l-Diversity Privacy
Beyond k-Anonymity. ICDE 2006
8
Defect of generalization

Query A SELECT COUNT() from Unknown-Microdata
WHERE Disease pneumonia AND Age in 0, 30
AND Zipcode in 10001, 20000

Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis

Estimated answer 2 p, where p is the
probability that each of the two tuples satisfies
the query conditions

9
Defect of generalization (cont.)

Query A SELECT COUNT() from Unknown-Microdata
WHERE Disease pneumonia AND Age in 0, 30
AND Zipcode in 10001, 20000
p Area( R1 n Q ) / Area( R1 ) 0.05
Estimated answer for query A 2 p 0.1

Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 pneumonia
10
Defect of generalization (cont.)

Query A SELECT COUNT() from Unknown-Microdata
WHERE Disease pneumonia AND Age in 0, 30
AND Zipcode in 10001, 20000
Estimated answer from the generalized table 0.1

The exact answer should be 1

V. S. Iyengar. Transforming data to satisfy
privacy constraints. KDD 2002.
K. Wang, P. S. Yu and S. Chakraborty. Bottom-Up
Generalization A Data Mining Solution to Privacy
Protection. ICDM 2004.
R. J. Bayardo Jr. and R. Agrawal. Data Privacy
through Optimal k-Anonymization. ICDE 2005.
B. C. M. Fung, K. Wang and P. S. Yu. Top-Down
Specialization for Information and Privacy
Preservation. ICDE 2005.
K. LeFevre, D. J. DeWitt and R. Ramakrishnan.
Incognito Efficient Full-Domain K-Anonymity.
SIGMOD 2005.
K. LeFevre, D. J. DeWitt and R. Ramakrishnan.
Mondrian Multidimensional K-Anonymity. ICDE 2006.
D. Kifer and J. Gehrke. Injecting utility into
anonymized datasets. SIGMOD 2006.
X. Xiao and Y. Tao. Personalized privacy
preservation. SIGMOD 2006.
K. Wang and B. C. M. Fung. Anonymization for
Sequential Releases. KDD 2006.
K. LeFevre, D. DeWitt and R. Ramakrishnan.
Workload-Aware Anonymization. KDD 2006.
J. Xu, Wei Wang, J. Pei, etc. Utility-Based
Anonymization Using Local Recodings. KDD 2006.

12
Contributions

We propose an alternative technique for
generalization called Anatomy, which allows much
more accurate data analysis while still
preserving privacy.
We develop an algorithm for computing anatomized
tables that
runs in linear I/Os
(nearly) minimizes information loss

13
Outline

Basic Idea of Anatomy
Preserving Correlation
Algorithm for Anatomy
Experimental Results

14
Basic Idea of Anatomy

For a given microdata table, Anatomy releases a
quasi-identifier table (QIT) and a sensitive
table (ST)

Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
Sensitive Table (ST)
Quasi-identifier Table (QIT)
microdata
15
Basic Idea of Anatomy (cont.)

1. Select a partition of the tuples

Age Sex Zipcode Disease

23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia

61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
QI group 1
QI group 2
a 2-diverse partition
16
Basic Idea of Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a
sensitive table (ST) based on the selected
partition

Disease

pneumonia
dyspepsia
dyspepsia
pneumonia

flu
gastritis
flu
bronchitis
Age Sex Zipcode

23 M 11000
27 M 13000
35 M 59000
59 M 12000

61 F 54000
65 F 25000
65 F 25000
70 F 30000
group 1
group 2
quasi-identifier table (QIT)
sensitive table (ST)
17
Basic Idea of Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a
sensitive table (ST) based on the selected
partition

Group-ID Disease

1 pneumonia
1 dyspepsia
1 dyspepsia
1 pneumonia

2 flu
2 gastritis
2 flu
2 bronchitis
Age Sex Zipcode Group-ID

23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1

61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
quasi-identifier table (QIT)
sensitive table (ST)
18
Basic Idea of Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a
sensitive table (ST) based on the selected
partition

Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
19
Privacy Preservation

From a pair of QIT and ST generated from an
l-diverse partition, the adversary can infer the
sensitive value of each individual with
confidence at most 1/l

Name Age Sex Zipcode
Bob 23 M 11000
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
20
Accuracy of Data Analysis

Query A SELECT COUNT() from Unknown-Microdata
WHERE Disease pneumonia AND Age in 0, 30
AND Zipcode in 10001, 20000

Query A SELECT COUNT() from Unknown-Microdata
WHERE Disease pneumonia AND Age in 0, 30
AND Zipcode in 10001, 20000
2 patients have contracted pneumonia
2 out of 4 patients satisfies the query condition
on Age and Zipcode
Estimated answer for query A 2 2 / 4 1,
which is also the actual result from the original
microdata

Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
t1t2 t3 t4
22
Outline

Rationale of Anatomy
Preserving Correlation
Algorithm for Anatomy
Experimental Results

23
Preserving Correlation

Let us first examine the correlation between Age
and Disease in our running example
Each tuple in the microdata can be mapped to a
point in the (Age, Disease) domain
The above tuple can be mapped to (23, pneumonia).

Age Sex Zipcode Disease
23 M 11000 pneumonia
....
t1
24
Preserving Correlation (cont.)

We model this tuple using a probability density
function (pdf)

25
Preserving Correlation (cont.)

In the generalized table, the tuple becomes
Its corresponding pdf becomes

Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia

26
Preserving Correlation (cont.)

In the anatomized tables, the tuple becomes
Its corresponding pdf becomes

Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2

Age Sex Zipcode Group-ID
23 M 11000 1

27
Preserving Correlation (cont.)

28
Outline

Rationale of Anatomy
Preserving Correlation
Algorithm for Anatomy
Experimental Results

29
Quality Metric
the original pdf
the approximated pdf

For each approximated pdf , we measure its
error from the original pdf by their L2
distance
We aim at obtaining anatomized tables that
minimize the following re-construction error
(RCE)

30
Anatomize

An algorithm for computing anatomized tables that
runs in I/O cost linear to the cardinality n of
the microdata table
minimizes the RCE when n is a multiple of l,
otherwise achieves an RCE that is higher than the
lower-bound by a factor of at most 1 1/n

31
Outline

Rationale of Anatomy
Preserving Correlation
Algorithm for Anatomy
Experimental Results

32
Experimental Settings

Goal to compare the accuracy of data analysis on
the generalized / anatomized tables.
Real dataset with 9 attributes
Age, Gender, Education, Marital-status, Race,
Work-class, Country,
Occupation, Salary-class
OCC-d, SAL-d, (d 3, 4, 5, 6, 7)
OCC-3
SAL-4
Cardinality 100k, 200k, 300k, 400k, 500k

Age Gender Education Occupation
Age Gender Education Marital-status Salary-class
33
Experimental Settings (cont.)

competitor multi-dimensional generalization
l 10
avg. relative error for 10000 aggregate queries
act est / act
qd 1, 2, , d
s 1, , 5, , 10

34
Accuracy of Data Analysis (cont.)
C.C. Aggarwal. On k-anonymity and the curse of
dimensionality. VLDB 2005
35
Accuracy of Data Analysis (cont.)
36
Accuracy of Data Analysis (cont.)
37
Computation Overhead
38
Summary

Anatomy outperforms generalization by allowing
much more accurate data analysis on the published
data.
Anatomized tables (with nearly optimal quality
guarantee) can be computed in I/O cost linear to
the database cardinality.

39
Thank you!

Datasets and implementation are available for
download at
http//www.cse.cuhk.edu.hk/taoyf

40
Anatomy vs. Generalization Revisit

Sometimes the adversary is not sure whether an
individual appears in the microdata or not

A 2-diverse generalized table
A Voter Registration List
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
Name Age Sex Zipcode
Bob 23 M 11000
Ken 27 M 13000
Peter 35 M 59000
Mark 40 M 30000
Ric 50 M 40000
Sam 59 M 12000

41
Anatomy vs. Generalization Revisit

From the adversarys perspective
Bob has 4 / 6 probability to be in the microdata
If Bob indeed appears the microdata, there is 2 /
4 probability that he has contracted pneumonia
So Bob has 4/6 2/4 1/3 probability to have
contracted pneumonia

A 2-diverse generalized table
A Voter Registration List
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia

Name Age Sex Zipcode
Bob 23 M 11000
Ken 27 M 13000
Peter 35 M 59000
Mark 40 M 30000
Ric 50 M 40000
Sam 59 M 12000

42
Anatomy vs. Generalization Revisit
Name Age Sex Zipcode
Bob 23 M 11000
Ken 27 M 13000
Peter 35 M 59000
Mark 40 M 30000
Ric 50 M 40000
Sam 59 M 12000

The adversary knows that
Bob must appear the microdata
There is 1/2 probability that Bob has contracted
pneumonia

Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2

Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1

2-diverse ST
2-diverse QIT
43
Anatomy vs. Generalization Revisit

For a given value of l, l-diverse generalization
may lead to higher privacy protection than
l-diverse anatomy does.
But is not always the case, since
the external database may not contain any
irrelevant individuals
the adversary may know that some individuals
indeed appear in the microdata

Name Age Sex Zipcode
Bob 23 M 11000
Ken 27 M 13000
Peter 35 M 59000
Mark 40 M 30000
Ric 50 M 40000
Sam 59 M 12000

Write a Comment

User Comments (0)