Title: Anatomy: Simple and Effective Privacy Preservation
1AnatomySimple and Effective Privacy Preservation
- Xiaokui Xiao, Yufei Tao
- Chinese University of Hong Kong
2Privacy preserving data publishing
- Microdata
- Purposes
- Allow researchers to effectively study the
correlation between various attributes - Protect the privacy of every patient
Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
3A naïve solution
-
- It does not work. See next.
Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
publish
4Inference attack
Published table
- An adversary knows that Bob
- has been hospitalized before
- is 23 years old
- lives in an area with zipcode 11000
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
Quasi-identifier (QI) attributes
5Generalization
- Transform each QI value into a less specific form
A generalized table
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
Name Age Sex Zipcode
Bob 23 M 11000
How much generalization do we need?
6l-diversity
- A QI-group with m tuples is l-diverse, iff each
sensitive value appears no more than m / l times
in the QI-group. - A table is l-diverse, iff all of its QI-groups
are l-diverse. -
- The above table is 2-diverse.
Quasi-identifier (QI) attributes
Sensitive attribute
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
2 QI-groups
7What l-diversity guarantees
- From an l-diverse generalized table, an adversary
(without any prior knowledge) can infer the
sensitive value of each individual with
confidence at most 1/l -
A 2-diverse generalized table
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
Name Age Sex Zipcode
Bob 23 M 11000
A. Machanavajjhala et al. l-Diversity Privacy
Beyond k-Anonymity. ICDE 2006
8Defect of generalization
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
-
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
- Estimated answer 2 p, where p is the
probability that each of the two tuples satisfies
the query conditions
9Defect of generalization (cont.)
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
- p Area( R1 n Q ) / Area( R1 ) 0.05
- Estimated answer for query A 2 p 0.1
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 pneumonia
10Defect of generalization (cont.)
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
- Estimated answer from the generalized table 0.1
- The exact answer should be 1
Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
11Research Works on Generalization
- V. S. Iyengar. Transforming data to satisfy
privacy constraints. KDD 2002. - K. Wang, P. S. Yu and S. Chakraborty. Bottom-Up
Generalization A Data Mining Solution to Privacy
Protection. ICDM 2004. - R. J. Bayardo Jr. and R. Agrawal. Data Privacy
through Optimal k-Anonymization. ICDE 2005. - B. C. M. Fung, K. Wang and P. S. Yu. Top-Down
Specialization for Information and Privacy
Preservation. ICDE 2005. - K. LeFevre, D. J. DeWitt and R. Ramakrishnan.
Incognito Efficient Full-Domain K-Anonymity.
SIGMOD 2005. - K. LeFevre, D. J. DeWitt and R. Ramakrishnan.
Mondrian Multidimensional K-Anonymity. ICDE 2006. - D. Kifer and J. Gehrke. Injecting utility into
anonymized datasets. SIGMOD 2006. - X. Xiao and Y. Tao. Personalized privacy
preservation. SIGMOD 2006. - K. Wang and B. C. M. Fung. Anonymization for
Sequential Releases. KDD 2006. - K. LeFevre, D. DeWitt and R. Ramakrishnan.
Workload-Aware Anonymization. KDD 2006. - J. Xu, Wei Wang, J. Pei, etc. Utility-Based
Anonymization Using Local Recodings. KDD 2006.
12Contributions
- We propose an alternative technique for
generalization called Anatomy, which allows much
more accurate data analysis while still
preserving privacy. - We develop an algorithm for computing anatomized
tables that - runs in linear I/Os
- (nearly) minimizes information loss
13Outline
- Basic Idea of Anatomy
- Preserving Correlation
- Algorithm for Anatomy
- Experimental Results
14Basic Idea of Anatomy
- For a given microdata table, Anatomy releases a
quasi-identifier table (QIT) and a sensitive
table (ST)
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
Sensitive Table (ST)
Quasi-identifier Table (QIT)
microdata
15Basic Idea of Anatomy (cont.)
- 1. Select a partition of the tuples
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
QI group 1
QI group 2
a 2-diverse partition
16Basic Idea of Anatomy (cont.)
- 2. Generate a quasi-idnetifier table (QIT) and a
sensitive table (ST) based on the selected
partition
Disease
pneumonia
dyspepsia
dyspepsia
pneumonia
flu
gastritis
flu
bronchitis
Age Sex Zipcode
23 M 11000
27 M 13000
35 M 59000
59 M 12000
61 F 54000
65 F 25000
65 F 25000
70 F 30000
group 1
group 2
quasi-identifier table (QIT)
sensitive table (ST)
17Basic Idea of Anatomy (cont.)
- 2. Generate a quasi-idnetifier table (QIT) and a
sensitive table (ST) based on the selected
partition
Group-ID Disease
1 pneumonia
1 dyspepsia
1 dyspepsia
1 pneumonia
2 flu
2 gastritis
2 flu
2 bronchitis
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
quasi-identifier table (QIT)
sensitive table (ST)
18Basic Idea of Anatomy (cont.)
- 2. Generate a quasi-idnetifier table (QIT) and a
sensitive table (ST) based on the selected
partition
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
19Privacy Preservation
- From a pair of QIT and ST generated from an
l-diverse partition, the adversary can infer the
sensitive value of each individual with
confidence at most 1/l
Name Age Sex Zipcode
Bob 23 M 11000
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
20Accuracy of Data Analysis
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
-
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
21Accuracy of Data Analysis (cont.)
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
- 2 patients have contracted pneumonia
- 2 out of 4 patients satisfies the query condition
on Age and Zipcode - Estimated answer for query A 2 2 / 4 1,
which is also the actual result from the original
microdata
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
t1t2 t3 t4
22Outline
- Rationale of Anatomy
- Preserving Correlation
- Algorithm for Anatomy
- Experimental Results
23Preserving Correlation
- Let us first examine the correlation between Age
and Disease in our running example - Each tuple in the microdata can be mapped to a
point in the (Age, Disease) domain - The above tuple can be mapped to (23, pneumonia).
-
Age Sex Zipcode Disease
23 M 11000 pneumonia
....
t1
24Preserving Correlation (cont.)
- We model this tuple using a probability density
function (pdf) -
25Preserving Correlation (cont.)
- In the generalized table, the tuple becomes
- Its corresponding pdf becomes
-
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
26Preserving Correlation (cont.)
- In the anatomized tables, the tuple becomes
- Its corresponding pdf becomes
-
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
Age Sex Zipcode Group-ID
23 M 11000 1
27Preserving Correlation (cont.)
28Outline
- Rationale of Anatomy
- Preserving Correlation
- Algorithm for Anatomy
- Experimental Results
29Quality Metric
the original pdf
the approximated pdf
- For each approximated pdf , we measure its
error from the original pdf by their L2
distance - We aim at obtaining anatomized tables that
minimize the following re-construction error
(RCE)
30Anatomize
- An algorithm for computing anatomized tables that
- runs in I/O cost linear to the cardinality n of
the microdata table - minimizes the RCE when n is a multiple of l,
otherwise achieves an RCE that is higher than the
lower-bound by a factor of at most 1 1/n
31Outline
- Rationale of Anatomy
- Preserving Correlation
- Algorithm for Anatomy
- Experimental Results
32Experimental Settings
- Goal to compare the accuracy of data analysis on
the generalized / anatomized tables. - Real dataset with 9 attributes
- Age, Gender, Education, Marital-status, Race,
Work-class, Country, - Occupation, Salary-class
- OCC-d, SAL-d, (d 3, 4, 5, 6, 7)
- OCC-3
- SAL-4
- Cardinality 100k, 200k, 300k, 400k, 500k
Age Gender Education Occupation
Age Gender Education Marital-status Salary-class
33Experimental Settings (cont.)
- competitor multi-dimensional generalization
- l 10
- avg. relative error for 10000 aggregate queries
- act est / act
-
- qd 1, 2, , d
-
- s 1, , 5, , 10
34Accuracy of Data Analysis (cont.)
C.C. Aggarwal. On k-anonymity and the curse of
dimensionality. VLDB 2005
35Accuracy of Data Analysis (cont.)
36Accuracy of Data Analysis (cont.)
37Computation Overhead
38Summary
- Anatomy outperforms generalization by allowing
much more accurate data analysis on the published
data. - Anatomized tables (with nearly optimal quality
guarantee) can be computed in I/O cost linear to
the database cardinality.
39Thank you!
- Datasets and implementation are available for
download at - http//www.cse.cuhk.edu.hk/taoyf
40Anatomy vs. Generalization Revisit
- Sometimes the adversary is not sure whether an
individual appears in the microdata or not
A 2-diverse generalized table
A Voter Registration List
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
Name Age Sex Zipcode
Bob 23 M 11000
Ken 27 M 13000
Peter 35 M 59000
Mark 40 M 30000
Ric 50 M 40000
Sam 59 M 12000
41Anatomy vs. Generalization Revisit
- From the adversarys perspective
- Bob has 4 / 6 probability to be in the microdata
- If Bob indeed appears the microdata, there is 2 /
4 probability that he has contracted pneumonia - So Bob has 4/6 2/4 1/3 probability to have
contracted pneumonia
A 2-diverse generalized table
A Voter Registration List
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
Name Age Sex Zipcode
Bob 23 M 11000
Ken 27 M 13000
Peter 35 M 59000
Mark 40 M 30000
Ric 50 M 40000
Sam 59 M 12000
42Anatomy vs. Generalization Revisit
Name Age Sex Zipcode
Bob 23 M 11000
Ken 27 M 13000
Peter 35 M 59000
Mark 40 M 30000
Ric 50 M 40000
Sam 59 M 12000
- The adversary knows that
- Bob must appear the microdata
- There is 1/2 probability that Bob has contracted
pneumonia
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
2-diverse ST
2-diverse QIT
43Anatomy vs. Generalization Revisit
- For a given value of l, l-diverse generalization
may lead to higher privacy protection than
l-diverse anatomy does. - But is not always the case, since
- the external database may not contain any
irrelevant individuals - the adversary may know that some individuals
indeed appear in the microdata
Name Age Sex Zipcode
Bob 23 M 11000
Ken 27 M 13000
Peter 35 M 59000
Mark 40 M 30000
Ric 50 M 40000
Sam 59 M 12000