Anatomy: Simple and Effective Privacy Preservation - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Anatomy: Simple and Effective Privacy Preservation

Description:

Allow researchers to effectively study the correlation between various attributes ... Incognito: Efficient Full-Domain K-Anonymity. SIGMOD 2005. ... – PowerPoint PPT presentation

Number of Views:199
Avg rating:3.0/5.0
Slides: 44
Provided by: fox83
Category:

less

Transcript and Presenter's Notes

Title: Anatomy: Simple and Effective Privacy Preservation


1
AnatomySimple and Effective Privacy Preservation
  • Xiaokui Xiao, Yufei Tao
  • Chinese University of Hong Kong

2
Privacy preserving data publishing
  • Microdata
  • Purposes
  • Allow researchers to effectively study the
    correlation between various attributes
  • Protect the privacy of every patient

Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
3
A naïve solution
  • It does not work. See next.

Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
publish
4
Inference attack
Published table
  • An adversary knows that Bob
  • has been hospitalized before
  • is 23 years old
  • lives in an area with zipcode 11000

Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
Quasi-identifier (QI) attributes
5
Generalization
  • Transform each QI value into a less specific form

A generalized table
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
Name Age Sex Zipcode
Bob 23 M 11000
How much generalization do we need?
6
l-diversity
  • A QI-group with m tuples is l-diverse, iff each
    sensitive value appears no more than m / l times
    in the QI-group.
  • A table is l-diverse, iff all of its QI-groups
    are l-diverse.
  • The above table is 2-diverse.

Quasi-identifier (QI) attributes
Sensitive attribute
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
2 QI-groups
7
What l-diversity guarantees
  • From an l-diverse generalized table, an adversary
    (without any prior knowledge) can infer the
    sensitive value of each individual with
    confidence at most 1/l

A 2-diverse generalized table
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
Name Age Sex Zipcode
Bob 23 M 11000
A. Machanavajjhala et al. l-Diversity Privacy
Beyond k-Anonymity. ICDE 2006
8
Defect of generalization
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000

Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
  • Estimated answer 2 p, where p is the
    probability that each of the two tuples satisfies
    the query conditions

9
Defect of generalization (cont.)
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000
  • p Area( R1 n Q ) / Area( R1 ) 0.05
  • Estimated answer for query A 2 p 0.1

Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 pneumonia
10
Defect of generalization (cont.)
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000
  • Estimated answer from the generalized table 0.1
  • The exact answer should be 1

Name Age Sex Zipcode Disease
Bob 23 M 11000 pneumonia
Ken 27 M 13000 dyspepsia
Peter 35 M 59000 dyspepsia
Sam 59 M 12000 pneumonia
Jane 61 F 54000 flu
Linda 65 F 25000 gastritis
Alice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
11
Research Works on Generalization
  1. V. S. Iyengar. Transforming data to satisfy
    privacy constraints. KDD 2002.
  2. K. Wang, P. S. Yu and S. Chakraborty. Bottom-Up
    Generalization A Data Mining Solution to Privacy
    Protection. ICDM 2004.
  3. R. J. Bayardo Jr. and R. Agrawal. Data Privacy
    through Optimal k-Anonymization. ICDE 2005.
  4. B. C. M. Fung, K. Wang and P. S. Yu. Top-Down
    Specialization for Information and Privacy
    Preservation. ICDE 2005.
  5. K. LeFevre, D. J. DeWitt and R. Ramakrishnan.
    Incognito Efficient Full-Domain K-Anonymity.
    SIGMOD 2005.
  6. K. LeFevre, D. J. DeWitt and R. Ramakrishnan.
    Mondrian Multidimensional K-Anonymity. ICDE 2006.
  7. D. Kifer and J. Gehrke. Injecting utility into
    anonymized datasets. SIGMOD 2006.
  8. X. Xiao and Y. Tao. Personalized privacy
    preservation. SIGMOD 2006.
  9. K. Wang and B. C. M. Fung. Anonymization for
    Sequential Releases. KDD 2006.
  10. K. LeFevre, D. DeWitt and R. Ramakrishnan.
    Workload-Aware Anonymization. KDD 2006.
  11. J. Xu, Wei Wang, J. Pei, etc. Utility-Based
    Anonymization Using Local Recodings. KDD 2006.

12
Contributions
  • We propose an alternative technique for
    generalization called Anatomy, which allows much
    more accurate data analysis while still
    preserving privacy.
  • We develop an algorithm for computing anatomized
    tables that
  • runs in linear I/Os
  • (nearly) minimizes information loss

13
Outline
  • Basic Idea of Anatomy
  • Preserving Correlation
  • Algorithm for Anatomy
  • Experimental Results

14
Basic Idea of Anatomy
  • For a given microdata table, Anatomy releases a
    quasi-identifier table (QIT) and a sensitive
    table (ST)

Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
Sensitive Table (ST)
Quasi-identifier Table (QIT)
microdata
15
Basic Idea of Anatomy (cont.)
  • 1. Select a partition of the tuples

Age Sex Zipcode Disease

23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia

61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
QI group 1
QI group 2
a 2-diverse partition
16
Basic Idea of Anatomy (cont.)
  • 2. Generate a quasi-idnetifier table (QIT) and a
    sensitive table (ST) based on the selected
    partition

Disease

pneumonia
dyspepsia
dyspepsia
pneumonia

flu
gastritis
flu
bronchitis
Age Sex Zipcode

23 M 11000
27 M 13000
35 M 59000
59 M 12000

61 F 54000
65 F 25000
65 F 25000
70 F 30000
group 1
group 2
quasi-identifier table (QIT)
sensitive table (ST)
17
Basic Idea of Anatomy (cont.)
  • 2. Generate a quasi-idnetifier table (QIT) and a
    sensitive table (ST) based on the selected
    partition

Group-ID Disease

1 pneumonia
1 dyspepsia
1 dyspepsia
1 pneumonia

2 flu
2 gastritis
2 flu
2 bronchitis
Age Sex Zipcode Group-ID

23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1

61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
quasi-identifier table (QIT)
sensitive table (ST)
18
Basic Idea of Anatomy (cont.)
  • 2. Generate a quasi-idnetifier table (QIT) and a
    sensitive table (ST) based on the selected
    partition

Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
19
Privacy Preservation
  • From a pair of QIT and ST generated from an
    l-diverse partition, the adversary can infer the
    sensitive value of each individual with
    confidence at most 1/l

Name Age Sex Zipcode
Bob 23 M 11000
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
20
Accuracy of Data Analysis
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000

Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 1
2 flu 2
2 gastritis 1
sensitive table (ST)
quasi-identifier table (QIT)
21
Accuracy of Data Analysis (cont.)
  • Query A SELECT COUNT() from Unknown-Microdata
  • WHERE Disease pneumonia AND Age in 0, 30
  • AND Zipcode in 10001, 20000
  • 2 patients have contracted pneumonia
  • 2 out of 4 patients satisfies the query condition
    on Age and Zipcode
  • Estimated answer for query A 2 2 / 4 1,
    which is also the actual result from the original
    microdata

Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
t1t2 t3 t4
22
Outline
  • Rationale of Anatomy
  • Preserving Correlation
  • Algorithm for Anatomy
  • Experimental Results

23
Preserving Correlation
  • Let us first examine the correlation between Age
    and Disease in our running example
  • Each tuple in the microdata can be mapped to a
    point in the (Age, Disease) domain
  • The above tuple can be mapped to (23, pneumonia).

Age Sex Zipcode Disease
23 M 11000 pneumonia
....
t1
24
Preserving Correlation (cont.)
  • We model this tuple using a probability density
    function (pdf)

25
Preserving Correlation (cont.)
  • In the generalized table, the tuple becomes
  • Its corresponding pdf becomes

Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia

26
Preserving Correlation (cont.)
  • In the anatomized tables, the tuple becomes
  • Its corresponding pdf becomes

Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2

Age Sex Zipcode Group-ID
23 M 11000 1

27
Preserving Correlation (cont.)

28
Outline
  • Rationale of Anatomy
  • Preserving Correlation
  • Algorithm for Anatomy
  • Experimental Results

29
Quality Metric
the original pdf
the approximated pdf
  • For each approximated pdf , we measure its
    error from the original pdf by their L2
    distance
  • We aim at obtaining anatomized tables that
    minimize the following re-construction error
    (RCE)

30
Anatomize
  • An algorithm for computing anatomized tables that
  • runs in I/O cost linear to the cardinality n of
    the microdata table
  • minimizes the RCE when n is a multiple of l,
    otherwise achieves an RCE that is higher than the
    lower-bound by a factor of at most 1 1/n

31
Outline
  • Rationale of Anatomy
  • Preserving Correlation
  • Algorithm for Anatomy
  • Experimental Results

32
Experimental Settings
  • Goal to compare the accuracy of data analysis on
    the generalized / anatomized tables.
  • Real dataset with 9 attributes
  • Age, Gender, Education, Marital-status, Race,
    Work-class, Country,
  • Occupation, Salary-class
  • OCC-d, SAL-d, (d 3, 4, 5, 6, 7)
  • OCC-3
  • SAL-4
  • Cardinality 100k, 200k, 300k, 400k, 500k

Age Gender Education Occupation
Age Gender Education Marital-status Salary-class
33
Experimental Settings (cont.)
  • competitor multi-dimensional generalization
  • l 10
  • avg. relative error for 10000 aggregate queries
  • act est / act
  • qd 1, 2, , d
  • s 1, , 5, , 10

34
Accuracy of Data Analysis (cont.)
C.C. Aggarwal. On k-anonymity and the curse of
dimensionality. VLDB 2005
35
Accuracy of Data Analysis (cont.)
36
Accuracy of Data Analysis (cont.)
37
Computation Overhead
38
Summary
  • Anatomy outperforms generalization by allowing
    much more accurate data analysis on the published
    data.
  • Anatomized tables (with nearly optimal quality
    guarantee) can be computed in I/O cost linear to
    the database cardinality.

39
Thank you!
  • Datasets and implementation are available for
    download at
  • http//www.cse.cuhk.edu.hk/taoyf

40
Anatomy vs. Generalization Revisit
  • Sometimes the adversary is not sure whether an
    individual appears in the microdata or not

A 2-diverse generalized table
A Voter Registration List
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 gastritis
61, 70 F 10001, 60000 flu
61, 70 F 10001, 60000 bronchitis
Name Age Sex Zipcode
Bob 23 M 11000
Ken 27 M 13000
Peter 35 M 59000
Mark 40 M 30000
Ric 50 M 40000
Sam 59 M 12000

41
Anatomy vs. Generalization Revisit
  • From the adversarys perspective
  • Bob has 4 / 6 probability to be in the microdata
  • If Bob indeed appears the microdata, there is 2 /
    4 probability that he has contracted pneumonia
  • So Bob has 4/6 2/4 1/3 probability to have
    contracted pneumonia

A 2-diverse generalized table
A Voter Registration List
Age Sex Zipcode Disease
21, 60 M 10001, 60000 pneumonia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 dyspepsia
21, 60 M 10001, 60000 pneumonia

Name Age Sex Zipcode
Bob 23 M 11000
Ken 27 M 13000
Peter 35 M 59000
Mark 40 M 30000
Ric 50 M 40000
Sam 59 M 12000

42
Anatomy vs. Generalization Revisit
Name Age Sex Zipcode
Bob 23 M 11000
Ken 27 M 13000
Peter 35 M 59000
Mark 40 M 30000
Ric 50 M 40000
Sam 59 M 12000
  • The adversary knows that
  • Bob must appear the microdata
  • There is 1/2 probability that Bob has contracted
    pneumonia

Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2

Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1

2-diverse ST
2-diverse QIT
43
Anatomy vs. Generalization Revisit
  • For a given value of l, l-diverse generalization
    may lead to higher privacy protection than
    l-diverse anatomy does.
  • But is not always the case, since
  • the external database may not contain any
    irrelevant individuals
  • the adversary may know that some individuals
    indeed appear in the microdata

Name Age Sex Zipcode
Bob 23 M 11000
Ken 27 M 13000
Peter 35 M 59000
Mark 40 M 30000
Ric 50 M 40000
Sam 59 M 12000
Write a Comment
User Comments (0)
About PowerShow.com