Title: Privacy Preserving Data Publication
1Privacy Preserving Data Publication
- Yufei Tao
- Department of Computer Science and Engineering
- Chinese University of Hong Kong
2Centralized publication
- Assume that a hospital wants to publish the
following table, called the microdata. - The publication must preserve the privacy of
patients. - Prevent an adversary from knowing
who-contracted-what.
Microdata
3Centralized publication (cont.)
- A simple solution Remove column Name.
- It does not work. See next.
publish
4Linking attacks
A voter registration list
The published table
Quasi-identifier (QI) attributes
An adversary
5These are real threats
- Fact 87 of Americans can be uniquely identified
by Zipcode, gender, date-of-birth. - A famous experiment by Sweeney International
Journal on Uncertainty, Fuzziness and
Knowledge-based Systems, 2002 - finds the medical record of an ex-governor of
Massachusetts.
6Objectives
- Publish a distorted version of the dataset so
that - Privacy the privacy of all individuals is
adequately protected - Utility the dataset is useful for analyzing the
characteristics of the microdata. - Paradox Privacy protection ?, utility ?.
7Issues
- Privacy principle
- What is adequate privacy protection?
- Distortion approach
- How to achieve the privacy principle?
- The literature has discussed other issues as
well. - Complexities, improving the utility of the
published data, etc.
8Principle 1 k-anonymity
Sweeney, International Journal on Uncertainty,
Fuzziness and Knowledge-based Systems, 2002
Sensitive attribute
- 2-anonymous generalization
QI attributes
A voter registration list
4 QI groups
9Defects of k-anonymity
- What is the disease of Joe?
No diversity in this QI group.
A voter registration list
10Principle 2 l-diversity
Machanavajjhala et al., ICDE, 2006
- Each QI group should have at least l
well-represented sensitive values. - Different ways to interpret well-represented.
11Naive interpretation
- Each QI-group has l different sensitive values.
-
A 2-diverse table
12Defects of the naive interpretation
- Assume that Joe is identified in the QI group.
What is the probability that he contracted HIV? - Implication The most frequent sensitive value in
a QI group cannot be too frequent. - But accomplishing only this is still vulnerable
against attacks with background knowledge.
98 tuples
A QI group with 100 tuples
13Background knowledge attack
- Let Joe be an individual in the QI group having
HIV. - A friend of Joe has the background knowledge
Joe does not have pneumonia. - How likely would this friend assume that Joe had
HIV?
50 tuples
A QI group with 100 tuples
49 tuples
14Controlling also the 2nd most frequent value
- Even if an adversary can eliminate pneumonia,
s/he can only assume that Joe has HIV with 40 /
70 probability.
40 tuples
A QI group with 100 tuples
30 tuples
30 tuples
15An example of 5-diversity
The most frequent value
The 2nd most frequent value
A QI group
The 3rd most frequent value
The 4th most frequent value
The other values
16An example of 5-diversity (cont.)
The most frequent value
A QI group
Same cardinality
The other values
17An example of 5-diversity (cont.)
- Assume that Joe is a person in the QI group.
- Property If an adversary can eliminate only ? 3
diseases, s/he can correctly guess the disease of
Joe with at most 50 probability.
HIV
pneumonia
A QI group
bronchitis
cancer
The other values
18l-diversity
- Consider a QI group.
- m is the number of sensitive values in the group.
- r1 is the number of tuples having the most
sensitive value. - r2 is the number of tuples having the 2nd most
sensitive value. -
- rm is the number of tuples having the m-th most
sensitive value. - Then, r1 ? c (rl rm), where c is a
constant. - If an adversary can eliminate only l 2
sensitive values, s/he can infer the disease of a
person with probability at most 1 / (c 1). - Called (c, l)-diversity precisely.
19Defects of l-diversity
- Andy does not want anyone to know that he had a
stomach problem. - Sarah does not mind at all if others find out
that she had flu. -
A 2-diverse table
A voter registration list
20Defects of l-diversity (cont.)
- Does not work if an individual can have multiple
tuples in the microdata.
Microdata
21Defects of l-diversity (cont.)
A 2-diverse table
A voter registration list
22Principle 3 Personalized anonymity
Xiao and Tao, SIGMOD, 2006
- Key ideas Guarding node sensitive attribute
(SA) generalization - Assume a publicly-known hierarchy on the
sensitive attribute.
23Guarding node
- Andy does not want anyone to know that he had a
stomach problem. - He can specify stomach disease as the guarding
node for his tuple. - Protect Andy from being conjectured to have any
disease in the subtree of the guarding node.
24Guarding node (cont.)
- Sarah is willing to disclose her exact symptom.
- She can specify Ø as the guarding node for her
tuple.
25Guarding node (cont.)
- Bill does not have any special preference.
- He sets the guarding node of his tuple to be the
same as his sensitive value.
26A personalized approach
27Personalized anonymity
- No adversary should be able to breach the privacy
requirement of any guarding node with a
probability above pbreach.. - If pbreach 0.3, then no adversary can have more
than 30 probability to find out that - Andy had a stomach disease
- Bill had dyspepsia
28Why SA generalization?
- How many female patients are there with age above
30? - 4 (60 30 1) / (60 21 1) 3
- Real answer 1
-
Pure QI generalization
Microdata
29SA generalization (cont.)
With SA generalization
Pure QI generalization
30Evaluation of disclosure risk
- What is the probability that the adversary can
find out that Andy had a stomach disease?
A voter registration list
The published data
31Combinatorial reconstruction (cont.)
- Can each individual appear more than once?
- No the primary case
- Yes the non-primary case
- Some possible reconstructions
The primary case
The non-primary case
32Combinatorial reconstruction (cont.)
- Can each individual appear more than once?
- No the primary case
- Yes the non-primary case
- Some possible reconstructions
The primary case
The non-primary case
33Breach probability (primary)
- Totally 120 possible reconstructions
- If Andy is associated with a stomach disease in
nb reconstructions - The probability that the adversary should
associate Andy with some stomach problem is nb /
120 - Andy is associated with
- gastric ulcer in 24 reconstructions
- dyspepsia in 24 reconstructions
- gastritis in 0 reconstructions
- nb 48
- The breach probability for Andys tuple is 48 /
120 2 / 5.
34Breach probability (non-primary)
- Totally 625 possible reconstructions
- Andy is associated with gastric ulcer or
dyspepsia or gastritis in 225 reconstructions. - nb 225
- The breach probability for Andys tuple is
- 225 / 625 9 / 25
35A defect of personalized anonymity
- Does not guard against background knowledge.
- Recall that l-diversity can achieve this purpose.
- But it seems possible to adapt the personalized
approach to tackle background knowledge. - Future work?
36Other privacy principles
- k-gather.
- Due to Aggarwal et al., PODS, 2006
- Suffers from the problems of k-anonymity.
- (a, k)-anonymity
- Due to Wong et al., KDD, 2006
- t-closeness.
- Recently proposed by Li and Li, ICDE, 2007
37Issues
- Privacy principle
- What is adequate privacy protection?
- Distortion approach
- How to achieve the privacy principle?
38Three approaches
- Suppression
- We do not discuss it because
- the utility of the resulting table is low
- it can be regarded as a special case of
generalization. - Generalization
- Due to Sweeney, International Journal on
Uncertainty, Fuzziness and Knowledge-based
Systems, 2002 - Anatomy (also called bucketization)
- Due to Xiao and Tao, VLDB, 2006
- Each of the above approaches can be integrated
with all the privacy principles discussed
earlier.
39A multidimensional view of generalization
40Taxonomy of generalization
LeFevre et al. SIGMOD, 2005
- Local recoding
- (Generalized) rectanglesmay overhalp.
- Suppression is a special caseof local recoding.
- Global recoding
- All rectangles are disjoint.
41Taxonomy of generalization (cont.)
- Global recoding can be further divided.
- Single-dimension recoding
- Rectangles form a grid.
- Multi-dimension recoding
- The opposite of single-dimension recoding.
42Taxonomy of generalization (cont.)
- Single-dimension recoding can be further divided.
- Full-domain recoding
- Full-subtree recoding
- Both assume a hierarchy on each QI attribute.
- Example A hierarchy on Age
43Taxonomy of generalization (cont.)
- Full-domain recoding
- All age values must be generalized to the same
level of the hierachy.
44Taxonomy of generalization (cont.)
- Full-subtree recoding
- The subtrees of all generalized values must be
disjoint. - Permissible generalization
- 1, 30, 31, 40, 41, 50, 51, 60, 61, 90.
- Illegal generalization
- 1, 10, 1, 30, 31, 60, 61, 90.
45Why all these generalization types?
- Reason 1If a dataset is generalized in a more
restricted manner, less preprocessing is required
before it can be analyzed by a standard
statistical tool (such as SAAS).
46Why all these generalization types?
- Reason 2 More restrictive generalization is
usually faster to compute and easier to analyze.
47Why all these generalization types?
- Reason 3 Less restrictive generalization
promises more accurate data analysis, provided
that a sophisticated analytical method is used.
48Generalization algorithms
- Operate on a quality metric. Examples
- The generalization level (for full-domain
recoding) - Total rectangle size (for local recoding)
-
- Mostly heuristics-based.
- Finding the optimal generalization is oftenNP
hard.
49Defect of generalization
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
-
- Estimated answer 2p, where p is the probability
that each of the two tuples satisfies the query
conditions on the Age and Zipcode.
50Defect of generalization (cont.)
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
- p Area(R1 n Q ) / Area(R1) 0.05
- Estimated answer for Query A 2p 0.1
51Defect of generalization (cont.)
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
- Estimated answer 0.1
52Defect of generalization (cont.)
- Cause of inaccuracyQI distribution inside each
QI group is lost!
53Anatomy
- Releases a quasi-identifier table (QIT) and a
sensitive table (ST).
Sensitive table (ST)
Quasi-identifier table (QIT)
Microdata
54Anatomy (cont.)
- 1. Decide an l-diverse partition of the tuples.
QI group 1
QI group 2
A 2-diverse partition
55Anatomy (cont.)
- 2. Generate a quasi-idnetifier table (QIT) and a
sensitive table (ST) based on the selected
partition.
group 1
group 2
quasi-identifier table (QIT)
sensitive table (ST)
56Anatomy (cont.)
- 2. Generate a quasi-idnetifier table (QIT) and a
sensitive table (ST) based on the decided
partition.
quasi-identifier table (QIT)
sensitive table (ST)
57Privacy preservation
- Given a pair of QIT and ST generated from an
l-diverse partition, an adversary can infer the
sensitive value of each individual with
confidence at most 1 / l.
sensitive table (ST)
quasi-identifier table (QIT)
58Data analysis
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
-
Sensitive table (ST)
Quasi-identifier table (QIT)
59Data analysis (cont.)
- Query A SELECT COUNT() from Unknown-Microdata
- WHERE Disease pneumonia AND Age in 0, 30
- AND Zipcode in 10001, 20000
- 2 patients contracted pneumonia
- 2 out of 4 patients satisfy the query conditions
on Age and Zipcode - Estimated answer 2 2 / 4 1.
t1t2 t3 t4
60A defect of anatomy
- Existence breach Does an individual exist in the
microdata?
61Future work
- Re-publication
- Tackle stronger background knowledge
- Recent work Martin et al., ICDE, 2007
- Improving utility
- Pioneering work Kifer and Gehrke, SIGMOD, 2006
- Application to specific (non-trivial)
applications - Location privacy
- Pioneering work Mokbel et al., VLDB, 2006