Title: Privacy Protection
1Privacy Protection
- This presentation was prepared by Yufei
Tao.http//www.cse.cuhk.edu.hk/taoyf
2 Motivation
- A hospital has a table, the microdata, to publish.
3Motivation
4Linking attack
Voter registration list
Published table
Quasi-identifier (QI) attributes
An adversary
5Real threats
- Fact 87 of Americans can be uniquely identified
by Zipcode, gender, date-of-birth. - Sweeney International Journal on Uncertainty,
Fuzziness and Knowledge-based Systems, 2002
shows that she can identify the medical record of
an ex-governor of Massachusetts from a real
publication.
6Real threats (cont.)
- Banking data
- Tax records
- Exam scores
- Creditcard transactions
7Privacy protection
- Distort the dataset before releasing it.
- Concerns
- Privacy
- Utility the dataset must be useful for research.
- Paradox privacy ?, utility ?.
8Main issues
- Privacy principle
- What do we mean by adequate privacy protection?
- Distortion algorithm
- How to achieve the above principle?
9Generalization
- Replace a QI-value with a fuzzier form.
QI attributes
Sensitive attribute
4 QI groups
10k-anonymity Sweeney International Journal on
Uncertainty, Fuzziness and Knowledge-based
Systems 02
- Each QI-group has at least k tuples.
- 2-anonymous generalization
11Defects of k-anonymity
- What is the disease of Joe?
No diversity in this QI group.
A voter registration list
12l-diversity Machanavajjhala et al. ICDE 06
- Each QI group should have at least l
well-represented sensitive values. - Different ways to definewell-representativeness
. - Naive l different values.
l 2
13Defects of the naive interpretation
- Assume that Joe is identified in the following QI
group. What is the probability that he contracted
HIV? - Implication The frequency of the most frequent
sensitive value in a QI group should be bounded
by 1 / l. - A very popular definition of l-diversity.
98 tuples
A QI group with 100 tuples
14Exclusive-value attacks
- A friend of Joe has the knowledge Joe does not
have pneumonia. - How likely would this friend assume that Joe had
HIV?
50 tuples
A QI group with 100 tuples
49 tuples
15Battling exclusive-value attacks
- Even if an adversary can eliminate pneumonia,
s/he can infer that Joe has HIV only with 40 / 70
probability.
40 tuples
A QI group with 100 tuples
30 tuples
30 tuples
16Battling 3-exclusive-values attacks
The most frequent value
The 2nd most frequent value
A QI group
The 3rd most frequent value
The 4th most frequent value
The other values
17Battling 3-exclusive-values attacks
The most frequent value
A QI group
The other values
As many as the red box
18Battling 3-exclusive-values attacks
- Assume that Joe is a person in the QI group.
- Property If an adversary can eliminate only ? 3
diseases, s/he can correctly guess the disease of
Joe with at most 50 probability.
HIV
pneumonia
A QI group
bronchitis
cancer
The other values
19A short summary
- Why data privacy?
- How to protect it?
- A very active research topic with urgent
applications.