An Zhu

About This Presentation

Title:

An Zhu

Description:

Flu. Disease. Zipcode. Race. Age. Public Database. Unique. Identifiers! Q: How to share such data? ... Flu. Cauc. Disease. Zipcode. Race. Age. Definition: k-anonymity ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 77

Provided by: anz2

Learn more at: http://www-cs-students.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: An Zhu

1
Towards Achieving Anonymity

An Zhu

2
Introduction

Collect and analyze personal data
Infer trends and patterns
Making the personal data public
Joining multiple sources
Third party involvement
Privacy concerns
Q How to share such data?

3
Example Medical Records
Identifiers Identifiers Sensitive Info
SSN Name Age Race Zipcode Disease
614 Sara 31 Cauc 94305 Flu
615 Joan 34 Cauc 94307 Cold
629 Kelly 27 Cauc 94301 Diabetes
710 Mike 41 Afr-A 94305 Flu
840 Carl 41 Afr-A 94059 Arthritis
780 Joe 65 Hisp 94042 Heart problem
616 Rob 46 Hisp 94042 Arthritis
4
De-identified Records
Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
5
Not Sufficient! Sweeney 00
Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
Unique Identifiers!
Public Database
6
Not Sufficient! Sweeney 00
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
Unique Identifiers!
Public Database
7
Anonymize the Quasi-Identifiers!
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
Flu
Cold
Diabetes
Flu
Arthritis
Heart problem
Arthritis
Unique Identifiers!
Public Database
8
Q How to share such data?

Anonymize the quasi-identifiers
Suppress information
Privacy guarantee anonymity
Quality the amount of suppressed information
Clustering
Privacy guarantee cluster size
Quality various clustering measures

9
Q How to share such data?

Anonymize the quasi-identifiers
Suppress information
Privacy guarantee anonymity
Quality the amount of suppressed information
Clustering
Privacy guarantee cluster size
Quality various clustering measures

10
k-anonymized Table Samarati 01
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
11
k-anonymized Table Samarati 01
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
Cauc Flu
Cauc Cold
Cauc Diabetes
41 Afr-A Flu
41 Afr-A Arthritis
Hisp 94042 Heart problem
Hisp 94042 Arthritis
Each row is identical to at least k-1 other rows
12
Definition k-anonymity

Input a table consists of n row, each with m
attributes (quasi-identifiers)
Output suppress some entries such that each row
is identical to at least k-1 other rows
Objective minimize the number of suppressed
entries

13
Past Work and New Results

MW 04
NP-hardness for a large size alphabet
O(k logk)-approximation
AFKMPTZ 05
NP-hardness even for ternary alphabet
O(k)-approximation
1.5-approximation for 2-anonymity
2-approximation for 3-anonymity

14
Past Work and New Results

MW 04
NP-hardness for a large size alphabet
O(k logk)-approximation
AFKMPTZ 05
NP-hardness even for ternary alphabet
O(k)-approximation
1.5-approximation for 2-anonymity
2-approximation for 3-anonymity

15
Graph Representation
4
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A B C D E F
A
B
2
3
F
C
3
4
E
D
6
W(e)Hamming distance between the two rows
16
Edge Selection I
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A B C D E F
A
B
2
2
0
3
F
C
2
E
D
Each node selects the lightest weight edge
k3
17
Edge Selection II
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A B C D E F
A
B
3
2
0
3
F
C
2
E
D
For components with ltk vertices, add more edges
k3
18
Lemma

Total weight of edges selected is no more than
OPT
In the optimal solution, each vertex pays at
least the weight of the (k-1)st lightest weight
edge
Forest at most one edge per vertex
By construction, the edge weight is no more than
the (k-1)st lightest weight edge per vertex

19
Grouping

Ideally, each connected component forms a group
Anonymize vertices within a group
Total cost of a group
(total edge weights) ?(number of nodes)
(2233)?6

A
B
2
3
0
3
F
C
2
E
D
Small groups O(k)
20
Dividing a Component

Root tree arbitrarily
Divide if Sub-trees rest ? k
Aim all sub-trees ltk

ltk
ltk
ltk
ltk
?k
?k
?k
?k
21
Dividing a Component

Root tree arbitrarily
Divide if Sub-trees rest ? k
Rotate the tree if necessary

?k
?k
?k
22
Dividing a Component

Root tree arbitrarily
Divide if Sub-trees rest ? k
T. condition max(2k-1, 3k-5)

ltk
ltk
ltk
ltk
ltk
23
An Example
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A
B
A B C D E F
2
3
0
3
F
C
2
E
D
24
An Example
C
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A B C D E F
2
2
3
F
E
B
3
A
0
D
25
An Example
C
0 1 0
0 1 1
0 1 1
0 1 0
0 1 1
0 1 0
A B C D E F
2
2
F
E
B
3
A
0
D
Estimated cost 4?33?3
Optimal cost 3?33?3
26
Past Work and New Results

MW 04
NP-hardness for a large size alphabet
O(k logk)-approximation
AFKMPTZ 05
NP-hardness even for ternary alphabet
O(k)-approximation
1.5-approximation for 2-anonymity
2-approximation for 3-anonymity

27
1.5-approximation
1
0 0 1 0 0 0
0 0 0 0 0 0
1 1 1 1 1 1
0 0 1 0 0 0
1 1 0 1 1 1
1 1 0 1 1 1
A B C D E F
A
B
6
6
F
C
0
5
E
D
6
W(e)Hamming distance between the two rows
28
Minimum 1,2-matching
1
0 0 1 0 0 0
0 0 0 0 0 0
1 1 1 1 1 1
0 0 1 0 0 0
1 1 0 1 1 1
1 1 0 1 1 1
A B C D E F
A
B
0
F
C
0
1
D
E
Each vertex is matched to 1 or 2 other vertices
29
Properties

Each component has ?3 nodes

gt3
Not possible (degree ? 2)
Not Optimal
30
Qualities

Cost ? 2OPT
For binary alphabet 1.5OPT

a
p
q
r ? p,q
OPT pays 2a We pay 2a
OPT pays ?pqr We pay ? 3(pq) ? 2(pqr)
31
Past Work and New Results

MW 04
NP-hardness for a large size alphabet
O(k logk)-approximation
AFKMPTZ 05
NP-hardness even for ternary alphabet
O(k)-approximation
1.5-approximation for 2-anonymity
2-approximation for 3-anonymity

32
Open Problems

Can we improve O(k)?
?(k) for graph representation

33
Open Problems

Can we improve O(k)?
?(k) for graph representation
1111111100000000000000000000000000000000
0000000011111111000000000000000000000000
0000000000000000111111110000000000000000
0000000000000000000000001111111100000000
0000000000000000000000000000000011111111
k 5, d 16, c k ? d / 2

34
Open Problems

Can we improve O(k)?
?(k) for graph representation
1111111100000000000000000000000000000000
0000000011111111000000000000000000000000
0000000000000000111111110000000000000000
0000000000000000000000001111111100000000
0000000000000000000000000000000011111111
k 5, d 16, c k ? d / 2

35
Open Problems

Can we improve O(k)?
?(k) for graph representation
10101010101010101010101010101010
11001100110011001100110011001100
11110000111100001111000011110000
11111111000000001111111100000000
11111111111111110000000000000000
k 5, d 16, c 2 ? d

36
Open Problems

Can we improve O(k)?
?(k) for graph representation
10101010101010101010101010101010
11001100110011001100110011001100
11110000111100001111000011110000
11111111000000001111111100000000
11111111111111110000000000000000
k 5, d 16, c 2 ? d

37
Q How to share such data?

Anonymize the quasi-identifiers
Suppress information
Privacy guarantee anonymity
Quality the amount of suppressed information
Clustering
Privacy guarantee cluster size
Quality various clustering measures

38
Clustering Approach AFKKPTZ 06
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
39
Transfers into a Metric
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
40
Clusters and Centers
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
41
Clusters and Centers
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
Cold
Diabetes
Flu
41 Afr-A 94059 Arthritis
Heart problem
46 Hisp 94042 Arthritis
42
Measure

How good are the clusters
Tight clusters are better
Minimize max radius Gather-k
Minimize max distortion error Cellular-k
? radius ? num_nodes

Cost Gather-k 10 Cellular-k 624
43
Measure

How good are the clusters
Tight clusters are better
Minimize max radius Gather-k
Minimize max distortion error Cellular-k
? radius ? num_nodes
Handle outliers
Constant approximations!

44
Comparison

k 5
5-anonymity
Suppress all entries
More distortion
Clustering
Can pick R5 as the center
Less distortion
Distortion is directly related with pair-wise
distances

R1 0 1 1 1
R2 1 0 1 1
R3 1 1 0 1
R4 1 1 1 0
R5 1 1 1 1
45
Results AFKKPTZ 06

Gather-k
Tight 2-approximation
Extension to outlier 4-approximation
Cellular-k
Primal-dual const. approximation
Extensions as well

46
Results AFKKPTZ 06

Gather-k
Tight 2-approximation
Extension to outlier 4-approximation
Cellular-k
Primal-dual const. approximation
Extensions as well

47
2-approximation

Assume an optimal value R
Make sure each node has at least k 1 neighbors
within distance 2R.

R
2R
A
48
2-approximation

Assume an optimal value R
Make sure each node has at least k 1 neighbors
within distance 2R.
Pick an arbitrary node as a center and remove all
remaining nodes within distance 2R. Repeat until
all nodes are gone.
Make sure we can reassign nodes to the selected
centers.

49
Example k 5
50
Optimal Solution
R
1
2
51
Center Selection
52
Center Selection
1
53
Center Selection
2R
1
54
Center Selection
2R
1
55
Center Selection
2R
2
1
56
Center Selection
2R
2
1
57
Reassignment
2
1
58
Degree Constrained Matching
k-1
1
1
2
1
1
k-1
1
1
1
1
1
1
59
Actual Clustering
2
1
60
Optimal Clustering
1
2
61
Our guarantees

Return clusters of radius no more than 2R
If R is guessed correctly, then reassignment is
possible
Each cluster has at least k nodes
Do a binary search on the value of R suffices

62
Binary Search on R

Assume an optimal value R
Make sure each node has at least k 1 neighbors
within distance 2R.
Pick an arbitrary node as a center and remove all
remaining nodes within distance 2R. Repeat until
all nodes are gone.
Make sure we can reassign nodes to the selected
centers.

63
Binary Search on R

Assume an optimal value R
Make sure each node has at least k 1 neighbors
within distance 2R.
Not necessary, but is useful for quick pruning
Pick an arbitrary node as a center and remove all
remaining nodes within distance 2R. Repeat until
all nodes are gone.
Make sure we can reassign nodes to the selected
centers.

64
Binary Search on R

Assume an optimal value R
Make sure each node has at least k 1 neighbors
within distance 2R.
Not necessary, but is useful for quick pruning
Pick an arbitrary node as a center and remove all
remaining nodes within distance 2R. Repeat until
all nodes are gone.
Make sure we can reassign nodes to the selected
centers.
If successful, R could be smaller
Otherwise, R should be larger

65
Results AFKKPTZ 06

Gather-k
Tight 2-approximation
Extension to outliner 4-approximation
Cellular-k
Primal-dual const. approximation
Extensions

66
Ignore Cluster Size Constraint

Similar to Facility Location
? radius ? num_nodes vs.
? invidual_distance_to_center
Caveat
Assigning one distant node to an existing cluster
will increase cost proportional to number of
nodes in that cluster
Each cluster is a (center, radius) pair

67
Intermediate Step I

Primal-dual constant approximation for
? radius ? num_nodes
No cluster size constaint
Arbitrary cluster setup cost
We want
? radius ? num_nodes
Cluster size constraint
No cluster setup cost

68
Enforce Cluster Size

Introduce extra cluster setup cost
Setup cost pays for k nodes to join a particular
cluster, i.e., csetup k ? r
This at most doubles the actual cost of any size
constrained cluster solution
Each clusters total cost is at least k ? r

69
Intermediate Step II

Shared solution!
For each cluster with less than k nodes,
additional nodes can join the cluster
At no additional cost, paid for by the cluster
setup cost
Now nodes could be shared among multiple clusters
Key convert a shared solution to a disjoint
solution

70
Separation

Starting from small radius clusters
Open as long as there are enough nodes
The left over points in clusters attach to the
intersecting smaller radius (open) clusters

Attached
Open
Attached
Attached
71
Regroup (k 5)