Title: An Zhu
1Towards Achieving Anonymity
2Introduction
- Collect and analyze personal data
- Infer trends and patterns
- Making the personal data public
- Joining multiple sources
- Third party involvement
- Privacy concerns
- Q How to share such data?
3Example Medical Records
Identifiers Identifiers Sensitive Info
SSN Name Age Race Zipcode Disease
614 Sara 31 Cauc 94305 Flu
615 Joan 34 Cauc 94307 Cold
629 Kelly 27 Cauc 94301 Diabetes
710 Mike 41 Afr-A 94305 Flu
840 Carl 41 Afr-A 94059 Arthritis
780 Joe 65 Hisp 94042 Heart problem
616 Rob 46 Hisp 94042 Arthritis
4De-identified Records
Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
5Not Sufficient! Sweeney 00
Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
Unique Identifiers!
Public Database
6Not Sufficient! Sweeney 00
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
Unique Identifiers!
Public Database
7Anonymize the Quasi-Identifiers!
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
Flu
Cold
Diabetes
Flu
Arthritis
Heart problem
Arthritis
Unique Identifiers!
Public Database
8Q How to share such data?
- Anonymize the quasi-identifiers
- Suppress information
- Privacy guarantee anonymity
- Quality the amount of suppressed information
- Clustering
- Privacy guarantee cluster size
- Quality various clustering measures
9Q How to share such data?
- Anonymize the quasi-identifiers
- Suppress information
- Privacy guarantee anonymity
- Quality the amount of suppressed information
- Clustering
- Privacy guarantee cluster size
- Quality various clustering measures
10k-anonymized Table Samarati 01
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
11k-anonymized Table Samarati 01
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
Cauc Flu
Cauc Cold
Cauc Diabetes
41 Afr-A Flu
41 Afr-A Arthritis
Hisp 94042 Heart problem
Hisp 94042 Arthritis
Each row is identical to at least k-1 other rows
12Definition k-anonymity
- Input a table consists of n row, each with m
attributes (quasi-identifiers) - Output suppress some entries such that each row
is identical to at least k-1 other rows - Objective minimize the number of suppressed
entries
13Past Work and New Results
- MW 04
- NP-hardness for a large size alphabet
- O(k logk)-approximation
- AFKMPTZ 05
- NP-hardness even for ternary alphabet
- O(k)-approximation
- 1.5-approximation for 2-anonymity
- 2-approximation for 3-anonymity
14Past Work and New Results
- MW 04
- NP-hardness for a large size alphabet
- O(k logk)-approximation
- AFKMPTZ 05
- NP-hardness even for ternary alphabet
- O(k)-approximation
- 1.5-approximation for 2-anonymity
- 2-approximation for 3-anonymity
15Graph Representation
4
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A B C D E F
A
B
2
3
F
C
3
4
E
D
6
W(e)Hamming distance between the two rows
16Edge Selection I
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A B C D E F
A
B
2
2
0
3
F
C
2
E
D
Each node selects the lightest weight edge
k3
17Edge Selection II
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A B C D E F
A
B
3
2
0
3
F
C
2
E
D
For components with ltk vertices, add more edges
k3
18Lemma
- Total weight of edges selected is no more than
OPT - In the optimal solution, each vertex pays at
least the weight of the (k-1)st lightest weight
edge - Forest at most one edge per vertex
- By construction, the edge weight is no more than
the (k-1)st lightest weight edge per vertex
19Grouping
- Ideally, each connected component forms a group
- Anonymize vertices within a group
- Total cost of a group
- (total edge weights) ?(number of nodes)
- (2233)?6
A
B
2
3
0
3
F
C
2
E
D
Small groups O(k)
20Dividing a Component
- Root tree arbitrarily
- Divide if Sub-trees rest ? k
- Aim all sub-trees ltk
ltk
ltk
ltk
ltk
?k
?k
?k
?k
21Dividing a Component
- Root tree arbitrarily
- Divide if Sub-trees rest ? k
- Rotate the tree if necessary
?k
?k
?k
22Dividing a Component
- Root tree arbitrarily
- Divide if Sub-trees rest ? k
- T. condition max(2k-1, 3k-5)
ltk
ltk
ltk
ltk
ltk
23An Example
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A
B
A B C D E F
2
3
0
3
F
C
2
E
D
24An Example
C
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A B C D E F
2
2
3
F
E
B
3
A
0
D
25An Example
C
0 1 0
0 1 1
0 1 1
0 1 0
0 1 1
0 1 0
A B C D E F
2
2
F
E
B
3
A
0
D
Estimated cost 4?33?3
Optimal cost 3?33?3
26Past Work and New Results
- MW 04
- NP-hardness for a large size alphabet
- O(k logk)-approximation
- AFKMPTZ 05
- NP-hardness even for ternary alphabet
- O(k)-approximation
- 1.5-approximation for 2-anonymity
- 2-approximation for 3-anonymity
271.5-approximation
1
0 0 1 0 0 0
0 0 0 0 0 0
1 1 1 1 1 1
0 0 1 0 0 0
1 1 0 1 1 1
1 1 0 1 1 1
A B C D E F
A
B
6
6
F
C
0
5
E
D
6
W(e)Hamming distance between the two rows
28Minimum 1,2-matching
1
0 0 1 0 0 0
0 0 0 0 0 0
1 1 1 1 1 1
0 0 1 0 0 0
1 1 0 1 1 1
1 1 0 1 1 1
A B C D E F
A
B
0
F
C
0
1
D
E
Each vertex is matched to 1 or 2 other vertices
29Properties
- Each component has ?3 nodes
gt3
Not possible (degree ? 2)
Not Optimal
30Qualities
- Cost ? 2OPT
- For binary alphabet 1.5OPT
a
p
q
r ? p,q
OPT pays 2a We pay 2a
OPT pays ?pqr We pay ? 3(pq) ? 2(pqr)
31Past Work and New Results
- MW 04
- NP-hardness for a large size alphabet
- O(k logk)-approximation
- AFKMPTZ 05
- NP-hardness even for ternary alphabet
- O(k)-approximation
- 1.5-approximation for 2-anonymity
- 2-approximation for 3-anonymity
32Open Problems
- Can we improve O(k)?
- ?(k) for graph representation
33Open Problems
- Can we improve O(k)?
- ?(k) for graph representation
- 1111111100000000000000000000000000000000
- 0000000011111111000000000000000000000000
- 0000000000000000111111110000000000000000
- 0000000000000000000000001111111100000000
- 0000000000000000000000000000000011111111
- k 5, d 16, c k ? d / 2
34Open Problems
- Can we improve O(k)?
- ?(k) for graph representation
- 1111111100000000000000000000000000000000
- 0000000011111111000000000000000000000000
- 0000000000000000111111110000000000000000
- 0000000000000000000000001111111100000000
- 0000000000000000000000000000000011111111
- k 5, d 16, c k ? d / 2
35Open Problems
- Can we improve O(k)?
- ?(k) for graph representation
- 10101010101010101010101010101010
- 11001100110011001100110011001100
- 11110000111100001111000011110000
- 11111111000000001111111100000000
- 11111111111111110000000000000000
- k 5, d 16, c 2 ? d
36Open Problems
- Can we improve O(k)?
- ?(k) for graph representation
- 10101010101010101010101010101010
- 11001100110011001100110011001100
- 11110000111100001111000011110000
- 11111111000000001111111100000000
- 11111111111111110000000000000000
- k 5, d 16, c 2 ? d
37Q How to share such data?
- Anonymize the quasi-identifiers
- Suppress information
- Privacy guarantee anonymity
- Quality the amount of suppressed information
- Clustering
- Privacy guarantee cluster size
- Quality various clustering measures
38Clustering Approach AFKKPTZ 06
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
39Transfers into a Metric
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
40Clusters and Centers
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
41Clusters and Centers
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
Cold
Diabetes
Flu
41 Afr-A 94059 Arthritis
Heart problem
46 Hisp 94042 Arthritis
42Measure
- How good are the clusters
- Tight clusters are better
- Minimize max radius Gather-k
- Minimize max distortion error Cellular-k
- ? radius ? num_nodes
Cost Gather-k 10 Cellular-k 624
43Measure
- How good are the clusters
- Tight clusters are better
- Minimize max radius Gather-k
- Minimize max distortion error Cellular-k
- ? radius ? num_nodes
- Handle outliers
- Constant approximations!
44Comparison
- k 5
- 5-anonymity
- Suppress all entries
- More distortion
- Clustering
- Can pick R5 as the center
- Less distortion
- Distortion is directly related with pair-wise
distances
R1 0 1 1 1
R2 1 0 1 1
R3 1 1 0 1
R4 1 1 1 0
R5 1 1 1 1
45Results AFKKPTZ 06
- Gather-k
- Tight 2-approximation
- Extension to outlier 4-approximation
- Cellular-k
- Primal-dual const. approximation
- Extensions as well
46Results AFKKPTZ 06
- Gather-k
- Tight 2-approximation
- Extension to outlier 4-approximation
- Cellular-k
- Primal-dual const. approximation
- Extensions as well
472-approximation
- Assume an optimal value R
- Make sure each node has at least k 1 neighbors
within distance 2R.
R
2R
A
482-approximation
- Assume an optimal value R
- Make sure each node has at least k 1 neighbors
within distance 2R. - Pick an arbitrary node as a center and remove all
remaining nodes within distance 2R. Repeat until
all nodes are gone. - Make sure we can reassign nodes to the selected
centers.
49Example k 5
50Optimal Solution
R
1
2
51Center Selection
52Center Selection
1
53Center Selection
2R
1
54Center Selection
2R
1
55Center Selection
2R
2
1
56Center Selection
2R
2
1
57Reassignment
2
1
58Degree Constrained Matching
k-1
1
1
2
1
1
k-1
1
1
1
1
1
1
59Actual Clustering
2
1
60Optimal Clustering
1
2
61Our guarantees
- Return clusters of radius no more than 2R
- If R is guessed correctly, then reassignment is
possible - Each cluster has at least k nodes
- Do a binary search on the value of R suffices
62Binary Search on R
- Assume an optimal value R
- Make sure each node has at least k 1 neighbors
within distance 2R. - Pick an arbitrary node as a center and remove all
remaining nodes within distance 2R. Repeat until
all nodes are gone. - Make sure we can reassign nodes to the selected
centers.
63Binary Search on R
- Assume an optimal value R
- Make sure each node has at least k 1 neighbors
within distance 2R. - Not necessary, but is useful for quick pruning
- Pick an arbitrary node as a center and remove all
remaining nodes within distance 2R. Repeat until
all nodes are gone. - Make sure we can reassign nodes to the selected
centers.
64Binary Search on R
- Assume an optimal value R
- Make sure each node has at least k 1 neighbors
within distance 2R. - Not necessary, but is useful for quick pruning
- Pick an arbitrary node as a center and remove all
remaining nodes within distance 2R. Repeat until
all nodes are gone. - Make sure we can reassign nodes to the selected
centers. - If successful, R could be smaller
- Otherwise, R should be larger
65Results AFKKPTZ 06
- Gather-k
- Tight 2-approximation
- Extension to outliner 4-approximation
- Cellular-k
- Primal-dual const. approximation
- Extensions
66Ignore Cluster Size Constraint
- Similar to Facility Location
- ? radius ? num_nodes vs.
- ? invidual_distance_to_center
- Caveat
- Assigning one distant node to an existing cluster
will increase cost proportional to number of
nodes in that cluster - Each cluster is a (center, radius) pair
67Intermediate Step I
- Primal-dual constant approximation for
- ? radius ? num_nodes
- No cluster size constaint
- Arbitrary cluster setup cost
- We want
- ? radius ? num_nodes
- Cluster size constraint
- No cluster setup cost
68Enforce Cluster Size
- Introduce extra cluster setup cost
- Setup cost pays for k nodes to join a particular
cluster, i.e., csetup k ? r - This at most doubles the actual cost of any size
constrained cluster solution - Each clusters total cost is at least k ? r
69Intermediate Step II
- Shared solution!
- For each cluster with less than k nodes,
additional nodes can join the cluster - At no additional cost, paid for by the cluster
setup cost - Now nodes could be shared among multiple clusters
- Key convert a shared solution to a disjoint
solution
70Separation
- Starting from small radius clusters
- Open as long as there are enough nodes
- The left over points in clusters attach to the
intersecting smaller radius (open) clusters
Attached
Open
Attached
Attached
71Regroup (k 5)
- Open cluster has k nodes
- Attached cluster has ltk nodes
- Group clusters to create bigger ones
- Choose the fat clusters center as the new
center
6
3
2
4
72What About Cluster Cost?
- These clustering intersects with the open cluster
73What About Cluster Cost?
- These clustering intersects with the open cluster
- Routing cost is only a constant blowup w.r.t. the
fat radius
74What About Cluster Cost?
- These clustering intersects with the open cluster
- Routing cost is only a constant blowup w.r.t. the
fat radius - Need to make sure the merged cluster is of
reasonable size
75Recap
- Anonymize the quasi-identifiers
- Suppress information
- Privacy guarantee anonymity
- Quality the amount of suppressed information
- Clustering
- Privacy guarantee cluster size
- Quality various clustering measures
76Thanks!