An Zhu - PowerPoint PPT Presentation

About This Presentation
Title:

An Zhu

Description:

Flu. Disease. Zipcode. Race. Age. Public Database. Unique. Identifiers! Q: How to share such data? ... Flu. Cauc. Disease. Zipcode. Race. Age. Definition: k-anonymity ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 77
Provided by: anz2
Category:

less

Transcript and Presenter's Notes

Title: An Zhu


1
Towards Achieving Anonymity
  • An Zhu

2
Introduction
  • Collect and analyze personal data
  • Infer trends and patterns
  • Making the personal data public
  • Joining multiple sources
  • Third party involvement
  • Privacy concerns
  • Q How to share such data?

3
Example Medical Records
Identifiers Identifiers Sensitive Info
SSN Name Age Race Zipcode Disease
614 Sara 31 Cauc 94305 Flu
615 Joan 34 Cauc 94307 Cold
629 Kelly 27 Cauc 94301 Diabetes
710 Mike 41 Afr-A 94305 Flu
840 Carl 41 Afr-A 94059 Arthritis
780 Joe 65 Hisp 94042 Heart problem
616 Rob 46 Hisp 94042 Arthritis
4
De-identified Records
Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
5
Not Sufficient! Sweeney 00
Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
Unique Identifiers!
Public Database
6
Not Sufficient! Sweeney 00
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
Unique Identifiers!
Public Database
7
Anonymize the Quasi-Identifiers!
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
Flu
Cold
Diabetes
Flu
Arthritis
Heart problem
Arthritis
Unique Identifiers!
Public Database
8
Q How to share such data?
  • Anonymize the quasi-identifiers
  • Suppress information
  • Privacy guarantee anonymity
  • Quality the amount of suppressed information
  • Clustering
  • Privacy guarantee cluster size
  • Quality various clustering measures

9
Q How to share such data?
  • Anonymize the quasi-identifiers
  • Suppress information
  • Privacy guarantee anonymity
  • Quality the amount of suppressed information
  • Clustering
  • Privacy guarantee cluster size
  • Quality various clustering measures

10
k-anonymized Table Samarati 01
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
11
k-anonymized Table Samarati 01
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
Cauc Flu
Cauc Cold
Cauc Diabetes
41 Afr-A Flu
41 Afr-A Arthritis
Hisp 94042 Heart problem
Hisp 94042 Arthritis
Each row is identical to at least k-1 other rows
12
Definition k-anonymity
  • Input a table consists of n row, each with m
    attributes (quasi-identifiers)
  • Output suppress some entries such that each row
    is identical to at least k-1 other rows
  • Objective minimize the number of suppressed
    entries

13
Past Work and New Results
  • MW 04
  • NP-hardness for a large size alphabet
  • O(k logk)-approximation
  • AFKMPTZ 05
  • NP-hardness even for ternary alphabet
  • O(k)-approximation
  • 1.5-approximation for 2-anonymity
  • 2-approximation for 3-anonymity

14
Past Work and New Results
  • MW 04
  • NP-hardness for a large size alphabet
  • O(k logk)-approximation
  • AFKMPTZ 05
  • NP-hardness even for ternary alphabet
  • O(k)-approximation
  • 1.5-approximation for 2-anonymity
  • 2-approximation for 3-anonymity

15
Graph Representation
4
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A B C D E F
A
B
2
3
F
C
3
4
E
D
6
W(e)Hamming distance between the two rows
16
Edge Selection I
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A B C D E F
A
B
2
2
0
3
F
C
2
E
D
Each node selects the lightest weight edge
k3
17
Edge Selection II
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A B C D E F
A
B
3
2
0
3
F
C
2
E
D
For components with ltk vertices, add more edges
k3
18
Lemma
  • Total weight of edges selected is no more than
    OPT
  • In the optimal solution, each vertex pays at
    least the weight of the (k-1)st lightest weight
    edge
  • Forest at most one edge per vertex
  • By construction, the edge weight is no more than
    the (k-1)st lightest weight edge per vertex

19
Grouping
  • Ideally, each connected component forms a group
  • Anonymize vertices within a group
  • Total cost of a group
  • (total edge weights) ?(number of nodes)
  • (2233)?6

A
B
2
3
0
3
F
C
2
E
D
Small groups O(k)
20
Dividing a Component
  • Root tree arbitrarily
  • Divide if Sub-trees rest ? k
  • Aim all sub-trees ltk

ltk
ltk
ltk
ltk
?k
?k
?k
?k
21
Dividing a Component
  • Root tree arbitrarily
  • Divide if Sub-trees rest ? k
  • Rotate the tree if necessary

?k
?k
?k
22
Dividing a Component
  • Root tree arbitrarily
  • Divide if Sub-trees rest ? k
  • T. condition max(2k-1, 3k-5)

ltk
ltk
ltk
ltk
ltk
23
An Example
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A
B
A B C D E F
2
3
0
3
F
C
2
E
D
24
An Example
C
0 0 1 0 0 0
1 0 0 1 0 1
0 1 0 1 0 1
0 0 1 0 0 0
1 1 0 1 1 1
0 1 1 0 1 1
A B C D E F
2
2
3
F
E
B
3
A
0
D
25
An Example
C
0 1 0
0 1 1
0 1 1
0 1 0
0 1 1
0 1 0
A B C D E F
2
2
F
E
B
3
A
0
D
Estimated cost 4?33?3
Optimal cost 3?33?3
26
Past Work and New Results
  • MW 04
  • NP-hardness for a large size alphabet
  • O(k logk)-approximation
  • AFKMPTZ 05
  • NP-hardness even for ternary alphabet
  • O(k)-approximation
  • 1.5-approximation for 2-anonymity
  • 2-approximation for 3-anonymity

27
1.5-approximation
1
0 0 1 0 0 0
0 0 0 0 0 0
1 1 1 1 1 1
0 0 1 0 0 0
1 1 0 1 1 1
1 1 0 1 1 1
A B C D E F
A
B
6
6
F
C
0
5
E
D
6
W(e)Hamming distance between the two rows
28
Minimum 1,2-matching
1
0 0 1 0 0 0
0 0 0 0 0 0
1 1 1 1 1 1
0 0 1 0 0 0
1 1 0 1 1 1
1 1 0 1 1 1
A B C D E F
A
B
0
F
C
0
1
D
E
Each vertex is matched to 1 or 2 other vertices
29
Properties
  • Each component has ?3 nodes

gt3
Not possible (degree ? 2)
Not Optimal
30
Qualities
  • Cost ? 2OPT
  • For binary alphabet 1.5OPT

a
p
q
r ? p,q
OPT pays 2a We pay 2a
OPT pays ?pqr We pay ? 3(pq) ? 2(pqr)
31
Past Work and New Results
  • MW 04
  • NP-hardness for a large size alphabet
  • O(k logk)-approximation
  • AFKMPTZ 05
  • NP-hardness even for ternary alphabet
  • O(k)-approximation
  • 1.5-approximation for 2-anonymity
  • 2-approximation for 3-anonymity

32
Open Problems
  • Can we improve O(k)?
  • ?(k) for graph representation

33
Open Problems
  • Can we improve O(k)?
  • ?(k) for graph representation
  • 1111111100000000000000000000000000000000
  • 0000000011111111000000000000000000000000
  • 0000000000000000111111110000000000000000
  • 0000000000000000000000001111111100000000
  • 0000000000000000000000000000000011111111
  • k 5, d 16, c k ? d / 2

34
Open Problems
  • Can we improve O(k)?
  • ?(k) for graph representation
  • 1111111100000000000000000000000000000000
  • 0000000011111111000000000000000000000000
  • 0000000000000000111111110000000000000000
  • 0000000000000000000000001111111100000000
  • 0000000000000000000000000000000011111111
  • k 5, d 16, c k ? d / 2

35
Open Problems
  • Can we improve O(k)?
  • ?(k) for graph representation
  • 10101010101010101010101010101010
  • 11001100110011001100110011001100
  • 11110000111100001111000011110000
  • 11111111000000001111111100000000
  • 11111111111111110000000000000000
  • k 5, d 16, c 2 ? d

36
Open Problems
  • Can we improve O(k)?
  • ?(k) for graph representation
  • 10101010101010101010101010101010
  • 11001100110011001100110011001100
  • 11110000111100001111000011110000
  • 11111111000000001111111100000000
  • 11111111111111110000000000000000
  • k 5, d 16, c 2 ? d

37
Q How to share such data?
  • Anonymize the quasi-identifiers
  • Suppress information
  • Privacy guarantee anonymity
  • Quality the amount of suppressed information
  • Clustering
  • Privacy guarantee cluster size
  • Quality various clustering measures

38
Clustering Approach AFKKPTZ 06
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
39
Transfers into a Metric
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
40
Clusters and Centers
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
34 Cauc 94307 Cold
27 Cauc 94301 Diabetes
41 Afr-A 94305 Flu
41 Afr-A 94059 Arthritis
65 Hisp 94042 Heart problem
46 Hisp 94042 Arthritis
41
Clusters and Centers
Quasi-Identifiers Quasi-Identifiers Quasi-Identifiers Sensitive Info
Age Race Zipcode Disease
31 Cauc 94305 Flu
Cold
Diabetes
Flu
41 Afr-A 94059 Arthritis
Heart problem
46 Hisp 94042 Arthritis
42
Measure
  • How good are the clusters
  • Tight clusters are better
  • Minimize max radius Gather-k
  • Minimize max distortion error Cellular-k
  • ? radius ? num_nodes

Cost Gather-k 10 Cellular-k 624
43
Measure
  • How good are the clusters
  • Tight clusters are better
  • Minimize max radius Gather-k
  • Minimize max distortion error Cellular-k
  • ? radius ? num_nodes
  • Handle outliers
  • Constant approximations!

44
Comparison
  • k 5
  • 5-anonymity
  • Suppress all entries
  • More distortion
  • Clustering
  • Can pick R5 as the center
  • Less distortion
  • Distortion is directly related with pair-wise
    distances

R1 0 1 1 1
R2 1 0 1 1
R3 1 1 0 1
R4 1 1 1 0
R5 1 1 1 1
45
Results AFKKPTZ 06
  • Gather-k
  • Tight 2-approximation
  • Extension to outlier 4-approximation
  • Cellular-k
  • Primal-dual const. approximation
  • Extensions as well

46
Results AFKKPTZ 06
  • Gather-k
  • Tight 2-approximation
  • Extension to outlier 4-approximation
  • Cellular-k
  • Primal-dual const. approximation
  • Extensions as well

47
2-approximation
  • Assume an optimal value R
  • Make sure each node has at least k 1 neighbors
    within distance 2R.

R
2R
A
48
2-approximation
  • Assume an optimal value R
  • Make sure each node has at least k 1 neighbors
    within distance 2R.
  • Pick an arbitrary node as a center and remove all
    remaining nodes within distance 2R. Repeat until
    all nodes are gone.
  • Make sure we can reassign nodes to the selected
    centers.

49
Example k 5
50
Optimal Solution
R
1
2
51
Center Selection
52
Center Selection
1
53
Center Selection
2R
1
54
Center Selection
2R
1
55
Center Selection
2R
2
1
56
Center Selection
2R
2
1
57
Reassignment
2
1
58
Degree Constrained Matching
k-1
1
1
2
1
1
k-1
1
1
1
1
1
1
59
Actual Clustering
2
1
60
Optimal Clustering
1
2
61
Our guarantees
  • Return clusters of radius no more than 2R
  • If R is guessed correctly, then reassignment is
    possible
  • Each cluster has at least k nodes
  • Do a binary search on the value of R suffices

62
Binary Search on R
  • Assume an optimal value R
  • Make sure each node has at least k 1 neighbors
    within distance 2R.
  • Pick an arbitrary node as a center and remove all
    remaining nodes within distance 2R. Repeat until
    all nodes are gone.
  • Make sure we can reassign nodes to the selected
    centers.

63
Binary Search on R
  • Assume an optimal value R
  • Make sure each node has at least k 1 neighbors
    within distance 2R.
  • Not necessary, but is useful for quick pruning
  • Pick an arbitrary node as a center and remove all
    remaining nodes within distance 2R. Repeat until
    all nodes are gone.
  • Make sure we can reassign nodes to the selected
    centers.

64
Binary Search on R
  • Assume an optimal value R
  • Make sure each node has at least k 1 neighbors
    within distance 2R.
  • Not necessary, but is useful for quick pruning
  • Pick an arbitrary node as a center and remove all
    remaining nodes within distance 2R. Repeat until
    all nodes are gone.
  • Make sure we can reassign nodes to the selected
    centers.
  • If successful, R could be smaller
  • Otherwise, R should be larger

65
Results AFKKPTZ 06
  • Gather-k
  • Tight 2-approximation
  • Extension to outliner 4-approximation
  • Cellular-k
  • Primal-dual const. approximation
  • Extensions

66
Ignore Cluster Size Constraint
  • Similar to Facility Location
  • ? radius ? num_nodes vs.
  • ? invidual_distance_to_center
  • Caveat
  • Assigning one distant node to an existing cluster
    will increase cost proportional to number of
    nodes in that cluster
  • Each cluster is a (center, radius) pair

67
Intermediate Step I
  • Primal-dual constant approximation for
  • ? radius ? num_nodes
  • No cluster size constaint
  • Arbitrary cluster setup cost
  • We want
  • ? radius ? num_nodes
  • Cluster size constraint
  • No cluster setup cost

68
Enforce Cluster Size
  • Introduce extra cluster setup cost
  • Setup cost pays for k nodes to join a particular
    cluster, i.e., csetup k ? r
  • This at most doubles the actual cost of any size
    constrained cluster solution
  • Each clusters total cost is at least k ? r

69
Intermediate Step II
  • Shared solution!
  • For each cluster with less than k nodes,
    additional nodes can join the cluster
  • At no additional cost, paid for by the cluster
    setup cost
  • Now nodes could be shared among multiple clusters
  • Key convert a shared solution to a disjoint
    solution

70
Separation
  • Starting from small radius clusters
  • Open as long as there are enough nodes
  • The left over points in clusters attach to the
    intersecting smaller radius (open) clusters

Attached
Open
Attached
Attached
71
Regroup (k 5)
  • Open cluster has k nodes
  • Attached cluster has ltk nodes
  • Group clusters to create bigger ones
  • Choose the fat clusters center as the new
    center

6
3
2
4
72
What About Cluster Cost?
  • These clustering intersects with the open cluster

73
What About Cluster Cost?
  • These clustering intersects with the open cluster
  • Routing cost is only a constant blowup w.r.t. the
    fat radius

74
What About Cluster Cost?
  • These clustering intersects with the open cluster
  • Routing cost is only a constant blowup w.r.t. the
    fat radius
  • Need to make sure the merged cluster is of
    reasonable size

75
Recap
  • Anonymize the quasi-identifiers
  • Suppress information
  • Privacy guarantee anonymity
  • Quality the amount of suppressed information
  • Clustering
  • Privacy guarantee cluster size
  • Quality various clustering measures

76
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com