Title: A non-interactive approach to database privacy
1A non-interactive approach to database privacy
- Shuchi Chawla
- Stanford / UW Madison
TCC05 Chawla, Dwork, McSherry, Smith,
Wee UAI05 Chawla, Dwork, McSherry, Talwar
2The privacy problem
- Many database applications require data that is
considered private by the contributors - disease outbreak detection
- correlating symptoms and diseases
- targeted advertizing
-
- The privacy dilemma Protect privacy, but also
allow legitimate applications - Alfred Kinsey Better privacy means better
data - Why is this possible?
- Because many applications require only
macroscopic features of data
3The privacy problem
- Many database applications require data that is
considered private by the contributors - disease outbreak detection
- correlating symptoms and diseases
- targeted advertizing
-
- The basic questions
- Given that privacy must be preserved, what
applications can we allow? - How do we store data and use it to ensure privacy
protection? - What do we mean by privacy? What is private
information?
4A road-map
- How to protect data
- Interactive approaches
- Non-interactive approaches
- Defining privacy
- What is not possible? some impossibility
results - What is possible? techniques for an abstract
version of the problem - What next?
5How to protect sensitive data
Take 1 The census (
) approach
non-interactive
- The data-collector sanitizes the dataset and
releases it publicly - Applications can freely use this sanitized data
- Suppression of sensitive records Aggregation of
attributes - Gusfield88, Dobra Fienberg Trottini00,
Chawla Dwork McSherry Talwar05, - Imputation Learn the underlying distribution
Rubin93 - Data-perturbation Add random noise to each
record, independently, or dependent on other
records - Agrawal Srikant00, Agrawal Agrawal01,
Roque03, Winkler04, - Evfimievsky Gehrke Srikant03, Chawla Dwork
McSherry Smith Wee05,
6How to protect sensitive data
Take 2 The trusted-party (interactive)
approach
- Data resides with a trusted party (gover nment?)
- Applications query the trusted party
- Trusted party reveals as much info as possible
without compromising privacy - Query auditing Answer truthfully or answer
nothing - Kleinberg Papadimitriou Raghavan03, Dinur
Nissim03 - Output perturbation Answer with some noise
- Denning80, Beck80, Adam Wortmann89, Dinur
Nissim03, - Dwork Nissim04, Blum Dwork McSherry
Nissim05, - Dwork McSherry Nissim Smith06,
7How to protect sensitive data
Take 1 The census (non-interactive) approach
- Practical sanitize once and then forget
- Potentially worse performance disadvantage of
not knowing the future applications that use
data Irreversible - Potentially improved performance tailor noise to
the query - Impractical Lifelong monitoring Can we really
trust anyone with our private info? - Inconsistent answers to related queries
Take 2 The trusted-party (interactive)
approach
8What is
a breach of
privacy?
- Informally When an adversary learns sensitive
information about an individual by looking at the
sanitized database or query answers. - Note The adversary may (and should) learn other
non-sensitive info - What constitutes sensitive information?
-
- Values of sensitive attributes Adversarys
belief about the individuals attribute values
should remain nearly the same as before - Evfimievski Gehrke Srikant03, Dinur
Nissim03, Dwork Nissim04, - Blum Dwork McSherry Nissim05
- Related concept Indistinguishability Dwork
McSherry Nissim Smith06
Values of sensitive attributes
Indistinguishability Weaker notions K-Anonymit
y Any information that the adversary learns
about an individual applies to K other
people Isolation Any information of the
adversary that approximately matches actual
attributes of an individual also matches
attributes of many (K) other individuals nearly
as well.
Sweeney02, Aggarwal Feder Kenthapadi Motwani
Panigrahy Thomas Zhu05
9Isolation
- Formalizing approximation
- U the space of all possible individual records
- d distance function on U
- Adversary ADV takes as arguments the sanitized
database S and auxiliary information X and
produces a guess g ? U. - For a real person p ? U, the distance d(p,g)
signifies how well the adversary approximates p. - Goal Want d( p , ADV(S,X) ) to be large
(relative to other points).
10Isolation
- Formalizing approximation
- U the space of all possible individual records
- d distance function on U
- Goal Want d( p , ADV(S,X) ) to be large.
- A real person p is (c,K)-isolated by a point g
ADV(S,X) - if d(p,g) r and B(g, cr) lt K
(3,7)-isolated
B(v,r) Ball of radius r around v w.r.t. distance
d c Small constant, say 8 or 10 K Privacy
parameter, say 100 or 1000
r
cr
Not (3,7)-isolated
RDB
11Isolation
- Formalizing approximation
- U the space of all possible individual records
- d distance function on U
- Goal Want d( p , ADV(S,X) ) to be large.
- A real person p is (c,K)-isolated by a point g
ADV(S,X) - if d(p,g) r and B(g, cr) lt K
- Some examples
- d( p, g ) 0 if p g 1 otherwise. (K-anonymi
ty) - d( p, g ) minattributes a pa - ga
- d( p, g ) Hamming distance between p and g
- d( p, g ) lp distance between p and g
12A good sanitizing algorithm
- A cryptographic approach The possibility of a
privacy breach does not increase significantly by
releasing the sanitized data
? D ? ADV ? ADV-Sim such that ? RDB, ? Aux.
X
Pr ADV(S,X) compromizes RDB Pr
ADV-Sim(X) compromizes RDB
Pr ADV(S,X) compromizes RDB ? Pr
ADV-Sim(X) compromizes RDB ? e
Is there a p ? RDB such that g isolates p?
13Is this possible?
- Cannot allow arbitrary Auxiliary information X
- Dwork Naor
- Basic idea X encodes compromising information
about RDB - S carries the key to this information
X
X
D
D
Aux. Gen.
Aux. Gen.
S
Sanitizer
RDB
RDB
ADV
ADV Sim
Compromise?
Compromise?
0/1
0/1
14Is this possible?
- Cannot allow arbitrary Auxiliary information X
- Cannot preserve arbitrary utility functions
- e.g. the exact nearest neighbor function
- exact histograms on discrete
distributions - Cannot preserve all well-behaved utility
functions in the non-interactive setting under
the indistinguishability framework - Dwork McSherry Nissim Smith06
15A good sanitizing algorithm
the Indistinguishability framework
Pr ADV(S,X) compromizes RDB Pr
ADV-Sim(S,X) compromizes RDB
A stronger requirement S S
16Is this possible?
- Cannot allow arbitrary Auxiliary information X
- Cannot preserve arbitrary utility functions
- e.g. the exact nearest neighbor function
- exact histograms on discrete
distributions - Cannot preserve all well-behaved utility
functions in the non-interactive setting under
the indistinguishability framework - Dwork McSherry Nissim Smith06
- Cannot preserve all utility functions Must
decide on the important ones before sanitizing - Unclear what this implies for the isolation
framework
17What can we achieve?
- An abstract model
- Records lie in ?d d is the Euclidean
distance - Our results for privacy Aux. X all but a
subset of points - Randomized (recursive) histograms
- D Uniform over a hypercube
- Isolation prob. lt 2-W(d), c ? 15, K 2o(d)
- Recursive density-maps
- D Uniform over well-rounded region
- Isolation prob. lt 2-W(d), c large constant,
K 2o(d) - Density-based Gaussian noise
- D Uniform on a hypercube or hypersphere
- Isolation prob. lt 2-W(d) , c ? 15, K 2o(d), n
2o(d)
18What can we achieve?
- An abstract model
- Records lie in ?d d is the Euclidean
distance - Our results for utility
- Recursive histograms, density-maps
- Popular summarization technique
- Benefit of providing more detail where required
- No noise!
- Randomized subdivision ? large distances
preserved approximately - Density-based perturbation
- Large aggregates are preserved
- Preserves clusterings via spectral and
diameter-based techniques
19What can we achieve?
- An abstract model
- Records lie in ?d d is the Euclidean
distance - Some key points
- Unconditional results
- Adversary is allowed unlimited computational
power - Distribution D used in a limited way
- Primarily describes the adversarys prior view
of the individual to be isolated - The real-valued data assumption
- Extends to discrete-valued case when granularity
is sufficiently small does not capture
everything, e.g. binary data - The low-dimensional case
- Creates a problem more later
20Histogram-based sanitization
- Recursively sub-divide space into 2d cells, until
each cell has less than K points
K3 d2
21Histograms A brief proof of privacy
- Adversarys goal produce a point g such that if
r is the distance between g and the closest real
point p, then?B(g,cr)?? K - Intuition
- Input distribution is uniform over the hypercube
- The adversarys view A product of uniform
distributions over cells - Within any cell, the adversary cant
conclusively single out a position for an
unrevealed point
22Histograms A brief proof of privacy
- Case 1 Sparse cell
- Expected distance g-p proportional to the
diameter of the cell - c times this distance is larger than the diameter
of the parent cell - ? B(g,cr) captures the entire parent cell and
contains ? K points - Case 2 Dense cell
- Consider B(g,r) and B(g,cr) for some radius r
- Adversary wins if
- Pr B(g,r) ? 1 is large, and,
- Pr B(g,cr) ? K is small
- We show that Pr ? x ? B(g,cr) Pr ? x ?
B(g,r)
B(g,cr)
g
p
23Histograms A brief proof of privacy
- Lemma Let c be a large enough constant.
- For any cell and any r lt diam(cell)/c ,
- Pr ? x ? B(q,cr) ? cell ? 2d Pr ? x ?
B(q,r) ? cell - Proof idea
- Pr ? x ? B(q,r) ? cell
- ? Vol( B(q,r) ? cell )
- Vol( B(q,cr) ? cell ) gt 2d Vol( B(q,r) ? cell )
- Key quantity Frac(q,r) Fraction of B(q,r) in
cell - How does Frac(q,r) decrease with r?
- As long as cr is small w.r.t. diameter, decrease
is less than c?d.
Study intersections of Gaussians with the
hypercube
g
Corollary Probability of success for the
adversary lt 2-W(d)
24Randomized Histograms
- Do histograms represent the distribution well?
- A desirable property Closer pairs of points are
separated at higher levels of recursion (w.h.p.) - Estimate distance between points from the
histogram - d(x,y) ? dH(x,y) ? d(x,y) Dx Dy
- Dx diameter of final cell containing x
- Want Dx to be roughly the K-radius of x
- Standard technique to achieve this translate the
hypercube through a random length in each
direction - Need to be careful about the aspect ratios of the
resulting cells undo the slicing of corner
cells
25Recursive density-maps
- Can we extend the histogram construction to other
distributions? - Key component of the previous analysis
- Intersections of balls cells expand
exponentially with radius - Does this hold for arbitrarily shaped cells?
- Yes, if the cells are well-rounded
- i.e. convex and with bounded aspect ratio
26Partitioning into well-rounded cells
- Basic procedure
- Pick a set of points forming a net over the
parent cell - The next level cells are given by the Voronoi
partition formed by these points - Picking the points
- Greedy method successively pick points that are
not covered yet - Random method pick a random set of appropriate
size - Works with a high probability (1 exp(-d))
- Provides guarantees on cut probabilities and
preserves large distances - Sanitizing algorithm recursive density-maps
- Divide and subdivide region into well-rounded
cells until each cell contains at most K points
Release the count of points in each cell
Well-spaced pi - pj? r1 ?i,j Covering
?i B(pi,r2) covers the parent cell
r2/r1 is small
27Summarizing
- Isolation a new formalism for describing
non-interactive approches to preserving privacy - Some examples of good sanitizers w.r.t. isolation
- Key approaches
- Recursive histograms
- Density-based Gaussian noise
A small amount of noise in each count makes them
secure w.r.t. indistinguishability!
28Summarizing the downsides
- Arguments are specific to the Euclidean metric
- Tricky to understand isolation in other metrics
- Dependence on a large number dimensions
- Inherently doomed if the adversary can re-create
(approximately) the underlying distribution for
low-dimensional data - Change the definition of privacy? Limit utility?
- Low-entropy (binary) attributes
- Current techniques should work as long as a large
number of attributes have high entropy - May be doable through new techniques
- Aux. info. is poorly understood
29Future challenges
- Non-interactive approach is important, but we
have strong negative results - Try weaker notions of privacy
- Use computational assumptions
- Restrict the functionality of the sanitized data
- Still havent answered the first question What
applications can we allow while preserving
privacy? - Indistinguishability suggests an answer
insensitive functions - Repeated application and composability
- How do different privacy-preserving techniques
interact? - How do entries exits from a database effect its
privacy?
30Questions?
31Positive results in other formalisms
- K-Anonymity Aggarwal Feder Kenthapadi Motwani
Panigrahy Thomas Zhu05 - NP-hard to find the smallest set of attribute
values to delete - Can be approximated to within O(k)
- Belief-based Dwork Nissim04, Blum Dwork
McSherry Nissim05 - Add noise to each query proportional to O(T
log2T) where T queries - O(vT) noise sufficient for sum queries
- Indistinguishability Dwork McSherry
Nissim Smith06 - Add noise to each query proportional to
sensitivity - Noise increases linearly with number of queries
- Better results in special cases