A non-interactive approach to database privacy - PowerPoint PPT Presentation

About This Presentation
Title:

A non-interactive approach to database privacy

Description:

A non-interactive approach to database privacy. Shuchi Chawla ... if d(p,g) = r and | B(g, cr) | K. Some examples: d( p, g ) = 0 if p = g ; 1 otherwise. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 31
Provided by: Shuchi2
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: A non-interactive approach to database privacy


1
A non-interactive approach to database privacy
  • Shuchi Chawla
  • Stanford / UW Madison

TCC05 Chawla, Dwork, McSherry, Smith,
Wee UAI05 Chawla, Dwork, McSherry, Talwar
2
The privacy problem
  • Many database applications require data that is
    considered private by the contributors
  • disease outbreak detection
  • correlating symptoms and diseases
  • targeted advertizing
  • The privacy dilemma Protect privacy, but also
    allow legitimate applications
  • Alfred Kinsey Better privacy means better
    data
  • Why is this possible?
  • Because many applications require only
    macroscopic features of data

3
The privacy problem
  • Many database applications require data that is
    considered private by the contributors
  • disease outbreak detection
  • correlating symptoms and diseases
  • targeted advertizing
  • The basic questions
  • Given that privacy must be preserved, what
    applications can we allow?
  • How do we store data and use it to ensure privacy
    protection?
  • What do we mean by privacy? What is private
    information?

4
A road-map
  • How to protect data
  • Interactive approaches
  • Non-interactive approaches
  • Defining privacy
  • What is not possible? some impossibility
    results
  • What is possible? techniques for an abstract
    version of the problem
  • What next?

5
How to protect sensitive data
Take 1 The census (
) approach
non-interactive
  • The data-collector sanitizes the dataset and
    releases it publicly
  • Applications can freely use this sanitized data
  • Suppression of sensitive records Aggregation of
    attributes
  • Gusfield88, Dobra Fienberg Trottini00,
    Chawla Dwork McSherry Talwar05,
  • Imputation Learn the underlying distribution
    Rubin93
  • Data-perturbation Add random noise to each
    record, independently, or dependent on other
    records
  • Agrawal Srikant00, Agrawal Agrawal01,
    Roque03, Winkler04,
  • Evfimievsky Gehrke Srikant03, Chawla Dwork
    McSherry Smith Wee05,

6
How to protect sensitive data
Take 2 The trusted-party (interactive)
approach
  • Data resides with a trusted party (gover nment?)
  • Applications query the trusted party
  • Trusted party reveals as much info as possible
    without compromising privacy
  • Query auditing Answer truthfully or answer
    nothing
  • Kleinberg Papadimitriou Raghavan03, Dinur
    Nissim03
  • Output perturbation Answer with some noise
  • Denning80, Beck80, Adam Wortmann89, Dinur
    Nissim03,
  • Dwork Nissim04, Blum Dwork McSherry
    Nissim05,
  • Dwork McSherry Nissim Smith06,

7
How to protect sensitive data
Take 1 The census (non-interactive) approach
  • Practical sanitize once and then forget
  • Potentially worse performance disadvantage of
    not knowing the future applications that use
    data Irreversible
  • Potentially improved performance tailor noise to
    the query
  • Impractical Lifelong monitoring Can we really
    trust anyone with our private info?
  • Inconsistent answers to related queries

Take 2 The trusted-party (interactive)
approach
8
What is
a breach of
privacy?
  • Informally When an adversary learns sensitive
    information about an individual by looking at the
    sanitized database or query answers.
  • Note The adversary may (and should) learn other
    non-sensitive info
  • What constitutes sensitive information?
  • Values of sensitive attributes Adversarys
    belief about the individuals attribute values
    should remain nearly the same as before
  • Evfimievski Gehrke Srikant03, Dinur
    Nissim03, Dwork Nissim04,
  • Blum Dwork McSherry Nissim05
  • Related concept Indistinguishability Dwork
    McSherry Nissim Smith06

Values of sensitive attributes
Indistinguishability Weaker notions K-Anonymit
y Any information that the adversary learns
about an individual applies to K other
people Isolation Any information of the
adversary that approximately matches actual
attributes of an individual also matches
attributes of many (K) other individuals nearly
as well.
Sweeney02, Aggarwal Feder Kenthapadi Motwani
Panigrahy Thomas Zhu05
9
Isolation
  • Formalizing approximation
  • U the space of all possible individual records
  • d distance function on U
  • Adversary ADV takes as arguments the sanitized
    database S and auxiliary information X and
    produces a guess g ? U.
  • For a real person p ? U, the distance d(p,g)
    signifies how well the adversary approximates p.
  • Goal Want d( p , ADV(S,X) ) to be large
    (relative to other points).

10
Isolation
  • Formalizing approximation
  • U the space of all possible individual records
  • d distance function on U
  • Goal Want d( p , ADV(S,X) ) to be large.
  • A real person p is (c,K)-isolated by a point g
    ADV(S,X)
  • if d(p,g) r and B(g, cr) lt K

(3,7)-isolated
B(v,r) Ball of radius r around v w.r.t. distance
d c Small constant, say 8 or 10 K Privacy
parameter, say 100 or 1000
r
cr
Not (3,7)-isolated
RDB
11
Isolation
  • Formalizing approximation
  • U the space of all possible individual records
  • d distance function on U
  • Goal Want d( p , ADV(S,X) ) to be large.
  • A real person p is (c,K)-isolated by a point g
    ADV(S,X)
  • if d(p,g) r and B(g, cr) lt K
  • Some examples
  • d( p, g ) 0 if p g 1 otherwise. (K-anonymi
    ty)
  • d( p, g ) minattributes a pa - ga
  • d( p, g ) Hamming distance between p and g
  • d( p, g ) lp distance between p and g

12
A good sanitizing algorithm
  • A cryptographic approach The possibility of a
    privacy breach does not increase significantly by
    releasing the sanitized data

? D ? ADV ? ADV-Sim such that ? RDB, ? Aux.
X
Pr ADV(S,X) compromizes RDB Pr
ADV-Sim(X) compromizes RDB
Pr ADV(S,X) compromizes RDB ? Pr
ADV-Sim(X) compromizes RDB ? e
Is there a p ? RDB such that g isolates p?
13
Is this possible?
  • Cannot allow arbitrary Auxiliary information X
  • Dwork Naor
  • Basic idea X encodes compromising information
    about RDB
  • S carries the key to this information

X
X
D
D
Aux. Gen.
Aux. Gen.
S
Sanitizer
RDB
RDB
ADV
ADV Sim
Compromise?
Compromise?
0/1
0/1
14
Is this possible?
  • Cannot allow arbitrary Auxiliary information X
  • Cannot preserve arbitrary utility functions
  • e.g. the exact nearest neighbor function
  • exact histograms on discrete
    distributions
  • Cannot preserve all well-behaved utility
    functions in the non-interactive setting under
    the indistinguishability framework
  • Dwork McSherry Nissim Smith06

15
A good sanitizing algorithm
the Indistinguishability framework
Pr ADV(S,X) compromizes RDB Pr
ADV-Sim(S,X) compromizes RDB
A stronger requirement S S
16
Is this possible?
  • Cannot allow arbitrary Auxiliary information X
  • Cannot preserve arbitrary utility functions
  • e.g. the exact nearest neighbor function
  • exact histograms on discrete
    distributions
  • Cannot preserve all well-behaved utility
    functions in the non-interactive setting under
    the indistinguishability framework
  • Dwork McSherry Nissim Smith06
  • Cannot preserve all utility functions Must
    decide on the important ones before sanitizing
  • Unclear what this implies for the isolation
    framework

17
What can we achieve?
  • An abstract model
  • Records lie in ?d d is the Euclidean
    distance
  • Our results for privacy Aux. X all but a
    subset of points
  • Randomized (recursive) histograms
  • D Uniform over a hypercube
  • Isolation prob. lt 2-W(d), c ? 15, K 2o(d)
  • Recursive density-maps
  • D Uniform over well-rounded region
  • Isolation prob. lt 2-W(d), c large constant,
    K 2o(d)
  • Density-based Gaussian noise
  • D Uniform on a hypercube or hypersphere
  • Isolation prob. lt 2-W(d) , c ? 15, K 2o(d), n
    2o(d)

18
What can we achieve?
  • An abstract model
  • Records lie in ?d d is the Euclidean
    distance
  • Our results for utility
  • Recursive histograms, density-maps
  • Popular summarization technique
  • Benefit of providing more detail where required
  • No noise!
  • Randomized subdivision ? large distances
    preserved approximately
  • Density-based perturbation
  • Large aggregates are preserved
  • Preserves clusterings via spectral and
    diameter-based techniques

19
What can we achieve?
  • An abstract model
  • Records lie in ?d d is the Euclidean
    distance
  • Some key points
  • Unconditional results
  • Adversary is allowed unlimited computational
    power
  • Distribution D used in a limited way
  • Primarily describes the adversarys prior view
    of the individual to be isolated
  • The real-valued data assumption
  • Extends to discrete-valued case when granularity
    is sufficiently small does not capture
    everything, e.g. binary data
  • The low-dimensional case
  • Creates a problem more later

20
Histogram-based sanitization
  • Recursively sub-divide space into 2d cells, until
    each cell has less than K points

K3 d2
21
Histograms A brief proof of privacy
  • Adversarys goal produce a point g such that if
    r is the distance between g and the closest real
    point p, then?B(g,cr)?? K
  • Intuition
  • Input distribution is uniform over the hypercube
  • The adversarys view A product of uniform
    distributions over cells
  • Within any cell, the adversary cant
    conclusively single out a position for an
    unrevealed point

22
Histograms A brief proof of privacy
  • Case 1 Sparse cell
  • Expected distance g-p proportional to the
    diameter of the cell
  • c times this distance is larger than the diameter
    of the parent cell
  • ? B(g,cr) captures the entire parent cell and
    contains ? K points
  • Case 2 Dense cell
  • Consider B(g,r) and B(g,cr) for some radius r
  • Adversary wins if
  • Pr B(g,r) ? 1 is large, and,
  • Pr B(g,cr) ? K is small
  • We show that Pr ? x ? B(g,cr) Pr ? x ?
    B(g,r)

B(g,cr)
g
p
23
Histograms A brief proof of privacy
  • Lemma Let c be a large enough constant.
  • For any cell and any r lt diam(cell)/c ,
  • Pr ? x ? B(q,cr) ? cell ? 2d Pr ? x ?
    B(q,r) ? cell
  • Proof idea
  • Pr ? x ? B(q,r) ? cell
  • ? Vol( B(q,r) ? cell )
  • Vol( B(q,cr) ? cell ) gt 2d Vol( B(q,r) ? cell )
  • Key quantity Frac(q,r) Fraction of B(q,r) in
    cell
  • How does Frac(q,r) decrease with r?
  • As long as cr is small w.r.t. diameter, decrease
    is less than c?d.

Study intersections of Gaussians with the
hypercube
g
Corollary Probability of success for the
adversary lt 2-W(d)
24
Randomized Histograms
  • Do histograms represent the distribution well?
  • A desirable property Closer pairs of points are
    separated at higher levels of recursion (w.h.p.)
  • Estimate distance between points from the
    histogram
  • d(x,y) ? dH(x,y) ? d(x,y) Dx Dy
  • Dx diameter of final cell containing x
  • Want Dx to be roughly the K-radius of x
  • Standard technique to achieve this translate the
    hypercube through a random length in each
    direction
  • Need to be careful about the aspect ratios of the
    resulting cells undo the slicing of corner
    cells

25
Recursive density-maps
  • Can we extend the histogram construction to other
    distributions?
  • Key component of the previous analysis
  • Intersections of balls cells expand
    exponentially with radius
  • Does this hold for arbitrarily shaped cells?
  • Yes, if the cells are well-rounded
  • i.e. convex and with bounded aspect ratio

26
Partitioning into well-rounded cells
  • Basic procedure
  • Pick a set of points forming a net over the
    parent cell
  • The next level cells are given by the Voronoi
    partition formed by these points
  • Picking the points
  • Greedy method successively pick points that are
    not covered yet
  • Random method pick a random set of appropriate
    size
  • Works with a high probability (1 exp(-d))
  • Provides guarantees on cut probabilities and
    preserves large distances
  • Sanitizing algorithm recursive density-maps
  • Divide and subdivide region into well-rounded
    cells until each cell contains at most K points
    Release the count of points in each cell

Well-spaced pi - pj? r1 ?i,j Covering
?i B(pi,r2) covers the parent cell
r2/r1 is small
27
Summarizing
  • Isolation a new formalism for describing
    non-interactive approches to preserving privacy
  • Some examples of good sanitizers w.r.t. isolation
  • Key approaches
  • Recursive histograms
  • Density-based Gaussian noise

A small amount of noise in each count makes them
secure w.r.t. indistinguishability!
28
Summarizing the downsides
  • Arguments are specific to the Euclidean metric
  • Tricky to understand isolation in other metrics
  • Dependence on a large number dimensions
  • Inherently doomed if the adversary can re-create
    (approximately) the underlying distribution for
    low-dimensional data
  • Change the definition of privacy? Limit utility?
  • Low-entropy (binary) attributes
  • Current techniques should work as long as a large
    number of attributes have high entropy
  • May be doable through new techniques
  • Aux. info. is poorly understood

29
Future challenges
  • Non-interactive approach is important, but we
    have strong negative results
  • Try weaker notions of privacy
  • Use computational assumptions
  • Restrict the functionality of the sanitized data
  • Still havent answered the first question What
    applications can we allow while preserving
    privacy?
  • Indistinguishability suggests an answer
    insensitive functions
  • Repeated application and composability
  • How do different privacy-preserving techniques
    interact?
  • How do entries exits from a database effect its
    privacy?

30
Questions?
31
Positive results in other formalisms
  • K-Anonymity Aggarwal Feder Kenthapadi Motwani
    Panigrahy Thomas Zhu05
  • NP-hard to find the smallest set of attribute
    values to delete
  • Can be approximated to within O(k)
  • Belief-based Dwork Nissim04, Blum Dwork
    McSherry Nissim05
  • Add noise to each query proportional to O(T
    log2T) where T queries
  • O(vT) noise sufficient for sum queries
  • Indistinguishability Dwork McSherry
    Nissim Smith06
  • Add noise to each query proportional to
    sensitivity
  • Noise increases linearly with number of queries
  • Better results in special cases
Write a Comment
User Comments (0)
About PowerShow.com