Private%20Matching - PowerPoint PPT Presentation

About This Presentation
Title:

Private%20Matching

Description:

Non-Cryptographic Approaches for Preserving Privacy (Based on Slides of Kobbi Nissim) ... Weeding phase: Solve the Linear Program (over ): 0 xi 1 ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 52
Provided by: Ben5153
Category:

less

Transcript and Presenter's Notes

Title: Private%20Matching


1
Privacy Preserving Data Mining Lecture
3 Non-Cryptographic Approaches for Preserving
Privacy (Based on Slides of Kobbi Nissim)
Benny Pinkas HP Labs, Israel
2
Why not use cryptographic methods?
  • Many users contribute data. Cannot require them
    to participate in a cryptographic protocol.
  • In particular, cannot require p2p communication
    between users.
  • Cryptographic protocols incur considerable
    overhead.

3
Data Privacy
Data
users
access mechanism
breach privacy
4
Easy Tempting Solution
A Bad Solution
Idea a. Remove identifying information (name,
SSN, )
b. Publish data
  • But, harmless attributes uniquely identify
    many patients (gender, age, approx weight,
    ethnicity, marital status)
  • Recall, DOBgenderzip code identify people whp.
  • Worserare attributes (e.g. disease with prob.
    ? 1/3000)

5
What is Privacy?
  • Something should not be computable from query
    answers
  • E.g. ? JoeJoes private data
  • The definition should take into account the
    adversarys power (computational, of queries,
    prior knowledge, )
  • Quite often it is much easier to say what is
    surely non-private
  • E.g. Strong breaking of privacy adversary is
    able to retrieve (almost) everybodys private data

Intuition privacy breached if it is possible to
compute someones private information from his
identity
6
The Data Privacy Game an Information-Privacy
Tradeoff
  • Private functions
  • want to hide ?x(DB)dx
  • Information functions
  • want to reveal f (q, DB) for queries q
  • Here explicit definition of private functions.
  • The question which information functions may be
    allowed?
  • Different from Crypto (secure function
    evaluation)
  • There, want to reveal f() (explicit definition
    of information function)
  • want to hide all functions ?() not computable
    from f()
  • Implicit definition of private functions
  • The question whether f() should be revealed is
    not asked

f
?x
f
f
7
A simplistic model Statistical Database (SDB)
query
q ? n
bits
8
Approaches to SDB Privacy
  • Studied extensively since the 70s
  • Perturbation
  • Add randomness. Give noisy or approximate
    answers
  • Techniques
  • Data perturbation (perturb data and then answer
    queries as usual) Reiss 84, Liew Choi Liew 85,
    Traub Yemini Wozniakowski 84
  • Output perturbation (perturb answers to queries)
    Denning 80, Beck 80, Achugbue Chin 79, Fellegi
    Phillips 74
  • Recent interest Agrawal, Srikant 00 Agrawal,
    Aggarwal 01,
  • Query Restriction
  • Answer queries accurately but sometimes disallow
    queries
  • Require queries to obey some structure Dobkin
    Jones Lipton 79
  • Restricts number of queries
  • Auditing Chin Ozsoyoglu 82, Kleinberg
    Papadimitriou Raghavan 01

9
Some Recent Privacy Definitions
  • X data, Y (noisy) observation of X
  • Agrawal, Srikant 00 Interval of confidence
  • Let Y Xnoise (e.g. uniform noise in
    -100,100.
  • Perturb input data. Can still estimate underlying
    distribution.
  • Tradeoff more noise ? less accuracy but more
    privacy.
  • Intuition large possible interval ? privacy
    preserved
  • Given Y, we know that within c confidence X is
    in a1,a2. For example, for Y200, with 50 X is
    in 150,250.
  • a2-a1 defines the amount of privacy at c
    confidence
  • Problem there might be some a-priori information
    about X
  • X someones age Y -97

10
The AS scheme can be turned against itself
  • Assume that N is large
  • Even if the data-miner doesnt have a-priori
    information about X, it can estimate it given the
    randomized data Y.
  • The perturbation is uniform in -1,1
  • AS privacy interval 2 with confidence 100
  • Let fx(X)50 for x?0,1, and 50 for x?4,5.
  • But, after learning fx(X) the value of X can be
    easily localized within an interval of size at
    most 1.
  • Problem aggregate information provides
    information that can be used to attack individual
    data

11
Some Recent Privacy Definitions
  • X data, Y (noisy) observation of X
  • Agrawal, Aggarwal 01 Mutual information
  • Intuition
  • High entropy is good. I(XY) H(X)-H(XY)
    (mutual information)
  • small I(XY) (mutual information) ? privacy
    preserved (Y provides little information about
    X).
  • Problem EGS
  • Average notion. Privacy loss can happen with low
    but significant probability, but without
    affecting I(XY).
  • Sometimes I(XY) seems good but privacy is
    breached

12
Output Perturbation (Randomization Approach)
  • Exact answer to query q
  • aq ??i?q di
  • Actual SDB answer âq
  • Perturbation ?
  • For all q âq aq ?
  • Questions
  • Does perturbation give any privacy?
  • How much perturbation is needed for privacy?
  • Usability

13
Privacy Preserved by Perturbation ? ? ?n
  • Database d?R0,1n (uniform input
    distribution!)
  • Algorithm on query q,
  • Let aq?i?q di
  • If aq - q/2 lt ? return âq q / 2
  • Otherwise return âq aq
  • ? ? ?n (lgn)2 ? Privacy is preserved
  • Assume poly(n) queries
  • If ? ? ?n (lgn)2, whp always use rule 2
  • No information about d is given!
  • (but database is completely useless)
  • Shows that sometimes perturbation ? ?n is enough
    for privacy. Can we do better?

14
Perturbation ? ltlt ?n Implies no Privacy
  • The previous useless database achieves the best
    possible perturbation.
  • Theorem Dinur-Nissim Given any DB and any DB
    response algorithm with perturbation ? o(?n),
    there is a poly-time reconstruction algorithm
    that outputs a database d, s.t. dist(d,d) lt
    o(n).

strong breaking of privacy
15
The Adversary as a Decoding Algorithm

16
Proof of Theorem DN03 The Adversary
Reconstruction Algorithm
  • Query phase Get âqj for t random subsets q1,,qt
  • Weeding phase Solve the Linear Program (over ?)
  • 0 ? xi ? 1
  • ?i?qj xi - âqj ? ?
  • Rounding Let ci round(xi), output c

Observation A solution always exists, e.g. xd.
17
Why does the Reconstruction Algorithm Work?
  • Consider x?0,1n s.t. dist(x,d)cn ?(n)
  • Observation
  • A random q contains cn coordinates in which x?d
  • The differences in the sum of these coordinates
    is, with constant probability, at least ?(?n) (gt
    ? o(?n) ).
  • Such a q disqualifies x as a solution for the LP
  • Since the total number of queries q is
    polynomial, then all such vectors x are
    disqualified with overwhelming probability.

18
Summary of Results (statistical database)
  • Dinur, Nissim 03
  • Unlimited adversary
  • Perturbation of magnitude ?(n) required
  • Polynomial-time adversary
  • Perturbation of magnitude ?(sqrt(n)) is required
    (shown above)
  • In both cases, adversary may reconstruct a good
    approximation for the database
  • Disallows even very weak notions of privacy
  • Bounded adversary, restricted to T ltlt n queries
    (SuLQ)
  • There is a privacy preserving access mechanism
    with perturbation ltlt sqrt(T)
  • Chance for usability
  • Reasonable model as database grows larger and
    larger

19
SuLQ for Multi-Attribute Statistical Database
(SDB)
Query (q, f) q ? n f 0,1k? 0,1
Answer aq,f??i?q f(di)
Database di,j
Row distributionD
(D1,D2,,Dn)
20
Privacy and Usability Concerns for the
Multi-Attribute Model DN
  • Rich set of queries subset sums over any
    property of the k attributes
  • Obviously increases usability, but how is privacy
    affected?
  • More to protect functions of the k attributes
  • Relevant factors
  • What is the adversarys goal?
  • Row dependency
  • Vertically split data (between k or less
    databases)
  • Can privacy still be maintained with
    independently operating databases?

21
Privacy Definition - Intuition
  • 3-phase adversary
  • Phase 0 defines a target set G of poly(n)
    functions g 0,1k? 0,1
  • Will try to learn some of this information about
    someone
  • Phase 1 adaptively queries the database To(n)
    times
  • Phase 2 chooses an index i of a row it intends
    to attack and a function g?G
  • Attack
  • given d-i
  • try to guess g(di,1di,k)

22
The Privacy Definition
  • P 0i,g a-priori probability that
    g(di,1di,k)1
  • p Ti,g a-posteriori probability that
    g(di,1di,k)1
  • Given answers to the T queries, and d-i
  • Define conf(p) log (p/(1-p))
  • 1-1 relationship between p and conf(p)
  • conf(1/2)0 conf(2/3)1 conf(1)?
  • ?conf i,g conf(pTi,g) conf(p0i,g)
  • (?,T) privacy (relative privacy)
  • For all distributions D1Dn , row i, function g
    and any adversary making at most T queries
  • Pr?conf i,g gt ? neg(n)

23
The SuLQ Database
  • Adversary restricted to T ltlt n queries
  • On query (q, f)
  • q ? n
  • f 0,1k? 0,1 (binary function)
  • Let aq,f ?i?q f(di,1di,k)
  • Let N ? Binomial(0, ?T )
  • Return aq,fN

SuLQ Sub Linear Queries
24
Privacy Analysis of the SuLQ Database
  • Pmi,g - a-posteriori probability that
    g(di,1di,k)1
  • Given d-i and answers to the first m queries
  • conf(pmi,g) Describes a random walk on the line
    with
  • Starting point conf(p0i,g)
  • Compromise conf(pmi,g) conf(p0i,g) gt ?
  • W.h.p. more than T steps needed to reach
    compromise

conf(p0i,g)
conf(p0i,g) ?
25
Usability One multi-attribute SuLQ DB
  • Statistics of any property f of the k attributes
  • I.e. for what fraction of the (sub)population
    does f(d1dk) hold?
  • Easy just put f in the query
  • Other applications
  • k independent multi-attribute SuLQ DBs
  • Vertically partitioned SulQ DBs
  • Testing whether Pr?? Pr??
  • Caveat we hide g() about a specific row (not
    about multiple rows)

26
Overview of Methods
  • Input Perturbation
  • Output Perturbation
  • Query Restriction

27
Query restriction
  • The decision whether to answer or deny the query
  • Can be based on the content of the query and on
    answers to previous queries
  • Or, can be based on the above and on the content
    of the database

28
Auditing
  • AW89 classify auditing as a query restriction
    method
  • Auditing of an SDB involves keeping up-to-date
    logs of all queries made by each user (not the
    data involved) and constantly checking for
    possible compromise whenever a new query is
    issued
  • Partial motivation May allow for more queries to
    be posed, if no privacy threat occurs.
  • Early work Hofmann 1977, Schlorer 1976, Chin,
    Ozsoyoglu 1981, 1986
  • Recent interest Kleinberg, Papadimitriou,
    Raghavan 2000, Li, Wang, Wang, Jajodia 2002,
    Jonsson, Krokhin 2003

29
How Auditors may Inadvertently Compromise Privacy
30
The Setting
Statisticaldatabase
  • Dataset dd1,,dn
  • Entries di Real, Integer, Boolean
  • Query q (f ,i1,,ik)
  • f Min, Max, Median, Sum, Average, Count
  • Bad users will try to breach the privacy of
    individuals
  • Compromise ? uniquely determine di (very weak def)

31
Auditing
Heres the answer
OR
Query denied (as the answer would cause privacy
loss)
Heres a new query qi1
Query log q1,,qi
32
Example 1 Sum/Max auditing
di real, sum/max queries, privacy breached if
some di learned
q1 sum(d1,d2,d3)
sum(d1,d2,d3) 15
q2 max(d1,d2,d3)
Denied (the answer would cause privacy loss)
There must be a reason for the denial
q2 is denied iff d1d2d3 5 I win!
Oh well
33
Sounds Familiar?
David Duncan, Former auditor for Enron and
partner in Andersen
Mr. Chairman, I would like to answer the
committee's questions, but on the advice of my
counsel I respectfully decline to answer the
question based on the protection afforded me
under the Constitution of the United States.
34
Max Auditing
di real
q1 max(d1,d2,d3,d4)
M1234
M123 / denied
If denied d4M1234
M12 / denied
If denied d3M123
Learn an item with prob ½
35
Boolean Auditing?
di Boolean
1 / denied
1 / denied

qi denied iff di di1 ? learn
database/complement
36
The Problem
  • The problem
  • Query denials leak (potentially sensitive)
    information
  • Users cannot decide denials by themselves

Possible assignments to d1,,dn
Assignments consistent with (q1,qi, a1,,ai)
qi1 denied
37
Solution to the problem simulatable Auditing
  • An auditor is simulatable if a simulator exists
    s.t.

?
Simulation ? denials do not leak information
38
Why Simulatable Auditors do not Leak Information?
Possible assignments to d1,,dn
39
  • Simulatable auditing

40
Query Restriction for Sum Queries
  • Given
  • Dx1,..,xn dataset, xi ??
  • S is a subset of X. Query ?xi?S xi
  • Is it possible to compromise D?
  • Here compromise means uniquely determine xi from
    the queries
  • Can compromise if subsets arbitrarily small
  • sum(x9) x9

41
Query Set Size Control
  • Do not permit queries that involve a small subset
    of the database.
  • Compromise still possible
  • Want to discover x
  • sum(x,y1,..,yk) - sum(y1,..,yk) x
  • Issue Overlap
  • In general, overlap is not enough.
  • Need to restrict also the number of queries
  • Note that overlap itself sometimes restricts
    number of queries (e.g. size of queries cn,
    overlap const, only about 1/c possible queries)

42
Restricting Set-Sum Queries
  • Restricting the sum queries based on
  • Number of database elements in the sum
  • Overlap with previous sum queries
  • Total number of queries
  • Note that the criteria are known to the user
  • They do not depend on the contents of the
    database
  • Therefore, the user can simulate the
    denial/no-denial answer given by the DB
  • Simulatable auditing

43
Restricting Overlap and Number of Queries
  • Assume
  • Query Qi k
  • Qi ? Qj r
  • Adversary knows a-priori at most L values, L1 lt
    k
  • Claim Data cannot be compromised with fewer than
    1(2k-L)/r Sum Queries.

44
Overlap Number of Queries
  • Claim Data cannot be compromised with fewer than
    1(2k-L)/r Sum Queries Dobkin,Jones,LiptonReiss
  • k lt query size, r gt overlap, L ? a-priori known
    items
  • Suppose xc compromised after t queries where each
    query represented by
  • Qi xi1 xi2 xik for i 1, , t
  • Implies that
  • xc ?i1,t ?i Qi ?i1,t ?i ?j1,k xij
  • Let ?i? 1 if x? in query i, 0 otherwise
  • xc ?i1,t ?i ??1,n ?i? x? ??1,n (?i1,t ?i
    ?i?)x?

45
Overlap Number of Queries
  • We have
  • xc ??1,n (?i1,t?i ?i?)x?
  • In the above sum, (?i1,t?i ?i?) must be 0 for
    all x? except for xc (in order for xc to be
    compromised)
  • This happens iff ?i?0 for all i, or if ?i? ?j?
    1 and ?i ?j have opposite signs
  • or ?i 0, in which case the ith query didnt
    matter

46
Overlap Number of Queries
  • Wlog, first query contains xc, second query is of
    opposite sign.
  • In the first query, k elements are probed
  • The second query adds at least k-r elements
  • Elements from first and second queries cannot be
    canceled within the same (additional) query
    (opposite signs requires)
  • Therefore each new query cancels items from first
    or from second query, but not from both.
  • Need to cancel 2k-r-L elements.
  • Need 2(2k-r-L)/r queries, i.e. 1(2k-L)/r.

47
Notes
  • The number of queries satisfying Qi k and Qi
    ? Qj r is small
  • If kn/c for some constant c and rconst, then
    there are only c queries where no two queries
    overlap by more than 1.
  • Hence , query sequence length may be
    uncomfortably short.
  • Or, if rk/c (overlap is a constant fraction of
    query size) then number of queries, 1(2k-L)/r,
    is O( c).

48
Conclusions
  • Privacy should be defined and analyzed rigorously
  • In particular, assuming randomization ? privacy
    is dangerous
  • High perturbation is needed for privacy against
    polynomial adversaries
  • Threshold phenomenon above ?n total privacy,
    below ?n no privacy (for poly-time adversary)
  • Main tool a reconstruction algorithm
  • Careless auditing might leak private information
  • Self Auditing (simulatable auditors) is safe
  • Decision whether to allow a query based on
    previous good queries and their answers
  • Without access to DB contents
  • Users may apply the decision procedure by
    themselves

49
ToDo
  • Come up with good model and requirements for
    database privacy
  • Learn from crypto
  • Protect against more general loss of privacy
  • Simulatable auditors are a starting point for
    designing more reasonable audit mechanisms

50
References
  • Course web page
  • A Study of Perturbation Techniques for Data
    Privacy, Cynthia Dwork and Nina Mishra and Kobbi
    Nissim, http//theory.stanford.edu/nmishra/cs369-
    2004.html
  • Privacy and Databases
  • http//theory.stanford.edu/rajeev/privacy.html

51
Foundations of CS at the Weizmann Institute
  • Uri Feige
  • Oded Goldreich
  • Shafi Goldwasser
  • David Harel
  • Moni Naor
  • David Peleg
  • Amir Pnueli
  • Ran Raz
  • Omer Reingold
  • Adi Shamir

Yellow ? crypto
  • All students receive a fellowship
  • Language of instruction English
Write a Comment
User Comments (0)
About PowerShow.com