Title: Private%20Matching
1Privacy Preserving Data Mining Lecture
3 Non-Cryptographic Approaches for Preserving
Privacy (Based on Slides of Kobbi Nissim)
Benny Pinkas HP Labs, Israel
2Why not use cryptographic methods?
- Many users contribute data. Cannot require them
to participate in a cryptographic protocol. - In particular, cannot require p2p communication
between users. - Cryptographic protocols incur considerable
overhead.
3Data Privacy
Data
users
access mechanism
breach privacy
4Easy Tempting Solution
A Bad Solution
Idea a. Remove identifying information (name,
SSN, )
b. Publish data
- But, harmless attributes uniquely identify
many patients (gender, age, approx weight,
ethnicity, marital status) - Recall, DOBgenderzip code identify people whp.
- Worserare attributes (e.g. disease with prob.
? 1/3000)
5What is Privacy?
- Something should not be computable from query
answers - E.g. ? JoeJoes private data
- The definition should take into account the
adversarys power (computational, of queries,
prior knowledge, ) - Quite often it is much easier to say what is
surely non-private - E.g. Strong breaking of privacy adversary is
able to retrieve (almost) everybodys private data
Intuition privacy breached if it is possible to
compute someones private information from his
identity
6The Data Privacy Game an Information-Privacy
Tradeoff
- Private functions
- want to hide ?x(DB)dx
- Information functions
- want to reveal f (q, DB) for queries q
- Here explicit definition of private functions.
- The question which information functions may be
allowed? - Different from Crypto (secure function
evaluation) - There, want to reveal f() (explicit definition
of information function) - want to hide all functions ?() not computable
from f() - Implicit definition of private functions
- The question whether f() should be revealed is
not asked
f
?x
f
f
7A simplistic model Statistical Database (SDB)
query
q ? n
bits
8Approaches to SDB Privacy
- Studied extensively since the 70s
- Perturbation
- Add randomness. Give noisy or approximate
answers - Techniques
- Data perturbation (perturb data and then answer
queries as usual) Reiss 84, Liew Choi Liew 85,
Traub Yemini Wozniakowski 84 - Output perturbation (perturb answers to queries)
Denning 80, Beck 80, Achugbue Chin 79, Fellegi
Phillips 74 - Recent interest Agrawal, Srikant 00 Agrawal,
Aggarwal 01, - Query Restriction
- Answer queries accurately but sometimes disallow
queries - Require queries to obey some structure Dobkin
Jones Lipton 79 - Restricts number of queries
- Auditing Chin Ozsoyoglu 82, Kleinberg
Papadimitriou Raghavan 01
9Some Recent Privacy Definitions
- X data, Y (noisy) observation of X
- Agrawal, Srikant 00 Interval of confidence
- Let Y Xnoise (e.g. uniform noise in
-100,100. - Perturb input data. Can still estimate underlying
distribution. - Tradeoff more noise ? less accuracy but more
privacy. - Intuition large possible interval ? privacy
preserved - Given Y, we know that within c confidence X is
in a1,a2. For example, for Y200, with 50 X is
in 150,250. - a2-a1 defines the amount of privacy at c
confidence - Problem there might be some a-priori information
about X - X someones age Y -97
10The AS scheme can be turned against itself
- Assume that N is large
- Even if the data-miner doesnt have a-priori
information about X, it can estimate it given the
randomized data Y. - The perturbation is uniform in -1,1
- AS privacy interval 2 with confidence 100
- Let fx(X)50 for x?0,1, and 50 for x?4,5.
- But, after learning fx(X) the value of X can be
easily localized within an interval of size at
most 1. - Problem aggregate information provides
information that can be used to attack individual
data
11Some Recent Privacy Definitions
- X data, Y (noisy) observation of X
- Agrawal, Aggarwal 01 Mutual information
- Intuition
- High entropy is good. I(XY) H(X)-H(XY)
(mutual information) - small I(XY) (mutual information) ? privacy
preserved (Y provides little information about
X). - Problem EGS
- Average notion. Privacy loss can happen with low
but significant probability, but without
affecting I(XY). - Sometimes I(XY) seems good but privacy is
breached
12Output Perturbation (Randomization Approach)
- Exact answer to query q
- aq ??i?q di
- Actual SDB answer âq
- Perturbation ?
- For all q âq aq ?
- Questions
- Does perturbation give any privacy?
- How much perturbation is needed for privacy?
- Usability
13Privacy Preserved by Perturbation ? ? ?n
- Database d?R0,1n (uniform input
distribution!) - Algorithm on query q,
- Let aq?i?q di
- If aq - q/2 lt ? return âq q / 2
- Otherwise return âq aq
- ? ? ?n (lgn)2 ? Privacy is preserved
- Assume poly(n) queries
- If ? ? ?n (lgn)2, whp always use rule 2
- No information about d is given!
- (but database is completely useless)
- Shows that sometimes perturbation ? ?n is enough
for privacy. Can we do better?
14Perturbation ? ltlt ?n Implies no Privacy
- The previous useless database achieves the best
possible perturbation. - Theorem Dinur-Nissim Given any DB and any DB
response algorithm with perturbation ? o(?n),
there is a poly-time reconstruction algorithm
that outputs a database d, s.t. dist(d,d) lt
o(n).
strong breaking of privacy
15The Adversary as a Decoding Algorithm
16Proof of Theorem DN03 The Adversary
Reconstruction Algorithm
- Query phase Get âqj for t random subsets q1,,qt
- Weeding phase Solve the Linear Program (over ?)
- 0 ? xi ? 1
- ?i?qj xi - âqj ? ?
- Rounding Let ci round(xi), output c
Observation A solution always exists, e.g. xd.
17Why does the Reconstruction Algorithm Work?
- Consider x?0,1n s.t. dist(x,d)cn ?(n)
- Observation
- A random q contains cn coordinates in which x?d
- The differences in the sum of these coordinates
is, with constant probability, at least ?(?n) (gt
? o(?n) ). - Such a q disqualifies x as a solution for the LP
- Since the total number of queries q is
polynomial, then all such vectors x are
disqualified with overwhelming probability.
18Summary of Results (statistical database)
- Dinur, Nissim 03
- Unlimited adversary
- Perturbation of magnitude ?(n) required
- Polynomial-time adversary
- Perturbation of magnitude ?(sqrt(n)) is required
(shown above) - In both cases, adversary may reconstruct a good
approximation for the database - Disallows even very weak notions of privacy
- Bounded adversary, restricted to T ltlt n queries
(SuLQ) - There is a privacy preserving access mechanism
with perturbation ltlt sqrt(T) - Chance for usability
- Reasonable model as database grows larger and
larger
19SuLQ for Multi-Attribute Statistical Database
(SDB)
Query (q, f) q ? n f 0,1k? 0,1
Answer aq,f??i?q f(di)
Database di,j
Row distributionD
(D1,D2,,Dn)
20Privacy and Usability Concerns for the
Multi-Attribute Model DN
- Rich set of queries subset sums over any
property of the k attributes - Obviously increases usability, but how is privacy
affected? - More to protect functions of the k attributes
- Relevant factors
- What is the adversarys goal?
- Row dependency
- Vertically split data (between k or less
databases) - Can privacy still be maintained with
independently operating databases?
21Privacy Definition - Intuition
- 3-phase adversary
- Phase 0 defines a target set G of poly(n)
functions g 0,1k? 0,1 - Will try to learn some of this information about
someone - Phase 1 adaptively queries the database To(n)
times - Phase 2 chooses an index i of a row it intends
to attack and a function g?G - Attack
- given d-i
- try to guess g(di,1di,k)
22The Privacy Definition
- P 0i,g a-priori probability that
g(di,1di,k)1 - p Ti,g a-posteriori probability that
g(di,1di,k)1 - Given answers to the T queries, and d-i
- Define conf(p) log (p/(1-p))
- 1-1 relationship between p and conf(p)
- conf(1/2)0 conf(2/3)1 conf(1)?
- ?conf i,g conf(pTi,g) conf(p0i,g)
- (?,T) privacy (relative privacy)
- For all distributions D1Dn , row i, function g
and any adversary making at most T queries - Pr?conf i,g gt ? neg(n)
23The SuLQ Database
- Adversary restricted to T ltlt n queries
- On query (q, f)
- q ? n
- f 0,1k? 0,1 (binary function)
- Let aq,f ?i?q f(di,1di,k)
- Let N ? Binomial(0, ?T )
- Return aq,fN
SuLQ Sub Linear Queries
24Privacy Analysis of the SuLQ Database
- Pmi,g - a-posteriori probability that
g(di,1di,k)1 - Given d-i and answers to the first m queries
- conf(pmi,g) Describes a random walk on the line
with - Starting point conf(p0i,g)
- Compromise conf(pmi,g) conf(p0i,g) gt ?
- W.h.p. more than T steps needed to reach
compromise
conf(p0i,g)
conf(p0i,g) ?
25Usability One multi-attribute SuLQ DB
- Statistics of any property f of the k attributes
- I.e. for what fraction of the (sub)population
does f(d1dk) hold? - Easy just put f in the query
- Other applications
- k independent multi-attribute SuLQ DBs
- Vertically partitioned SulQ DBs
- Testing whether Pr?? Pr??
- Caveat we hide g() about a specific row (not
about multiple rows)
26Overview of Methods
- Input Perturbation
-
- Output Perturbation
- Query Restriction
27Query restriction
- The decision whether to answer or deny the query
- Can be based on the content of the query and on
answers to previous queries - Or, can be based on the above and on the content
of the database
28Auditing
- AW89 classify auditing as a query restriction
method - Auditing of an SDB involves keeping up-to-date
logs of all queries made by each user (not the
data involved) and constantly checking for
possible compromise whenever a new query is
issued - Partial motivation May allow for more queries to
be posed, if no privacy threat occurs. - Early work Hofmann 1977, Schlorer 1976, Chin,
Ozsoyoglu 1981, 1986 - Recent interest Kleinberg, Papadimitriou,
Raghavan 2000, Li, Wang, Wang, Jajodia 2002,
Jonsson, Krokhin 2003
29How Auditors may Inadvertently Compromise Privacy
30The Setting
Statisticaldatabase
- Dataset dd1,,dn
- Entries di Real, Integer, Boolean
- Query q (f ,i1,,ik)
- f Min, Max, Median, Sum, Average, Count
- Bad users will try to breach the privacy of
individuals - Compromise ? uniquely determine di (very weak def)
31Auditing
Heres the answer
OR
Query denied (as the answer would cause privacy
loss)
Heres a new query qi1
Query log q1,,qi
32Example 1 Sum/Max auditing
di real, sum/max queries, privacy breached if
some di learned
q1 sum(d1,d2,d3)
sum(d1,d2,d3) 15
q2 max(d1,d2,d3)
Denied (the answer would cause privacy loss)
There must be a reason for the denial
q2 is denied iff d1d2d3 5 I win!
Oh well
33Sounds Familiar?
David Duncan, Former auditor for Enron and
partner in Andersen
Mr. Chairman, I would like to answer the
committee's questions, but on the advice of my
counsel I respectfully decline to answer the
question based on the protection afforded me
under the Constitution of the United States.
34Max Auditing
di real
q1 max(d1,d2,d3,d4)
M1234
M123 / denied
If denied d4M1234
M12 / denied
If denied d3M123
Learn an item with prob ½
35Boolean Auditing?
di Boolean
1 / denied
1 / denied
qi denied iff di di1 ? learn
database/complement
36The Problem
- The problem
- Query denials leak (potentially sensitive)
information - Users cannot decide denials by themselves
Possible assignments to d1,,dn
Assignments consistent with (q1,qi, a1,,ai)
qi1 denied
37Solution to the problem simulatable Auditing
- An auditor is simulatable if a simulator exists
s.t.
?
Simulation ? denials do not leak information
38Why Simulatable Auditors do not Leak Information?
Possible assignments to d1,,dn
39 40Query Restriction for Sum Queries
- Given
- Dx1,..,xn dataset, xi ??
- S is a subset of X. Query ?xi?S xi
- Is it possible to compromise D?
- Here compromise means uniquely determine xi from
the queries - Can compromise if subsets arbitrarily small
- sum(x9) x9
41Query Set Size Control
- Do not permit queries that involve a small subset
of the database. - Compromise still possible
- Want to discover x
- sum(x,y1,..,yk) - sum(y1,..,yk) x
- Issue Overlap
- In general, overlap is not enough.
- Need to restrict also the number of queries
- Note that overlap itself sometimes restricts
number of queries (e.g. size of queries cn,
overlap const, only about 1/c possible queries)
42Restricting Set-Sum Queries
- Restricting the sum queries based on
- Number of database elements in the sum
- Overlap with previous sum queries
- Total number of queries
- Note that the criteria are known to the user
- They do not depend on the contents of the
database - Therefore, the user can simulate the
denial/no-denial answer given by the DB - Simulatable auditing
43Restricting Overlap and Number of Queries
- Assume
- Query Qi k
- Qi ? Qj r
- Adversary knows a-priori at most L values, L1 lt
k - Claim Data cannot be compromised with fewer than
1(2k-L)/r Sum Queries.
44Overlap Number of Queries
- Claim Data cannot be compromised with fewer than
1(2k-L)/r Sum Queries Dobkin,Jones,LiptonReiss
- k lt query size, r gt overlap, L ? a-priori known
items - Suppose xc compromised after t queries where each
query represented by - Qi xi1 xi2 xik for i 1, , t
- Implies that
- xc ?i1,t ?i Qi ?i1,t ?i ?j1,k xij
- Let ?i? 1 if x? in query i, 0 otherwise
- xc ?i1,t ?i ??1,n ?i? x? ??1,n (?i1,t ?i
?i?)x?
45Overlap Number of Queries
- We have
- xc ??1,n (?i1,t?i ?i?)x?
- In the above sum, (?i1,t?i ?i?) must be 0 for
all x? except for xc (in order for xc to be
compromised) - This happens iff ?i?0 for all i, or if ?i? ?j?
1 and ?i ?j have opposite signs - or ?i 0, in which case the ith query didnt
matter
46Overlap Number of Queries
- Wlog, first query contains xc, second query is of
opposite sign. - In the first query, k elements are probed
- The second query adds at least k-r elements
- Elements from first and second queries cannot be
canceled within the same (additional) query
(opposite signs requires) - Therefore each new query cancels items from first
or from second query, but not from both. - Need to cancel 2k-r-L elements.
- Need 2(2k-r-L)/r queries, i.e. 1(2k-L)/r.
47Notes
- The number of queries satisfying Qi k and Qi
? Qj r is small - If kn/c for some constant c and rconst, then
there are only c queries where no two queries
overlap by more than 1. - Hence , query sequence length may be
uncomfortably short. - Or, if rk/c (overlap is a constant fraction of
query size) then number of queries, 1(2k-L)/r,
is O( c).
48Conclusions
- Privacy should be defined and analyzed rigorously
- In particular, assuming randomization ? privacy
is dangerous - High perturbation is needed for privacy against
polynomial adversaries - Threshold phenomenon above ?n total privacy,
below ?n no privacy (for poly-time adversary) - Main tool a reconstruction algorithm
- Careless auditing might leak private information
- Self Auditing (simulatable auditors) is safe
- Decision whether to allow a query based on
previous good queries and their answers - Without access to DB contents
- Users may apply the decision procedure by
themselves
49ToDo
- Come up with good model and requirements for
database privacy - Learn from crypto
- Protect against more general loss of privacy
- Simulatable auditors are a starting point for
designing more reasonable audit mechanisms
50References
- Course web page
- A Study of Perturbation Techniques for Data
Privacy, Cynthia Dwork and Nina Mishra and Kobbi
Nissim, http//theory.stanford.edu/nmishra/cs369-
2004.html - Privacy and Databases
- http//theory.stanford.edu/rajeev/privacy.html
51Foundations of CS at the Weizmann Institute
- Uri Feige
- Oded Goldreich
- Shafi Goldwasser
- David Harel
- Moni Naor
- David Peleg
- Amir Pnueli
- Ran Raz
- Omer Reingold
- Adi Shamir
Yellow ? crypto
- All students receive a fellowship
- Language of instruction English