Private%20Matching - PowerPoint PPT Presentation

About This Presentation

Title:

Private%20Matching

Description:

Non-Cryptographic Approaches for Preserving Privacy (Based on Slides of Kobbi Nissim) ... Weeding phase: Solve the Linear Program (over ): 0 xi 1 ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 52

Provided by: Ben5153

Category:

more less

Transcript and Presenter's Notes

Title: Private%20Matching

1
Privacy Preserving Data Mining Lecture
3 Non-Cryptographic Approaches for Preserving
Privacy (Based on Slides of Kobbi Nissim)
Benny Pinkas HP Labs, Israel
2
Why not use cryptographic methods?

Many users contribute data. Cannot require them
to participate in a cryptographic protocol.
In particular, cannot require p2p communication
between users.
Cryptographic protocols incur considerable
overhead.

3
Data Privacy
Data
users
access mechanism
breach privacy
4
Easy Tempting Solution
A Bad Solution
Idea a. Remove identifying information (name,
SSN, )
b. Publish data

But, harmless attributes uniquely identify
many patients (gender, age, approx weight,
ethnicity, marital status)
Recall, DOBgenderzip code identify people whp.
Worserare attributes (e.g. disease with prob.
? 1/3000)

5
What is Privacy?

Something should not be computable from query
answers
E.g. ? JoeJoes private data
The definition should take into account the
adversarys power (computational, of queries,
prior knowledge, )
Quite often it is much easier to say what is
surely non-private
E.g. Strong breaking of privacy adversary is
able to retrieve (almost) everybodys private data

Intuition privacy breached if it is possible to
compute someones private information from his
identity
6
The Data Privacy Game an Information-Privacy
Tradeoff

Private functions
want to hide ?x(DB)dx
Information functions
want to reveal f (q, DB) for queries q
Here explicit definition of private functions.
The question which information functions may be
allowed?
Different from Crypto (secure function
evaluation)
There, want to reveal f() (explicit definition
of information function)
want to hide all functions ?() not computable
from f()
Implicit definition of private functions
The question whether f() should be revealed is
not asked

f
?x
f
f
7
A simplistic model Statistical Database (SDB)
query
q ? n
bits
8
Approaches to SDB Privacy

Studied extensively since the 70s
Perturbation
Add randomness. Give noisy or approximate
answers
Techniques
Data perturbation (perturb data and then answer
queries as usual) Reiss 84, Liew Choi Liew 85,
Traub Yemini Wozniakowski 84
Output perturbation (perturb answers to queries)
Denning 80, Beck 80, Achugbue Chin 79, Fellegi
Phillips 74
Recent interest Agrawal, Srikant 00 Agrawal,
Aggarwal 01,
Query Restriction
Answer queries accurately but sometimes disallow
queries
Require queries to obey some structure Dobkin
Jones Lipton 79
Restricts number of queries
Auditing Chin Ozsoyoglu 82, Kleinberg
Papadimitriou Raghavan 01

9
Some Recent Privacy Definitions

X data, Y (noisy) observation of X
Agrawal, Srikant 00 Interval of confidence
Let Y Xnoise (e.g. uniform noise in
-100,100.
Perturb input data. Can still estimate underlying
distribution.
Tradeoff more noise ? less accuracy but more
privacy.
Intuition large possible interval ? privacy
preserved
Given Y, we know that within c confidence X is
in a1,a2. For example, for Y200, with 50 X is
in 150,250.
a2-a1 defines the amount of privacy at c
confidence
Problem there might be some a-priori information
about X
X someones age Y -97

10
The AS scheme can be turned against itself

Assume that N is large
Even if the data-miner doesnt have a-priori
information about X, it can estimate it given the
randomized data Y.
The perturbation is uniform in -1,1
AS privacy interval 2 with confidence 100
Let fx(X)50 for x?0,1, and 50 for x?4,5.
But, after learning fx(X) the value of X can be
easily localized within an interval of size at
most 1.
Problem aggregate information provides
information that can be used to attack individual
data

11
Some Recent Privacy Definitions

X data, Y (noisy) observation of X
Agrawal, Aggarwal 01 Mutual information
Intuition
High entropy is good. I(XY) H(X)-H(XY)
(mutual information)
small I(XY) (mutual information) ? privacy
preserved (Y provides little information about
X).
Problem EGS
Average notion. Privacy loss can happen with low
but significant probability, but without
affecting I(XY).
Sometimes I(XY) seems good but privacy is
breached

12
Output Perturbation (Randomization Approach)

Exact answer to query q
aq ??i?q di
Actual SDB answer âq
Perturbation ?
For all q âq aq ?
Questions
Does perturbation give any privacy?
How much perturbation is needed for privacy?
Usability

13
Privacy Preserved by Perturbation ? ? ?n

Database d?R0,1n (uniform input
distribution!)
Algorithm on query q,
Let aq?i?q di
If aq - q/2 lt ? return âq q / 2
Otherwise return âq aq
? ? ?n (lgn)2 ? Privacy is preserved
Assume poly(n) queries
If ? ? ?n (lgn)2, whp always use rule 2
No information about d is given!
(but database is completely useless)
Shows that sometimes perturbation ? ?n is enough
for privacy. Can we do better?

14
Perturbation ? ltlt ?n Implies no Privacy

The previous useless database achieves the best
possible perturbation.
Theorem Dinur-Nissim Given any DB and any DB
response algorithm with perturbation ? o(?n),
there is a poly-time reconstruction algorithm
that outputs a database d, s.t. dist(d,d) lt
o(n).

strong breaking of privacy
15
The Adversary as a Decoding Algorithm

16
Proof of Theorem DN03 The Adversary
Reconstruction Algorithm

Query phase Get âqj for t random subsets q1,,qt
Weeding phase Solve the Linear Program (over ?)
0 ? xi ? 1
?i?qj xi - âqj ? ?
Rounding Let ci round(xi), output c

Observation A solution always exists, e.g. xd.
17
Why does the Reconstruction Algorithm Work?

Consider x?0,1n s.t. dist(x,d)cn ?(n)
Observation
A random q contains cn coordinates in which x?d
The differences in the sum of these coordinates
is, with constant probability, at least ?(?n) (gt
? o(?n) ).
Such a q disqualifies x as a solution for the LP
Since the total number of queries q is
polynomial, then all such vectors x are
disqualified with overwhelming probability.

18
Summary of Results (statistical database)

Dinur, Nissim 03
Unlimited adversary
Perturbation of magnitude ?(n) required
Polynomial-time adversary
Perturbation of magnitude ?(sqrt(n)) is required
(shown above)
In both cases, adversary may reconstruct a good
approximation for the database
Disallows even very weak notions of privacy
Bounded adversary, restricted to T ltlt n queries
(SuLQ)
There is a privacy preserving access mechanism
with perturbation ltlt sqrt(T)
Chance for usability
Reasonable model as database grows larger and
larger

19
SuLQ for Multi-Attribute Statistical Database
(SDB)
Query (q, f) q ? n f 0,1k? 0,1
Answer aq,f??i?q f(di)
Database di,j
Row distributionD
(D1,D2,,Dn)
20
Privacy and Usability Concerns for the
Multi-Attribute Model DN

Rich set of queries subset sums over any
property of the k attributes
Obviously increases usability, but how is privacy
affected?
More to protect functions of the k attributes
Relevant factors
What is the adversarys goal?
Row dependency
Vertically split data (between k or less
databases)
Can privacy still be maintained with
independently operating databases?

21
Privacy Definition - Intuition

3-phase adversary
Phase 0 defines a target set G of poly(n)
functions g 0,1k? 0,1
Will try to learn some of this information about
someone
Phase 1 adaptively queries the database To(n)
times
Phase 2 chooses an index i of a row it intends
to attack and a function g?G
Attack
given d-i
try to guess g(di,1di,k)

22
The Privacy Definition

P 0i,g a-priori probability that
g(di,1di,k)1
p Ti,g a-posteriori probability that
g(di,1di,k)1
Given answers to the T queries, and d-i
Define conf(p) log (p/(1-p))
1-1 relationship between p and conf(p)
conf(1/2)0 conf(2/3)1 conf(1)?
?conf i,g conf(pTi,g) conf(p0i,g)
(?,T) privacy (relative privacy)
For all distributions D1Dn , row i, function g
and any adversary making at most T queries
Pr?conf i,g gt ? neg(n)

23
The SuLQ Database

Adversary restricted to T ltlt n queries
On query (q, f)
q ? n
f 0,1k? 0,1 (binary function)
Let aq,f ?i?q f(di,1di,k)
Let N ? Binomial(0, ?T )
Return aq,fN

SuLQ Sub Linear Queries
24
Privacy Analysis of the SuLQ Database

Pmi,g - a-posteriori probability that
g(di,1di,k)1
Given d-i and answers to the first m queries
conf(pmi,g) Describes a random walk on the line
with
Starting point conf(p0i,g)
Compromise conf(pmi,g) conf(p0i,g) gt ?
W.h.p. more than T steps needed to reach
compromise

conf(p0i,g)
conf(p0i,g) ?
25
Usability One multi-attribute SuLQ DB

Statistics of any property f of the k attributes
I.e. for what fraction of the (sub)population
does f(d1dk) hold?
Easy just put f in the query
Other applications
k independent multi-attribute SuLQ DBs
Vertically partitioned SulQ DBs
Testing whether Pr?? Pr??
Caveat we hide g() about a specific row (not
about multiple rows)

26
Overview of Methods

Input Perturbation
Output Perturbation
Query Restriction

27
Query restriction

The decision whether to answer or deny the query
Can be based on the content of the query and on
answers to previous queries
Or, can be based on the above and on the content
of the database

28
Auditing

AW89 classify auditing as a query restriction
method
Auditing of an SDB involves keeping up-to-date
logs of all queries made by each user (not the
data involved) and constantly checking for
possible compromise whenever a new query is
issued
Partial motivation May allow for more queries to
be posed, if no privacy threat occurs.
Early work Hofmann 1977, Schlorer 1976, Chin,
Ozsoyoglu 1981, 1986
Recent interest Kleinberg, Papadimitriou,
Raghavan 2000, Li, Wang, Wang, Jajodia 2002,
Jonsson, Krokhin 2003

29
How Auditors may Inadvertently Compromise Privacy
30
The Setting
Statisticaldatabase

Dataset dd1,,dn
Entries di Real, Integer, Boolean
Query q (f ,i1,,ik)
f Min, Max, Median, Sum, Average, Count
Bad users will try to breach the privacy of
individuals
Compromise ? uniquely determine di (very weak def)

31
Auditing
Heres the answer
OR
Query denied (as the answer would cause privacy
loss)
Heres a new query qi1
Query log q1,,qi
32
Example 1 Sum/Max auditing
di real, sum/max queries, privacy breached if
some di learned
q1 sum(d1,d2,d3)
sum(d1,d2,d3) 15
q2 max(d1,d2,d3)
Denied (the answer would cause privacy loss)
There must be a reason for the denial
q2 is denied iff d1d2d3 5 I win!
Oh well
33
Sounds Familiar?
David Duncan, Former auditor for Enron and
partner in Andersen
Mr. Chairman, I would like to answer the
committee's questions, but on the advice of my
counsel I respectfully decline to answer the
question based on the protection afforded me
under the Constitution of the United States.
34
Max Auditing
di real
q1 max(d1,d2,d3,d4)
M1234
M123 / denied
If denied d4M1234
M12 / denied
If denied d3M123
Learn an item with prob ½
35
Boolean Auditing?
di Boolean
1 / denied
1 / denied

qi denied iff di di1 ? learn
database/complement
36
The Problem

The problem
Query denials leak (potentially sensitive)
information
Users cannot decide denials by themselves

Possible assignments to d1,,dn
Assignments consistent with (q1,qi, a1,,ai)
qi1 denied
37
Solution to the problem simulatable Auditing

An auditor is simulatable if a simulator exists
s.t.

?
Simulation ? denials do not leak information
38
Why Simulatable Auditors do not Leak Information?
Possible assignments to d1,,dn
39

Simulatable auditing

40
Query Restriction for Sum Queries

Given
Dx1,..,xn dataset, xi ??
S is a subset of X. Query ?xi?S xi
Is it possible to compromise D?
Here compromise means uniquely determine xi from
the queries
Can compromise if subsets arbitrarily small
sum(x9) x9

41
Query Set Size Control

Do not permit queries that involve a small subset
of the database.
Compromise still possible
Want to discover x
sum(x,y1,..,yk) - sum(y1,..,yk) x
Issue Overlap
In general, overlap is not enough.
Need to restrict also the number of queries
Note that overlap itself sometimes restricts
number of queries (e.g. size of queries cn,
overlap const, only about 1/c possible queries)

42
Restricting Set-Sum Queries

Restricting the sum queries based on
Number of database elements in the sum
Overlap with previous sum queries
Total number of queries
Note that the criteria are known to the user
They do not depend on the contents of the
database
Therefore, the user can simulate the
denial/no-denial answer given by the DB
Simulatable auditing

43
Restricting Overlap and Number of Queries

Assume
Query Qi k
Qi ? Qj r
Adversary knows a-priori at most L values, L1 lt
k
Claim Data cannot be compromised with fewer than
1(2k-L)/r Sum Queries.

44
Overlap Number of Queries

Claim Data cannot be compromised with fewer than
1(2k-L)/r Sum Queries Dobkin,Jones,LiptonReiss
k lt query size, r gt overlap, L ? a-priori known
items
Suppose xc compromised after t queries where each
query represented by
Qi xi1 xi2 xik for i 1, , t
Implies that
xc ?i1,t ?i Qi ?i1,t ?i ?j1,k xij
Let ?i? 1 if x? in query i, 0 otherwise
xc ?i1,t ?i ??1,n ?i? x? ??1,n (?i1,t ?i
?i?)x?

45
Overlap Number of Queries

We have
xc ??1,n (?i1,t?i ?i?)x?
In the above sum, (?i1,t?i ?i?) must be 0 for
all x? except for xc (in order for xc to be
compromised)
This happens iff ?i?0 for all i, or if ?i? ?j?
1 and ?i ?j have opposite signs
or ?i 0, in which case the ith query didnt
matter

46
Overlap Number of Queries

Wlog, first query contains xc, second query is of
opposite sign.
In the first query, k elements are probed
The second query adds at least k-r elements
Elements from first and second queries cannot be
canceled within the same (additional) query
(opposite signs requires)
Therefore each new query cancels items from first
or from second query, but not from both.
Need to cancel 2k-r-L elements.
Need 2(2k-r-L)/r queries, i.e. 1(2k-L)/r.

47
Notes

The number of queries satisfying Qi k and Qi
? Qj r is small
If kn/c for some constant c and rconst, then
there are only c queries where no two queries
overlap by more than 1.
Hence , query sequence length may be
uncomfortably short.
Or, if rk/c (overlap is a constant fraction of
query size) then number of queries, 1(2k-L)/r,
is O( c).

48
Conclusions

Privacy should be defined and analyzed rigorously
In particular, assuming randomization ? privacy
is dangerous
High perturbation is needed for privacy against
polynomial adversaries
Threshold phenomenon above ?n total privacy,
below ?n no privacy (for poly-time adversary)
Main tool a reconstruction algorithm
Careless auditing might leak private information
Self Auditing (simulatable auditors) is safe
Decision whether to allow a query based on
previous good queries and their answers
Without access to DB contents
Users may apply the decision procedure by
themselves

49
ToDo

Come up with good model and requirements for
database privacy
Learn from crypto
Protect against more general loss of privacy
Simulatable auditors are a starting point for
designing more reasonable audit mechanisms

50
References

Course web page
A Study of Perturbation Techniques for Data
Privacy, Cynthia Dwork and Nina Mishra and Kobbi
Nissim, http//theory.stanford.edu/nmishra/cs369-
2004.html
Privacy and Databases
http//theory.stanford.edu/rajeev/privacy.html

51
Foundations of CS at the Weizmann Institute