Title: Private Matching
1Privacy Preserving Data Mining Lecture
1 Motivating privacy research, Introducing
Crypto
Benny Pinkas HP Labs, Israel
2Course structure
- Lecture 1
- Introduction to privacy
- Introduction to cryptography, in particular, to
rigorous cryptographic analysis. - Definitions
- Proofs of security
- Lecture 2
- Cryptographic tools for privacy preserving data
mining. - Lecture 3
- Non-cryptographic tools for privacy preserving
data mining - In particular, answer perturbation.
3Privacy-Preserving Data Mining
- Allow multiple data holders to collaborate in
order to compute important information while
protecting the privacy of other information. - Security-related information
- Public health information
- Marketing information
- Advantages of privacy protection
- protection of personal information
- protection of proprietary or sensitive
information - enables collaboration between different data
owners (since they may be more willing or able to
collaborate if they need not reveal their
information) - compliance with the law
4Privacy Preserving Data Mining
- Two papers appeared in 2000
- Privacy preserving data mining, Agrawal and
Srikant, SIGMOD 2000. (statistical approach) - Privacy preserving data mining, Lindell and
Pinkas, Crypto 2000. (cryptographic approach) - Why privacy now?
- Technological changes erode privacy ubiquitous
computing, cheap storage. - Public awareness health coverage, employment,
personal relationships. - Historical changes Small towns vs. Cities vs.
Connected society. - Privacy is a real problem that needs to be solved
5Some data privacy cases hospital data
- Hospital data contains
- Identifying information name, id, address
- General information age, marital status
- Medical information
- Billing information
- Database access issues
- Your doctor should get every information that is
required to take care of you - Emergency rooms should get all medical
information that is required to take care of
whoever comes there - Billing department should only get information
relevant to billing - Problem how to stop employees from getting
information about family, neighbors, celebrities?
6Some data privacy cases Medical Research
- Medical research
- Trying to learn patterns in the data, in
aggregate form. - Problem how to enable learning aggregate data
without revealing personal medical information? - Hiding names is not enough, since there are many
ways to uniquely identify a person - A single hospitals/medical researcher might not
have enough data - How can different organizations share research
data without revealing personal data?
7Public Data
- Many public records are available in electronic
form birth records, property records, voter
registration - Your information serves as an error correcting
code of your identity - Latanya Sweeney
- Date of birth uniquely identifies 12 of the
population of Cambridge, MA. - Date of birth gender 29
- Date of birth gender (9 digit) zip code 95
- Sweeney was therefore able to get her medical
information from an annonymized database
8Census data
- A trusted party (the census bureau) collects
information about individuals - Collected data
- Explicitly identifying data (names, address..)
- Implicitly identifying data (combination of
several attributes) - Private data
- The data should is collected to help decision
making - Partial or aggregate data should therefore made
public
9Total Information Awareness (TIA)
- Collects information about transactions (credit
card purchases, magazine subscriptions, bank
deposits, flights) - Early detection of terrorist activity
- Check a chemistry book in the library, buy
something at a hardware store and something in a
pharmacy - Early collection of epidemic bursts
- Early symptoms of Anthrax are similar to the flu
- Check non-traditional data sources grocery and
pharmacy data, school attendance records, etc.. - Such systems are developed and used
- Could the collection of data be done in a privacy
preserving manner? (without learning about
individuals?)
10Basic Scenarios
- Single (centralized) database, e.g., census data
- This is often a simple abstraction of a more
complicated scenario, so we better solve this one - Need to collect data and present it in a privacy
preserving way - Published data (e.g., on a CD)
- A trusted party collects data and then
publishes a sanitized version - Users can do any computation they wish with the
sanitized data - For example, statistical tabulations.
11Basic Scenarios
- Multi database scenarios
- Two or more parties with private data want to
cooperate. - Horizontally split Each party has a large
database. Databases have same attributes but are
about different subjects. For example, the
parties are banks which each have information
about their customers. - Vertically split Each party has some information
about the same set of subjects. For example, the
participating parties are government agencies
each with some data about every citizen.
bank 1
u1 un
u1 un
houses
u1 un
bank 2
bank
taxes
12Issues and Tools
- Best privacy can be achieved by not giving any
data, but.. - Privacy tools cryptography LP00
- Encryption data is hidden unless you have the
decryption key. However, we also want to use the
data. - Secure function evaluation two or more parties
with private inputs. Can compute any function
they wish without revealing anything else. - Strong theory. Starts to be relevant to real
applications. - Non-cryptographic tools AS00
- Query restriction prevent certain queries from
being answered. - Data/Input/output perturbation add errors to
inputs hide personal data while keeping
aggregates accurate. (randomization, rounding,
data swapping.) - Can these be understood as well as we understand
Crypto? Provide the same level of security as
Crypto?
13Introduction to Cryptography
14Why learn/use crypto to solve privacy issues?
- Why are we referring to crypto?
- Cryptography is one of the tools we can use for
preserving privacy - A mature research area
- many useful results/tools
- Can reflect on our thinking how is security
defined in cryptography? How should we define
privacy?
15What is Cryptography?
Traditionally how to maintain secrecy in
communication
Alice and Bob talk while Eve tries to listen
Bob
Alice
Eve
16History of Cryptography
- Very ancient occupation
- Up to the mid 70s - mostly classified military
work - Exception Shannon, Turing
- Since then - explosive growth
- Commercial applications
- Scientific work tight relationship with
Computational Complexity Theory - Major works Diffie-Hellman, Rivest, Shamir and
Adleman (RSA) - Recently - more involved models for more diverse
tasks. - Scope How to maintain the secrecy, integrity and
functionality in computer and communication
system.
17Relation to computational hardness
- Cryptography uses problems that are infeasible to
solve. - Uses the intractability of some problems in order
to construct secure systems. - Feasible computable in probabilistic polynomial
time (PPT) - Infeasible no probabilistic polynomial time
algorithm - Usually average case hardness is needed
- For example, the discrete log problem
18The Discrete Log Problem
- Let G be a group and g an element in G.
- Given y?G let x be minimal non-negative integer
satisfying the equation ygx. - x is called the discrete log of y to base g.
- Example ygx mod p in the multiplicative group
of Zp (p is prime). (For example, p7, g3, y4
? x4.) - In general, it is easy to exponentiate
- (using repeated squaring and the binary
representation of x) - Computing the discrete log is believed to be hard
in Zp if p is large. (E.g., p is a prime,
pgt768 bits, p2q1 and q is also a prime.)
19Encryption
- Alice wants to send a message m ? 0,1n to Bob
- Set-up phase is secret
- Symmetric encryption Alice and Bob share a
secret key k - They want to prevent Eve from learning anything
about the message
Ek(m)
Alice
Bob
k
k
Eve
20Public key encryption
- Alice generates a private/public key pair (SK,PK)
- Only Alice knows the secret key SK
- Everyone (even Eve) knows the public key PK, and
can encrypt messages to Alice - Only Alice can decrypt (using SK)
EPK(m)
Alice
Bob
SK
PK
EPK(m)
Charlie
Eve
PK
21Rigorous Specification of Security
- To define the security of a system we must
specify - What constitute a failure of the system
- The power of the adversary
- computational
- access to the system
- what it means to break the system.
22What does learn mean?
- Even if Eve has some prior knowledge of m, she
should not have any advantage in - Probability of guessing m, or probability of
guessing whether m is m0 or m1, or prob. of
computing any other function f of m ,or even
computing m - Ideally the message sent is a independent of the
message m - Implies all the above
- Achievable one-time pad (symmetric encryption)
- Let r?R 0,1 n be the shared key.
- Let m ? 0,1 n
- To encrypt m send r ? m
- To decrypt z send m z ? r
- Shannon achievable only if the entropy of the
shared secret is at least as large as that of m.
Therefore must use long key ?.
23Defining security
- The power of the adversary
- Computational Probabilistic polynomial time
machine (PPTM) - Access to the system e.g. can it change
messages? - Passive adversary, (adaptive) chosen plaintext
attack, chosen ciphertext attack - What constitutes a failure of the system?
- Recovering plaintext from ciphertext not enough
- Allows for the leakage of partial information
- In general, hard to answer which partial
information may/should not be leaked. Application
dependent. - How would partial information the adversary
already holds be combined with what he learns to
affect privacy? - Better Prevent learning anything about an
encrypted message - There are two common, equivalent, definitions
24Security of Encryption Definition
1Indistinguishability of Encryptions
- Adversary A chooses any X0 , X1 ??0,1?n
- Receives encryption of Xb for b?R?0,1?
- Has to decide whether b ? 0 or b ? 1.
- For every PPTM A, choosing a pair X0 , X1
??0,1?n - Pr? A(E(X0)) 1 ? - Pr? A(E(Xb1))? 1 ?
neg(n) - (Probability is over the choice of keys,
randomization in the encryption and As coins) - Note that a proof of security must be rigorous
25Computational Indistinguishability
- Definition two sequences of distributions Dn
and Dn on 0,1n are computationally
indistinguishable if - for every polynomial p(n) and sufficiently large
n, for every probabilistic polynomial time
adversary A that receives input y ? 0,1n and
tries to decide whether y was sampled from Dn or
Dn - ProbA0 Dn - ProbA0 Dn lt
1/p(n)
26Security of Encryption Definition 2Semantic
Security
- Simulation Whatever Adversary A can compute
given an encryption of X ??0,1?n so can a
simulator S that does not get to see the
encryption of X. - A selects a distribution Dn on ?0,1?n and a
relation R(X,Y) - computable in PPT (e.g.
R(X,Y)1 iff Y is last bit of X). - X?R Dn is sampled
- Given E(X), A outputs Y trying to satisfy
R(X,Y) - The simulator S does the same without access to
E(X) - Simulation is successful if A and S have the same
success probability - Successful simulation ? semantic security
27Security of Encryption (2)Semantic Security
- More formally
- For every PPTM A there is a PPTM S so that
- for all PPTM relations R
- for X?R Dn
- ? Pr? R(X,A(E(X)) ? - Pr? R(X,S(?)) ? ?
- is negligible.
- In other words The outputs of A and S are
indistinguishable even for a test that is aware
of X.
28Which is the Right Definition?
- Semantic security seems to convey that the
message is protected - But it is usually easier to prove
indistinguishability of encryptions - Would like to argue that the two definitions are
equivalent - Must define the attack chosen plaintext attack
- Adversary can obtain the encryption for any
message it chooses, in an adaptive manner - More severe attacks chosen ciphertext
- The Equivalence Theorem
- A cryptosystem is semantically secure if and
only if it has the indistinguishability of
encryptions property
29Equivalence Proof (informal)
- Semantic security ? Indistinguishability of
encryptions - Suppose no indistinguishability
- A chooses a pair X0 , X1??0,1?n for which it can
distinguish encryptions with non-negligible
advantage ? - Choose
- Distribution Dn X0 , X1
- Relation R which is equality with X
- ?S that doesnt get E(X), and outputs Y we have
- Prob R( X, Y ) ½
- Given E(Xb ), run A(E(Xb )), get output b?0,1,
set YXb - Now, Pr?A(E(Xb)) 1 ? b ? 1? - Pr?A(E(Xb))?
1 ? b ? 0? gt ? - Therefore, Pr?R(X,Y)? - Pr?R(E(X,Y)? gt ? / 2
30Equivalence Proof (informal)
- Indistinguishability of encryptions ? Semantic
security - Suppose no semantic security A chooses some
distribution Dn and some relation R - Choose X0, X1 ?R Dn , choose b?R 0,1, compute
E(Xb). - Give E(Xb) to A, ask A to compute Yb A(E(Xb))
- For X0 , X1 ?R Dn let
- ?0 ProbR(X0, Yb), ?1 ProbR(X1, Yb)
- With noticeable probability ?0 - ?1 is
non-negligible, since otherwise Yb can be
computed without the encryption. - If ?0 - ?1 is non-negligible, then we can
distinguish between an encryption of X0 and X1
31Lessons learned?
- Rigorous approach to cryptography
- Defining security
- Proving security
32References
- Books
- O. Goldreich, Foundations of Cryptography Vol 1,
Basic Tools, Cambridge, 2001 - Pseudo-randomness, zero-knowledge
- Vol 2, Basic Applications (to be available May
2004) - Encryption, Secure Function Evaluation)
- Other volumes in www.wisdom.weizmann.ac.il/oded/b
ooks.html - Web material/courses
- S. Goldwasser and M. Bellare, Lecture Notes on
Cryptography, - http//www-cse.ucsd.edu/mihir/papers/gb.html
- M. Naor, 9th EWSCS, http//www.cs.ioc.ee/yik/schoo
ls/win2004/naor.php
33Secure Function Evaluation
- A major topic of cryptographic research
- How to let n parties, P1,..,Pn compute a function
f(x1,..,xn) - Where input xi is known to party Pi
- Parties learn the final input and nothing else
34The Millionaires Problem Yao
x
y
Alice
Bob
Whose value is greater?
Leak no other information!
35Comparing Information without Leaking it
x
y
Alice
Bob
- Output Is xy?
- The following solution is insecure
- Use a one-way hash function H()
- Alice publishes H(x), Bob publishes H(y)
36Secure two-party computation - definition
y
x
Input
F(x,y) and nothing else
Output
y
As if
x
Trusted third party
F(x,y)
F(x,y)
37Leak no other information
- A protocol is secure if it emulates the ideal
solution - Alice learns F(x,y), and therefore can compute
everything that is implied by x, her prior
knowledge of y, and F(x,y). - Alice should not be able to compute anything else
- Simulation
- A protocol is considered secure if
- For every adversary in the real world
- There exists a simulator in the ideal world,
which outputs an indistinguishable transcript
, given access to the information that the
adversary is allowed to learn
38