Defining and Achieving Differential Privacy - PowerPoint PPT Presentation

About This Presentation

Title:

Defining and Achieving Differential Privacy

Description:

Title: An Ad Omnia Approach to Defining and Achieving Private Data Analysis Author: Cynthia Dwork Last modified by: Carnegie Mellon University Created Date – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 22

Provided by: Cynth82

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Defining and Achieving Differential Privacy

1
Defining and Achieving Differential Privacy

Cynthia Dwork, Microsoft

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAA
2
Meaningful Privacy Guarantees

Statistical databases
Medical
Government Agency
Social Science
Searching / click stream
Learn non-trivial trends while protecting privacy
of individuals and fine-grained structure

3
Linkage Attacks

Using innocuous data in one dataset to identify
a record in a different dataset containing both
innocuous and sensitive data
At the heart of the voluminous research on hiding
small cell counts in tabular data

4
(No Transcript)
5
The Netflix Prize

Netflix Recommends Movies to its Subscribers
Offers 1,000,000 for 10 improvement in its
recommendation system
Not concerned here with how this is measured
Publishes training data
Nearly 500,000 records, 18,000 movie titles
The ratings are on a scale from 1 to 5
(integral) stars. To protect customer privacy,
all personal information identifying individual
customers has been removed and all customer ids
have been replaced by randomly-assigned ids. The
date of each rating and the title and year of
release for each movie are provided.
Some ratings not sensitive, some may be sensitive
OK for Netflix to know, not OK for public to know

6
A Publicly Available Set of Movie Rankings

International Movie Database (IMDb)
Individuals may register for an account and rate
movies
Need not be anonymous
Visible material includes ratings, dates,
comments
By definition, these ratings not sensitive

7
The Fiction of Non-PII Narayanan Shmatikov 2006

Movie ratings and dates are PII
With 8 movie ratings (of which we allow 2 to be
completely wrong) and dates that may have a 3-day
error, 96 of Netflix subscribers whose records
have been released can be uniquely identified in
the dataset.
Linkage attack prosecuted using the IMDb.
Link ratings in IMDb to (non-sensitive) ratings
in Netflix, revealing sensitive ratings in
Netflix
NS draw conclusions about user.
May be wrong, may be right. User harmed either
way.

8
What Went Wrong?

What is Personally Identifiable Information?
Typically syntactic, not semantic
Eg, genome sequence not considered PII ??
Suppressing PII doesnt rule out linkage
attacks
Famously observed by Sweeney, circa 1998
AOL debacle
Need a more semantic approach to privacy

9
Semantic Security Against an Eavesdropper
Goldwasser Micali 1982

Vocabulary
Plaintext the message to be transmitted
Ciphertext the encryption of the plaintext
Auxiliary information anything else known to
attacker
The ciphertext leaks no information about the
plaintext.
Formalization
Compare the ability of someone seeing aux and
ciphertext to guess (anything about) the
plaintext, to the ability of someone seeing only
aux to do the same thing. Difference should be
tiny.

10
Semantic Security for Statistical Databases?

Dalenius, 1977
Anything that can be learned about a respondent
from the statistical database can be learned
without access to the database
An ad omnia guarantee
Happily, Formalizes to Semantic Security
Recall Anything about the plaintext that can be
learned from the ciphertext can be learned
without the ciphertext
Popular Intuition prior and posterior views
about an individual shouldnt change too much.
Clearly Silly
My (incorrect) prior is that everyone has 2 left
feet.
Very popular in literature nevertheless
Definitional awkwardness even when used correctly

11
Semantic Security for Statistical Databases?

Unhappily, Unachievable
Cant achieve cryptographically small levels of
tiny
Intuition (adversarial) user is supposed to
learn unpredictable things about the DB
translates to learning more than a
cryptographically tiny amount about a respondent
Relax tiny?

12
Relaxed Semantic Security for Statistical
Databases?

Relaxing Tininess Doesnt Help
Dwork Naor 2006
Database teaches average heights of population
subgroups
Terry Gross is two inches shorter than avg
Lithuanian ?
Access to DB teaches Terrys height
Terrys height learnable from the DB, not
learnable otherwise
Formal proof extends to essentially any notion of
privacy compromise, uses extracted randomness
from the SDB as a one-time pad.
Bad news for k-,l-,m- etc.
Attack Works Even if Terry Not in DB!
Suggests new notion of privacy risk incurred by
joining DB
Differential Privacy
Privacy, when existence of DB is stipulated
Before/After interacting vs Risk when
in/notin DB

13
Differential Privacy

K gives ?-differential privacy if for all values
of DB and Me and all transcripts t

Pr t
14
Differential Privacy is an Ad Omnia Guarantee

No perceptible risk is incurred by joining DB.
Anything adversary can do to me, it could do
without Me (my data).

15
An Interactive Sanitizer KDwork, McSherry,
Nissim, Smith 2006
noise
f
K
f DB ? R K (f, DB) f(DB) Noise Eg,
Count(P, DB) rows in DB with Property P
16
Sensitivity of a Function f

How Much Can f(DB Me) Exceed f(DB - Me)?
Recall K (f, DB) f(DB) noise
Question Asks What difference must noise obscure?

f maxDB, Me f(DBMe) f(DB-Me)
eg, ?Count 1

17
Calibrate Noise to Sensitivity
? f maxDB, Me f(DBMe) f(DB-Me)
Theorem To achieve ?-differential privacy, use
scaled symmetric noise Lap(x/R) with R ?f/?.
0
R
2R
3R
4R
5R
-R
-2R
-3R
-4R
Prx proportional to exp(-x/R) Increasing R
flattens curve more privacy Noise depends on f
and ?, not on the database
18
Calibrate Noise to Sensitivity
? f maxDB, Me f(DBMe) f(DB-Me)
Theorem To achieve ?-differential privacy, use
scaled symmetric noise Lap(x/R) with R ?f/?.
0
R
2R
3R
4R
5R
-R
-2R
-3R
-4R
19
Multiple Queries

For query sequence f1, , fd ?-privacy achieved
with noise generation parameter ? Ri ? ?fi/?
for each response.
Can sometimes do better.
Noise must increase with the sensitivity of the
query sequence. Naively, more queries means
noisier answers
Dinur and Nissim 2003 et sequelae
Speaks to the Non-Interactive Setting
Any non-interactive solution permitting too
accurate answers to too many questions is
vulnerable to attack.
Privacy mechanism is at an even greater
disadvantage than in the interactive case can be
exploited

20
Future Work

Investigate Techniques from Robust Statistics
Area of statistics devoted to coping with
Small amounts of wild data entry errors
Rounding errors
Limited dependence among samples
Problem the statistical setting makes strong
assumptions about existence and nature of an
underlying distribution
Differential Privacy for Social Networks, Graphs
What are the utility questions of interest?
Definitional and Algorithmic Work for Other
Settings
Differential approach more broadly useful
Several results discussed in next few hours
Porous Boundary Between Inside and Outside?
Outsourcing, bug reporting, combating D-DoS
attacks and terror