Title: I
1Im in the Database, But Nobody Knows
- Cynthia Dwork, Microsoft Research
TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAAA
2Many Threats to Privacy of Electronic Data
- Theft
- Phishing
- Viruses
- Cryptanalysis
- Changing Privacy Policies
3This Talk Privacy-Preserving Data Analysis
- First Tier Motivating Examples
- Analysis of Census Data, Medical Outcomes Data,
GWAS data, Epidemiology, Analysis of Vehicle
Braking Records - Second Tier Examples
- Training an advertising classifier,
Recommendation System, Netflix Challenge
4Pure Privacy Problem
- Difficult Even if
- Curator is Angel
- Data are in Vault
5Typical Suggestions
- Large Set Queries
- How many MSFT employees have Sickle Cell Trait
(CST)? - How many MSFT employees who are not female
Distinguished Scientists with very curly hair
have the CST? - Add Random Noise to True Answer
- Average of responses to repeated queries
converges to true answer - Cant simply detect repetition (undecidable)
- Detect When Answering is Unsafe
- Refusal can be disclosive
6A Litany
7AOL Search History Release (2006)
Name Thelma Arnold Age 62 Widow Residence
Lilburn, GA
8William Welds Medical Record Sweeney02
HMO data
voter registration data
name
ethnicity
address
ZIP
visit date
date reg.
diagnosis
birth date
procedure
party affiliation
sex
medication
total charge
last voted
9(No Transcript)
10GWAS Membership Homer et al. 08
C
T
T
T
- SNP Single Nucleotide (A,C,G,T) polymorphism
Reference Population Major Allele (C)
94 Minor Allele (T) 6 Genome-Wide
Association Study Allele frequencies for
many thousands of SNPS
11Anonymized Social Networks BackstromDK07
- Magic Step
- Isolate lightly linked-in subgraphs from rest of
graph - Special structure of subgraph permits finding A, B
A
S
B
J
12Definitional Failures
- Failure to Cope with Auxiliary Information
- Existing and future databases, newspaper reports,
Flikr, literature, etc. - Definitions are Syntactic and Ad Hoc
- Daleniuss Ad Omnia Guarantee (1977)
- Anything that can be learned about a
respondent from the statistical database can be
learned without access to the database - Unachievable
-
13Parable How Tall is Pamela Jones (Groklaw)?
- An Admittedly Unreasonable Impossibility Proof
- Database teaches average heights of population
subgroups - PJ is 2 inches shorter than avg Swedish woman
- PJs height learnable with the DB, not learnable
without. - PJ loses privacy whether or not she is in the
database - Suggests new notion of privacy risk incurred by
joining DB - The outcome of any analysis is essentially
equally likely, independent of whether any
individual joins or refrains from joining the
dataset. (The likelihood is over the choices
made by the algorithm.)
14Differential Privacy Dwork-McSherry-Nissim-Smith
2006
M gives ? - differential privacy if for all
adjacent D1 and D2, and all C µ range(M) Pr M
(D1) 2 C e? Pr M (D2) 2 C
Neutralizes all linkage attacks. Composes
unconditionally and automatically Si ? i
15Differential Privacy
- Resilience to All Auxiliary Information
- Past, present, future data sources and algorithms
- Low-error high-privacy DP techniques exist for
many problems - datamining tasks (association rules, decision
trees, clustering, ), contingency tables,
histograms, synthetic data sets for query logs,
machine learning (boosting, statistical queries
learning model, SVMs, logistic regression),
various statistical estimators, network trace
analysis, recommendation systems, - Programming Platforms
- http//research.microsoft.com/en-us/projects/PINQ/
- http//userweb.cs.utexas.edu/shmat/shmat_nsdi10.p
df
Download today!
16(No Transcript)
17(No Transcript)
18Can we store and share your information with
health officials and researchers? This
information can be very helpful in monitoring
regional health conditions, plan flu response,
and conduct health research. By allowing the
responses to the survey questions to be used for
public health, education and research purposes,
you can help your community.
19Snow 1854
Cholera cases
Suspected pump
20https//h1n1.cloudapp.net/Privacy.aspx
Microsoft may also disclose information if
required to do so by law or in the good faith
belief that such action is necessary to (a)
conform to the edicts of the law or comply with
legal process served on Microsoft or the Site
(b) protect and defend the rights or property of
Microsoft and our family of Web sites or (c) act
in urgent circumstances to protect the personal
safety of users of Microsoft products or members
of the public.
21Mission Creep
Think of the children!
Never store the data!
22Pan-Private Streaming Algorithms DNPRY10
- Private inside and out
- Completely hide the pattern of appearances of any
individual - Presence, absence, frequency, etc.
- Protect against mission creep, subpoena, intrusion
23DiffeP Limitations and Challenges
- Cant study outliers
- Privacy erosion over multiple analyses is
cumulative - Privacy erosion over multiple databases is
cumulative - Compare real world to one in which my info is
everywhere deleted, looking at a lifetime of
exposure against worst-case adversary/information/
collection of databases - Formally capture reasonable worlds?
- What are the right questions to ask about social
networks? - Removing one person can affect data of many other
people
24Utility Implies Exposure to Harm
- Database teaches that smoking causes cancer.
- Smoker Ss insurance premiums rise.
- But learning that smoking causes cancer is the
whole point. - Smoker S enrolls in a smoking cessation program.
- May be fine for First-Tier Uses, but what about
Second Tier? - Who decides, and how?
25Pause
- Ad Omnia definition of privacy
- Composes automatically and obliviously
independent of aux info - Achievable, frequently with great accuracy
- Usable
- Can program using a privacy-preserving interface
- Many questions remain
- Is there a weaker ad omnia definition than
differential privacy that also composes
automatically?
26Which Ad(s) Am I Charged For?
27More Subtle Attack
- A potentially embarrassing interest
In a Long Leg Cast Anna models her cast on a
stoop, wiggling/rubbing her casted toes.
Length 90 minutes - User clicks on ad targeted to wealthy, whale
loving LLC enthusiasts ) user is wealthy, whale
loving LLC enthusiast - User understands that interest in LLC is
communicated to seller, but does not understand
that wealth and love of whales are also
communicated - Should privacy of these attributes be protected?
What about race? - Fairness should poor children see different ads
than rich children?
28Wall Street Journal 4/4/2010
- User visits capitalone.com
- Capital One uses tracking information via
tracking network x1 to personalize offers - Danger Steering minorities into higher rates
- In principle, law seems to allow credit outcomes
based on browsing history may encode race - How can an ad network prevent steering?
- What is fairness in classification, and how can
we achieve it?
29Work in Progress
D., Fiat, Hardt, Pitassi, Reingold, Zemel
30Thank You!