Title: Anonymity and Privacy Issues --- re-identification
1Anonymity and Privacy Issues--- re-identification
2Index
- Views on Privacy of Social Media
- Overview of Re-identification
- You are What You Say Privacy Risks of Public
Mentions, Frankowski et al. SIGIR06
3Improper Use of Personal Information Online
4Top Privacy Concerns
5Remaining Anonymous
6True Information Provide While Registering
7Ability to Remain Anonymous
8Importance of Controlling Personal Information
9Specifying Who Can ViewPersonal Information
10Conclusion
- Around 40 of people would like to remain
anonymous on social media or social networking
sites - Most people provide their true personal
information while registering - Most people think it is important to have the
control of personal information online
Re-identification Techniques can identify the
users of an anonymous dataset
11Privacy Loss through Re-identification
- Re-identification Linkage of datasets with
explicit identifiers with datasets without
explicit identifiers through common attributes - Datasets without explicit identifiers
- Public data which are made anonymous by users
- Public data by research groups (after suitable
anonymizing) - Public data from government agencies (census)
People wish to keep private
12Example of Re-identification
Voter register list of Massachusetts purchased
with only 20
87 of Population in 1990. US are likely to be
uniquely identified based on only on Zip, Birth
and Sex
Sweeney, 2002
13The Rebus Form
Governors medical records!
From Frankowski, SIGIR06
14Example of face identification
Without explicit identified profiles
With explicit identified profiles
Friendster
Facebook
Identity violation!
Face Recognizer
Gross and Acquisti, WPES 05
15You Are What You Say Privacy Risks of Public
Mentions
- Dan Frankowski, Dan Cosley, Shilad Sen, Loren
Terveen, John Riedl - University of Minnesota
- SIGIR 2006
16Main Idea
- People can be identified by their preferences and
what they talk about - Reviews of books, movies, songs
- Mentions on forums or blogs
- Friend list on Facebook
- Wish or purchase list on Amazon
- Method for Re-identification
- Datasets are represented in Sparse Relation
Spaces - Re-identification can be done by matching two
Sparse Relation Spaces
17Sparse Relation Space
- Relates people to items
- Sparse have few relationships recorded per
person - Dataset that can be represented in a Sparse
Relation Space is vulnerable
i1 i2 i3
p1 X
p2 X
p3 X
18Research Questions
- Risks of dataset release
- What are the risks to user privacy when releasing
a dataset - Altering the dataset
- How can dataset owners alter the dataset to
preserve user privacy - Self defense
- How can users protect their own privacy
19Experiment Dataset MovieLens
Dataset1 Movie Ratings Users do not allow to
reveal Released for research use Anonymous
Dataset
Dataset2 Movies Reviews Public
20Feature of the dataset
- Both ratings and mentions follow a power law
- Important feature for real world sparse relation
space
Frankowski, SIGIR 06
21Evaluation Measure
Mentions
Mentions by User t
Ratings
Re-identify Algorithm
Top k ratings users ranked by the likelihood they
are user t
K-identified t is in the k users returned by the
algorithm K-identification rate the fraction of
k-identified users
22Set Intersection Algorithm for Re-identification
- Likely list Users in the rating database who
have rated every movie mentions by user t - Problem
- Users mention movies but do not rate them
23TF-IDF Algorithm
- Mentions of a user vector of the movies the user
mentioned - Ratings of a user vector of the movies the user
rated - Likelihood TF-IDF cosine similarity
24Scoring Algorithm
- Scoring
- emphasize the mentions of rarely rated movies
- de-emphasize the number of ratings a user has
Score for one mention/movie of a user
Fraction of users who have not rated mention m
Score for a user Multiplication of scores for
all mentions of this user
25Scoring Algorithm with Ratings
- Suppose we have an magic analyzer which can guess
the rating of a movie from the mention - Eg. Using the context of that mention
- Algorithms
- ExactRating the analyzer can perfectly determine
the rating - FuzzingRaing the analyzer can guess the rating
value within /-1
26Percent of users identified by different
algorithms
271-identification rate
28RQ2 Altering the dataset
- How can dataset owners alter the dataset they
release to preserve user privacy - Data Suppression
- Algorithm Drop rarely rated movies
- Not big problem for industry, but harmful for
research
29Dataset level Suppression
Do not work!
30RQ3 Self Defence
- How can users protect their own privacy
- Suppression
- Not to mention movies rated rarely
- Misdirection
- Mention items they have not rated
31User Level Suppression
Do not work!
32Misdirection
Works when user mention popular items
33Conclusion
- Simple data mining algorithms can identify the
users who mention in a sparse relation space and
think they are anonymous - Use the algorithms eg. find paper reviewers
(Future work of Frankowski) - Privacy risks for users on Social Media sites
- Hard to preserve privacies
- Dont reveal your privacies even if it seems to
be anonymous