Title: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION
1HIDING EMERGING PATTERNS WITH LOCAL RECODING
GENERALIZATION
- Presented by Michael Cheng
- Supervisor Dr. William Cheung Co-Supervisor
Dr. Byron Choi
2Presentation Flow
- Privacy-Preserving Data Publishing
- Introduction to Emerging Patterns (EPs)
- Introduction to Equivalence Class
- Introduction to Generalization
- Proposed Problem and Motivation
- Heuristic for the Problem
- Experimental Results
- Future research plan
3Privacy Preserving Data Publishing- Introduction
- Organizations often need to publish or share
their data for legitimate reasons - Sensitive information (e.g. personal identities,
restrictive patterns) maybe inferred from the
published data
4Privacy Preserving Data Publishing- Objective
- Transform the dataset before publishing, such
that - Sensitive information
- In our case Emerging Patterns (EPs)
- Subsequence analysis
- In our case Frequent Itemset (FIS) Mining
5Introduction to Emerging Patterns (EPs)
- Emerging Patterns (EPs) are itemsets exist in
pair of datasets whose supports are significant
in one dataset but insignificant in another
MSE, Exec is an Emerging Pattern
Edu
Occup
Marital
Married
MSE
Exec
Married
MSE
Exec
BA
Exec
Married
BA
Manager
Married
BA
Repair
Never
Income gt 50k
Income lt 50k
6Introduction to Emerging Patterns (EPs)
- Formally, growth rate and EPs are defined as
follow
7Introduction to Equivalence Class
- Tuples are said to be in the same Equivalence
Class w.r.t. a set of Attribute A if they take
same values of A
Tuples 1,2,3 are in the same Equivalence Class
w.r.t. Occup, Marital
ID
Edu
Occup
Marital
1
MSE
Exec
Married
2
MSE
Exec
Married
3
BA
Exec
Married
Manager
4
BA
Married
5
BA
Repair
Never
8Introduction to Generalization
- Extensively studied in achieving k-Anonymity
- Not studied before for hiding itemsets
- Modify the original values in dataset into more
general values according to a user-given
hierarchy such that more tuples will share the
same set of attribute values - Example
- In Adult, BA and MSE maybe generalized to
Degree Holder
9Types of Generalization
- Single Dimensional Global Recoding
- Multi Dimensional Global Recoding
- Multi Dimensional Local Recoding
10Single Dimensional Global Recoding
- If we decide to generalize some values to a
single value, all tuples which contains these
values will be affected
Single Dimensional Global Recoding
11Multi Dimensional Global Recoding
- If we decide to generalize some values to a
single value, all tuples in the same equivalence
class which contains those values will be
affected
Occup
Occupation
Multi Dimensional Global Recoding
Occupation
Occupation
Manager
Repair
12Multi Dimensional Local Recoding
- Same as the Multi Dimensional Global Recoding
except no Equivalence Class constraint
Occup
Occupation
Multi Dimensional Local Recoding
Occupation
Exec
Manager
Repair
13Proposed Problem- Why EP and FIS ?
- Emerging Pattern may reveal sensitive information
- E.g. In the Adult dataset from UCI Repository, we
found that - Never-Married, Own-Child is an EP from the
class Income lt 50k to the class Income gt50k - Growth Rate 35
- Frequent Itemset is a popular data mining task
and supported by commercial data-mining software
14Proposed Problem-Why Generalization ?
- Other methods studied in PPDP
- For example
- Adding unknowns, remove tuples, adding fake
tuples randomly - Either
- Incomplete information
- Fake information
- In some applications, completeness and
truthfulness of data are important - By using generalization, we can preserve the
completeness and truthfulness of the data
15Proposed problem- Problem Illustration
Emerging Patterns Frequent Itemsets
Emerging Patterns Frequent Itemsets
D
D
Transformation (Local Recoding)
16Intuition of Local Recoding
- Support of FIS 40 Growth Rate of EP 3
- Frequent Itemset Exec, Married
- Emerging Pattern MSE ,Exec
Edu
Occup
Marital
Edu
Occup
Marital
Married
MSE
Exec
BA
Exec
Married
Married
MSE
Exec
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Worker
Married
BA
Repair
Never
MSE
Manager
Never
Income gt 50k
Income lt 50k
17Intuition of Local Recoding
Edu
Occup
Marital
Edu
Occup
Marital
Married
MSE
Exec
BA
Exec
Married
Married
MSE
Exec
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Worker
Married
BA
Repair
Never
MSE
Manager
Never
Income gt 50k
Income lt 50k
Edu
Occup
Marital
Edu
Occup
Marital
Married
MSE
White col
BA
Exec
Married
Married
MSE
White col
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Worker
Married
BA
Repair
Never
MSE
White Col
Never
Income gt 50k
Income lt 50k
18Heuristic for the Problem- Greedy Approach
Repeat
D
Equivalence Classes
Utility Gain
Class
1
40
EPs
Emerging Patterns Mining
Class 2
90
EP 1
Class 3
60
EP 2
Class 4
20
EP 3
EP 4
Class 5
15
Until
All Emerging Patterns are removed
19Heuristic for the Problem-Greedy Approach
- Drawbacks
- Trapped into some local minima
- Solution
- Simulated Annealing Style Approach for choosing
equivalence class
20Heuristic for the Problem- Simulated Annealing
Style Approach
- Choose Equivalence Class probabilistically
- Two parameters
- Initial temperature ( T0 )
- Cooling Rate ( a )
- Acceptance Probability
- exp Utility Gain / Temperature
- Temperature updating
- Tn a Tn-1
Acceptance probability of different utility gain
and temperature
21Heuristic for the Problem- Simulated Annealing
Style Approach
Repeat
D
Equivalence Classes
Probability
Class
1
0.2
EPs
Emerging Patterns Mining
Class 2
0.4
EP 1
Class 3
0.1
EP 2
Class 4
0.25
EP 3
EP 4
Class 5
0.05
Until
All Emerging Patterns are removed
22Two questions
- How to choose an EP for generalization?
- How to calculate the utility gain?
23How to choose an EP for generalization?
- Choose the EP which overlaps with the remaining
EPs the most - More likely to hide other EPs simultaneously
Emerging Patterns
MSE Never Married
BA
Divorced
BA
Divorced
Worker
BA
Divorced Repairman
BA Divorced
Own
-
Child
24How to calculate utility gain?
- Utility gain is a function of
- Recoding Distance (RD)
- Reduction of Growth Rate (RG)
25How to calculate utility gain ?- Recoding
Distance (RD)
- The detail derivation is stated in the paper
- Intuitively, it measures
- How many and how much FIS have been generalized?
- How many FIS disappeared?
- High level definition of RD
- ?q x (generalized FIS) ( 1- ?q ) x
(disappeared FIS) - ,where ?q is user defined parameter
The larger the value of RD, the more the
distortion generated on the Frequent Itemset
26How to calculate utility gain ?- Reduction of
Growth Rate(RG)
- After taken a local recoding, RG is defined as
- The reduction of growth rate of all EPs
Local Recoding
RG 60 25 35
27How to calculate utility gain?
- Putting all these together, utility gain is
defined as - ?p x RG (1- ?p ) x RD
- ,where ?p is user defined parameters
-
- It favors
- Local recoding which can reduce lots of growth
rate - It penalizes
- Local recoding which generate large distortion on
FIS
28Experimental Setup
- Dataset Adult dataset from UCI Repository
- Popular benchmark dataset used for generalization
- Total number of records 30162
- Income gt 50k 7508
- Income lt 50k 22654
- Use only 8 categorical attributes for experiment
- A well accepted hierarchy is defined
- Parameters
- Support of FIS 40
- Growth rate of EP 5
- Initial Temperature 10
- Cooling Rate 0.4
29Performance
RD / No. of FIS disappeared of the Greedy
Approach
RD / No. of FIS disappeared of Simulated
Annealing Style Approach (Best of 5)
30Runtime (in minutes)
Greedy Approach
Simulated Annealing Style Approach (Best of 5)
31Future Research Plan
- Hide EPs in temporal datasets
- Consider multi-level FIS
- Hiding a group of emerging patterns at a time
32Q A