HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION - PowerPoint PPT Presentation

About This Presentation
Title:

HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

Description:

In Adult, 'BA' and 'MSE' maybe generalized to 'Degree Holder' Types of Generalization ... RG = 60 25 = 35. How to calculate utility gain? ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 33
Provided by: Mich72
Category:

less

Transcript and Presenter's Notes

Title: HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION


1
HIDING EMERGING PATTERNS WITH LOCAL RECODING
GENERALIZATION
  • Presented by Michael Cheng
  • Supervisor Dr. William Cheung Co-Supervisor
    Dr. Byron Choi

2
Presentation Flow
  • Privacy-Preserving Data Publishing
  • Introduction to Emerging Patterns (EPs)
  • Introduction to Equivalence Class
  • Introduction to Generalization
  • Proposed Problem and Motivation
  • Heuristic for the Problem
  • Experimental Results
  • Future research plan

3
Privacy Preserving Data Publishing- Introduction
  • Organizations often need to publish or share
    their data for legitimate reasons
  • Sensitive information (e.g. personal identities,
    restrictive patterns) maybe inferred from the
    published data

4
Privacy Preserving Data Publishing- Objective
  • Transform the dataset before publishing, such
    that
  • Sensitive information
  • In our case Emerging Patterns (EPs)
  • Subsequence analysis
  • In our case Frequent Itemset (FIS) Mining

5
Introduction to Emerging Patterns (EPs)
  • Emerging Patterns (EPs) are itemsets exist in
    pair of datasets whose supports are significant
    in one dataset but insignificant in another

MSE, Exec is an Emerging Pattern
Edu
Occup
Marital
Married
MSE
Exec
Married
MSE
Exec
BA
Exec
Married
BA
Manager
Married
BA
Repair
Never
Income gt 50k
Income lt 50k
6
Introduction to Emerging Patterns (EPs)
  • Formally, growth rate and EPs are defined as
    follow

7
Introduction to Equivalence Class
  • Tuples are said to be in the same Equivalence
    Class w.r.t. a set of Attribute A if they take
    same values of A

Tuples 1,2,3 are in the same Equivalence Class
w.r.t. Occup, Marital
ID
Edu
Occup
Marital
1
MSE
Exec
Married
2
MSE
Exec
Married
3
BA
Exec
Married
Manager
4
BA
Married
5
BA
Repair
Never
8
Introduction to Generalization
  • Extensively studied in achieving k-Anonymity
  • Not studied before for hiding itemsets
  • Modify the original values in dataset into more
    general values according to a user-given
    hierarchy such that more tuples will share the
    same set of attribute values
  • Example
  • In Adult, BA and MSE maybe generalized to
    Degree Holder

9
Types of Generalization
  • Single Dimensional Global Recoding
  • Multi Dimensional Global Recoding
  • Multi Dimensional Local Recoding

10
Single Dimensional Global Recoding
  • If we decide to generalize some values to a
    single value, all tuples which contains these
    values will be affected

Single Dimensional Global Recoding
11
Multi Dimensional Global Recoding
  • If we decide to generalize some values to a
    single value, all tuples in the same equivalence
    class which contains those values will be
    affected

Occup
Occupation
Multi Dimensional Global Recoding
Occupation
Occupation
Manager
Repair
12
Multi Dimensional Local Recoding
  • Same as the Multi Dimensional Global Recoding
    except no Equivalence Class constraint

Occup
Occupation
Multi Dimensional Local Recoding
Occupation
Exec
Manager
Repair
13
Proposed Problem- Why EP and FIS ?
  • Emerging Pattern may reveal sensitive information
  • E.g. In the Adult dataset from UCI Repository, we
    found that
  • Never-Married, Own-Child is an EP from the
    class Income lt 50k to the class Income gt50k
  • Growth Rate 35
  • Frequent Itemset is a popular data mining task
    and supported by commercial data-mining software

14
Proposed Problem-Why Generalization ?
  • Other methods studied in PPDP
  • For example
  • Adding unknowns, remove tuples, adding fake
    tuples randomly
  • Either
  • Incomplete information
  • Fake information
  • In some applications, completeness and
    truthfulness of data are important
  • By using generalization, we can preserve the
    completeness and truthfulness of the data

15
Proposed problem- Problem Illustration
Emerging Patterns Frequent Itemsets
Emerging Patterns Frequent Itemsets
D
D
Transformation (Local Recoding)
16
Intuition of Local Recoding
  • Support of FIS 40 Growth Rate of EP 3
  • Frequent Itemset Exec, Married
  • Emerging Pattern MSE ,Exec

Edu
Occup
Marital
Edu
Occup
Marital
Married
MSE
Exec
BA
Exec
Married
Married
MSE
Exec
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Worker
Married
BA
Repair
Never
MSE
Manager
Never
Income gt 50k
Income lt 50k
17
Intuition of Local Recoding
Edu
Occup
Marital
Edu
Occup
Marital
Married
MSE
Exec
BA
Exec
Married
Married
MSE
Exec
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Worker
Married
BA
Repair
Never
MSE
Manager
Never
Income gt 50k
Income lt 50k
Edu
Occup
Marital
Edu
Occup
Marital
Married
MSE
White col
BA
Exec
Married
Married
MSE
White col
BA
Exec
Married
BA
Exec
Married
BA
Exec
Married
BA
Manager
Married
BA
Worker
Married
BA
Repair
Never
MSE
White Col
Never
Income gt 50k
Income lt 50k
18
Heuristic for the Problem- Greedy Approach
Repeat
D
Equivalence Classes
Utility Gain
Class
1
40
EPs
Emerging Patterns Mining
Class 2
90
EP 1
Class 3
60
EP 2
Class 4
20
EP 3
EP 4
Class 5
15
Until
All Emerging Patterns are removed
19
Heuristic for the Problem-Greedy Approach
  • Drawbacks
  • Trapped into some local minima
  • Solution
  • Simulated Annealing Style Approach for choosing
    equivalence class

20
Heuristic for the Problem- Simulated Annealing
Style Approach
  • Choose Equivalence Class probabilistically
  • Two parameters
  • Initial temperature ( T0 )
  • Cooling Rate ( a )
  • Acceptance Probability
  • exp Utility Gain / Temperature
  • Temperature updating
  • Tn a Tn-1

Acceptance probability of different utility gain
and temperature
21
Heuristic for the Problem- Simulated Annealing
Style Approach
Repeat
D
Equivalence Classes
Probability
Class
1
0.2
EPs
Emerging Patterns Mining
Class 2
0.4
EP 1
Class 3
0.1
EP 2
Class 4
0.25
EP 3
EP 4
Class 5
0.05
Until
All Emerging Patterns are removed
22
Two questions
  • How to choose an EP for generalization?
  • How to calculate the utility gain?

23
How to choose an EP for generalization?
  • Choose the EP which overlaps with the remaining
    EPs the most
  • More likely to hide other EPs simultaneously

Emerging Patterns
MSE Never Married
BA
Divorced
BA
Divorced
Worker
BA
Divorced Repairman
BA Divorced
Own
-
Child
24
How to calculate utility gain?
  • Utility gain is a function of
  • Recoding Distance (RD)
  • Reduction of Growth Rate (RG)

25
How to calculate utility gain ?- Recoding
Distance (RD)
  • The detail derivation is stated in the paper
  • Intuitively, it measures
  • How many and how much FIS have been generalized?
  • How many FIS disappeared?
  • High level definition of RD
  • ?q x (generalized FIS) ( 1- ?q ) x
    (disappeared FIS)
  • ,where ?q is user defined parameter

The larger the value of RD, the more the
distortion generated on the Frequent Itemset
26
How to calculate utility gain ?- Reduction of
Growth Rate(RG)
  • After taken a local recoding, RG is defined as
  • The reduction of growth rate of all EPs

Local Recoding
RG 60 25 35
27
How to calculate utility gain?
  • Putting all these together, utility gain is
    defined as
  • ?p x RG (1- ?p ) x RD
  • ,where ?p is user defined parameters
  • It favors
  • Local recoding which can reduce lots of growth
    rate
  • It penalizes
  • Local recoding which generate large distortion on
    FIS

28
Experimental Setup
  • Dataset Adult dataset from UCI Repository
  • Popular benchmark dataset used for generalization
  • Total number of records 30162
  • Income gt 50k 7508
  • Income lt 50k 22654
  • Use only 8 categorical attributes for experiment
  • A well accepted hierarchy is defined
  • Parameters
  • Support of FIS 40
  • Growth rate of EP 5
  • Initial Temperature 10
  • Cooling Rate 0.4

29
Performance
  • Maximum RD 623.1

RD / No. of FIS disappeared of the Greedy
Approach
RD / No. of FIS disappeared of Simulated
Annealing Style Approach (Best of 5)
30
Runtime (in minutes)
Greedy Approach
Simulated Annealing Style Approach (Best of 5)
31
Future Research Plan
  • Hide EPs in temporal datasets
  • Consider multi-level FIS
  • Hiding a group of emerging patterns at a time

32
Q A
  • Any Questions?
Write a Comment
User Comments (0)
About PowerShow.com