Title: Privacy preserving data mining
1Privacy preserving data mining randomized
response and association rule hiding
Li Xiong CS573 Data Privacy and Anonymity
Partial slides credit W. Du, Syracuse
University, Y. Gao, Peking University
2Privacy Preserving Data Mining Techniques
- Protecting sensitive raw data
- Randomization (additive noise)
- Geometric perturbation and projection
(multiplicative noise) - Randomized response technique
- Categorical data perturbation in data collection
model - Protecting sensitive knowledge (knowledge hiding)
3Data Collection Model
Data cannot be shared directly because of privacy
concern
4BackgroundRandomized Response
The true answer is Yes
Do you smoke?
Yes
Head
Biased coin
No
Tail
5Decision Tree Mining using Randomized Response
- Multiple attributes encoded in bits
- Column distribution can be estimated for learning
a decision tree!
Using Randomized Response Techniques for
Privacy-Preserving Data Mining, Du, 2003
6Accuracy of Decision tree built on randomized
response
7Generalization for Multi-Valued Categorical Data
Si Si1 Si2 Si3
q1
q2
q3
q4
True Value Si
M
8A Generalization
- RR Matrices Warner 65, R.Agrawal 05, S.
Agrawal 05 - RR Matrix can be arbitrary
- Can we find optimal RR matrices?
OptRROptimizing Randomized Response Schemes for
Privacy-Preserving Data Mining, Huang, 2008
9What is an optimal matrix?
- Which of the following is better?
Privacy M2 is better Utility M1 is better
So, what is an optimal matrix?
10Optimal RR Matrix
- An RR matrix M is optimal if no other RR matrixs
privacy and utility are both better than M (i, e,
no other matrix dominates M). - Privacy Quantification
- Utility Quantification
- A number of privacy and utility metrics have been
proposed. - Privacy how accurately one can estimate
individual info. - Utility how accurately we can estimate aggregate
info.
11Metrics
- Privacy accuracy of estimate of individual
values - Utility difference between the original
probability and the estimated probability
12Optimization Methods
- Approach 1 Weighted sum
- w1 Privacy w2 Utility
- Approach 2
- Fix Privacy, find M with the optimal Utility.
- Fix Utility, find M with the optimal Privacy.
- Challenge Difficult to generate M with a fixed
privacy or utility. - Proposed Approach Multi-Objective Optimization
13Optimization algorithm
- Evolutionary Multi-Objective Optimization (EMOO)
- The algorithm
- Start with a set of initial RR matrices
- Repeat the following steps in each iteration
- Mating selecting two RR matrices in the pool
- Crossover exchanging several columns between the
two RR matrices - Mutation change some values in a RR matrix
- Meet the privacy bound filtering the resultant
matrices - Evaluate the fitness value for the new RR
matrices. - Note the fitness values is defined in terms of
privacy and utility metrics
14Illustration
15Output of Optimization
The optimal set is often plotted in the objective
space as Pareto front.
Worse
M5
M6
M4
M7
M8
M3
M2
Utility
M1
Better
Privacy
16For First attribute of Adult data
17Privacy Preserving Data Mining Techniques
- Protecting sensitive raw data
- Randomization (additive noise)
- Geometric perturbation and projection
(multiplicative noise) - Randomized response technique
- Protecting sensitive knowledge (knowledge hiding)
- Frequent itemset and association rule hiding
- Downgrading classifier effectiveness
18Frequent Itemset Mining and Association Rule
Mining
- Frequent itemset mining frequent set of items in
a transaction data set - Association rules associations between items
19Frequent Itemset Mining and Association Rule
Mining
- First proposed by Agrawal, Imielinski, and Swami
in SIGMOD 1993 - SIGMOD Test of Time Award 2003
- This paper started a field of research.
In addition to containing an innovative
algorithm, its subject matter brought data mining
to the attention of the database community even
led several years ago to an IBM commercial,
featuring supermodels, that touted the importance
of work such as that contained in this paper. - Apriori algorithm in VLDB 1994
- 4 in the top 10 data mining algorithms in ICDM
2006 -
- R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large
databases. In SIGMOD 93. - Apriori Rakesh Agrawal and Ramakrishnan
Srikant. Fast Algorithms for Mining Association
Rules. In VLDB '94.
20Basic Concepts Frequent Patterns and Association
Rules
- Itemset X x1, , xk (k-itemset)
- Frequent itemset X with minimum support count
- Support count (absolute support) count of
transactions containing X - Association rule A ? B with minimum support and
confidence - Support probability that a transaction contains
A ? B - s P(A ? B)
- Confidence conditional probability that a
transaction having A also contains B - c P(A B)
- Association rule mining process
- Find all frequent patterns (more costly)
- Generate strong association rules
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
21Illustration of Frequent Itemsets and Association
Rules
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
- Frequent itemsets (minimum support count 3) ?
- Association rules (minimum support 50, minimum
confidence 50) ?
A3, B3, D4, E3, AD3
A ? D (60, 100) D ? A (60, 75)
22Association Rule Hiding what? why??
- Problem hide sensitive association rules in data
without losing non-sensitive rules
- Motivations confidential rules may have serious
adverse effects
23Problem statement
- Given
- a database D to be released
- minimum threshold MST, MCT
- a set of association rules R mined from D
- a set of sensitive rules Rh R to be hided
- Find a new database D such that
- the rules in Rh cannot be mined from D
- the rules in R-Rh can still be mined as many as
possible
24Solutions
- Data modification approaches
- Basic idea data sanitization D-gtD
- Approaches distortion,blocking
- Drawbacks
- Cannot control hiding effects intuitively, lots
of I/O - Data reconstruction approaches
- Basic idea knowledge sanitization D-gtK-gtD
- Potential advantages
- Can easily control the availability of rules and
control the hiding effects directly, intuitively,
handily
25Distortion-based Techniques
Sample Database
Distorted Database
A B C D
1 1 1 0
1 0 1 1
0 0 0 1
1 1 1 0
1 0 1 1
A B C D
1 1 1 0
1 0 0 1
0 0 0 1
1 1 1 0
1 0 0 1
Rule A?C has Support(A?C)80 Confidence(A?C)10
0
Rule A?C has now Support(A?C)40 Confidence(A?C
)50
26Side Effects
Before Hiding Process After Hiding Process Side Effect
Rule Ri has had conf(Ri)gtMCT Rule Ri has now conf(Ri)ltMCT Rule Eliminated (Undesirable Side Effect)
Rule Ri has had conf(Ri)ltMCT Rule Ri has now conf(Ri)gtMCT Ghost Rule (Undesirable Side Effect)
Large Itemset I has had sup(I)gtMST Itemset I has now sup(I)ltMST Itemset Eliminated (Undesirable Side Effect)
27Distortion-based Techniques
- Challenges/Goals
- To minimize the undesirable Side Effects that the
hiding process causes to non-sensitive rules. - To minimize the number of 1s that must be
deleted in the database. - Algorithms must be linear in time as the database
increases in size.
28Sensitive itemsets ABC
29Data distortion Atallah 99
- Hardness result
- The distortion problem is NP Hard
- Heuristic search
- Find items to remove and transactions to remove
the items from
Disclosure Limitation of Sensitive Rules, M.
Atallah, A. Elmagarmid, M. Ibrahim, E. Bertino,
V. Verykios, 1999
30(No Transcript)
31Heuristic Approach
- A greedy bottom-up search through the ancestors
(subsets) of the sensitive itemset for the parent
with maximum support (why?) - At the end of the search, 1-itemset is selected
- Search through the common transactions containing
the item and the sensitive itemset for the
transaction that affects minimum number of
2-itemsets - Delete the selected item from the identified
transaction
32(No Transcript)
33Results comparison
34Blocking-based Techniques
Initial Database
New Database
A B C D
1 1 1 0
1 0 1 1
0 0 0 1
1 1 1 0
1 0 1 1
A B C D
1 1 1 0
1 0 ? 1
? 0 0 1
1 1 1 0
1 0 1 1
Support and Confidence becomes marginal. In New
Database 60 conf(A ? C) 100
35Data reconstruction approach
FS
D
D
D
FP
-
tree
36The first two phases
- 1. Frequent set mining
- Generate all frequent itemsets with their
supports and support counts FS from original
database D - 2. Perform sanitization algorithm
- Input FS output in phase 1, R, Rh
- Output sanitized frequent itemsets FS
- Process
- Select hiding strategy
- Identify sensitive frequent sets
- Perform sanitization
In best cases, sanitization algorithm can ensure
from FS ,we can exactly get the non-sensitive
rules set R-Rh
37Example the first two phases
O
i
g
i
n
a
l
D
a
t
a
b
a
s
e
D
1. Frequent set mining
T
I
D
I
t
e
m
s
T
1
A
B
C
E
T
2
A
B
C
T
3
A
B
C
D
T
4
A
B
D
T
5
A
D
T
6
A
C
D
2. Perform sanitization algorithm
38Open research questions
- Optimal solution
- Itemsets sanitization
- The support and confidence of the rules in R- Rh
should remain unchanged as much as possible - Integrating data protection and knowledge (rule)
protection
39Coming up
- Cryptographic protocols for privacy preserving
distributed data mining
40Classification of current algorithms
41Weight-based Sorting Distortion Algorithm (WSDA)
Pontikakis 03
- High Level Description
- Input
- Initial Database
- Set of Sensitive Rules
- Safety Margin (for example 10)
- Output
- Sanitized Database
- Sensitive Rules no longer hold in the Database
42WSDA Algorithm
- High Level Description
- 1st step
- Retrieve the set of transactions which support
sensitive rule RS - For each sensitive rule RS find the number N1 of
transaction in which, one item that supports the
rule will be deleted
43WSDA Algorithm
- High Level Description
- 2nd step
- For each rule Ri in the Database with common
items with RS compute a weight w that denotes how
strong is Ri - For each transaction that supports RS compute a
priority Pi, that denotes how many strong rules
this transaction supports
44WSDA Algorithm
- High Level Description
- 3rd step
- Sort the N1 transactions in ascending order
according to their priority value Pi - 4th step
- For the first N1 transactions hide an item that
is contained in RS
45WSDA Algorithm
- High Level Description
- 5th step
- Update confidence and support values for other
rules in the database
46Discussion
Proposed Solution
- Sanitization algorithm
- Compared with early popular data sanitization
performs sanitization directly on knowledge
level of data - Inverse frequent set mining algorithm
- Deals with frequent items and infrequent items
separately more efficiently, a large number of
outputs
Our solution provides user with a knowledge level
window to perform sanitization handily and
generates a number of secure databases