Privacy preserving data mining - PowerPoint PPT Presentation

About This Presentation

Title:

Privacy preserving data mining

Description:

Privacy preserving data mining randomized response and association rule hiding Li Xiong CS573 Data Privacy and Anonymity Partial s credit: W. Du, Syracuse ... – PowerPoint PPT presentation

Number of Views:249

Avg rating:3.0/5.0

Slides: 47

Provided by: WeiZ9

Learn more at: http://www.mathcs.emory.edu

Category:

more less

Transcript and Presenter's Notes

Title: Privacy preserving data mining

1
Privacy preserving data mining randomized
response and association rule hiding
Li Xiong CS573 Data Privacy and Anonymity
Partial slides credit W. Du, Syracuse
University, Y. Gao, Peking University
2
Privacy Preserving Data Mining Techniques

Protecting sensitive raw data
Randomization (additive noise)
Geometric perturbation and projection
(multiplicative noise)
Randomized response technique
Categorical data perturbation in data collection
model
Protecting sensitive knowledge (knowledge hiding)

3
Data Collection Model
Data cannot be shared directly because of privacy
concern
4
BackgroundRandomized Response
The true answer is Yes
Do you smoke?
Yes
Head
Biased coin
No
Tail
5
Decision Tree Mining using Randomized Response

Multiple attributes encoded in bits

Column distribution can be estimated for learning
a decision tree!

Using Randomized Response Techniques for
Privacy-Preserving Data Mining, Du, 2003
6
Accuracy of Decision tree built on randomized
response
7
Generalization for Multi-Valued Categorical Data
Si Si1 Si2 Si3
q1
q2
q3
q4
True Value Si
M
8
A Generalization

RR Matrices Warner 65, R.Agrawal 05, S.
Agrawal 05
RR Matrix can be arbitrary
Can we find optimal RR matrices?

OptRROptimizing Randomized Response Schemes for
Privacy-Preserving Data Mining, Huang, 2008
9
What is an optimal matrix?

Which of the following is better?

Privacy M2 is better Utility M1 is better
So, what is an optimal matrix?
10
Optimal RR Matrix

An RR matrix M is optimal if no other RR matrixs
privacy and utility are both better than M (i, e,
no other matrix dominates M).
Privacy Quantification
Utility Quantification
A number of privacy and utility metrics have been
proposed.
Privacy how accurately one can estimate
individual info.
Utility how accurately we can estimate aggregate
info.

11
Metrics

Privacy accuracy of estimate of individual
values
Utility difference between the original
probability and the estimated probability

12
Optimization Methods

Approach 1 Weighted sum
w1 Privacy w2 Utility
Approach 2
Fix Privacy, find M with the optimal Utility.
Fix Utility, find M with the optimal Privacy.
Challenge Difficult to generate M with a fixed
privacy or utility.
Proposed Approach Multi-Objective Optimization

13
Optimization algorithm

Evolutionary Multi-Objective Optimization (EMOO)
The algorithm
Start with a set of initial RR matrices
Repeat the following steps in each iteration
Mating selecting two RR matrices in the pool
Crossover exchanging several columns between the
two RR matrices
Mutation change some values in a RR matrix
Meet the privacy bound filtering the resultant
matrices
Evaluate the fitness value for the new RR
matrices.
Note the fitness values is defined in terms of
privacy and utility metrics

14
Illustration
15
Output of Optimization
The optimal set is often plotted in the objective
space as Pareto front.
Worse
M5
M6
M4
M7
M8
M3
M2
Utility
M1
Better
Privacy
16
For First attribute of Adult data
17
Privacy Preserving Data Mining Techniques

Protecting sensitive raw data
Randomization (additive noise)
Geometric perturbation and projection
(multiplicative noise)
Randomized response technique
Protecting sensitive knowledge (knowledge hiding)
Frequent itemset and association rule hiding
Downgrading classifier effectiveness

18
Frequent Itemset Mining and Association Rule
Mining

Frequent itemset mining frequent set of items in
a transaction data set
Association rules associations between items

19
Frequent Itemset Mining and Association Rule
Mining

First proposed by Agrawal, Imielinski, and Swami
in SIGMOD 1993
SIGMOD Test of Time Award 2003
This paper started a field of research.
In addition to containing an innovative
algorithm, its subject matter brought data mining
to the attention of the database community even
led several years ago to an IBM commercial,
featuring supermodels, that touted the importance
of work such as that contained in this paper.
Apriori algorithm in VLDB 1994
4 in the top 10 data mining algorithms in ICDM
2006
R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large
databases. In SIGMOD 93.
Apriori Rakesh Agrawal and Ramakrishnan
Srikant. Fast Algorithms for Mining Association
Rules. In VLDB '94.

20
Basic Concepts Frequent Patterns and Association
Rules

Itemset X x1, , xk (k-itemset)
Frequent itemset X with minimum support count
Support count (absolute support) count of
transactions containing X
Association rule A ? B with minimum support and
confidence
Support probability that a transaction contains
A ? B
s P(A ? B)
Confidence conditional probability that a
transaction having A also contains B
c P(A B)
Association rule mining process
Find all frequent patterns (more costly)
Generate strong association rules

Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
21
Illustration of Frequent Itemsets and Association
Rules
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F

Frequent itemsets (minimum support count 3) ?
Association rules (minimum support 50, minimum
confidence 50) ?

A3, B3, D4, E3, AD3
A ? D (60, 100) D ? A (60, 75)
22
Association Rule Hiding what? why??

Problem hide sensitive association rules in data
without losing non-sensitive rules

Motivations confidential rules may have serious
adverse effects

23
Problem statement

Given
a database D to be released
minimum threshold MST, MCT
a set of association rules R mined from D
a set of sensitive rules Rh R to be hided
Find a new database D such that
the rules in Rh cannot be mined from D
the rules in R-Rh can still be mined as many as
possible

24
Solutions

Data modification approaches
Basic idea data sanitization D-gtD
Approaches distortion,blocking
Drawbacks
Cannot control hiding effects intuitively, lots
of I/O
Data reconstruction approaches
Basic idea knowledge sanitization D-gtK-gtD
Potential advantages
Can easily control the availability of rules and
control the hiding effects directly, intuitively,
handily

25
Distortion-based Techniques
Sample Database
Distorted Database
A B C D
1 1 1 0
1 0 1 1
0 0 0 1
1 1 1 0
1 0 1 1
A B C D
1 1 1 0
1 0 0 1
0 0 0 1
1 1 1 0
1 0 0 1
Rule A?C has Support(A?C)80 Confidence(A?C)10
0
Rule A?C has now Support(A?C)40 Confidence(A?C
)50
26
Side Effects
Before Hiding Process After Hiding Process Side Effect
Rule Ri has had conf(Ri)gtMCT Rule Ri has now conf(Ri)ltMCT Rule Eliminated (Undesirable Side Effect)
Rule Ri has had conf(Ri)ltMCT Rule Ri has now conf(Ri)gtMCT Ghost Rule (Undesirable Side Effect)
Large Itemset I has had sup(I)gtMST Itemset I has now sup(I)ltMST Itemset Eliminated (Undesirable Side Effect)
27
Distortion-based Techniques

Challenges/Goals
To minimize the undesirable Side Effects that the
hiding process causes to non-sensitive rules.
To minimize the number of 1s that must be
deleted in the database.
Algorithms must be linear in time as the database
increases in size.

28
Sensitive itemsets ABC
29
Data distortion Atallah 99

Hardness result
The distortion problem is NP Hard
Heuristic search
Find items to remove and transactions to remove
the items from

Disclosure Limitation of Sensitive Rules, M.
Atallah, A. Elmagarmid, M. Ibrahim, E. Bertino,
V. Verykios, 1999
30
(No Transcript)
31
Heuristic Approach

A greedy bottom-up search through the ancestors
(subsets) of the sensitive itemset for the parent
with maximum support (why?)
At the end of the search, 1-itemset is selected
Search through the common transactions containing
the item and the sensitive itemset for the
transaction that affects minimum number of
2-itemsets
Delete the selected item from the identified
transaction

32
(No Transcript)
33
Results comparison
34
Blocking-based Techniques
Initial Database
New Database
A B C D
1 1 1 0
1 0 1 1
0 0 0 1
1 1 1 0
1 0 1 1
A B C D
1 1 1 0
1 0 ? 1
? 0 0 1
1 1 1 0
1 0 1 1
Support and Confidence becomes marginal. In New
Database 60 conf(A ? C) 100
35
Data reconstruction approach
FS
D
D
D

FP
-
tree
36
The first two phases

1. Frequent set mining
Generate all frequent itemsets with their
supports and support counts FS from original
database D
2. Perform sanitization algorithm
Input FS output in phase 1, R, Rh
Output sanitized frequent itemsets FS
Process
Select hiding strategy
Identify sensitive frequent sets
Perform sanitization

In best cases, sanitization algorithm can ensure
from FS ,we can exactly get the non-sensitive
rules set R-Rh
37
Example the first two phases
O
i
g
i
n
a
l

D
a
t
a
b
a
s
e

D
1. Frequent set mining
T
I
D
I
t
e
m
s
T
1
A
B
C
E
T
2
A
B
C
T
3
A
B
C
D
T
4
A
B
D
T
5
A
D
T
6
A
C
D
2. Perform sanitization algorithm
38
Open research questions

Optimal solution
Itemsets sanitization
The support and confidence of the rules in R- Rh
should remain unchanged as much as possible
Integrating data protection and knowledge (rule)
protection

39
Coming up

Cryptographic protocols for privacy preserving
distributed data mining

40
Classification of current algorithms
41
Weight-based Sorting Distortion Algorithm (WSDA)
Pontikakis 03

High Level Description
Input
Initial Database
Set of Sensitive Rules
Safety Margin (for example 10)
Output
Sanitized Database
Sensitive Rules no longer hold in the Database

42
WSDA Algorithm

High Level Description
1st step
Retrieve the set of transactions which support
sensitive rule RS
For each sensitive rule RS find the number N1 of
transaction in which, one item that supports the
rule will be deleted

43
WSDA Algorithm

High Level Description
2nd step
For each rule Ri in the Database with common
items with RS compute a weight w that denotes how
strong is Ri
For each transaction that supports RS compute a
priority Pi, that denotes how many strong rules
this transaction supports

44
WSDA Algorithm

High Level Description
3rd step
Sort the N1 transactions in ascending order
according to their priority value Pi
4th step
For the first N1 transactions hide an item that
is contained in RS

45
WSDA Algorithm

High Level Description
5th step
Update confidence and support values for other
rules in the database

46
Discussion
Proposed Solution

Sanitization algorithm
Compared with early popular data sanitization
performs sanitization directly on knowledge
level of data
Inverse frequent set mining algorithm
Deals with frequent items and infrequent items
separately more efficiently, a large number of
outputs

Our solution provides user with a knowledge level
window to perform sanitization handily and
generates a number of secure databases

Write a Comment

User Comments (0)