Privacy preserving data mining - PowerPoint PPT Presentation

About This Presentation
Title:

Privacy preserving data mining

Description:

Privacy preserving data mining randomized response and association rule hiding Li Xiong CS573 Data Privacy and Anonymity Partial s credit: W. Du, Syracuse ... – PowerPoint PPT presentation

Number of Views:242
Avg rating:3.0/5.0
Slides: 47
Provided by: WeiZ9
Category:

less

Transcript and Presenter's Notes

Title: Privacy preserving data mining


1
Privacy preserving data mining randomized
response and association rule hiding
Li Xiong CS573 Data Privacy and Anonymity
Partial slides credit W. Du, Syracuse
University, Y. Gao, Peking University
2
Privacy Preserving Data Mining Techniques
  • Protecting sensitive raw data
  • Randomization (additive noise)
  • Geometric perturbation and projection
    (multiplicative noise)
  • Randomized response technique
  • Categorical data perturbation in data collection
    model
  • Protecting sensitive knowledge (knowledge hiding)

3
Data Collection Model
Data cannot be shared directly because of privacy
concern
4
BackgroundRandomized Response
The true answer is Yes
Do you smoke?
Yes
Head
Biased coin
No
Tail
5
Decision Tree Mining using Randomized Response
  • Multiple attributes encoded in bits
  • Column distribution can be estimated for learning
    a decision tree!

Using Randomized Response Techniques for
Privacy-Preserving Data Mining, Du, 2003
6
Accuracy of Decision tree built on randomized
response
7
Generalization for Multi-Valued Categorical Data
Si Si1 Si2 Si3
q1
q2
q3
q4
True Value Si
M
8
A Generalization
  • RR Matrices Warner 65, R.Agrawal 05, S.
    Agrawal 05
  • RR Matrix can be arbitrary
  • Can we find optimal RR matrices?

OptRROptimizing Randomized Response Schemes for
Privacy-Preserving Data Mining, Huang, 2008
9
What is an optimal matrix?
  • Which of the following is better?

Privacy M2 is better Utility M1 is better
So, what is an optimal matrix?
10
Optimal RR Matrix
  • An RR matrix M is optimal if no other RR matrixs
    privacy and utility are both better than M (i, e,
    no other matrix dominates M).
  • Privacy Quantification
  • Utility Quantification
  • A number of privacy and utility metrics have been
    proposed.
  • Privacy how accurately one can estimate
    individual info.
  • Utility how accurately we can estimate aggregate
    info.

11
Metrics
  • Privacy accuracy of estimate of individual
    values
  • Utility difference between the original
    probability and the estimated probability

12
Optimization Methods
  • Approach 1 Weighted sum
  • w1 Privacy w2 Utility
  • Approach 2
  • Fix Privacy, find M with the optimal Utility.
  • Fix Utility, find M with the optimal Privacy.
  • Challenge Difficult to generate M with a fixed
    privacy or utility.
  • Proposed Approach Multi-Objective Optimization

13
Optimization algorithm
  • Evolutionary Multi-Objective Optimization (EMOO)
  • The algorithm
  • Start with a set of initial RR matrices
  • Repeat the following steps in each iteration
  • Mating selecting two RR matrices in the pool
  • Crossover exchanging several columns between the
    two RR matrices
  • Mutation change some values in a RR matrix
  • Meet the privacy bound filtering the resultant
    matrices
  • Evaluate the fitness value for the new RR
    matrices.
  • Note the fitness values is defined in terms of
    privacy and utility metrics

14
Illustration
15
Output of Optimization
The optimal set is often plotted in the objective
space as Pareto front.
Worse
M5
M6
M4
M7
M8
M3
M2
Utility
M1
Better
Privacy
16
For First attribute of Adult data
17
Privacy Preserving Data Mining Techniques
  • Protecting sensitive raw data
  • Randomization (additive noise)
  • Geometric perturbation and projection
    (multiplicative noise)
  • Randomized response technique
  • Protecting sensitive knowledge (knowledge hiding)
  • Frequent itemset and association rule hiding
  • Downgrading classifier effectiveness

18
Frequent Itemset Mining and Association Rule
Mining
  • Frequent itemset mining frequent set of items in
    a transaction data set
  • Association rules associations between items

19
Frequent Itemset Mining and Association Rule
Mining
  • First proposed by Agrawal, Imielinski, and Swami
    in SIGMOD 1993
  • SIGMOD Test of Time Award 2003
  • This paper started a field of research.
    In addition to containing an innovative
    algorithm, its subject matter brought data mining
    to the attention of the database community even
    led several years ago to an IBM commercial,
    featuring supermodels, that touted the importance
    of work such as that contained in this paper.
  • Apriori algorithm in VLDB 1994
  • 4 in the top 10 data mining algorithms in ICDM
    2006
  • R. Agrawal, T. Imielinski, and A. Swami. Mining
    association rules between sets of items in large
    databases. In SIGMOD 93.
  • Apriori Rakesh Agrawal and Ramakrishnan
    Srikant. Fast Algorithms for Mining Association
    Rules. In VLDB '94.

20
Basic Concepts Frequent Patterns and Association
Rules
  • Itemset X x1, , xk (k-itemset)
  • Frequent itemset X with minimum support count
  • Support count (absolute support) count of
    transactions containing X
  • Association rule A ? B with minimum support and
    confidence
  • Support probability that a transaction contains
    A ? B
  • s P(A ? B)
  • Confidence conditional probability that a
    transaction having A also contains B
  • c P(A B)
  • Association rule mining process
  • Find all frequent patterns (more costly)
  • Generate strong association rules

Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
21
Illustration of Frequent Itemsets and Association
Rules
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
  • Frequent itemsets (minimum support count 3) ?
  • Association rules (minimum support 50, minimum
    confidence 50) ?

A3, B3, D4, E3, AD3
A ? D (60, 100) D ? A (60, 75)
22
Association Rule Hiding what? why??
  • Problem hide sensitive association rules in data
    without losing non-sensitive rules
  • Motivations confidential rules may have serious
    adverse effects

23
Problem statement
  • Given
  • a database D to be released
  • minimum threshold MST, MCT
  • a set of association rules R mined from D
  • a set of sensitive rules Rh R to be hided
  • Find a new database D such that
  • the rules in Rh cannot be mined from D
  • the rules in R-Rh can still be mined as many as
    possible

24
Solutions
  • Data modification approaches
  • Basic idea data sanitization D-gtD
  • Approaches distortion,blocking
  • Drawbacks
  • Cannot control hiding effects intuitively, lots
    of I/O
  • Data reconstruction approaches
  • Basic idea knowledge sanitization D-gtK-gtD
  • Potential advantages
  • Can easily control the availability of rules and
    control the hiding effects directly, intuitively,
    handily

25
Distortion-based Techniques
Sample Database
Distorted Database
A B C D
1 1 1 0
1 0 1 1
0 0 0 1
1 1 1 0
1 0 1 1
A B C D
1 1 1 0
1 0 0 1
0 0 0 1
1 1 1 0
1 0 0 1
Rule A?C has Support(A?C)80 Confidence(A?C)10
0
Rule A?C has now Support(A?C)40 Confidence(A?C
)50
26
Side Effects
Before Hiding Process After Hiding Process Side Effect
Rule Ri has had conf(Ri)gtMCT Rule Ri has now conf(Ri)ltMCT Rule Eliminated (Undesirable Side Effect)
Rule Ri has had conf(Ri)ltMCT Rule Ri has now conf(Ri)gtMCT Ghost Rule (Undesirable Side Effect)
Large Itemset I has had sup(I)gtMST Itemset I has now sup(I)ltMST Itemset Eliminated (Undesirable Side Effect)
27
Distortion-based Techniques
  • Challenges/Goals
  • To minimize the undesirable Side Effects that the
    hiding process causes to non-sensitive rules.
  • To minimize the number of 1s that must be
    deleted in the database.
  • Algorithms must be linear in time as the database
    increases in size.

28
Sensitive itemsets ABC
29
Data distortion Atallah 99
  • Hardness result
  • The distortion problem is NP Hard
  • Heuristic search
  • Find items to remove and transactions to remove
    the items from

Disclosure Limitation of Sensitive Rules, M.
Atallah, A. Elmagarmid, M. Ibrahim, E. Bertino,
V. Verykios, 1999
30
(No Transcript)
31
Heuristic Approach
  • A greedy bottom-up search through the ancestors
    (subsets) of the sensitive itemset for the parent
    with maximum support (why?)
  • At the end of the search, 1-itemset is selected
  • Search through the common transactions containing
    the item and the sensitive itemset for the
    transaction that affects minimum number of
    2-itemsets
  • Delete the selected item from the identified
    transaction

32
(No Transcript)
33
Results comparison
34
Blocking-based Techniques
Initial Database
New Database
A B C D
1 1 1 0
1 0 1 1
0 0 0 1
1 1 1 0
1 0 1 1
A B C D
1 1 1 0
1 0 ? 1
? 0 0 1
1 1 1 0
1 0 1 1
Support and Confidence becomes marginal. In New
Database 60 conf(A ? C) 100
35
Data reconstruction approach
FS
D
D
D

FP
-
tree
36
The first two phases
  • 1. Frequent set mining
  • Generate all frequent itemsets with their
    supports and support counts FS from original
    database D
  • 2. Perform sanitization algorithm
  • Input FS output in phase 1, R, Rh
  • Output sanitized frequent itemsets FS
  • Process
  • Select hiding strategy
  • Identify sensitive frequent sets
  • Perform sanitization

In best cases, sanitization algorithm can ensure
from FS ,we can exactly get the non-sensitive
rules set R-Rh
37
Example the first two phases
O
i
g
i
n
a
l

D
a
t
a
b
a
s
e


D
1. Frequent set mining
T
I
D
I
t
e
m
s
T
1
A
B
C
E
T
2
A
B
C
T
3
A
B
C
D
T
4
A
B
D
T
5
A
D
T
6
A
C
D
2. Perform sanitization algorithm
38
Open research questions
  • Optimal solution
  • Itemsets sanitization
  • The support and confidence of the rules in R- Rh
    should remain unchanged as much as possible
  • Integrating data protection and knowledge (rule)
    protection

39
Coming up
  • Cryptographic protocols for privacy preserving
    distributed data mining

40
Classification of current algorithms
41
Weight-based Sorting Distortion Algorithm (WSDA)
Pontikakis 03
  • High Level Description
  • Input
  • Initial Database
  • Set of Sensitive Rules
  • Safety Margin (for example 10)
  • Output
  • Sanitized Database
  • Sensitive Rules no longer hold in the Database

42
WSDA Algorithm
  • High Level Description
  • 1st step
  • Retrieve the set of transactions which support
    sensitive rule RS
  • For each sensitive rule RS find the number N1 of
    transaction in which, one item that supports the
    rule will be deleted

43
WSDA Algorithm
  • High Level Description
  • 2nd step
  • For each rule Ri in the Database with common
    items with RS compute a weight w that denotes how
    strong is Ri
  • For each transaction that supports RS compute a
    priority Pi, that denotes how many strong rules
    this transaction supports

44
WSDA Algorithm
  • High Level Description
  • 3rd step
  • Sort the N1 transactions in ascending order
    according to their priority value Pi
  • 4th step
  • For the first N1 transactions hide an item that
    is contained in RS

45
WSDA Algorithm
  • High Level Description
  • 5th step
  • Update confidence and support values for other
    rules in the database

46
Discussion
Proposed Solution
  • Sanitization algorithm
  • Compared with early popular data sanitization
    performs sanitization directly on knowledge
    level of data
  • Inverse frequent set mining algorithm
  • Deals with frequent items and infrequent items
    separately more efficiently, a large number of
    outputs

Our solution provides user with a knowledge level
window to perform sanitization handily and
generates a number of secure databases
Write a Comment
User Comments (0)
About PowerShow.com