Title: Privacypreserving data mining 2
1Privacy-preserving data mining (2)
2Outline
- A brief introduction to association rule mining
- Privacy preserving rule mining
- Single party
- Perturbation
- Encryption
- Distributed multiparty
- Cryptographic protocols
- Hiding sensitive rules
3Association Rule Mining
- Transactional datasets
- A transaction t a,b,c,
- a,b,c, are called items
- The length of transaction of items
- Transaction length can vary
- Equivalent representation
- The set of all items I
- A transaction t can be transformed to a boolean
vector in length of I.
4Association Rule Mining
- Rule mining
- Goal find the frequent itemset
- Some itemset, e.g.a,b,c, appears frequently,
higher than certain support. - Rules can be derived from the itemset
- a,b,c is frequent, then a,b?c, a?bc,
- Metrics
- Support of occurrences of itemset/ total of
transactions - Confidence of occurrences of itemset/ of
occurrences of left(rule) - I.e. the conditional prob Pr (rightleft)
5- Example
- E.g. a,b appears 100 times together, while abc
appears 50 times together in total 5000
transactions - Support of abc 50/5000 0.01
- Confidence of ab?c 50/100 0.5
6Algorithms
- Apriori
- observation if itemset A is a part of B, then
support(A) gt support (B) - Steps in finding frequent itemsets
- Starting from single-item set, pruning the
itemsets that have support lt threshold - When we have a set of k-itemsets, we expand it to
k1-itemsets, and check their supports. - Using data structures like hash tree to speed up
the counting process
7- Algorithm
- Generating rules with confidence threshold
- Confidence (A?B) P(BA)
-
support(AB)/support(A)
8Single party PPRM
- Two methods
- (Categorical data) Perturbation
- Encryption
9Perturbation
- Paper 111,112,113
- Basic ideas
- Consider a transaction as a boolean bit vector
- Perturb each bit with certain rule
- Paper 111 randomly select j items from t, then
for rest of all items, with prob p to be selected - Paper 112 each bit has the prob p to be
original, 1-p to be flipped - Paper 113 unify the methods with perturbation
matrix
10The key is
- After you perturb the data, you should still be
able to find the supported rules correctly. - The accuracy is traded off by the intensity of
perturbation (p)
11Methods discovering the original support
- Paper 111 using the correlation between partial
support to find the original support - Concept of partial support
- Prob of the length change of matched parts
notewe actually want sup for lk
The size of t m, the size of itemset A k
12Some results
- Let si be supi(A) and si be supi (A)
1. 2.
The matrix P and D are defined with only pl?l
From 1, we can estimate the original support From
2, we can estimate the reliability (variance) of
the support Estimation (which is related to
perturbation rate p)
13Privacy
- Given an itemset A in perturbed transaction t
- What is the probability of an item a, really in
the itemset A, i.e.,
14Tradeoff between utility and privacy
Lowest discoverable support distinguishable from
zero (consider the variance of support
estimation)
15Encryption method (paper118)
- Substitution encryption
- 1-1 substitution a?1, b?2,
- 1-n substitution a?1,10, b?2,11,12,
- Problem
- 1-1 substitution is weak
- Arbitrary 1-n substitution does not work
- Cannot recover original rules from the rules from
the substituted items.
16The basic idea
- Fake items
- Original n items, additional m fake items
- Define admissible 1-n mapping
- Arbitrary 1-n mapping may result in irreversible
results - E.g., a?1,2, b?2, c?3
- If we find frequent itemset 1,2,3 in the
substituted set, ac or abc, which one is the
right original itemset? - Admissible 1-n mapping
- For each mapping, there should be at least one
unique substitute item in the mapped result,
which does not appear in other mapping - E.g., a?1,2, b?2, c?3 breaks the definition
- while a?1,2, b?2,4, c?3 is admissible
17Recovering rules
- When we use admissible mappings
- We are able to reverse the discovered rules on
substituted set. - E.g., if we find 1,2,4 is a frequent set
- check all mappings
- 1,2 ? a, 2,4 ?b ? 1,2,4 ?ab
18cost
- Additional cost
- Generating item mapping
- Generating transaction transformation
- significant
- Cost of rule mining
- Both the of items and the average length of
transaction is increased, thus the total cost
will be increased
19Features of encryption method
- Rules can be accurately recovered
- A tradeoff between cost and privacy
- Privacy is better preserved with more fake items
- More fake items will result in higher additional
cost.
20Distributed datasets
- Perturbation also works, but encryption does not
work - might need some protocols to make the encryption
method applicable - Horizontally or vertically partitioned
- Paper 114 and 115
- Using the cryptographic methods to construct
protocols
21Hiding sensitive rules (paper122)
- When publishing data for rule mining, the rule
itself can be sensitive too. - Basic methods
- Decrease the support
- Support of occurrences of itemset/ total of
transactions - Decrease the confident
- Confidence of occurrences of itemset/ of
occurrences of left(rule)
22discussion on rule hiding
- Need sufficient amount of computational cost at
the data owner side - You should know what rules are sensitive in
advance! - So only necessary for the case you have to share
the data - When hiding sensitive rules, other rules might be
damaged
23Summary
- Two methods for single party data publishing
- Perturbation and encryption
- Distributed mutliparty can use protocols and
perturbation method - Hiding sensitive rules is also important in some
cases