Privacypreserving data mining 2 - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Privacypreserving data mining 2

Description:

Using data structures like hash tree to speed up the counting process. Algorithm ... discussion on rule hiding. Need sufficient amount of computational cost at ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 24
Provided by: keke9
Category:

less

Transcript and Presenter's Notes

Title: Privacypreserving data mining 2


1
Privacy-preserving data mining (2)
2
Outline
  • A brief introduction to association rule mining
  • Privacy preserving rule mining
  • Single party
  • Perturbation
  • Encryption
  • Distributed multiparty
  • Cryptographic protocols
  • Hiding sensitive rules

3
Association Rule Mining
  • Transactional datasets
  • A transaction t a,b,c,
  • a,b,c, are called items
  • The length of transaction of items
  • Transaction length can vary
  • Equivalent representation
  • The set of all items I
  • A transaction t can be transformed to a boolean
    vector in length of I.

4
Association Rule Mining
  • Rule mining
  • Goal find the frequent itemset
  • Some itemset, e.g.a,b,c, appears frequently,
    higher than certain support.
  • Rules can be derived from the itemset
  • a,b,c is frequent, then a,b?c, a?bc,
  • Metrics
  • Support of occurrences of itemset/ total of
    transactions
  • Confidence of occurrences of itemset/ of
    occurrences of left(rule)
  • I.e. the conditional prob Pr (rightleft)

5
  • Example
  • E.g. a,b appears 100 times together, while abc
    appears 50 times together in total 5000
    transactions
  • Support of abc 50/5000 0.01
  • Confidence of ab?c 50/100 0.5

6
Algorithms
  • Apriori
  • observation if itemset A is a part of B, then
    support(A) gt support (B)
  • Steps in finding frequent itemsets
  • Starting from single-item set, pruning the
    itemsets that have support lt threshold
  • When we have a set of k-itemsets, we expand it to
    k1-itemsets, and check their supports.
  • Using data structures like hash tree to speed up
    the counting process

7
  • Algorithm
  • Generating rules with confidence threshold
  • Confidence (A?B) P(BA)

  • support(AB)/support(A)

8
Single party PPRM
  • Two methods
  • (Categorical data) Perturbation
  • Encryption

9
Perturbation
  • Paper 111,112,113
  • Basic ideas
  • Consider a transaction as a boolean bit vector
  • Perturb each bit with certain rule
  • Paper 111 randomly select j items from t, then
    for rest of all items, with prob p to be selected
  • Paper 112 each bit has the prob p to be
    original, 1-p to be flipped
  • Paper 113 unify the methods with perturbation
    matrix

10
The key is
  • After you perturb the data, you should still be
    able to find the supported rules correctly.
  • The accuracy is traded off by the intensity of
    perturbation (p)

11
Methods discovering the original support
  • Paper 111 using the correlation between partial
    support to find the original support
  • Concept of partial support
  • Prob of the length change of matched parts

notewe actually want sup for lk
The size of t m, the size of itemset A k
12
Some results
  • Let si be supi(A) and si be supi (A)

1. 2.
The matrix P and D are defined with only pl?l
From 1, we can estimate the original support From
2, we can estimate the reliability (variance) of
the support Estimation (which is related to
perturbation rate p)
13
Privacy
  • Given an itemset A in perturbed transaction t
  • What is the probability of an item a, really in
    the itemset A, i.e.,

14
Tradeoff between utility and privacy
Lowest discoverable support distinguishable from
zero (consider the variance of support
estimation)
15
Encryption method (paper118)
  • Substitution encryption
  • 1-1 substitution a?1, b?2,
  • 1-n substitution a?1,10, b?2,11,12,
  • Problem
  • 1-1 substitution is weak
  • Arbitrary 1-n substitution does not work
  • Cannot recover original rules from the rules from
    the substituted items.

16
The basic idea
  • Fake items
  • Original n items, additional m fake items
  • Define admissible 1-n mapping
  • Arbitrary 1-n mapping may result in irreversible
    results
  • E.g., a?1,2, b?2, c?3
  • If we find frequent itemset 1,2,3 in the
    substituted set, ac or abc, which one is the
    right original itemset?
  • Admissible 1-n mapping
  • For each mapping, there should be at least one
    unique substitute item in the mapped result,
    which does not appear in other mapping
  • E.g., a?1,2, b?2, c?3 breaks the definition
  • while a?1,2, b?2,4, c?3 is admissible

17
Recovering rules
  • When we use admissible mappings
  • We are able to reverse the discovered rules on
    substituted set.
  • E.g., if we find 1,2,4 is a frequent set
  • check all mappings
  • 1,2 ? a, 2,4 ?b ? 1,2,4 ?ab

18
cost
  • Additional cost
  • Generating item mapping
  • Generating transaction transformation
  • significant
  • Cost of rule mining
  • Both the of items and the average length of
    transaction is increased, thus the total cost
    will be increased

19
Features of encryption method
  • Rules can be accurately recovered
  • A tradeoff between cost and privacy
  • Privacy is better preserved with more fake items
  • More fake items will result in higher additional
    cost.

20
Distributed datasets
  • Perturbation also works, but encryption does not
    work
  • might need some protocols to make the encryption
    method applicable
  • Horizontally or vertically partitioned
  • Paper 114 and 115
  • Using the cryptographic methods to construct
    protocols

21
Hiding sensitive rules (paper122)
  • When publishing data for rule mining, the rule
    itself can be sensitive too.
  • Basic methods
  • Decrease the support
  • Support of occurrences of itemset/ total of
    transactions
  • Decrease the confident
  • Confidence of occurrences of itemset/ of
    occurrences of left(rule)

22
discussion on rule hiding
  • Need sufficient amount of computational cost at
    the data owner side
  • You should know what rules are sensitive in
    advance!
  • So only necessary for the case you have to share
    the data
  • When hiding sensitive rules, other rules might be
    damaged

23
Summary
  • Two methods for single party data publishing
  • Perturbation and encryption
  • Distributed mutliparty can use protocols and
    perturbation method
  • Hiding sensitive rules is also important in some
    cases
Write a Comment
User Comments (0)
About PowerShow.com