Title: Fault-tolerant Frequent Patterns Mining
1Fault-tolerant Frequent Patterns Mining
2Introduction
- Traditional association rules mining
- Extracting exactly match patterns
Head cold Symptoms coughing, nose tearing,
headache, throat hurt, fever, palpitations,
vomiting Treatment Vit-C,
Fever over 38?, throat hurt, headache
3Introduction
- Fault-tolerant mining
- Allowing limited inexactitude
Head cold Symptoms coughing, nose tearing,
headache, throat hurt, fever, palpitations,
vomiting Treatment Vit-C,
Fever over 38?, throat hurt, headache
4Introduction
- Previous work
- YFB01 Discovering of groups of similar
transactions that share most items. - Focusing on transactions, not items
- Sparse pattern problem
tid item 1 2 3 4 5 6
010 1 1 1 1 0 0
020 1 1 1 1 0 0
030 1 1 1 1 0 0
040 1 1 1 1 0 1
050 0 0 0 0 1 0
060 0 0 0 0 1 0
d 0.8 min_sup 4
5Introduction
- Previous work
- PTH01WL02 Mining those patterns tolerate d
items mismatched within the pattern. - Unfair pattern problem Tolerating fixed number
of items no matter how long the itemset is
6Problem description
- Proportional fault-tolerant pattern mining
- Finding such patterns as X, while items in
each sub-pattern of X with length (Xd)
frequently occur together. - For example
- X a b c d , delta0.75,
- X is a FT pattern gt a b c, a b d, a c
d, b c d frequently occur
7Problem description-definition
- A transaction t FT-contains pattern X iff t
contains x, where x is sub-pattern of X and
(d is a fault-tolerant parameter) - supFT(X) of transactions FT-contains X.
- supitemB(X)(x) of transactions contains x in
the transactions which FT-contains X.
tid items
010 c d e
020 b d e
030 a d e
040 a b c
050 a b c
X abcd, d0.75 supFT(X) 040, 050
2 supitemB(X)(a) 040, 050 2
supitemB(X)(d) 0
8Problem description-definition
- A pattern X is a FT-pattern iff
- 1. supFT(X) gt min_supFT
- 2. For each item x in X,
- supitemB(X)(x) gt min_ supitem
9Problem description-observation
10Problem description-challenge
- The sets of patterns separated by the gap are
independent. i.e., the anti-monotonic property
does not exist.
d0.6 min_supFT5 min_supitem2 fault(3)1 fault
(4)1 fault(5)2
C4 abcd 2, ---- abce 2, ---- abde 2,
---- acde 2, ---- bcde 2, ----
C5 abcde 5, (3, 2, 3, 3, 3)
C3 abc 2, --- ade 3, --- abd 4, ---
bcd 4, --- abe 4, --- bce 4, --- acd 4,
--- bde 3, --- ace 4, --- cde 3, ---
c d e
b d e
a d e
a b c
a b c
11Approaches
- Lemma 1 (Extended Fault-tolerant Apriori) If X
is not a FT-pattern, then none of its superset
with the same number of faults will be a
FT-pattern.
12Approaches
- Lemma 2 Given a pattern X and the set of its
sub-patterns set(Xsubpattern), where for all
pattern P in Xsubpattern, P X-1. Moreover,
let fault(X)-1 fault(X-1). (i.e., X and
the considered subsets are parted by the gap), If
X is not a frequent FT-pattern, then we have
following two conditions - case 1. if supFT(X) lt min_supFT then for all
pattern P in Xsubpattern, P can not be a
FT-pattern. - case 2. else if supitemB(X)(xj) lt min_supitem
where xj denotes an item contained by X, then
none of patterns in set(Xsubpattern) which
contains item xj can be a FT-pattern
abcd1 abce1 abde1 acde1 bcde1
abc1 ade1 abd1 bcd1 abe1 bce1 acd1
bde1 ace1 cde1
abcde2
13FT-LevelWise
14Observation
- let d 0.5, and the pattern, milk, bread,
pencil, eraser, which seems meaningless would be
mined - It is hard to understand the relationships
between items by observing TDB directly - Items which never appear in the same transaction
might have chance to be composed to a FT-pattern
15FT-association graph
- FT-association graph
- The transactions in TDB are scanned one by one
- When an item x is first scanned, a node is
constructed for x and the field used to record
the support count of x is set to 1 - Otherwise, add 1 to the support count of x
- If it is the first time of an item y appears with
x in the same transaction, an edge exy would be
constructed for x and y - Every times item y appears with x, the edge
weight wxy is added by 1 - Item y is called a neighbor of x
- The nodes that are not frequent are pruned
16Example
17Property
- Lemma 2.1 If an item y is away from x for the
distance greater than 2 in the FT-association
graph, then a pattern P which contains both x and
y can not be a frequent FT-pattern. - or
- The transactions which contain Py will never
FT-contain P gt supitemB(P)(y) 0
i
18Property
- Lemma 2.2 If P is a frequent FT-pattern, for each
item x of P, there must exist
items which are neighbors of x in the
FT-association graph in P. - max_sup(P) minwxy x, y ? P
- Lemma 2.3 Given a pattern P, the upper bound of
supFT(P), denoted as max_supFT(P), is equal to
. - And if max_supFT(P) lt min_ supFT, then P can
not be a frequent FT-pattern.
19Proportional FT Frequent pattern Mining
- data preprocessing
- Candidate generation and pruning
- Checking candidates
20Data Preprocessing
- In order to avoid scanning the whole database
when checking candidates, the original database
is transformed into a bitmap
21Data Preprocessing
- Constructs FT-association graph
- The support counts of each item and each
co-appeared itemset are calculated when
constructing FT-association graph - Prunes items whose supports are less than
min_supitem from both of the bitmap and the
FT-association graph
22Candidate generation and pruning
- The data structure of FT-association graph
23Checking candidates
- Extract bitmap(P) for a candidate P
- Calculate the supFT of P and the supitemB(P)(i)
of each item i of P - Let candidate P abcde, the bitmap(P) is shown
below
24Proportional FT-pattern Mining, PFM
25Fixed FT-pattern Mining, FFM
- the FT parameter d is redefined as a fixed number
- Patterns with different length tolerate the same
number of faults, d - MinPattern is set to
- MaxPattern is no more necessary because of the
adoption of FT-Apriori heuristic - For an item x of FT-pattern P, x must have
neighbors
26Conclusions
- The presented framework can be used to solve both
of the problems of mining proportional and fixed
FT-patterns. - The proposed lemmas filter out impossible
candidates with high efficiency - Instead of scanning whole database once for the
candidates in traditional approaches, our method
checks only small part of the bitmap transformed
from original database
27Privacy Preserving Data Mining
28Introduction
- Why privacy preserving data mining?
- Data privacy V.S. Information privacy
- Data privacy
- Information privacy
29Data Privacy Randomization Approach Overview
50 40K ...
30 70K ...
...
Randomizer
Randomizer
65 20K ...
25 60K ...
...
Reconstruct distribution of Age
Reconstruct distribution of Salary
...
Data Mining Algorithms
Model
30Reconstruction Problem
- Original values x1, x2, ..., xn
- from probability distribution X (unknown)
- To hide these values, we use y1, y2, ..., yn
- from probability distribution Y
- Given
- x1y1, x2y2, ..., xnyn
- the probability distribution of Y
- Estimate the probability distribution of X.
31Reconstruction Bootstrapping
- fX0 Uniform distribution
- j 0 // Iteration number
- repeat
- fXj1(a)
(Bayes' rule) - j j1
- until (stopping criterion met)
- Converges to maximum likelihood estimate.
- D. Agrawal C.C. Aggarwal, PODS 2001.
32Information privacy (1)
- Oliveira_Zaiane proposed 6 algorithms
- Step1Find sensitive transactions
- Step2 Choose victim items
- Step3 Compute how many sensitive transactions
should be changed - Step4 Select victim transactions
- SWA of Oliveira_Zaiane
- Almost the same with the others but each step
applied to K transactions, K is the window size - With the best performance
33Related Work (cont.)
- Published pattern sets
- Forward-Inference Attack
34Preliminary
- Represent TDB as a binary matrix
- P frequent patterns
- PH frequent patterns with security policies
- PH frequent patterns without security policies
- PH ?PH P
- Pair-Subset
- eg 1, 2, 3 is frequent 1, 2, 1, 3, 2,
3 is the Pair-Subset of 1, 2, 3 and 1, 2 is
a pair-subpattern of 1, 2, 3
35Problem Definition
- Transform D into D', such that PH are hidden and
PH are still mined in D' and also avoid
Forward-Inference Attack - Kernel Ideal
- D is multiplied by a sanitization matrix S
- The problem is transformed to how to define S
36Matrix Multiplication
- If Dij 0, D'ij is set to 0 directly
- If 1, D'ij is set to 1
- If 0, D'ij is set to 0
37Matrix Observation
- Setting of 1
- If Sij id set to 1, for the row that Dti and
Dtj are both equal to 1, D'ij will become 0
38Matrix Observation (cont.)
- Setting of 1
- Setting Sij to 1 can keep the relation between
item i and item j by enhancing the strength of
item j
39Sanitization Process
40Marked-Set Generation
- 1. Put the patterns with length 2 in PH into
Marked-Set directly - 2. for all remainder P in PH do
- if (P has no Pair-Subsets included in Marked-Set)
- Generate k groups, k of all Pair-Subsets of P
- Class label of group is named by each
Pair-Subsets of P - P is stored in each group
- 3.Merge the groups with same class label
41Marked-Set Generation (cont.)
- 4.for all NP in PH do
- Generate their all Pair-Subsets
- Count the frequencies of all Pair-Subsets
- 5. for all groups do
- If the class label of the group ? any Pair-Subset
generated in Step1 - The frequency of the group 0
- If the class label of the group one Pair-Subset
generated in Step1 - The frequency of the group the frequency of the
Pair-Subset
42Marked-Set Generation (cont.)
- 6. Sort the groups by frequency in the increasing
order - 7.for (i 1 to number of groups -1)
- for ( j i 1 to number of groups)
- Compare groups pair-wise, Gi, Gj
- for all overlap in GinGj do
- If the size of Gi ? the size of Gj
- Remove overlap from the small one
- else
- if Check the frequency
- Remove overlap from the large one
- else
- Remove overlap form the group
chosen randomly - 8.for all groups do
- If number of patterns stored in group gt 0
- Put the class label into Marked-Set
43An overall example
44(No Transcript)
45(No Transcript)
46Sanitization Matrix Setting
- 1. Sii 1
- 2. for all i, j in Marked-Set do
- if( of i in PH lt of j in PH)
- Sji 1
- If( of i in PH gt of j in PH)
- Sij 1
- else
- if( of i in Marked-Set gt of j in Marked-Set)
- Sji 1
- if( of i in Marked-Set lt of j in Marked-Set)
- Sij 1
- else
- Sji 1 or Sij 1 randomly
47Sanitization Matrix Setting (cont.)
- 3.for all i, j in (large2- Marked-Set) do
- Set Sij 1, Sji 1
- 4.Sij 0, otherwise
48(No Transcript)
49Probability Policies
- Distortion Probability?
- Used when only one 1 in the column j
- and works if D'ij has?j
to be set to 1 and 1?j to be set to 0
50Probability Policies (cont.)
- Lemma1 Give a minimum supportsand a level of
confidence c. Let i, j be a pattern in
Marked-Set nij be the support count of i, j ?
is the probability of column j. W.L.O.G we assume
that Sij 1. If ? satisfies - and
-
-
- where D is the number of transaction in D,
we can say that we are c confident that i, j
isnt frequent in D'
51Probability Policies (cont.)
- Conformity Probabilityµ
- Used when the column j of S contains at
- least two 1s, works if , and
at - least one 1 in j is multiplied by 1 in D,
D'ij - is set to 1 withµand 0 with 1µ
52Probability Policies (cont.)
- Lemma 2 Given a minimum support s, and a level
of confidence c. Let i, j be a pattern in
Marked-Set, and k, j be a pattern which belongs
to large2 Marked-Set, nikj be the support
count of i, k, j. W.L.O.G, we assume that Sij
1.µis the Conformity probability of column j. If
µ is set according to the following rule, -
-
- we can say that we are c confident that i, j
isnt frequent in D'.
53Conclusion
- A probability based approach to solve sensitive
knowledge problem is proposed - In some conditions, the miss cost and the
dissimilarity is little higher than SWA, but
overall, better performance than SWA and could
not suffer from Forward-Inference Attack
54Reference
- LCC04Guanling Lee, Chien-Yu Chang and Arbee L.P
Chen. Hiding sensitive patterns in association
rules mining. The 28th Annual International
Computer Software and Applications Conference
(COMPSAC 2004) - OZ02S. R. M. Oliveira and O. R. Zaïane. Privacy
Preserving Frequent Itemset Mining. In Proc. of
the IEEE ICDM Workshop on Privacy, Security, and
Data Mining Japan, December 2002. - OZ03aS. R. M. Oliveira and O. R. Zaïane.
Algorithms for Balancing Privacy and Knowledge
Discovery in Association Rule Mining. In Proc. of
the 7th International Database Engineering and
Applications Symposium (IDEAS03), Hong Kong,
China, July 2003. - OZ03bS. R. M. Oliveira and O. R. Zaïane.
Protecting Sensitive Knowledge By Data
Sanitization. In Proc. of the 3rd IEEE
International Conference on Data Mining
(ICDM03). - OZS04S. R. M. Oliveira, O. R. Zaïane and Yücel
Saygin. Secure Association Rule Sharing The 8th
Pacific-Asia Conference on Knowledge Discovery
and Data Mining 2004(PAKDD-04). - VAE04Verykios, V.S. Elmagarmid, A.K. Bertino,
E. Saygin, Y. Dasseni, E. Association rule
hiding. IEEE Transactions On Knowledge And Data
Engineering, Vol. 16, No. 4, April 2004.