Title: Top Down FP-Growth for Association Rule Mining
1Top Down FP-Growth for Association Rule Mining
2Introduction
- Classically, for rule A ? B
- support computed by count( AB )
- frequent --- if pass minimum support threshold
- confidence computed by count( AB ) / count(A )
- confident if pass minimum confidence threshold
- How to mine association rules?
- find all frequent patterns
- generate rules from the frequent patterns
3Introduction
- Limitations of current research
- use uniform minimum support threshold
- only use support as pruning measure
- Our contribution
- improve efficiency
- adopt multiple minimum supports
- introduce confidence pruning
4Related work -- Frequent pattern mining
- Apriori algorithm
- method use anti-monotone property of support to
do pruning, i.e. - if length k pattern is infrequent, its length k1
super-pattern can never be frequent - FP-growth algorithm--better than Apriori
- method
- build FP-tree to store database
- mine FP-tree in bottom-up order
5Related work -- Association rule mining
- Fast algorithms trying to guarantee completeness
of frequent patterns - Parallel algorithms association rule based
query languages - Various association rule mining problems
- multi-level multi-dimension rule
- constraints on specific item
6TD-FP-Growth for frequent pattern mining
- Similar tree structure as FP-growth
- Compressed tree to store the database
- nodes on each path of the tree are globally
ordered - Different mining method VS.FP-growth
- FP-growth bottom-up tree mining
- TD-FP-Growth top-down tree mining
7TD-FP-Growth for frequent pattern mining
b, e a, b, c, e b, c, e a, c, d a minsup 2
Construct a FP-tree
Entry value count side-link
a b c e 3 3 3 3
8TD-FP-Growth for frequent pattern mining
b, e a, b, c, e b, c, e a, c, d a minsup 2
FP-growth bottom-up mining
Mining order e, c, b, a
item Head of node-link
a b c e
9TD-FP-Growth for frequent pattern mining
- FP-growth bottom-up mining
item Head of node-link
b c
? drawback!
10TD-FP-Growth for frequent pattern mining
- TD-FP-Growth adopt top-down mining strategy
- motivation avoid building extra databases and
sub-trees as FP-growth does - method process nodes on the upper level before
those on the lower level - result any modification happened on the upper
level nodes would not affect the lower level nodes
See example ?
11TD-FP-Growth for frequent pattern mining
b, e a, b, c, e b, c, e a, c, d a minsup 2
Mining order a, b, c, e
Entry value count side-link
a b c e 3 3 3 3
CT-tree and header table H
12CT-tree for frequent pattern mining
b, e a, b, c, e b, c, e a, c, d a minsup 2
Entry value count side-link
a b 2 2
Entry value count side-link
a b c e 3 3 3 3
13CT-tree for frequent pattern mining
- Completeness
- for entry i in H, we mine all the frequent
patterns that end up with item i, no more and no
less - Complete set of frequent patterns
- a
- b
- c , b, c , a, c
- e , b, e , c, e , b, c, e
14TD-FP-Growth for frequent pattern mining
- Comparing to FP-growth, TD-FP-Growth is
- Space saving
- only one tree and a few header tables
- no extra databases and sub-trees
- Time saving
- does not build extra databases and sub-trees
- walk up path only once to update count
information for nodes on the tree and build
sub-header-tables.
15TD-FP-Growth for association rule mining
- Assumptions
- There is a class-attribute in the database
- Items in the class-attribute called class-items,
others are non-class-items - Each transaction is associated a class-item
- Only class-item appears in the right-hand of the
rule
Transaction ID non-class-attribute class-attribute
1 a, b C1
2 d C2
3 e, d, f C3
example rule a, b ? Ci
16TD-FP-Growth for association rule mining--multi
mini support
- Why?
- Use uniform minimum support, computation of count
considers only number of appearance - Uniform minimum support is unfair to items that
appears less but worth more. - Eg. responder vs. non-responder
- How?
- Use different support threshold for different
class
17TD-FP-Growth for association rule mining -- multi
mini support
- multiple VS. uniform
- C1 4, C 2 2
- rules with relative minsup 50 proportional to
each class -- multiplier in performance - uniform minimum support absolute minsup 1
- 11 nodes tree, 23 rules
- multiple minimum supports absolute minsup1 2
absolute minsup2 1 - 7 nodes tree, 9 rules
- more effective and space-saving
- time-saving --- show in performance
c, f, C1 b, e, C2 b, e, f, C1 a, c, f, C1 c, e,
C2 b, c, d, C1
18TD-FP-Growth for association rule mining--conf
pruning
- Motivation
- make use of the other constraint of association
rule confidence, to speed up mining - Method
- confidence is not anti-monotone
- introduce acting constraint of confidence, which
is anti-monotone - push it inside the mining process
19TD-FP-Growth for association rule mining--conf
pruning
conf(A ? B) count(AB) / count(A) gt minconf ?
count(AB) gt count(A) minconf ?
count(AB) gt minsup minconf (anti-monotone
weaker)
--- the acting constraint of confidence for the
original confidence constraint of rule A ? B
- support of rule is computed by count(A)
- count(AB) class-count of itemset A related to
class B
20TD-FP-Growth for association rule mining--conf
pruning
c, f, C1 b, e, C2 b, e, f, C1 a, c, f, C1 a, c,
d, C2 minsup 2 minconf 60
Entry value i count (i) count(i,C1) count(i,C2) side-link
a b c e f 2 2 3 2 3 1 1 2 1 3 1 1 1 1 0
count(e) gt minsup However, both count(e, C1)
count(e, C2) lt minsup minconf ? terminate
mining for e!
Entry value i count (i) count(i,Ci) count(i,C2) side-link
b 2 1 1
21Performance
- Choose several data sets from UC_Irvine Machine
Learning Database Repository
http//www.ics.uci.edu/mlearn/MLRepository.htm
l.
name of dataset of transactions of items in each transaction class distribution of distinct items
Dna-train 2000 61 23.2, 24.25, 52.55 240
Connect-4 67557 43 9.55, 24.62, 65.83 126
Forest 581012 13 0.47, 1.63, 2.99, 3.53, 6.15, 36.36, 48.76 15916
22Performancefrequent pattern
23Performance mine rules with multiple minimum
supports
FP-growth is only for frequent pattern mining
relative minsup, proportional to each class
24Performance mine rules with confidence pruning
25Conclusions and future work
- Conclusions of TD-FP-Growth algorithm
- more efficient in finding both frequent patterns
and association rules - more effective in mining rules by using multiple
minimum supports - Introduce a new pruning method confidence
pruning, and push it inside the mining process
thus further speed up mining
26Conclusions and future work
- Future work
- Explore other constraint-based association rule
mining method - Mine association rules with item concept
hierarchy - Apply TD-FP-Growth to applications based on
association rule mining - Clustering
- Classification
27Reference
- (1) R. Agrawal, T. Imielinski, and A. Swami.
Mining association rules between sets of items in
large databases. Proc. 1993 ACM-SIGMOD Int. Conf.
on Management of Data (SIGMOD93), pages 207-216,
Washington, D.C., May 1993. - (2) U. M. Fayyad, G. Piatetsky-Shapiro, P.
Smyth, and R. Uthurusamy (eds.). Advances in
Knowledge Discovery and Data Mining. AAAI/MIT
Press, 1996. - (3) H. Toivonen. Sampling large databases for
association rules. Proc. 1996 Int. Conf. Very
Large Data Bases (VLDB96), pages 134-145,
Bombay, India, September 1996. - (4) R. Agrawal and S. Srikant. Mining
sequential patterns. Proc. 1995 Int. Conf. Data
Engineering (ICDE95), pages 3-14, Taipei,
Taiwan, March 1995. - (5) J. Han, J. Pei and Y. Yin. Mining Frequent
Patterns without Candidate Generation. Proc. 2000
ACM-SIGMOD Int. Conf. on Management of Data
(SIGMOD00), pages 1-12, Dallas, TX, May 2000. - (6) J. Han, J. Pei, G. Dong, and K. Wang.
Efficient Computation of Iceberg Cubes with
Complex Measures. Proc. 2001 ACM-SIGMOD Int.
Conf., Santa Barbara, CA, May 2001. - And more!