Title: Mining HighUtility Itemsets from Databases
1Mining High-Utility Itemsetsfrom Databases
- Raymond Chan,
- Qiang Yang
- Hong Kong University of Science Technology
- and
- Yi-Dong Shen
- Institute of Software, Chinese Academy of Sciences
2Background Frequent Patterns and Association
Rules
- Itemset Xx1, , xk
- Find all the rules X?Y with min confidence and
support - support, s, probability that a transaction
contains X?Y - confidence, c, conditional probability that a
transaction having X also contains Y.
Let min_support 50, min_conf 50 A ? C
(50, 66.7) C ? A (50, 100)
3Background
- Existing approaches to association mining are
itemset-correlation-oriented - I1, , Im ? Im1(s, c)
- But,
- There are too many such rules!
- people are more interested to find out how useful
the association rules are - We introduce the concept of objective and utility
Shen, et al., ICDM 2002 - I1, , Im ? Obj(s, c, u)
4Utility as a function
- Support
- Confidence
- Utility of objective item
- U ?(A v) ? ?
5(No Transcript)
6Deg
Deg
7(No Transcript)
8Definitions (contd)
- Utility of record r in DB
- Total utility of I over DB
- Expected utility of OOA rule I1, , Im ? Obj(s,
c, u)
9(No Transcript)
10Han et al.
11(No Transcript)
12Mining OOA Rules
- A Pruning Strategy by estimating upper bounds
negative
Records satisfying I
All positive (but not sufficient for mu Thus,
must include negative items)
Utility Upper Bound for (I) Sum of all
positive items
A better Upper Bound for (I) Sum of All
positive Items Lower Bound of Some
Negative Lower bound for Some Negative
height(negative) lowerbound of all single
negative items
13(No Transcript)
14Problems with OOA Mining
- Two problems
- The minimum utility mu is not easy to set
- Solution is to mine the top-K utility patterns
- Still too many patterns due to the downward
closure property - Solution mine frequent closed patterns
- Removes redundancy (see later)
- Lets look at them one at a time
15Related Work Top-K Patterns
- Liu, et al. 2000, Silberschatz, et al. 1996
focused on finding interesting patterns by
matching them against a given set of user beliefs - Sese, et al. 2002 studied mining N most
correlated association rules - Han, et al. 2002 proposed a new mining task to
mine top-k frequent closed patterns of length no
less than min_l
16Top-K with Upper and Lower Bound Utilities
- Utility can be a positive or negative value
- Utility constraint becomes neither monotone nor
anti-monotone - Strategy
- Look for upper bound and lower bound of utility
to satisfy the anti-monotone restriction - Pruning Opportunities
upper1
lt lower2
Utility
value1
value2
17Top-n Utility and Bottom-n Utility
- Top-n utility
- Bottom-n utility
- where n 1, , N
- and u(.) sorted u(r1) ? u(r2) ? ? u(rN)
- Choose n ms ? DB, they become the tightest
upper bound and lower bound
18Mining Top-K Utility Frequent Closed Patterns
- L the set of top-K utilities in u I ?
Obj(s, c, u) is an OOA rule - ? variable to store minimum utility
- Apriori based, with enhancement to handle top-K
objective utility and closed patterns (see next
page)
19Closed Patterns
- A closed itemset X has no superset X such that
every transaction containing X also contains X - a, b, a, b, d, a, b, c are frequent closed
patterns - A closed pattern may not be a max pattern (e.g.
a,b is not a max pattern) - Concise rep. of freq pats
- Reduce of patterns and rules
- N. Pasquier et al. In ICDT99
Min_sup2
204 Pruning Strategies I
- Strategy 1 Similar to the Apriori algorithm
pruning strategy, - for every (i 1)-generator I in Gi1,
- if there exists a i-sub-itemset J ? I such that
J ? Gi, - I is pruned from Gi1.
- This strategy prunes all supersets of infrequent
generators - because Gk only contains frequent generators.
21Pruning Strategies II
- Strategy 2
- For all the remaining (i 1)-generators in Gi1,
- we prune those generators that do not satisfy the
user specified minimum support constraint. - This strategy removes all infrequent generators.
- These infrequent generators cannot produce
frequent closed itemsets because of the
anti-monotone property of the support constraint.
22Pruning Strategies Using Upper and Lower Bounds
- Strategy 3
- For the rest of the frequent (i 1)-generators
in Gi1, - we prune those generators that has a top-K
utility (utility upper bound) less than ? ( the
temporary Kth highest utility value during
computation). - This strategy applies the anti-monotone property
of the utility upper-bound constraint - removes all generators that cannot produce closed
itemsets with a utility high enough to be in the
top-K list.
23Pruning Strategy 4 closed itemsets
- Strategy 4 For every remaining ( i
1)-generators I in Gi1, - if there exists a i-sub-itemset J ? I such that I
and J have the same support, - I is pruned from Gi1.
- This strategy removes redundant generators since
the closed itemset from I has already been
generated by J previously.
24Generate Closed Patterns
- One database scan of DB is required to generate
the closed itemsets from the generators. - For each database record r of DB and for each
generator I in Gk1, - the corresponding closed itemset I.closure is
updated. - If r is the first record that contains I,
I.closure is empty so we put all non-objective
items into I.closure. - If r is not the first record that contains I,
- I.closure is non-empty so we perform an
intersection between I.closure and r (i.e.
I.closure ? r) and put the resulting itemset back
to I.closure. - At the end of the database scan, I.closure will
contain a closed itemset generated from generator
I.
25Apriori Closed Algorithm
26Experimental Evaluation
- We expect more useful and fewer patterns than
standard Apriori - Real datasets from UCI Machine Learning Archive
- German Credit dataset (ftp//ftp.ics.uci.edu/pub/m
achine-learning-databases/statlog/german/) - 1000 customer records, 21 attributes
- Heart Disease dataset (ftp//ftp.ics.uci.edu/pub/m
achine-learning-databases/statlog/heart) - 270 patient records, 14 attributes
27Performances by Varying ms
(a) German credit dataset.
(b) Heart disease dataset.
28Performances by Varying K
(a) German credit dataset.
(b) Heart disease dataset.
29Effect of Prune Strategies 1 to 4
(a) German credit dataset.
(b) Heart disease dataset.
30Frequent Itemsets (German)
31Frequent Itemsets (heart)
32Conclusions and Future Work
- We developed a new approach to modeling
association mining, OOA mining, which is
objective oriented - Developed an algorithm to mine the OOA frequent
closed patterns and the top-K utility OOA rules - Found a weaker but anti-monotonic condition based
on utility that helped us to prune the search
space - Our algorithm can produce the desired results
without too much overhead, with the added
advantage to specify the number of rules - Future
- More pruning strategies
- More sophisticated knowledge than rules