Title: Frequent itemset mining and temporal extensions
1Frequent itemset mining and temporal extensions
Sunita Sarawagi sunita_at_it.iitb.ac.in http//www.it
.iitb.ac.in/sunita
2Association rules
- Given several sets of items, example
- Set of items purchased
- Set of pages visited on a website
- Set of doctors visited
- Find all rules that correlate the presence of one
set of items with another - Rules are of the form X ? Y where X and Y are
sets of items - Eg Purchase of books AB ? purchase of C
3Parameters Support and Confidence
- All rules X ? Z have two parameters
- Support probability that a transaction has X and
Z - confidence conditional probability that a
transaction having X also contains Z - Two parameters to association rule mining
- Minimum support s
- Minimum confidence c
- S 50, and c 50
- A ? C (50, 66.6)
- C ? A (50, 100)
4Applications of fast itemset counting
- Cross selling in retail, banking
- Catalog design and store layout
- Applications in medicine find redundant tests
- Improve predictive capability of classifiers that
assume attribute independence - Improved clustering of categorical attributes
5Finding association rules in large databases
- Number of transactions in millions
- Number of distinct items tens of thousands
- Lots of work on scalable algorithms
- Typically two parts to the algorithm
- Finding all frequent itemsets with support gt S
- Finding rules with confidence greater than C
- Frequent itemset search more expensive
- Apriori algorithm, FP-tree algorithm
6The Apriori Algorithm
- L1 frequent items of size one
- for (k 1 Lk !? k)
- Ck1 candidates generated from Lk by
- Join Lk with itself
- Prune any k1 itemset whose subset not in Lk
- for each transaction t in database do
- increment the count of all candidates in Ck1
that are contained in t - Lk1 candidates in Ck1 with min_support
- return ?k Lk
7How to Generate Candidates?
- Suppose the items in Lk-1 are listed in an order
- Step 1 self-joining Lk-1
- insert into Ck
- select p.item1, p.item2, , p.itemk-1, q.itemk-1
- from Lk-1 p, Lk-1 q
- where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 - Step 2 pruning
- forall itemsets c in Ck do
- forall (k-1)-subsets s of c do
- if (s is not in Lk-1) then delete c from Ck
8The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
L3
Scan D
C3
9Improvements to Apriori
- Apriori with well-designed data structures works
well in practice when frequent itemsets not too
long (common case) - Lots of enhancements proposed
- Sampling count in two passes
- Invert database to be column major instead of row
major and count by intersection - Count multiple length itemsets in one-pass
- Reducing passes not useful since I/O not
bottleneck - Main bottleneck candidate generation and
counting ? not optimized for long itemsets
10Mining Frequent Patterns Without Candidate
Generation
- Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure - highly condensed, but complete for frequent
pattern mining - Develop an efficient, FP-tree-based frequent
pattern mining method - A divide-and-conquer methodology decompose
mining tasks into smaller ones - Avoid candidate generation
11Construct FP-tree from Database
min_support 0.5
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o f, b 400 b, c, k,
s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p
Scan DB once, find frequent 1-itemset Order
frequent items by decreasing frequency Scan DB
again, construct FP-tree
Item frequency f 4 c 4 a 3 b 3 m 3 p 3
f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
12Step 1 FP-tree to Conditional Pattern Base
- Starting at the frequent header table in the
FP-tree - Traverse the FP-tree by following the link of
each frequent item - Accumulate all of transformed prefix paths of
that item to form a conditional pattern base
Conditional pattern bases item cond. pattern
base c f3 a fc3 b fca1, f1, c1 m fca2,
fcab1 p fcam2, cb1
13Step 2 Construct Conditional FP-tree
- For each pattern-base
- Accumulate the count for each item in the base
- Construct the FP-tree for the frequent items of
the pattern base
m-conditional pattern base fca2, fcab1
All frequent patterns concerning m m, fm, cm,
am, fcm, fam, cam, fcam
?
14Mining Frequent Patterns by Creating Conditional
Pattern-Bases
Repeat this recursively for higher items
15FP-growth vs. Apriori Scalability With the
Support Threshold
Data set T25I20D10K
16Criticism to Support and Confidence
- X and Y positively correlated,
- X and Z, negatively related
- support and confidence of
- XgtZ dominates
- Need to measure departure from expected.
- For two items
- For k items, expected support derived from
support of k-1 itemsets using iterative scaling
methods
17Prevalent correlations are not interesting
- Analysts already know about prevalent rules
- Interesting rules are those that deviate from
prior expectation - Minings payoff is in finding surprising phenomena
1995
bedsheets and pillow covers sell together!
bedsheets and pillow covers sell together!
18What makes a rule surprising?
- Does not match prior expectation
- Correlation between milk and cereal remains
roughly constant over time
- Cannot be trivially derived from simpler rules
- Milk 10, cereal 10
- Milk and cereal 10 surprising
- Eggs 10
- Milk, cereal and eggs 0.1 surprising!
- Expected 1
19Finding suprising temporal patterns
- Algorithms to mine for surprising patterns
- Encode itemsets into bit streams using two models
- Mopt The optimal model that allows change along
time - Mcons The constrained model that does not allow
change along time - Surprise difference in number of bits in Mopt
and Mcons
20One item optimal model
- Milk-buying habits modeled by biased coin
- Customer tosses this coin to decide whether to
buy milk - Head or 1 denotes basket contains milk
- Coin bias is Prmilk
- Analyst wants to study Prmilk along time
- Single coin with fixed bias is not interesting
- Changes in bias are interesting
21The coin segmentation problem
- Players A and B
- A has a set of coins with different biases
- A repeatedly
- Picks arbitrary coin
- Tosses it arbitrary number of times
- B observes H/T
- Guesses transition points and biases
Return
Pick
A
Toss
B
22How to explain the data
- Given n head/tail observations
- Can assume n different coins with bias 0 or 1
- Data fits perfectly (with probability one)
- Many coins needed
- Or assume one coin
- May fit data poorly
- Best explanation is a compromise
5/7
1/3
1/4
23Coding examples
- Sequence of k zeroes
- Naïve encoding takes k bits
- Run length takes about log k bits
- 1000 bits, 10 randomly placed 1s, rest 0s
- Posit a coin with bias 0.01
- Data encoding cost is (Shannons theorem)
24How to find optimal segments
Sequence of 17 tosses
Derived graph with 18 nodes
Data cost for Prhead 5/7, 5 heads, 2 tails
Edge cost model cost data cost
Model cost one node ID one Prhead
25Two or more items
- Unconstrained segmentation
- k items induce a 2k sided coin
- milk and cereal 11, milk, not cereal 10,
neither 00, etc. - Shortest path finds significant shift in any of
the coin face probabilities - Problem some of these shifts may be completely
explained by marginals
26Example
- Drop in joint sale of milk and cereal is
completely explained by drop in sale of milk - Prmilk cereal / (Prmilk ? Prcereal)
remains constant over time - Call this ratio ?
27Constant- ? segmentation
Observed support
Independence
- Compute global ? over all time
- All coins must have this common value of ?
- Segment as before
- Compare with un-constrained coding cost
28Is all this really needed?
- Simpler alternative
- Aggregate data into suitable time windows
- Compute support, correlation, ?, etc. in each
window - Use variance threshold to choose itemsets
- Pitfalls
- Choices windows, thresholds
- May miss fine detail
- Over-sensitive to outliers
29Experiments
- Millions of baskets over several years
- Two algorithms
- Complete MDL approach
- MDL segmentation statistical tests (MStat)
- Data set
- 2.8 million transactions
- 7 years, 1987 to 1993
- 15800 items
- Average 2.62 items per basket
30Little agreement in itemset ranks
- Simpler methods do not approximate MDL
31MDL has high selectivity
- Score of best itemsets stand out from the rest
using MDL
32Three anecdotes
- ? against time
- High MStat score
- Small marginals
- Polo shirt shorts
- High correlation
- Small variation
- Bedsheets pillow cases
- High MDL score
- Significant gradual drift
- Mens womens shorts
33Conclusion
- New notion of surprising patterns based on
- Joint support expected from marginals
- Variation of joint support along time
- Robust MDL formulation
- Efficient algorithms
- Near-optimal segmentation using shortest path
- Pruning criteria
- Successful application to real data
34References
- R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB'94 487-499,
Santiago, Chile. - S. Chakrabarti, S. Sarawagi and B.Dom, Mining
surprising patterns using temporal description
length Proc. of the 24th Int'l Conference on Very
Large Databases (VLDB), 1998 - J. Han, J. Pei, and Y. Yin. Mining frequent
patterns without candidate generation. SIGMOD'00,
1-12, Dallas, TX, May 2000. - Jiawei Han, Micheline Kamber , Data Mining
Concepts and Techniques by, Morgan Kaufmann
Publishers (Some of the slides in the talk are
taken from this book) - H. Toivonen. Sampling large databases for
association rules. VLDB'96, 134-145, Bombay,
India, Sept. 1996