Frequent itemset mining and temporal extensions - PowerPoint PPT Presentation

About This Presentation

Title:

Frequent itemset mining and temporal extensions

Description:

... and temporal extensions. Sunita Sarawagi. sunita_at_it.iitb.ac.in ... http://www.it.iitb.ac.in/~sunita. Association rules. Given several sets of items, example: ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 35

Provided by: Sun43

Category:

more less

Transcript and Presenter's Notes

Title: Frequent itemset mining and temporal extensions

1
Frequent itemset mining and temporal extensions
Sunita Sarawagi sunita_at_it.iitb.ac.in http//www.it
.iitb.ac.in/sunita
2
Association rules

Given several sets of items, example
Set of items purchased
Set of pages visited on a website
Set of doctors visited
Find all rules that correlate the presence of one
set of items with another
Rules are of the form X ? Y where X and Y are
sets of items
Eg Purchase of books AB ? purchase of C

3
Parameters Support and Confidence

All rules X ? Z have two parameters
Support probability that a transaction has X and
Z
confidence conditional probability that a
transaction having X also contains Z
Two parameters to association rule mining
Minimum support s
Minimum confidence c

S 50, and c 50
A ? C (50, 66.6)
C ? A (50, 100)

4
Applications of fast itemset counting

Cross selling in retail, banking
Catalog design and store layout
Applications in medicine find redundant tests
Improve predictive capability of classifiers that
assume attribute independence
Improved clustering of categorical attributes

5
Finding association rules in large databases

Number of transactions in millions
Number of distinct items tens of thousands
Lots of work on scalable algorithms
Typically two parts to the algorithm
Finding all frequent itemsets with support gt S
Finding rules with confidence greater than C
Frequent itemset search more expensive
Apriori algorithm, FP-tree algorithm

6
The Apriori Algorithm

L1 frequent items of size one
for (k 1 Lk !? k)
Ck1 candidates generated from Lk by
Join Lk with itself
Prune any k1 itemset whose subset not in Lk
for each transaction t in database do
increment the count of all candidates in Ck1
that are contained in t
Lk1 candidates in Ck1 with min_support
return ?k Lk

7
How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order
Step 1 self-joining Lk-1
insert into Ck
select p.item1, p.item2, , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1
Step 2 pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

8
The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
L3
Scan D
C3
9
Improvements to Apriori

Apriori with well-designed data structures works
well in practice when frequent itemsets not too
long (common case)
Lots of enhancements proposed
Sampling count in two passes
Invert database to be column major instead of row
major and count by intersection
Count multiple length itemsets in one-pass
Reducing passes not useful since I/O not
bottleneck
Main bottleneck candidate generation and
counting ? not optimized for long itemsets

10
Mining Frequent Patterns Without Candidate
Generation

Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
highly condensed, but complete for frequent
pattern mining
Develop an efficient, FP-tree-based frequent
pattern mining method
A divide-and-conquer methodology decompose
mining tasks into smaller ones
Avoid candidate generation

11
Construct FP-tree from Database
min_support 0.5
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o f, b 400 b, c, k,
s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p

Scan DB once, find frequent 1-itemset Order
frequent items by decreasing frequency Scan DB
again, construct FP-tree
Item frequency f 4 c 4 a 3 b 3 m 3 p 3
f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
12
Step 1 FP-tree to Conditional Pattern Base

Starting at the frequent header table in the
FP-tree
Traverse the FP-tree by following the link of
each frequent item
Accumulate all of transformed prefix paths of
that item to form a conditional pattern base

Conditional pattern bases item cond. pattern
base c f3 a fc3 b fca1, f1, c1 m fca2,
fcab1 p fcam2, cb1
13
Step 2 Construct Conditional FP-tree

For each pattern-base
Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of
the pattern base

m-conditional pattern base fca2, fcab1
All frequent patterns concerning m m, fm, cm,
am, fcm, fam, cam, fcam
?
14
Mining Frequent Patterns by Creating Conditional
Pattern-Bases
Repeat this recursively for higher items
15
FP-growth vs. Apriori Scalability With the
Support Threshold
Data set T25I20D10K
16
Criticism to Support and Confidence

X and Y positively correlated,
X and Z, negatively related
support and confidence of
XgtZ dominates
Need to measure departure from expected.
For two items
For k items, expected support derived from
support of k-1 itemsets using iterative scaling
methods

17
Prevalent correlations are not interesting

Analysts already know about prevalent rules
Interesting rules are those that deviate from
prior expectation
Minings payoff is in finding surprising phenomena

1995
bedsheets and pillow covers sell together!
bedsheets and pillow covers sell together!
18
What makes a rule surprising?

Does not match prior expectation
Correlation between milk and cereal remains
roughly constant over time

Cannot be trivially derived from simpler rules
Milk 10, cereal 10
Milk and cereal 10 surprising
Eggs 10
Milk, cereal and eggs 0.1 surprising!
Expected 1

19
Finding suprising temporal patterns

Algorithms to mine for surprising patterns
Encode itemsets into bit streams using two models
Mopt The optimal model that allows change along
time
Mcons The constrained model that does not allow
change along time
Surprise difference in number of bits in Mopt
and Mcons

20
One item optimal model

Milk-buying habits modeled by biased coin
Customer tosses this coin to decide whether to
buy milk
Head or 1 denotes basket contains milk
Coin bias is Prmilk
Analyst wants to study Prmilk along time
Single coin with fixed bias is not interesting
Changes in bias are interesting

21
The coin segmentation problem

Players A and B
A has a set of coins with different biases
A repeatedly
Picks arbitrary coin
Tosses it arbitrary number of times
B observes H/T
Guesses transition points and biases

Return
Pick
A
Toss
B
22
How to explain the data

Given n head/tail observations
Can assume n different coins with bias 0 or 1
Data fits perfectly (with probability one)
Many coins needed
Or assume one coin
May fit data poorly
Best explanation is a compromise

5/7
1/3
1/4
23
Coding examples

Sequence of k zeroes
Naïve encoding takes k bits
Run length takes about log k bits
1000 bits, 10 randomly placed 1s, rest 0s
Posit a coin with bias 0.01
Data encoding cost is (Shannons theorem)

24
How to find optimal segments
Sequence of 17 tosses
Derived graph with 18 nodes
Data cost for Prhead 5/7, 5 heads, 2 tails
Edge cost model cost data cost
Model cost one node ID one Prhead
25
Two or more items

Unconstrained segmentation
k items induce a 2k sided coin
milk and cereal 11, milk, not cereal 10,
neither 00, etc.
Shortest path finds significant shift in any of
the coin face probabilities
Problem some of these shifts may be completely
explained by marginals

26
Example

Drop in joint sale of milk and cereal is
completely explained by drop in sale of milk
Prmilk cereal / (Prmilk ? Prcereal)
remains constant over time
Call this ratio ?

27
Constant- ? segmentation
Observed support
Independence

Compute global ? over all time
All coins must have this common value of ?
Segment as before
Compare with un-constrained coding cost

28
Is all this really needed?

Simpler alternative
Aggregate data into suitable time windows
Compute support, correlation, ?, etc. in each
window
Use variance threshold to choose itemsets
Pitfalls
Choices windows, thresholds
May miss fine detail
Over-sensitive to outliers

29
Experiments

Millions of baskets over several years
Two algorithms
Complete MDL approach
MDL segmentation statistical tests (MStat)
Data set
2.8 million transactions
7 years, 1987 to 1993
15800 items
Average 2.62 items per basket

30
Little agreement in itemset ranks

Simpler methods do not approximate MDL

31
MDL has high selectivity

Score of best itemsets stand out from the rest
using MDL

32
Three anecdotes

? against time
High MStat score
Small marginals
Polo shirt shorts
High correlation
Small variation
Bedsheets pillow cases
High MDL score
Significant gradual drift
Mens womens shorts

33
Conclusion

New notion of surprising patterns based on
Joint support expected from marginals
Variation of joint support along time
Robust MDL formulation
Efficient algorithms
Near-optimal segmentation using shortest path
Pruning criteria
Successful application to real data

34
References

R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB'94 487-499,
Santiago, Chile.
S. Chakrabarti, S. Sarawagi and B.Dom, Mining
surprising patterns using temporal description
length Proc. of the 24th Int'l Conference on Very
Large Databases (VLDB), 1998
J. Han, J. Pei, and Y. Yin. Mining frequent
patterns without candidate generation. SIGMOD'00,
1-12, Dallas, TX, May 2000.
Jiawei Han, Micheline Kamber , Data Mining
Concepts and Techniques by, Morgan Kaufmann
Publishers (Some of the slides in the talk are
taken from this book)
H. Toivonen. Sampling large databases for
association rules. VLDB'96, 134-145, Bombay,
India, Sept. 1996