Frequent itemset mining and temporal extensions - PowerPoint PPT Presentation

About This Presentation
Title:

Frequent itemset mining and temporal extensions

Description:

... and temporal extensions. Sunita Sarawagi. sunita_at_it.iitb.ac.in ... http://www.it.iitb.ac.in/~sunita. Association rules. Given several sets of items, example: ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 35
Provided by: Sun43
Category:

less

Transcript and Presenter's Notes

Title: Frequent itemset mining and temporal extensions


1
Frequent itemset mining and temporal extensions
Sunita Sarawagi sunita_at_it.iitb.ac.in http//www.it
.iitb.ac.in/sunita
2
Association rules
  • Given several sets of items, example
  • Set of items purchased
  • Set of pages visited on a website
  • Set of doctors visited
  • Find all rules that correlate the presence of one
    set of items with another
  • Rules are of the form X ? Y where X and Y are
    sets of items
  • Eg Purchase of books AB ? purchase of C

3
Parameters Support and Confidence
  • All rules X ? Z have two parameters
  • Support probability that a transaction has X and
    Z
  • confidence conditional probability that a
    transaction having X also contains Z
  • Two parameters to association rule mining
  • Minimum support s
  • Minimum confidence c
  • S 50, and c 50
  • A ? C (50, 66.6)
  • C ? A (50, 100)

4
Applications of fast itemset counting
  • Cross selling in retail, banking
  • Catalog design and store layout
  • Applications in medicine find redundant tests
  • Improve predictive capability of classifiers that
    assume attribute independence
  • Improved clustering of categorical attributes

5
Finding association rules in large databases
  • Number of transactions in millions
  • Number of distinct items tens of thousands
  • Lots of work on scalable algorithms
  • Typically two parts to the algorithm
  • Finding all frequent itemsets with support gt S
  • Finding rules with confidence greater than C
  • Frequent itemset search more expensive
  • Apriori algorithm, FP-tree algorithm

6
The Apriori Algorithm
  • L1 frequent items of size one
  • for (k 1 Lk !? k)
  • Ck1 candidates generated from Lk by
  • Join Lk with itself
  • Prune any k1 itemset whose subset not in Lk
  • for each transaction t in database do
  • increment the count of all candidates in Ck1
    that are contained in t
  • Lk1 candidates in Ck1 with min_support
  • return ?k Lk

7
How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in an order
  • Step 1 self-joining Lk-1
  • insert into Ck
  • select p.item1, p.item2, , p.itemk-1, q.itemk-1
  • from Lk-1 p, Lk-1 q
  • where p.item1q.item1, , p.itemk-2q.itemk-2,
    p.itemk-1 lt q.itemk-1
  • Step 2 pruning
  • forall itemsets c in Ck do
  • forall (k-1)-subsets s of c do
  • if (s is not in Lk-1) then delete c from Ck

8
The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
L3
Scan D
C3
9
Improvements to Apriori
  • Apriori with well-designed data structures works
    well in practice when frequent itemsets not too
    long (common case)
  • Lots of enhancements proposed
  • Sampling count in two passes
  • Invert database to be column major instead of row
    major and count by intersection
  • Count multiple length itemsets in one-pass
  • Reducing passes not useful since I/O not
    bottleneck
  • Main bottleneck candidate generation and
    counting ? not optimized for long itemsets

10
Mining Frequent Patterns Without Candidate
Generation
  • Compress a large database into a compact,
    Frequent-Pattern tree (FP-tree) structure
  • highly condensed, but complete for frequent
    pattern mining
  • Develop an efficient, FP-tree-based frequent
    pattern mining method
  • A divide-and-conquer methodology decompose
    mining tasks into smaller ones
  • Avoid candidate generation

11
Construct FP-tree from Database
min_support 0.5
TID Items bought (ordered) frequent
items 100 f, a, c, d, g, i, m, p f, c, a, m,
p 200 a, b, c, f, l, m, o f, c, a, b,
m 300 b, f, h, j, o f, b 400 b, c, k,
s, p c, b, p 500 a, f, c, e, l, p, m,
n f, c, a, m, p

Scan DB once, find frequent 1-itemset Order
frequent items by decreasing frequency Scan DB
again, construct FP-tree
Item frequency f 4 c 4 a 3 b 3 m 3 p 3
f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
12
Step 1 FP-tree to Conditional Pattern Base
  • Starting at the frequent header table in the
    FP-tree
  • Traverse the FP-tree by following the link of
    each frequent item
  • Accumulate all of transformed prefix paths of
    that item to form a conditional pattern base

Conditional pattern bases item cond. pattern
base c f3 a fc3 b fca1, f1, c1 m fca2,
fcab1 p fcam2, cb1
13
Step 2 Construct Conditional FP-tree
  • For each pattern-base
  • Accumulate the count for each item in the base
  • Construct the FP-tree for the frequent items of
    the pattern base

m-conditional pattern base fca2, fcab1
All frequent patterns concerning m m, fm, cm,
am, fcm, fam, cam, fcam
?
14
Mining Frequent Patterns by Creating Conditional
Pattern-Bases
Repeat this recursively for higher items
15
FP-growth vs. Apriori Scalability With the
Support Threshold
Data set T25I20D10K
16
Criticism to Support and Confidence
  • X and Y positively correlated,
  • X and Z, negatively related
  • support and confidence of
  • XgtZ dominates
  • Need to measure departure from expected.
  • For two items
  • For k items, expected support derived from
    support of k-1 itemsets using iterative scaling
    methods

17
Prevalent correlations are not interesting
  • Analysts already know about prevalent rules
  • Interesting rules are those that deviate from
    prior expectation
  • Minings payoff is in finding surprising phenomena

1995
bedsheets and pillow covers sell together!
bedsheets and pillow covers sell together!
18
What makes a rule surprising?
  • Does not match prior expectation
  • Correlation between milk and cereal remains
    roughly constant over time
  • Cannot be trivially derived from simpler rules
  • Milk 10, cereal 10
  • Milk and cereal 10 surprising
  • Eggs 10
  • Milk, cereal and eggs 0.1 surprising!
  • Expected 1

19
Finding suprising temporal patterns
  • Algorithms to mine for surprising patterns
  • Encode itemsets into bit streams using two models
  • Mopt The optimal model that allows change along
    time
  • Mcons The constrained model that does not allow
    change along time
  • Surprise difference in number of bits in Mopt
    and Mcons

20
One item optimal model
  • Milk-buying habits modeled by biased coin
  • Customer tosses this coin to decide whether to
    buy milk
  • Head or 1 denotes basket contains milk
  • Coin bias is Prmilk
  • Analyst wants to study Prmilk along time
  • Single coin with fixed bias is not interesting
  • Changes in bias are interesting

21
The coin segmentation problem
  • Players A and B
  • A has a set of coins with different biases
  • A repeatedly
  • Picks arbitrary coin
  • Tosses it arbitrary number of times
  • B observes H/T
  • Guesses transition points and biases

Return
Pick
A
Toss
B
22
How to explain the data
  • Given n head/tail observations
  • Can assume n different coins with bias 0 or 1
  • Data fits perfectly (with probability one)
  • Many coins needed
  • Or assume one coin
  • May fit data poorly
  • Best explanation is a compromise

5/7
1/3
1/4
23
Coding examples
  • Sequence of k zeroes
  • Naïve encoding takes k bits
  • Run length takes about log k bits
  • 1000 bits, 10 randomly placed 1s, rest 0s
  • Posit a coin with bias 0.01
  • Data encoding cost is (Shannons theorem)

24
How to find optimal segments
Sequence of 17 tosses
Derived graph with 18 nodes
Data cost for Prhead 5/7, 5 heads, 2 tails
Edge cost model cost data cost
Model cost one node ID one Prhead
25
Two or more items
  • Unconstrained segmentation
  • k items induce a 2k sided coin
  • milk and cereal 11, milk, not cereal 10,
    neither 00, etc.
  • Shortest path finds significant shift in any of
    the coin face probabilities
  • Problem some of these shifts may be completely
    explained by marginals

26
Example
  • Drop in joint sale of milk and cereal is
    completely explained by drop in sale of milk
  • Prmilk cereal / (Prmilk ? Prcereal)
    remains constant over time
  • Call this ratio ?

27
Constant- ? segmentation
Observed support
Independence
  • Compute global ? over all time
  • All coins must have this common value of ?
  • Segment as before
  • Compare with un-constrained coding cost

28
Is all this really needed?
  • Simpler alternative
  • Aggregate data into suitable time windows
  • Compute support, correlation, ?, etc. in each
    window
  • Use variance threshold to choose itemsets
  • Pitfalls
  • Choices windows, thresholds
  • May miss fine detail
  • Over-sensitive to outliers

29
Experiments
  • Millions of baskets over several years
  • Two algorithms
  • Complete MDL approach
  • MDL segmentation statistical tests (MStat)
  • Data set
  • 2.8 million transactions
  • 7 years, 1987 to 1993
  • 15800 items
  • Average 2.62 items per basket

30
Little agreement in itemset ranks
  • Simpler methods do not approximate MDL

31
MDL has high selectivity
  • Score of best itemsets stand out from the rest
    using MDL

32
Three anecdotes
  • ? against time
  • High MStat score
  • Small marginals
  • Polo shirt shorts
  • High correlation
  • Small variation
  • Bedsheets pillow cases
  • High MDL score
  • Significant gradual drift
  • Mens womens shorts

33
Conclusion
  • New notion of surprising patterns based on
  • Joint support expected from marginals
  • Variation of joint support along time
  • Robust MDL formulation
  • Efficient algorithms
  • Near-optimal segmentation using shortest path
  • Pruning criteria
  • Successful application to real data

34
References
  • R. Agrawal and R. Srikant. Fast algorithms for
    mining association rules. VLDB'94 487-499,
    Santiago, Chile.
  • S. Chakrabarti, S. Sarawagi and B.Dom, Mining
    surprising patterns using temporal description
    length Proc. of the 24th Int'l Conference on Very
    Large Databases (VLDB), 1998
  • J. Han, J. Pei, and Y. Yin. Mining frequent
    patterns without candidate generation. SIGMOD'00,
    1-12, Dallas, TX, May 2000.
  • Jiawei Han, Micheline Kamber , Data Mining
    Concepts and Techniques by, Morgan Kaufmann
    Publishers (Some of the slides in the talk are
    taken from this book)
  • H. Toivonen. Sampling large databases for
    association rules. VLDB'96, 134-145, Bombay,
    India, Sept. 1996
Write a Comment
User Comments (0)
About PowerShow.com