Title: Association Rules
1Association Rules
- Market Baskets
- Frequent Itemsets
- A-Priori Algorithm
2The Market-Basket Model
- A large set of items, e.g., things sold in a
supermarket. - A large set of baskets, each of which is a small
set of the items, e.g., the things one customer
buys on one day.
3Market-Baskets (2)
- Really a general many-many mapping (association)
between two kinds of things. - But we ask about connections among items, not
baskets. - The technology focuses on common events, not rare
events (long tail).
4Support
- Simplest question find sets of items that appear
frequently in the baskets. - Support for itemset I the number of baskets
containing all items in I. - Sometimes given as a percentage.
- Given a support threshold s, sets of items that
appear in at least s baskets are called frequent
itemsets.
5Example Frequent Itemsets
- Itemsmilk, coke, pepsi, beer, juice.
- Support 3 baskets.
- B1 m, c, b B2 m, p, j
- B3 m, b B4 c, j
- B5 m, p, b B6 m, c, b, j
- B7 c, b, j B8 b, c
- Frequent itemsets m, c, b, j,
6Applications (1)
- Items products baskets sets of products
someone bought in one trip to the store. - Example application given that many people buy
beer and diapers together - Run a sale on diapers raise price of beer.
- Only useful if many buy diapers beer.
7Applications (2)
- Baskets sentences items documents containing
those sentences. - Items that appear together too often could
represent plagiarism. - Notice items do not have to be in baskets.
8Applications (3)
- Baskets Web pages items words.
- Unusual words appearing together in a large
number of documents, e.g., Brad and Angelina,
may indicate an interesting relationship.
9Aside Words on the Web
- Many Web-mining applications involve words.
- Cluster pages by their topic, e.g., sports.
- Find useful blogs, versus nonsense.
- Determine the sentiment (positive or negative) of
comments. - Partition pages retrieved from an ambiguous
query, e.g., jaguar.
10Words (2)
- Very common words are stop words.
- They rarely help determine meaning, and they
block from view interesting events, so ignore
them. - The TF/IDF measure distinguishes important
words from those that are usually not meaningful.
11Words (3)
- TF/IDF term frequency, inverse
- document frequency relates the number of times
a word appears to the number of documents in
which it appears. - Low values are words like also that appear at
random. - High values are words like computer that may be
the topic of documents in which it appears at all.
12Scale of the Problem
- WalMart sells 100,000 items and can store
billions of baskets. - The Web has billions of words and many billions
of pages.
13Association Rules
- If-then rules about the contents of baskets.
- i1, i2,,ik ? j means if a basket contains
all of i1,,ik then it is likely to contain j. - Confidence of this association rule is the
probability of j given i1,,ik.
14Example Confidence
- B1 m, c, b B2 m, p, j
- B3 m, b B4 c, j
- B5 m, p, b B6 m, c, b, j
- B7 c, b, j B8 b, c
- An association rule m, b ? c.
- Confidence 2/4 50.
_ _
15Finding Association Rules
- Question find all association rules with
support s and confidence c . - Note support of an association rule is the
support of the set of items on the left. - Hard part finding the frequent itemsets.
- Note if i1, i2,,ik ? j has high support and
confidence, then both i1, i2,,ik and
i1, i2,,ik ,j will be frequent.
16Computation Model
- Typically, data is kept in flat files rather than
in a database system. - Stored on disk.
- Stored basket-by-basket.
- Expand baskets into pairs, triples, etc. as you
read baskets. - Use k nested loops to generate all sets of size
k.
17File Organization
Item
Item
Example items are positive integers, and
boundaries between baskets are 1.
Basket 1
Item
Item
Item
Item
Basket 2
Item
Item
Item
Item
Basket 3
Item
Item
Etc.
18Computation Model (2)
- The true cost of mining disk-resident data is
usually the number of disk I/Os. - In practice, association-rule algorithms read the
data in passes all baskets read in turn. - Thus, we measure the cost by the number of passes
an algorithm takes.
19Main-Memory Bottleneck
- For many frequent-itemset algorithms, main memory
is the critical resource. - As we read baskets, we need to count something,
e.g., occurrences of pairs. - The number of different things we can count is
limited by main memory. - Swapping counts in/out is a disaster (why?).
20Finding Frequent Pairs
- The hardest problem often turns out to be finding
the frequent pairs. - Why? Often frequent pairs are common, frequent
triples are rare. - Why? Probability of being frequent drops
exponentially with size number of sets grows
more slowly with size. - Well concentrate on pairs, then extend to larger
sets.
21Naïve Algorithm
- Read file once, counting in main memory the
occurrences of each pair. - From each basket of n items, generate its
n (n -1)/2 pairs by two nested loops. - Fails if (items)2 exceeds main memory.
- Remember items can be 100K (Wal-Mart) or 10B
(Web pages).
22Example Counting Pairs
- Suppose 105 items.
- Suppose counts are 4-byte integers.
- Number of pairs of items 105(105-1)/2 5109
(approximately). - Therefore, 21010 (20 gigabytes) of main memory
needed.
23Details of Main-Memory Counting
- Two approaches
- Count all pairs, using a triangular matrix.
- Keep a table of triples i, j, c the count of
the pair of items i, j is c. - (1) requires only 4 bytes/pair.
- Note always assume integers are 4 bytes.
- (2) requires 12 bytes, but only for those pairs
with count gt 0.
244 per pair
12 per occurring pair
Method (1)
Method (2)
25Triangular-Matrix Approach (1)
- Number items 1, 2,
- Requires table of size O(n) to convert item names
to consecutive integers. - Count i, j only if i lt j.
- Keep pairs in the order 1,2, 1,3,, 1,n ,
2,3, 2,4,,2,n , 3,4,, 3,n ,n -1,n .
26Triangular-Matrix Approach (2)
- Find pair i, j at the position
(i 1)(n i /2) j i. - Total number of pairs n (n 1)/2 total bytes
about 2n 2.
27Details of Approach 2
- Total bytes used is about 12p, where p is the
number of pairs that actually occur. - Beats triangular matrix if at most 1/3 of
possible pairs actually occur. - May require extra space for retrieval structure,
e.g., a hash table.
28A-Priori Algorithm for Pairs
- A two-pass approach called a-priori limits the
need for main memory. - Key idea monotonicity if a set of items
appears at least s times, so does every subset. - Contrapositive for pairs if item i does not
appear in s baskets, then no pair including i
can appear in s baskets.
29Apriori Algorithm - General
Data base D
1-candidates
Freq 1-itemsets
2-candidates
TID Items
10 a, c, d
20 b, c, e
30 a, b, c, e
40 b, e
Itemset Sup
a 2
b 3
c 3
d 1
e 3
Itemset Sup
a 2
b 3
c 3
e 3
Itemset
ab
ac
ae
bc
be
ce
Scan D
Min_sup2
Counting
Freq 2-itemsets
3-candidates
Scan D
Itemset Sup
ab 1
ac 2
ae 1
bc 2
be 3
ce 2
Itemset Sup
ac 2
bc 2
be 3
ce 2
Itemset
bce
Scan D
Freq 3-itemsets
Itemset Sup
bce 2
30Important Details of Apriori
- How to generate candidates?
- Step 1 self-joining Lk
- Step 2 pruning
- How to count supports of candidates?
31How to Generate Candidates?
- Suppose the items in Lk-1 are listed in an order
- Step 1 self-join Lk-1
- INSERT INTO Ck
- SELECT p.item1, p.item2, , p.itemk-1, q.itemk-1
- FROM Lk-1 p, Lk-1 q
- WHERE p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 - Step 2 pruning
- For each itemset c in Ck do
- For each (k-1)-subsets s of c do if (s is not in
Lk-1) then delete c from Ck
32Example of Candidate-generation
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
33How to Count Supports of Candidates?
- Why counting supports of candidates a problem?
- The total number of candidates can be very huge
- One transaction may contain many candidates
- Method
- Candidate itemsets are stored in a hash-tree
- Leaf node of hash-tree contains a list of
itemsets and counts - Interior node contains a hash table
- Subset function finds all the candidates
contained in a transaction
34Apriori Candidate Generation-and-test
- Any subset of a frequent itemset must be also
frequent an anti-monotone property - A transaction containing beer, diaper, nuts
also contains beer, diaper - beer, diaper, nuts is frequent ? beer, diaper
must also be frequent - No superset of any infrequent itemset should be
generated or tested - Many item combinations can be pruned
35The Apriori Algorithm
- Ck Candidate itemset of size k
- Lk frequent itemset of size k
- L1 frequent items
- for (k 1 Lk !? k) do
- Ck1 candidates generated from Lk
- for each transaction t in database do increment
the count of all candidates in Ck1 that are
contained in t - Lk1 candidates in Ck1 with min_support
- return ?k Lk
36A-Priori Algorithm (2)
- Pass 1 Read baskets and count in main memory the
occurrences of each item. - Requires only memory proportional to items.
- Items that appear at least s times are the
frequent items.
37A-Priori Algorithm (3)
- Pass 2 Read baskets again and count in main
memory only those pairs both of which were found
in Pass 1 to be frequent. - Requires memory proportional to square of
frequent items only (for counts), plus a list of
the frequent items (so you know what must be
counted).
38Picture of A-Priori
Item counts
Frequent items
Counts of pairs of frequent items
Pass 1
Pass 2
39Detail for A-Priori
- You can use the triangular matrix method with n
number of frequent items. - May save space compared with storing triples.
- Trick number frequent items 1,2, and keep a
table relating new numbers to original item
numbers.
40A-Priori Using Triangular Matrix for Counts
Item counts
1. Freq- Old 2. quent item items
s
Counts of pairs of frequent items
Pass 1
Pass 2
41Frequent Triples, Etc.
- For each k, we construct two sets of k -sets
(sets of size k ) - Ck candidate k -sets those that might be
frequent sets (support gt s ) based on information
from the pass for k 1. - Lk the set of truly frequent k -sets.
42Filter
Filter
Construct
Construct
C1
L1
C2
L2
C3
First pass
Second pass
43A-Priori for All Frequent Itemsets
- One pass for each k.
- Needs room in main memory to count each candidate
k -set. - For typical market-basket data and reasonable
support (e.g., 1), k 2 requires the most
memory.
44Frequent Itemsets (2)
- C1 all items
- In general, Lk members of Ck with support s.
- Ck 1 (k 1) -sets, each k of which is in Lk .
45Challenges of FPM
- Challenges
- Multiple scans of transaction database
- Huge number of candidates
- Tedious workload of support counting for
candidates - Improving Apriori general ideas
- Reduce number of transaction database scans
- Shrink number of candidates
- Facilitate support counting of candidates
46DIC Reduce Scans
ABCD
- Once both A and D are determined frequent, the
counting of AD can begin - Once all length-2 subsets of BCD are determined
frequent, the counting of BCD can begin
ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
B
C
D
A
Apriori
Itemset lattice
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997.
3-items
DIC
47DHP Reduce the Number of Candidates
- A hashing bucket count ltmin_sup ? every candidate
in the buck is infrequent - Candidates a, b, c, d, e
- Hash entries ab, ad, ae bd, be, de
- Large 1-itemset a, b, d, e
- The sum of counts of ab, ad, ae lt min_sup ? ab
should not be a candidate 2-itemset - J. Park, M. Chen, and P. Yu, 1995
48Partition Scan Database Only Twice
- Partition the database into n partitions
- Itemset X is frequent ? X is frequent in at least
one partition - Scan 1 partition database and find local
frequent patterns - Scan 2 consolidate global frequent patterns
- A. Savasere, E. Omiecinski, and S. Navathe, 1995
49Sampling for Frequent Patterns
- Select a sample of original database, mine
frequent patterns within sample using Apriori - Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked - Example check abcd instead of ab, ac, , etc.
- Scan database again to find missed frequent
patterns - H. Toivonen, 1996
50Bottleneck of Frequent-pattern Mining
- Multiple database scans are costly
- Mining long patterns needs many passes of
scanning and generates lots of candidates - To find frequent itemset i1i2i100
- of scans 100
- of Candidates
- Bottleneck candidate-generation-and-test
- Can we avoid candidate generation?
51Set Enumeration Tree
- Subsets of I can be enumerated systematically
- Ia, b, c, d
?
a
b
c
d
ab
ac
ad
bc
bd
cd
abc
abd
acd
bcd
abcd
52Borders of Frequent Itemsets
- Connected
- X and Y are frequent and X is an ancestor of Y ?
all patterns between X and Y are frequent
53Projected Databases
- To find a child Xy of X, only X-projected
database is needed - The sub-database of transactions containing X
- Item y is frequent in X-projected database
54Tree-Projection Method
- Find frequent 2-itemsets
- For each frequent 2-itemset xy, form a projected
database - The sub-database containing xy
- Recursive mining
- If xy is frequent in xy-proj db, then xyxy is
a frequent pattern
55Borders and Max-patterns
- Max-patterns borders of frequent patterns
- A subset of max-pattern is frequent
- A superset of max-pattern is infrequent
56MaxMiner Mining Max-patterns
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
- 1st scan find frequent items
- A, B, C, D, E
- 2nd scan find support for
- AB, AC, AD, AE, ABCDE
- BC, BD, BE, BCDE
- CD, CE, CDE, DE,
- Since BCDE is a max-pattern, no need to check
BCD, BDE, CDE in later scan - Baya98
Min_sup2
Potential max-patterns
57Frequent Closed Patterns
- For frequent itemset X, if there exists no item y
s.t. every transaction containing X also contains
y, then X is a frequent closed pattern - acdf is a frequent closed pattern
- Concise rep. of freq pats
- Reduce of patterns and rules
- N. Pasquier et al. In ICDT99
Min_sup2
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
58CLOSET Mining Frequent Closed Patterns
- Flist list of all freq items in support asc.
order - Flist d-a-f-e-c
- Divide search space
- Patterns having d
- Patterns having d but no a, etc.
- Find frequent closed pattern recursively
- Every transaction having d also has cfa ? cfad is
a frequent closed pattern - PHM00
Min_sup2
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
59Closed and Max-patterns
- Closed pattern mining algorithms can be adapted
to mine max-patterns - A max-pattern must be closed
- Depth-first search methods have advantages over
breadth-first search ones
60Multiple-level Association Rules
- Items often form hierarchy
- Flexible support settings Items at the lower
level are expected to have lower support. - Transaction database can be encoded based on
dimensions and levels - explore shared multi-level mining
61Multi-dimensional Association Rules
- Single-dimensional rules
- buys(X, milk) ? buys(X, bread)
- MD rules ? 2 dimensions or predicates
- Inter-dimension assoc. rules (no repeated
predicates) - age(X,19-25) ? occupation(X,student) ?
buys(X,coke) - hybrid-dimension assoc. rules (repeated
predicates) - age(X,19-25) ? buys(X, popcorn) ? buys(X,
coke) - Categorical Attributes finite number of possible
values, no order among values - Quantitative Attributes numeric, implicit order
62Quantitative/Weighted Association Rules
Numeric attributes are dynamically
discretized maximize the confidence or
compactness of the rules 2-D quantitative
association rules Aquan1 ? Aquan2 ? Acat Cluster
adjacent association rules to form general
rules using a 2-D grid.
70-80k
60-70k
50-60k
40-50k
30-40k
20-30k
lt20k
32 33 34 35 36 37 38
Income
age(X,33-34) ? income(X,30K - 50K) ?
buys(X,high resolution TV)
Age