Association Rules

About This Presentation

Title:

Association Rules

Description:

Association Rules Market Baskets Frequent Itemsets A-Priori Algorithm * – PowerPoint PPT presentation

Number of Views:199

Avg rating:3.0/5.0

Slides: 51

Provided by: Jeff548

Category:

more less

Transcript and Presenter's Notes

Title: Association Rules

1
Association Rules

Market Baskets
Frequent Itemsets
A-Priori Algorithm

2
The Market-Basket Model

A large set of items, e.g., things sold in a
supermarket.
A large set of baskets, each of which is a small
set of the items, e.g., the things one customer
buys on one day.

3
Market-Baskets (2)

Really a general many-many mapping (association)
between two kinds of things.
But we ask about connections among items, not
baskets.
The technology focuses on common events, not rare
events (long tail).

4
Support

Simplest question find sets of items that appear
frequently in the baskets.
Support for itemset I the number of baskets
containing all items in I.
Sometimes given as a percentage.
Given a support threshold s, sets of items that
appear in at least s baskets are called frequent
itemsets.

5
Example Frequent Itemsets

Itemsmilk, coke, pepsi, beer, juice.
Support 3 baskets.
B1 m, c, b B2 m, p, j
B3 m, b B4 c, j
B5 m, p, b B6 m, c, b, j
B7 c, b, j B8 b, c
Frequent itemsets m, c, b, j,

6
Applications (1)

Items products baskets sets of products
someone bought in one trip to the store.
Example application given that many people buy
beer and diapers together
Run a sale on diapers raise price of beer.
Only useful if many buy diapers beer.

7
Applications (2)

Baskets sentences items documents containing
those sentences.
Items that appear together too often could
represent plagiarism.
Notice items do not have to be in baskets.

8
Applications (3)

Baskets Web pages items words.
Unusual words appearing together in a large
number of documents, e.g., Brad and Angelina,
may indicate an interesting relationship.

9
Aside Words on the Web

Many Web-mining applications involve words.
Cluster pages by their topic, e.g., sports.
Find useful blogs, versus nonsense.
Determine the sentiment (positive or negative) of
comments.
Partition pages retrieved from an ambiguous
query, e.g., jaguar.

10
Words (2)

Very common words are stop words.
They rarely help determine meaning, and they
block from view interesting events, so ignore
them.
The TF/IDF measure distinguishes important
words from those that are usually not meaningful.

11
Words (3)

TF/IDF term frequency, inverse
document frequency relates the number of times
a word appears to the number of documents in
which it appears.
Low values are words like also that appear at
random.
High values are words like computer that may be
the topic of documents in which it appears at all.

12
Scale of the Problem

WalMart sells 100,000 items and can store
billions of baskets.
The Web has billions of words and many billions
of pages.

13
Association Rules

If-then rules about the contents of baskets.
i1, i2,,ik ? j means if a basket contains
all of i1,,ik then it is likely to contain j.
Confidence of this association rule is the
probability of j given i1,,ik.

14
Example Confidence

B1 m, c, b B2 m, p, j
B3 m, b B4 c, j
B5 m, p, b B6 m, c, b, j
B7 c, b, j B8 b, c
An association rule m, b ? c.
Confidence 2/4 50.

_ _

15
Finding Association Rules

Question find all association rules with
support s and confidence c .
Note support of an association rule is the
support of the set of items on the left.
Hard part finding the frequent itemsets.
Note if i1, i2,,ik ? j has high support and
confidence, then both i1, i2,,ik and
i1, i2,,ik ,j will be frequent.

16
Computation Model

Typically, data is kept in flat files rather than
in a database system.
Stored on disk.
Stored basket-by-basket.
Expand baskets into pairs, triples, etc. as you
read baskets.
Use k nested loops to generate all sets of size
k.

17
File Organization
Item
Item
Example items are positive integers, and
boundaries between baskets are 1.
Basket 1
Item
Item
Item
Item
Basket 2
Item
Item
Item
Item
Basket 3
Item
Item
Etc.
18
Computation Model (2)

The true cost of mining disk-resident data is
usually the number of disk I/Os.
In practice, association-rule algorithms read the
data in passes all baskets read in turn.
Thus, we measure the cost by the number of passes
an algorithm takes.

19
Main-Memory Bottleneck

For many frequent-itemset algorithms, main memory
is the critical resource.
As we read baskets, we need to count something,
e.g., occurrences of pairs.
The number of different things we can count is
limited by main memory.
Swapping counts in/out is a disaster (why?).

20
Finding Frequent Pairs

The hardest problem often turns out to be finding
the frequent pairs.
Why? Often frequent pairs are common, frequent
triples are rare.
Why? Probability of being frequent drops
exponentially with size number of sets grows
more slowly with size.
Well concentrate on pairs, then extend to larger
sets.

21
Naïve Algorithm

Read file once, counting in main memory the
occurrences of each pair.
From each basket of n items, generate its
n (n -1)/2 pairs by two nested loops.
Fails if (items)2 exceeds main memory.
Remember items can be 100K (Wal-Mart) or 10B
(Web pages).

22
Example Counting Pairs

Suppose 105 items.
Suppose counts are 4-byte integers.
Number of pairs of items 105(105-1)/2 5109
(approximately).
Therefore, 21010 (20 gigabytes) of main memory
needed.

23
Details of Main-Memory Counting

Two approaches
Count all pairs, using a triangular matrix.
Keep a table of triples i, j, c the count of
the pair of items i, j is c.
(1) requires only 4 bytes/pair.
Note always assume integers are 4 bytes.
(2) requires 12 bytes, but only for those pairs
with count gt 0.

24
4 per pair
12 per occurring pair
Method (1)
Method (2)
25
Triangular-Matrix Approach (1)

Number items 1, 2,
Requires table of size O(n) to convert item names
to consecutive integers.
Count i, j only if i lt j.
Keep pairs in the order 1,2, 1,3,, 1,n ,
2,3, 2,4,,2,n , 3,4,, 3,n ,n -1,n .

26
Triangular-Matrix Approach (2)

Find pair i, j at the position
(i 1)(n i /2) j i.
Total number of pairs n (n 1)/2 total bytes
about 2n 2.

27
Details of Approach 2

Total bytes used is about 12p, where p is the
number of pairs that actually occur.
Beats triangular matrix if at most 1/3 of
possible pairs actually occur.
May require extra space for retrieval structure,
e.g., a hash table.

28
A-Priori Algorithm for Pairs

A two-pass approach called a-priori limits the
need for main memory.
Key idea monotonicity if a set of items
appears at least s times, so does every subset.
Contrapositive for pairs if item i does not
appear in s baskets, then no pair including i
can appear in s baskets.

29
Apriori Algorithm - General

Agrawal Srikant 1994

Data base D
1-candidates
Freq 1-itemsets
2-candidates
TID Items
10 a, c, d
20 b, c, e
30 a, b, c, e
40 b, e
Itemset Sup
a 2
b 3
c 3
d 1
e 3
Itemset Sup
a 2
b 3
c 3
e 3
Itemset
ab
ac
ae
bc
be
ce
Scan D
Min_sup2
Counting
Freq 2-itemsets
3-candidates
Scan D
Itemset Sup
ab 1
ac 2
ae 1
bc 2
be 3
ce 2
Itemset Sup
ac 2
bc 2
be 3
ce 2
Itemset
bce
Scan D
Freq 3-itemsets
Itemset Sup
bce 2
30
Important Details of Apriori

How to generate candidates?
Step 1 self-joining Lk
Step 2 pruning
How to count supports of candidates?

31
How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order
Step 1 self-join Lk-1
INSERT INTO Ck
SELECT p.item1, p.item2, , p.itemk-1, q.itemk-1
FROM Lk-1 p, Lk-1 q
WHERE p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1
Step 2 pruning
For each itemset c in Ck do
For each (k-1)-subsets s of c do if (s is not in
Lk-1) then delete c from Ck

32
Example of Candidate-generation

L3abc, abd, acd, ace, bcd
Self-joining L3L3
abcd from abc and abd
acde from acd and ace
Pruning
acde is removed because ade is not in L3
C4abcd

33
How to Count Supports of Candidates?

Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of
itemsets and counts
Interior node contains a hash table
Subset function finds all the candidates
contained in a transaction

34
Apriori Candidate Generation-and-test

Any subset of a frequent itemset must be also
frequent an anti-monotone property
A transaction containing beer, diaper, nuts
also contains beer, diaper
beer, diaper, nuts is frequent ? beer, diaper
must also be frequent
No superset of any infrequent itemset should be
generated or tested
Many item combinations can be pruned

35
The Apriori Algorithm

Ck Candidate itemset of size k
Lk frequent itemset of size k
L1 frequent items
for (k 1 Lk !? k) do
Ck1 candidates generated from Lk
for each transaction t in database do increment
the count of all candidates in Ck1 that are
contained in t
Lk1 candidates in Ck1 with min_support
return ?k Lk

36
A-Priori Algorithm (2)

Pass 1 Read baskets and count in main memory the
occurrences of each item.
Requires only memory proportional to items.
Items that appear at least s times are the
frequent items.

37
A-Priori Algorithm (3)

Pass 2 Read baskets again and count in main
memory only those pairs both of which were found
in Pass 1 to be frequent.
Requires memory proportional to square of
frequent items only (for counts), plus a list of
the frequent items (so you know what must be
counted).

38
Picture of A-Priori

Item counts
Frequent items
Counts of pairs of frequent items
Pass 1
Pass 2
39
Detail for A-Priori

You can use the triangular matrix method with n
number of frequent items.
May save space compared with storing triples.
Trick number frequent items 1,2, and keep a
table relating new numbers to original item
numbers.

40
A-Priori Using Triangular Matrix for Counts
Item counts
1. Freq- Old 2. quent item items
s
Counts of pairs of frequent items
Pass 1
Pass 2
41
Frequent Triples, Etc.

For each k, we construct two sets of k -sets
(sets of size k )
Ck candidate k -sets those that might be
frequent sets (support gt s ) based on information
from the pass for k 1.
Lk the set of truly frequent k -sets.

42
Filter
Filter
Construct
Construct
C1
L1
C2
L2
C3
First pass
Second pass
43
A-Priori for All Frequent Itemsets

One pass for each k.
Needs room in main memory to count each candidate
k -set.
For typical market-basket data and reasonable
support (e.g., 1), k 2 requires the most
memory.

44
Frequent Itemsets (2)

C1 all items
In general, Lk members of Ck with support s.
Ck 1 (k 1) -sets, each k of which is in Lk .

45
Challenges of FPM

Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
Improving Apriori general ideas
Reduce number of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates

46
DIC Reduce Scans
ABCD

Once both A and D are determined frequent, the
counting of AD can begin
Once all length-2 subsets of BCD are determined
frequent, the counting of BCD can begin

ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
B
C
D
A
Apriori

Itemset lattice
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997.
3-items
DIC
47
DHP Reduce the Number of Candidates

A hashing bucket count ltmin_sup ? every candidate
in the buck is infrequent
Candidates a, b, c, d, e
Hash entries ab, ad, ae bd, be, de
Large 1-itemset a, b, d, e
The sum of counts of ab, ad, ae lt min_sup ? ab
should not be a candidate 2-itemset
J. Park, M. Chen, and P. Yu, 1995

48
Partition Scan Database Only Twice

Partition the database into n partitions
Itemset X is frequent ? X is frequent in at least
one partition
Scan 1 partition database and find local
frequent patterns
Scan 2 consolidate global frequent patterns
A. Savasere, E. Omiecinski, and S. Navathe, 1995

49
Sampling for Frequent Patterns

Select a sample of original database, mine
frequent patterns within sample using Apriori
Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked
Example check abcd instead of ab, ac, , etc.
Scan database again to find missed frequent
patterns
H. Toivonen, 1996

50
Bottleneck of Frequent-pattern Mining

Multiple database scans are costly
Mining long patterns needs many passes of
scanning and generates lots of candidates
To find frequent itemset i1i2i100
of scans 100
of Candidates
Bottleneck candidate-generation-and-test
Can we avoid candidate generation?

51
Set Enumeration Tree

Subsets of I can be enumerated systematically
Ia, b, c, d

?
a
b
c
d
ab
ac
ad
bc
bd
cd
abc
abd
acd
bcd
abcd
52
Borders of Frequent Itemsets

Connected
X and Y are frequent and X is an ancestor of Y ?
all patterns between X and Y are frequent

53
Projected Databases

To find a child Xy of X, only X-projected
database is needed
The sub-database of transactions containing X
Item y is frequent in X-projected database

54
Tree-Projection Method

Find frequent 2-itemsets
For each frequent 2-itemset xy, form a projected
database
The sub-database containing xy
Recursive mining
If xy is frequent in xy-proj db, then xyxy is
a frequent pattern

55
Borders and Max-patterns

Max-patterns borders of frequent patterns
A subset of max-pattern is frequent
A superset of max-pattern is infrequent

56
MaxMiner Mining Max-patterns
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F

1st scan find frequent items
A, B, C, D, E
2nd scan find support for
AB, AC, AD, AE, ABCDE
BC, BD, BE, BCDE
CD, CE, CDE, DE,
Since BCDE is a max-pattern, no need to check
BCD, BDE, CDE in later scan
Baya98

Min_sup2
Potential max-patterns
57
Frequent Closed Patterns

For frequent itemset X, if there exists no item y
s.t. every transaction containing X also contains
y, then X is a frequent closed pattern
acdf is a frequent closed pattern
Concise rep. of freq pats
Reduce of patterns and rules
N. Pasquier et al. In ICDT99

Min_sup2
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
58
CLOSET Mining Frequent Closed Patterns

Flist list of all freq items in support asc.
order
Flist d-a-f-e-c
Divide search space
Patterns having d
Patterns having d but no a, etc.
Find frequent closed pattern recursively
Every transaction having d also has cfa ? cfad is
a frequent closed pattern
PHM00

Min_sup2
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
59
Closed and Max-patterns

Closed pattern mining algorithms can be adapted
to mine max-patterns
A max-pattern must be closed
Depth-first search methods have advantages over
breadth-first search ones

60
Multiple-level Association Rules

Items often form hierarchy
Flexible support settings Items at the lower
level are expected to have lower support.
Transaction database can be encoded based on
dimensions and levels
explore shared multi-level mining

61
Multi-dimensional Association Rules

Single-dimensional rules
buys(X, milk) ? buys(X, bread)
MD rules ? 2 dimensions or predicates
Inter-dimension assoc. rules (no repeated
predicates)
age(X,19-25) ? occupation(X,student) ?
buys(X,coke)
hybrid-dimension assoc. rules (repeated
predicates)
age(X,19-25) ? buys(X, popcorn) ? buys(X,
coke)
Categorical Attributes finite number of possible
values, no order among values
Quantitative Attributes numeric, implicit order

62
Quantitative/Weighted Association Rules
Numeric attributes are dynamically
discretized maximize the confidence or
compactness of the rules 2-D quantitative
association rules Aquan1 ? Aquan2 ? Acat Cluster
adjacent association rules to form general
rules using a 2-D grid.
70-80k
60-70k
50-60k
40-50k
30-40k
20-30k
lt20k
32 33 34 35 36 37 38
Income
age(X,33-34) ? income(X,30K - 50K) ?
buys(X,high resolution TV)
Age

Write a Comment

User Comments (0)