Association Rule Mining II - PowerPoint PPT Presentation

About This Presentation

Title:

Association Rule Mining II

Description:

Step 2: pruning. For each itemset c in Ck do ... Many item combinations can be pruned. The Apriori Algorithm. Ck: Candidate itemset of size k ... – PowerPoint PPT presentation

Number of Views:180

Avg rating:3.0/5.0

Slides: 34

Provided by: weiw6

Learn more at: http://protocols.netlab.uky.edu

Category:

more less

Transcript and Presenter's Notes

Title: Association Rule Mining II

1
Association Rule Mining II

CS 685 Special Topics in Data Mining
Spring 2008

2
Frequent Pattern Analysis

Finding inherent regularities in data
What products were often purchased together?
Beer and diapers?!
What are the subsequent purchases after buying a
PC?
What are the commonly occurring subsequences in a
group of genes?
What are the shared substructures in a group of
effective drugs?

3
What Is Frequent Pattern Analysis?

Frequent pattern a pattern (a set of items,
subsequences, substructures, etc.) that occurs
frequently in a data set
Applications
Identify motifs in bio-molecules
DNA sequence analysis, protein structure analysis
Identify patterns in micro-arrays
Business applications
Market basket analysis, cross-marketing, catalog
design, sale campaign analysis, etc.

4
Data

An item is an element (a literal, a variable, a
symbol, a descriptor, an attribute, a
measurement, etc)
A transaction is a set of items
A data set is a set of transactions
A database is a data set

Transaction-id Items bought
100 f, a, c, d, g, I, m, p
200 a, b, c, f, l,m, o
300 b, f, h, j, o
400 b, c, k, s, p
500 a, f, c, e, l, p, m, n
5
Association Rules
Transaction-id Items bought
100 f, a, c, d, g, I, m, p
200 a, b, c, f, l,m, o
300 b, f, h, j, o
400 b, c, k, s, p
500 a, f, c, e, l, p, m, n

Itemset X x1, , xk
Find all the rules X ? Y with minimum support and
confidence
support, s, is the probability that a transaction
contains X ? Y
confidence, c, is the conditional probability
that a transaction having X also contains Y

Let supmin 50, confmin 50 Association
rules A ? C (60, 100) C ? A (60, 75)
6
Apriori-based Mining

Generate length (k1) candidate itemsets from
length k frequent itemsets, and
Test the candidates against DB

7
Apriori Algorithm

A level-wise, candidate-generation-and-test
approach (Agrawal Srikant 1994)

Data base D
1-candidates
Freq 1-itemsets
2-candidates
TID Items
10 a, c, d
20 b, c, e
30 a, b, c, e
40 b, e
Itemset Sup
a 2
b 3
c 3
d 1
e 3
Itemset Sup
a 2
b 3
c 3
e 3
Itemset
ab
ac
ae
bc
be
ce
Scan D
Min_sup2
Counting
Freq 2-itemsets
3-candidates
Scan D
Itemset Sup
ab 1
ac 2
ae 1
bc 2
be 3
ce 2
Itemset Sup
ac 2
bc 2
be 3
ce 2
Itemset
bce
Scan D
Freq 3-itemsets
Itemset Sup
bce 2
8
Important Details of Apriori

How to generate candidates?
Step 1 self-joining Lk
Step 2 pruning
How to count supports of candidates?

9
How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order
Step 1 self-join Lk-1
INSERT INTO Ck
SELECT p.item1, p.item2, , p.itemk-1, q.itemk-1
FROM Lk-1 p, Lk-1 q
WHERE p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1
Step 2 pruning
For each itemset c in Ck do
For each (k-1)-subsets s of c do if (s is not in
Lk-1) then delete c from Ck

10
Example of Candidate-generation

L3abc, abd, acd, ace, bcd
Self-joining L3L3
abcd from abc and abd
acde from acd and ace
Pruning
acde is removed because ade is not in L3
C4abcd

11
How to Count Supports of Candidates?

Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of
itemsets and counts
Interior node contains a hash table
Subset function finds all the candidates
contained in a transaction

12
Apriori Candidate Generation-and-test

Any subset of a frequent itemset must be also
frequent an anti-monotone property
A transaction containing beer, diaper, nuts
also contains beer, diaper
beer, diaper, nuts is frequent ? beer, diaper
must also be frequent
No superset of any infrequent itemset should be
generated or tested
Many item combinations can be pruned

13
The Apriori Algorithm

Ck Candidate itemset of size k
Lk frequent itemset of size k
L1 frequent items
for (k 1 Lk !? k) do
Ck1 candidates generated from Lk
for each transaction t in database do increment
the count of all candidates in Ck1 that are
contained in t
Lk1 candidates in Ck1 with min_support
return ?k Lk

14
Challenges of Frequent Pattern Mining

Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
Improving Apriori general ideas
Reduce number of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates

15
DIC Reduce Number of Scans
ABCD

Once both A and D are determined frequent, the
counting of AD can begin
Once all length-2 subsets of BCD are determined
frequent, the counting of BCD can begin

ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
B
C
D
A
Apriori

Itemset lattice
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997.
3-items
DIC
16
DHP Reduce the Number of Candidates

A hashing bucket count ltmin_sup ? every candidate
in the buck is infrequent
Candidates a, b, c, d, e
Hash entries ab, ad, ae bd, be, de
Large 1-itemset a, b, d, e
The sum of counts of ab, ad, ae lt min_sup ? ab
should not be a candidate 2-itemset
J. Park, M. Chen, and P. Yu, 1995

17
Partition Scan Database Only Twice

Partition the database into n partitions
Itemset X is frequent ? X is frequent in at least
one partition
Scan 1 partition database and find local
frequent patterns
Scan 2 consolidate global frequent patterns
A. Savasere, E. Omiecinski, and S. Navathe, 1995

18
Sampling for Frequent Patterns

Select a sample of original database, mine
frequent patterns within sample using Apriori
Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked
Example check abcd instead of ab, ac, , etc.
Scan database again to find missed frequent
patterns
H. Toivonen, 1996

19
Bottleneck of Frequent-pattern Mining

Multiple database scans are costly
Mining long patterns needs many passes of
scanning and generates lots of candidates
To find frequent itemset i1i2i100
of scans 100
of Candidates
Bottleneck candidate-generation-and-test
Can we avoid candidate generation?

20
Set Enumeration Tree

Subsets of I can be enumerated systematically
Ia, b, c, d

?
a
b
c
d
ab
ac
ad
bc
bd
cd
abc
abd
acd
bcd
abcd
21
Borders of Frequent Itemsets

Connected
X and Y are frequent and X is an ancestor of Y ?
all patterns between X and Y are frequent

22
Projected Databases

To find a child Xy of X, only X-projected
database is needed
The sub-database of transactions containing X
Item y is frequent in X-projected database

23
Tree-Projection Method

Find frequent 2-itemsets
For each frequent 2-itemset xy, form a projected
database
The sub-database containing xy
Recursive mining
If xy is frequent in xy-proj db, then xyxy is
a frequent pattern

24
Borders and Max-patterns

Max-patterns borders of frequent patterns
A subset of max-pattern is frequent
A superset of max-pattern is infrequent

25
MaxMiner Mining Max-patterns
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F

1st scan find frequent items
A, B, C, D, E
2nd scan find support for
AB, AC, AD, AE, ABCDE
BC, BD, BE, BCDE
CD, CE, CDE, DE,
Since BCDE is a max-pattern, no need to check
BCD, BDE, CDE in later scan
Baya98

Min_sup2
Potential max-patterns
26
Frequent Closed Patterns

For frequent itemset X, if there exists no item y
s.t. every transaction containing X also contains
y, then X is a frequent closed pattern
acdf is a frequent closed pattern
Concise rep. of freq pats
Reduce of patterns and rules
N. Pasquier et al. In ICDT99

Min_sup2
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
27
CLOSET Mining Frequent Closed Patterns

Flist list of all freq items in support asc.
order
Flist d-a-f-e-c
Divide search space
Patterns having d
Patterns having d but no a, etc.
Find frequent closed pattern recursively
Every transaction having d also has cfa ? cfad is
a frequent closed pattern
PHM00

Min_sup2
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
28
Closed and Max-patterns

Closed pattern mining algorithms can be adapted
to mine max-patterns
A max-pattern must be closed
Depth-first search methods have advantages over
breadth-first search ones

29
Multiple-level Association Rules

Items often form hierarchy
Flexible support settings Items at the lower
level are expected to have lower support.
Transaction database can be encoded based on
dimensions and levels
explore shared multi-level mining

30
Multi-dimensional Association Rules

Single-dimensional rules
buys(X, milk) ? buys(X, bread)
MD rules ? 2 dimensions or predicates
Inter-dimension assoc. rules (no repeated
predicates)
age(X,19-25) ? occupation(X,student) ?
buys(X,coke)
hybrid-dimension assoc. rules (repeated
predicates)
age(X,19-25) ? buys(X, popcorn) ? buys(X,
coke)
Categorical Attributes finite number of possible
values, no order among values
Quantitative Attributes numeric, implicit order

31
Quantitative/Weighted Association Rules
Numeric attributes are dynamically
discretized maximize the confidence or
compactness of the rules 2-D quantitative
association rules Aquan1 ? Aquan2 ? Acat Cluster
adjacent association rules to form general
rules using a 2-D grid.
70-80k
60-70k
50-60k
40-50k
30-40k
20-30k
lt20k
32 33 34 35 36 37 38
Income
age(X,33-34) ? income(X,30K - 50K) ?
buys(X,high resolution TV)
Age
32
Constraint-based Data Mining