Association Rule Mining I - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Association Rule Mining I

Description:

Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently ... Eclat/MaxEclat and VIPER: Vertical Data Format ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 23

Provided by: WeiW8

Category:

more less

Transcript and Presenter's Notes

Title: Association Rule Mining I

1
Association Rule Mining I

COMP 790-90 Seminar
BCB 713 Module
Spring 2009

2
Outline

What is association rule mining?
Methods for association rule mining
Extensions of association rule

3
What Is Association Rule Mining?

Frequent patterns patterns (set of items,
sequence, etc.) that occur frequently in a
database AIS93
Frequent pattern mining finding regularities in
data
What products were often purchased together?
Beer and diapers?!
What are the subsequent purchases after buying a
car?
Can we automatically profile customers?

4
Why Essential?

Foundation for many data mining tasks
Association rules, correlation, causality,
sequential patterns, structural patterns, spatial
and multimedia patterns, associative
classification, cluster analysis, iceberg cube,
Broad applications
Basket data analysis, cross-marketing, catalog
design, sale campaign analysis, web log (click
stream) analysis,

5
Basics

Itemset a set of items
E.g., acma, c, m
Support of itemsets
Sup(acm)3
Given min_sup3, acm is a frequent pattern
Frequent pattern mining find all frequent
patterns in a database

Transaction database TDB
6
Frequent Pattern Mining A Road Map

Boolean vs. quantitative associations
age(x, 30..39) income(x, 42..48K) ? buys(x,
car) 1, 75
Single dimension vs. multiple dimensional
associations
Single level vs. multiple-level analysis
What brands of beers are associated with what
brands of diapers?

7
Extensions Applications

Correlation, causality analysis mining
interesting rules
Maxpatterns and frequent closed itemsets
Constraint-based mining
Sequential patterns
Periodic patterns
Structural Patterns
Computing iceberg cubes

8
Frequent Pattern Mining Methods

Apriori and its variations/improvements
Mining frequent-patterns without candidate
generation
Mining max-patterns and closed itemsets
Mining multi-dimensional, multi-level frequent
patterns with flexible support constraints
Interestingness correlation and causality

9
Apriori Candidate Generation-and-test

Any subset of a frequent itemset must be also
frequent an anti-monotone property
A transaction containing beer, diaper, nuts
also contains beer, diaper
beer, diaper, nuts is frequent ? beer, diaper
must also be frequent
No superset of any infrequent itemset should be
generated or tested
Many item combinations can be pruned

10
Apriori-based Mining

Generate length (k1) candidate itemsets from
length k frequent itemsets, and
Test the candidates against DB

11
Apriori Algorithm

A level-wise, candidate-generation-and-test
approach (Agrawal Srikant 1994)

Data base D
1-candidates
Freq 1-itemsets
2-candidates
Scan D
Min_sup2
Counting
Freq 2-itemsets
3-candidates
Scan D
Scan D
Freq 3-itemsets
12
The Apriori Algorithm

Ck Candidate itemset of size k
Lk frequent itemset of size k
L1 frequent items
for (k 1 Lk !? k) do
Ck1 candidates generated from Lk
for each transaction t in database do
increment the count of all candidates in Ck1
that are contained in t
Lk1 candidates in Ck1 with min_support
return ?k Lk

13
Important Details of Apriori

How to generate candidates?
Step 1 self-joining Lk
Step 2 pruning
How to count supports of candidates?

14
How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order
Step 1 self-join Lk-1
INSERT INTO Ck
SELECT p.item1, p.item2, , p.itemk-1, q.itemk-1
FROM Lk-1 p, Lk-1 q
WHERE p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1
Step 2 pruning
For each itemset c in Ck do
For each (k-1)-subsets s of c do if (s is not in
Lk-1) then delete c from Ck

15
Example of Candidate-generation

L3abc, abd, acd, ace, bcd
Self-joining L3L3
abcd from abc and abd
acde from acd and ace
Pruning
acde is removed because ade is not in L3
C4abcd

16
How to Count Supports of Candidates?

Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of
itemsets and counts
Interior node contains a hash table
Subset function finds all the candidates
contained in a transaction

17
Example Counting Supports of Candidates
Transaction 1 2 3 5 6
1 2 3 5 6
1 3 5 6
1 2 3 5 6
18
Challenges of Frequent Pattern Mining

Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
Improving Apriori general ideas
Reduce number of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates

19
DIC Reduce Number of Scans
ABCD

Once both A and D are determined frequent, the
counting of AD can begin
Once all length-2 subsets of BCD are determined
frequent, the counting of BCD can begin

ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
B
C
D
A
Apriori

Itemset lattice
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997.
3-items
DIC
20
DHP Reduce the Number of Candidates

A hashing bucket count ltmin_sup ? every candidate
in the buck is infrequent
Candidates a, b, c, d, e
Hash entries ab, ad, ae bd, be, de
Large 1-itemset a, b, d, e
The sum of counts of ab, ad, ae lt min_sup ? ab
should not be a candidate 2-itemset
J. Park, M. Chen, and P. Yu, 1995

21
Partition Scan Database Only Twice

Partition the database into n partitions
Itemset X is frequent ? X is frequent in at least
one partition
Scan 1 partition database and find local
frequent patterns
Scan 2 consolidate global frequent patterns
A. Savasere, E. Omiecinski, and S. Navathe, 1995

22
Sampling for Frequent Patterns

Select a sample of original database, mine
frequent patterns within sample using Apriori
Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked
Example check abcd instead of ab, ac, , etc.
Scan database again to find missed frequent
patterns
H. Toivonen, 1996

23
Eclat/MaxEclat and VIPER Vertical Data Format

Tid-list the list of transaction-ids containing
an itemset
Major operation intersection of tid-lists
Compression of tid-lists
Itemset A t1, t2 t3, sup(A)3
Itemset B t2, t3, t4, sup(B)3
Itemset AB t2, t3, sup(AB)2
M. Zaki et al., 1997
P. Shenoy et al., 2000

24
Bottleneck of Frequent-pattern Mining

Multiple database scans are costly
Mining long patterns needs many passes of
scanning and generates lots of candidates
To find frequent itemset i1i2i100
of scans 100
of Candidates
Bottleneck candidate-generation-and-test
Can we avoid candidate generation?

Write a Comment

User Comments (0)