Association Rule Mining I - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Association Rule Mining I

Description:

Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently ... Eclat/MaxEclat and VIPER: Vertical Data Format ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 23
Provided by: WeiW8
Category:

less

Transcript and Presenter's Notes

Title: Association Rule Mining I


1
Association Rule Mining I
  • COMP 790-90 Seminar
  • BCB 713 Module
  • Spring 2009

2
Outline
  • What is association rule mining?
  • Methods for association rule mining
  • Extensions of association rule

3
What Is Association Rule Mining?
  • Frequent patterns patterns (set of items,
    sequence, etc.) that occur frequently in a
    database AIS93
  • Frequent pattern mining finding regularities in
    data
  • What products were often purchased together?
  • Beer and diapers?!
  • What are the subsequent purchases after buying a
    car?
  • Can we automatically profile customers?

4
Why Essential?
  • Foundation for many data mining tasks
  • Association rules, correlation, causality,
    sequential patterns, structural patterns, spatial
    and multimedia patterns, associative
    classification, cluster analysis, iceberg cube,
  • Broad applications
  • Basket data analysis, cross-marketing, catalog
    design, sale campaign analysis, web log (click
    stream) analysis,

5
Basics
  • Itemset a set of items
  • E.g., acma, c, m
  • Support of itemsets
  • Sup(acm)3
  • Given min_sup3, acm is a frequent pattern
  • Frequent pattern mining find all frequent
    patterns in a database

Transaction database TDB
6
Frequent Pattern Mining A Road Map
  • Boolean vs. quantitative associations
  • age(x, 30..39) income(x, 42..48K) ? buys(x,
    car) 1, 75
  • Single dimension vs. multiple dimensional
    associations
  • Single level vs. multiple-level analysis
  • What brands of beers are associated with what
    brands of diapers?

7
Extensions Applications
  • Correlation, causality analysis mining
    interesting rules
  • Maxpatterns and frequent closed itemsets
  • Constraint-based mining
  • Sequential patterns
  • Periodic patterns
  • Structural Patterns
  • Computing iceberg cubes

8
Frequent Pattern Mining Methods
  • Apriori and its variations/improvements
  • Mining frequent-patterns without candidate
    generation
  • Mining max-patterns and closed itemsets
  • Mining multi-dimensional, multi-level frequent
    patterns with flexible support constraints
  • Interestingness correlation and causality

9
Apriori Candidate Generation-and-test
  • Any subset of a frequent itemset must be also
    frequent an anti-monotone property
  • A transaction containing beer, diaper, nuts
    also contains beer, diaper
  • beer, diaper, nuts is frequent ? beer, diaper
    must also be frequent
  • No superset of any infrequent itemset should be
    generated or tested
  • Many item combinations can be pruned

10
Apriori-based Mining
  • Generate length (k1) candidate itemsets from
    length k frequent itemsets, and
  • Test the candidates against DB

11
Apriori Algorithm
  • A level-wise, candidate-generation-and-test
    approach (Agrawal Srikant 1994)

Data base D
1-candidates
Freq 1-itemsets
2-candidates
Scan D
Min_sup2
Counting
Freq 2-itemsets
3-candidates
Scan D
Scan D
Freq 3-itemsets
12
The Apriori Algorithm
  • Ck Candidate itemset of size k
  • Lk frequent itemset of size k
  • L1 frequent items
  • for (k 1 Lk !? k) do
  • Ck1 candidates generated from Lk
  • for each transaction t in database do
  • increment the count of all candidates in Ck1
    that are contained in t
  • Lk1 candidates in Ck1 with min_support
  • return ?k Lk

13
Important Details of Apriori
  • How to generate candidates?
  • Step 1 self-joining Lk
  • Step 2 pruning
  • How to count supports of candidates?

14
How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in an order
  • Step 1 self-join Lk-1
  • INSERT INTO Ck
  • SELECT p.item1, p.item2, , p.itemk-1, q.itemk-1
  • FROM Lk-1 p, Lk-1 q
  • WHERE p.item1q.item1, , p.itemk-2q.itemk-2,
    p.itemk-1 lt q.itemk-1
  • Step 2 pruning
  • For each itemset c in Ck do
  • For each (k-1)-subsets s of c do if (s is not in
    Lk-1) then delete c from Ck

15
Example of Candidate-generation
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

16
How to Count Supports of Candidates?
  • Why counting supports of candidates a problem?
  • The total number of candidates can be very huge
  • One transaction may contain many candidates
  • Method
  • Candidate itemsets are stored in a hash-tree
  • Leaf node of hash-tree contains a list of
    itemsets and counts
  • Interior node contains a hash table
  • Subset function finds all the candidates
    contained in a transaction

17
Example Counting Supports of Candidates
Transaction 1 2 3 5 6
1 2 3 5 6
1 3 5 6
1 2 3 5 6
18
Challenges of Frequent Pattern Mining
  • Challenges
  • Multiple scans of transaction database
  • Huge number of candidates
  • Tedious workload of support counting for
    candidates
  • Improving Apriori general ideas
  • Reduce number of transaction database scans
  • Shrink number of candidates
  • Facilitate support counting of candidates

19
DIC Reduce Number of Scans
ABCD
  • Once both A and D are determined frequent, the
    counting of AD can begin
  • Once all length-2 subsets of BCD are determined
    frequent, the counting of BCD can begin

ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
B
C
D
A
Apriori

Itemset lattice
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997.
3-items
DIC
20
DHP Reduce the Number of Candidates
  • A hashing bucket count ltmin_sup ? every candidate
    in the buck is infrequent
  • Candidates a, b, c, d, e
  • Hash entries ab, ad, ae bd, be, de
  • Large 1-itemset a, b, d, e
  • The sum of counts of ab, ad, ae lt min_sup ? ab
    should not be a candidate 2-itemset
  • J. Park, M. Chen, and P. Yu, 1995

21
Partition Scan Database Only Twice
  • Partition the database into n partitions
  • Itemset X is frequent ? X is frequent in at least
    one partition
  • Scan 1 partition database and find local
    frequent patterns
  • Scan 2 consolidate global frequent patterns
  • A. Savasere, E. Omiecinski, and S. Navathe, 1995

22
Sampling for Frequent Patterns
  • Select a sample of original database, mine
    frequent patterns within sample using Apriori
  • Scan database once to verify frequent itemsets
    found in sample, only borders of closure of
    frequent patterns are checked
  • Example check abcd instead of ab, ac, , etc.
  • Scan database again to find missed frequent
    patterns
  • H. Toivonen, 1996

23
Eclat/MaxEclat and VIPER Vertical Data Format
  • Tid-list the list of transaction-ids containing
    an itemset
  • Major operation intersection of tid-lists
  • Compression of tid-lists
  • Itemset A t1, t2 t3, sup(A)3
  • Itemset B t2, t3, t4, sup(B)3
  • Itemset AB t2, t3, sup(AB)2
  • M. Zaki et al., 1997
  • P. Shenoy et al., 2000

24
Bottleneck of Frequent-pattern Mining
  • Multiple database scans are costly
  • Mining long patterns needs many passes of
    scanning and generates lots of candidates
  • To find frequent itemset i1i2i100
  • of scans 100
  • of Candidates
  • Bottleneck candidate-generation-and-test
  • Can we avoid candidate generation?
Write a Comment
User Comments (0)
About PowerShow.com