Association Rule Mining II - PowerPoint PPT Presentation

About This Presentation
Title:

Association Rule Mining II

Description:

Step 2: pruning. For each itemset c in Ck do ... Many item combinations can be pruned. The Apriori Algorithm. Ck: Candidate itemset of size k ... – PowerPoint PPT presentation

Number of Views:180
Avg rating:3.0/5.0
Slides: 34
Provided by: weiw6
Category:

less

Transcript and Presenter's Notes

Title: Association Rule Mining II


1
Association Rule Mining II
  • CS 685 Special Topics in Data Mining
  • Spring 2008

2
Frequent Pattern Analysis
  • Finding inherent regularities in data
  • What products were often purchased together?
    Beer and diapers?!
  • What are the subsequent purchases after buying a
    PC?
  • What are the commonly occurring subsequences in a
    group of genes?
  • What are the shared substructures in a group of
    effective drugs?

3
What Is Frequent Pattern Analysis?
  • Frequent pattern a pattern (a set of items,
    subsequences, substructures, etc.) that occurs
    frequently in a data set
  • Applications
  • Identify motifs in bio-molecules
  • DNA sequence analysis, protein structure analysis
  • Identify patterns in micro-arrays
  • Business applications
  • Market basket analysis, cross-marketing, catalog
    design, sale campaign analysis, etc.

4
Data
  • An item is an element (a literal, a variable, a
    symbol, a descriptor, an attribute, a
    measurement, etc)
  • A transaction is a set of items
  • A data set is a set of transactions
  • A database is a data set

Transaction-id Items bought
100 f, a, c, d, g, I, m, p
200 a, b, c, f, l,m, o
300 b, f, h, j, o
400 b, c, k, s, p
500 a, f, c, e, l, p, m, n
5
Association Rules
Transaction-id Items bought
100 f, a, c, d, g, I, m, p
200 a, b, c, f, l,m, o
300 b, f, h, j, o
400 b, c, k, s, p
500 a, f, c, e, l, p, m, n
  • Itemset X x1, , xk
  • Find all the rules X ? Y with minimum support and
    confidence
  • support, s, is the probability that a transaction
    contains X ? Y
  • confidence, c, is the conditional probability
    that a transaction having X also contains Y

Let supmin 50, confmin 50 Association
rules A ? C (60, 100) C ? A (60, 75)
6
Apriori-based Mining
  • Generate length (k1) candidate itemsets from
    length k frequent itemsets, and
  • Test the candidates against DB

7
Apriori Algorithm
  • A level-wise, candidate-generation-and-test
    approach (Agrawal Srikant 1994)

Data base D
1-candidates
Freq 1-itemsets
2-candidates
TID Items
10 a, c, d
20 b, c, e
30 a, b, c, e
40 b, e
Itemset Sup
a 2
b 3
c 3
d 1
e 3
Itemset Sup
a 2
b 3
c 3
e 3
Itemset
ab
ac
ae
bc
be
ce
Scan D
Min_sup2
Counting
Freq 2-itemsets
3-candidates
Scan D
Itemset Sup
ab 1
ac 2
ae 1
bc 2
be 3
ce 2
Itemset Sup
ac 2
bc 2
be 3
ce 2
Itemset
bce
Scan D
Freq 3-itemsets
Itemset Sup
bce 2
8
Important Details of Apriori
  • How to generate candidates?
  • Step 1 self-joining Lk
  • Step 2 pruning
  • How to count supports of candidates?

9
How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in an order
  • Step 1 self-join Lk-1
  • INSERT INTO Ck
  • SELECT p.item1, p.item2, , p.itemk-1, q.itemk-1
  • FROM Lk-1 p, Lk-1 q
  • WHERE p.item1q.item1, , p.itemk-2q.itemk-2,
    p.itemk-1 lt q.itemk-1
  • Step 2 pruning
  • For each itemset c in Ck do
  • For each (k-1)-subsets s of c do if (s is not in
    Lk-1) then delete c from Ck

10
Example of Candidate-generation
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

11
How to Count Supports of Candidates?
  • Why counting supports of candidates a problem?
  • The total number of candidates can be very huge
  • One transaction may contain many candidates
  • Method
  • Candidate itemsets are stored in a hash-tree
  • Leaf node of hash-tree contains a list of
    itemsets and counts
  • Interior node contains a hash table
  • Subset function finds all the candidates
    contained in a transaction

12
Apriori Candidate Generation-and-test
  • Any subset of a frequent itemset must be also
    frequent an anti-monotone property
  • A transaction containing beer, diaper, nuts
    also contains beer, diaper
  • beer, diaper, nuts is frequent ? beer, diaper
    must also be frequent
  • No superset of any infrequent itemset should be
    generated or tested
  • Many item combinations can be pruned

13
The Apriori Algorithm
  • Ck Candidate itemset of size k
  • Lk frequent itemset of size k
  • L1 frequent items
  • for (k 1 Lk !? k) do
  • Ck1 candidates generated from Lk
  • for each transaction t in database do increment
    the count of all candidates in Ck1 that are
    contained in t
  • Lk1 candidates in Ck1 with min_support
  • return ?k Lk

14
Challenges of Frequent Pattern Mining
  • Challenges
  • Multiple scans of transaction database
  • Huge number of candidates
  • Tedious workload of support counting for
    candidates
  • Improving Apriori general ideas
  • Reduce number of transaction database scans
  • Shrink number of candidates
  • Facilitate support counting of candidates

15
DIC Reduce Number of Scans
ABCD
  • Once both A and D are determined frequent, the
    counting of AD can begin
  • Once all length-2 subsets of BCD are determined
    frequent, the counting of BCD can begin

ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
B
C
D
A
Apriori

Itemset lattice
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997.
3-items
DIC
16
DHP Reduce the Number of Candidates
  • A hashing bucket count ltmin_sup ? every candidate
    in the buck is infrequent
  • Candidates a, b, c, d, e
  • Hash entries ab, ad, ae bd, be, de
  • Large 1-itemset a, b, d, e
  • The sum of counts of ab, ad, ae lt min_sup ? ab
    should not be a candidate 2-itemset
  • J. Park, M. Chen, and P. Yu, 1995

17
Partition Scan Database Only Twice
  • Partition the database into n partitions
  • Itemset X is frequent ? X is frequent in at least
    one partition
  • Scan 1 partition database and find local
    frequent patterns
  • Scan 2 consolidate global frequent patterns
  • A. Savasere, E. Omiecinski, and S. Navathe, 1995

18
Sampling for Frequent Patterns
  • Select a sample of original database, mine
    frequent patterns within sample using Apriori
  • Scan database once to verify frequent itemsets
    found in sample, only borders of closure of
    frequent patterns are checked
  • Example check abcd instead of ab, ac, , etc.
  • Scan database again to find missed frequent
    patterns
  • H. Toivonen, 1996

19
Bottleneck of Frequent-pattern Mining
  • Multiple database scans are costly
  • Mining long patterns needs many passes of
    scanning and generates lots of candidates
  • To find frequent itemset i1i2i100
  • of scans 100
  • of Candidates
  • Bottleneck candidate-generation-and-test
  • Can we avoid candidate generation?

20
Set Enumeration Tree
  • Subsets of I can be enumerated systematically
  • Ia, b, c, d

?
a
b
c
d
ab
ac
ad
bc
bd
cd
abc
abd
acd
bcd
abcd
21
Borders of Frequent Itemsets
  • Connected
  • X and Y are frequent and X is an ancestor of Y ?
    all patterns between X and Y are frequent

22
Projected Databases
  • To find a child Xy of X, only X-projected
    database is needed
  • The sub-database of transactions containing X
  • Item y is frequent in X-projected database

23
Tree-Projection Method
  • Find frequent 2-itemsets
  • For each frequent 2-itemset xy, form a projected
    database
  • The sub-database containing xy
  • Recursive mining
  • If xy is frequent in xy-proj db, then xyxy is
    a frequent pattern

24
Borders and Max-patterns
  • Max-patterns borders of frequent patterns
  • A subset of max-pattern is frequent
  • A superset of max-pattern is infrequent

25
MaxMiner Mining Max-patterns
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
  • 1st scan find frequent items
  • A, B, C, D, E
  • 2nd scan find support for
  • AB, AC, AD, AE, ABCDE
  • BC, BD, BE, BCDE
  • CD, CE, CDE, DE,
  • Since BCDE is a max-pattern, no need to check
    BCD, BDE, CDE in later scan
  • Baya98

Min_sup2
Potential max-patterns
26
Frequent Closed Patterns
  • For frequent itemset X, if there exists no item y
    s.t. every transaction containing X also contains
    y, then X is a frequent closed pattern
  • acdf is a frequent closed pattern
  • Concise rep. of freq pats
  • Reduce of patterns and rules
  • N. Pasquier et al. In ICDT99

Min_sup2
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
27
CLOSET Mining Frequent Closed Patterns
  • Flist list of all freq items in support asc.
    order
  • Flist d-a-f-e-c
  • Divide search space
  • Patterns having d
  • Patterns having d but no a, etc.
  • Find frequent closed pattern recursively
  • Every transaction having d also has cfa ? cfad is
    a frequent closed pattern
  • PHM00

Min_sup2
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
28
Closed and Max-patterns
  • Closed pattern mining algorithms can be adapted
    to mine max-patterns
  • A max-pattern must be closed
  • Depth-first search methods have advantages over
    breadth-first search ones

29
Multiple-level Association Rules
  • Items often form hierarchy
  • Flexible support settings Items at the lower
    level are expected to have lower support.
  • Transaction database can be encoded based on
    dimensions and levels
  • explore shared multi-level mining

30
Multi-dimensional Association Rules
  • Single-dimensional rules
  • buys(X, milk) ? buys(X, bread)
  • MD rules ? 2 dimensions or predicates
  • Inter-dimension assoc. rules (no repeated
    predicates)
  • age(X,19-25) ? occupation(X,student) ?
    buys(X,coke)
  • hybrid-dimension assoc. rules (repeated
    predicates)
  • age(X,19-25) ? buys(X, popcorn) ? buys(X,
    coke)
  • Categorical Attributes finite number of possible
    values, no order among values
  • Quantitative Attributes numeric, implicit order

31
Quantitative/Weighted Association Rules
Numeric attributes are dynamically
discretized maximize the confidence or
compactness of the rules 2-D quantitative
association rules Aquan1 ? Aquan2 ? Acat Cluster
adjacent association rules to form general
rules using a 2-D grid.
70-80k
60-70k
50-60k
40-50k
30-40k
20-30k
lt20k
32 33 34 35 36 37 38
Income
age(X,33-34) ? income(X,30K - 50K) ?
buys(X,high resolution TV)
Age
32
Constraint-based Data Mining
  • Find all the patterns in a database autonomously?
  • The patterns could be too many but not focused!
  • Data mining should be interactive
  • User directs what to be mined
  • Constraint-based mining
  • User flexibility provides constraints on what to
    be mined
  • System optimization push constraints for
    efficient mining

33
Constraints in Data Mining
  • Knowledge type constraint
  • classification, association, etc.
  • Data constraint using SQL-like queries
  • find product pairs sold together in stores in New
    York
  • Dimension/level constraint
  • in relevance to region, price, brand, customer
    category
  • Rule (or pattern) constraint
  • small sales (price lt 10) triggers big sales (sum
    gt200)
  • Interestingness constraint
  • strong rules support and confidence
Write a Comment
User Comments (0)
About PowerShow.com