Association Rules - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Association Rules

Description:

Association Rules Market Baskets Frequent Itemsets A-Priori Algorithm * – PowerPoint PPT presentation

Number of Views:189
Avg rating:3.0/5.0
Slides: 51
Provided by: Jeff548
Category:

less

Transcript and Presenter's Notes

Title: Association Rules


1
Association Rules
  • Market Baskets
  • Frequent Itemsets
  • A-Priori Algorithm

2
The Market-Basket Model
  • A large set of items, e.g., things sold in a
    supermarket.
  • A large set of baskets, each of which is a small
    set of the items, e.g., the things one customer
    buys on one day.

3
Market-Baskets (2)
  • Really a general many-many mapping (association)
    between two kinds of things.
  • But we ask about connections among items, not
    baskets.
  • The technology focuses on common events, not rare
    events (long tail).

4
Support
  • Simplest question find sets of items that appear
    frequently in the baskets.
  • Support for itemset I the number of baskets
    containing all items in I.
  • Sometimes given as a percentage.
  • Given a support threshold s, sets of items that
    appear in at least s baskets are called frequent
    itemsets.

5
Example Frequent Itemsets
  • Itemsmilk, coke, pepsi, beer, juice.
  • Support 3 baskets.
  • B1 m, c, b B2 m, p, j
  • B3 m, b B4 c, j
  • B5 m, p, b B6 m, c, b, j
  • B7 c, b, j B8 b, c
  • Frequent itemsets m, c, b, j,

6
Applications (1)
  • Items products baskets sets of products
    someone bought in one trip to the store.
  • Example application given that many people buy
    beer and diapers together
  • Run a sale on diapers raise price of beer.
  • Only useful if many buy diapers beer.

7
Applications (2)
  • Baskets sentences items documents containing
    those sentences.
  • Items that appear together too often could
    represent plagiarism.
  • Notice items do not have to be in baskets.

8
Applications (3)
  • Baskets Web pages items words.
  • Unusual words appearing together in a large
    number of documents, e.g., Brad and Angelina,
    may indicate an interesting relationship.

9
Aside Words on the Web
  • Many Web-mining applications involve words.
  • Cluster pages by their topic, e.g., sports.
  • Find useful blogs, versus nonsense.
  • Determine the sentiment (positive or negative) of
    comments.
  • Partition pages retrieved from an ambiguous
    query, e.g., jaguar.

10
Words (2)
  • Very common words are stop words.
  • They rarely help determine meaning, and they
    block from view interesting events, so ignore
    them.
  • The TF/IDF measure distinguishes important
    words from those that are usually not meaningful.

11
Words (3)
  • TF/IDF term frequency, inverse
  • document frequency relates the number of times
    a word appears to the number of documents in
    which it appears.
  • Low values are words like also that appear at
    random.
  • High values are words like computer that may be
    the topic of documents in which it appears at all.

12
Scale of the Problem
  • WalMart sells 100,000 items and can store
    billions of baskets.
  • The Web has billions of words and many billions
    of pages.

13
Association Rules
  • If-then rules about the contents of baskets.
  • i1, i2,,ik ? j means if a basket contains
    all of i1,,ik then it is likely to contain j.
  • Confidence of this association rule is the
    probability of j given i1,,ik.

14
Example Confidence
  • B1 m, c, b B2 m, p, j
  • B3 m, b B4 c, j
  • B5 m, p, b B6 m, c, b, j
  • B7 c, b, j B8 b, c
  • An association rule m, b ? c.
  • Confidence 2/4 50.

_ _

15
Finding Association Rules
  • Question find all association rules with
    support s and confidence c .
  • Note support of an association rule is the
    support of the set of items on the left.
  • Hard part finding the frequent itemsets.
  • Note if i1, i2,,ik ? j has high support and
    confidence, then both i1, i2,,ik and
    i1, i2,,ik ,j will be frequent.

16
Computation Model
  • Typically, data is kept in flat files rather than
    in a database system.
  • Stored on disk.
  • Stored basket-by-basket.
  • Expand baskets into pairs, triples, etc. as you
    read baskets.
  • Use k nested loops to generate all sets of size
    k.

17
File Organization
Item
Item
Example items are positive integers, and
boundaries between baskets are 1.
Basket 1
Item
Item
Item
Item
Basket 2
Item
Item
Item
Item
Basket 3
Item
Item
Etc.
18
Computation Model (2)
  • The true cost of mining disk-resident data is
    usually the number of disk I/Os.
  • In practice, association-rule algorithms read the
    data in passes all baskets read in turn.
  • Thus, we measure the cost by the number of passes
    an algorithm takes.

19
Main-Memory Bottleneck
  • For many frequent-itemset algorithms, main memory
    is the critical resource.
  • As we read baskets, we need to count something,
    e.g., occurrences of pairs.
  • The number of different things we can count is
    limited by main memory.
  • Swapping counts in/out is a disaster (why?).

20
Finding Frequent Pairs
  • The hardest problem often turns out to be finding
    the frequent pairs.
  • Why? Often frequent pairs are common, frequent
    triples are rare.
  • Why? Probability of being frequent drops
    exponentially with size number of sets grows
    more slowly with size.
  • Well concentrate on pairs, then extend to larger
    sets.

21
Naïve Algorithm
  • Read file once, counting in main memory the
    occurrences of each pair.
  • From each basket of n items, generate its
    n (n -1)/2 pairs by two nested loops.
  • Fails if (items)2 exceeds main memory.
  • Remember items can be 100K (Wal-Mart) or 10B
    (Web pages).

22
Example Counting Pairs
  • Suppose 105 items.
  • Suppose counts are 4-byte integers.
  • Number of pairs of items 105(105-1)/2 5109
    (approximately).
  • Therefore, 21010 (20 gigabytes) of main memory
    needed.

23
Details of Main-Memory Counting
  • Two approaches
  • Count all pairs, using a triangular matrix.
  • Keep a table of triples i, j, c the count of
    the pair of items i, j is c.
  • (1) requires only 4 bytes/pair.
  • Note always assume integers are 4 bytes.
  • (2) requires 12 bytes, but only for those pairs
    with count gt 0.

24
4 per pair
12 per occurring pair
Method (1)
Method (2)
25
Triangular-Matrix Approach (1)
  • Number items 1, 2,
  • Requires table of size O(n) to convert item names
    to consecutive integers.
  • Count i, j only if i lt j.
  • Keep pairs in the order 1,2, 1,3,, 1,n ,
    2,3, 2,4,,2,n , 3,4,, 3,n ,n -1,n .

26
Triangular-Matrix Approach (2)
  • Find pair i, j at the position
    (i 1)(n i /2) j i.
  • Total number of pairs n (n 1)/2 total bytes
    about 2n 2.

27
Details of Approach 2
  • Total bytes used is about 12p, where p is the
    number of pairs that actually occur.
  • Beats triangular matrix if at most 1/3 of
    possible pairs actually occur.
  • May require extra space for retrieval structure,
    e.g., a hash table.

28
A-Priori Algorithm for Pairs
  • A two-pass approach called a-priori limits the
    need for main memory.
  • Key idea monotonicity if a set of items
    appears at least s times, so does every subset.
  • Contrapositive for pairs if item i does not
    appear in s baskets, then no pair including i
    can appear in s baskets.

29
Apriori Algorithm - General
  • Agrawal Srikant 1994

Data base D
1-candidates
Freq 1-itemsets
2-candidates
TID Items
10 a, c, d
20 b, c, e
30 a, b, c, e
40 b, e
Itemset Sup
a 2
b 3
c 3
d 1
e 3
Itemset Sup
a 2
b 3
c 3
e 3
Itemset
ab
ac
ae
bc
be
ce
Scan D
Min_sup2
Counting
Freq 2-itemsets
3-candidates
Scan D
Itemset Sup
ab 1
ac 2
ae 1
bc 2
be 3
ce 2
Itemset Sup
ac 2
bc 2
be 3
ce 2
Itemset
bce
Scan D
Freq 3-itemsets
Itemset Sup
bce 2
30
Important Details of Apriori
  • How to generate candidates?
  • Step 1 self-joining Lk
  • Step 2 pruning
  • How to count supports of candidates?

31
How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in an order
  • Step 1 self-join Lk-1
  • INSERT INTO Ck
  • SELECT p.item1, p.item2, , p.itemk-1, q.itemk-1
  • FROM Lk-1 p, Lk-1 q
  • WHERE p.item1q.item1, , p.itemk-2q.itemk-2,
    p.itemk-1 lt q.itemk-1
  • Step 2 pruning
  • For each itemset c in Ck do
  • For each (k-1)-subsets s of c do if (s is not in
    Lk-1) then delete c from Ck

32
Example of Candidate-generation
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

33
How to Count Supports of Candidates?
  • Why counting supports of candidates a problem?
  • The total number of candidates can be very huge
  • One transaction may contain many candidates
  • Method
  • Candidate itemsets are stored in a hash-tree
  • Leaf node of hash-tree contains a list of
    itemsets and counts
  • Interior node contains a hash table
  • Subset function finds all the candidates
    contained in a transaction

34
Apriori Candidate Generation-and-test
  • Any subset of a frequent itemset must be also
    frequent an anti-monotone property
  • A transaction containing beer, diaper, nuts
    also contains beer, diaper
  • beer, diaper, nuts is frequent ? beer, diaper
    must also be frequent
  • No superset of any infrequent itemset should be
    generated or tested
  • Many item combinations can be pruned

35
The Apriori Algorithm
  • Ck Candidate itemset of size k
  • Lk frequent itemset of size k
  • L1 frequent items
  • for (k 1 Lk !? k) do
  • Ck1 candidates generated from Lk
  • for each transaction t in database do increment
    the count of all candidates in Ck1 that are
    contained in t
  • Lk1 candidates in Ck1 with min_support
  • return ?k Lk

36
A-Priori Algorithm (2)
  • Pass 1 Read baskets and count in main memory the
    occurrences of each item.
  • Requires only memory proportional to items.
  • Items that appear at least s times are the
    frequent items.

37
A-Priori Algorithm (3)
  • Pass 2 Read baskets again and count in main
    memory only those pairs both of which were found
    in Pass 1 to be frequent.
  • Requires memory proportional to square of
    frequent items only (for counts), plus a list of
    the frequent items (so you know what must be
    counted).

38
Picture of A-Priori

Item counts
Frequent items
Counts of pairs of frequent items
Pass 1
Pass 2
39
Detail for A-Priori
  • You can use the triangular matrix method with n
    number of frequent items.
  • May save space compared with storing triples.
  • Trick number frequent items 1,2, and keep a
    table relating new numbers to original item
    numbers.

40
A-Priori Using Triangular Matrix for Counts
Item counts
1. Freq- Old 2. quent item items
s
Counts of pairs of frequent items
Pass 1
Pass 2
41
Frequent Triples, Etc.
  • For each k, we construct two sets of k -sets
    (sets of size k )
  • Ck candidate k -sets those that might be
    frequent sets (support gt s ) based on information
    from the pass for k 1.
  • Lk the set of truly frequent k -sets.

42
Filter
Filter
Construct
Construct
C1
L1
C2
L2
C3
First pass
Second pass
43
A-Priori for All Frequent Itemsets
  • One pass for each k.
  • Needs room in main memory to count each candidate
    k -set.
  • For typical market-basket data and reasonable
    support (e.g., 1), k 2 requires the most
    memory.

44
Frequent Itemsets (2)
  • C1 all items
  • In general, Lk members of Ck with support s.
  • Ck 1 (k 1) -sets, each k of which is in Lk .

45
Challenges of FPM
  • Challenges
  • Multiple scans of transaction database
  • Huge number of candidates
  • Tedious workload of support counting for
    candidates
  • Improving Apriori general ideas
  • Reduce number of transaction database scans
  • Shrink number of candidates
  • Facilitate support counting of candidates

46
DIC Reduce Scans
ABCD
  • Once both A and D are determined frequent, the
    counting of AD can begin
  • Once all length-2 subsets of BCD are determined
    frequent, the counting of BCD can begin

ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
Transactions
B
C
D
A
Apriori

Itemset lattice
2-items
S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997.
3-items
DIC
47
DHP Reduce the Number of Candidates
  • A hashing bucket count ltmin_sup ? every candidate
    in the buck is infrequent
  • Candidates a, b, c, d, e
  • Hash entries ab, ad, ae bd, be, de
  • Large 1-itemset a, b, d, e
  • The sum of counts of ab, ad, ae lt min_sup ? ab
    should not be a candidate 2-itemset
  • J. Park, M. Chen, and P. Yu, 1995

48
Partition Scan Database Only Twice
  • Partition the database into n partitions
  • Itemset X is frequent ? X is frequent in at least
    one partition
  • Scan 1 partition database and find local
    frequent patterns
  • Scan 2 consolidate global frequent patterns
  • A. Savasere, E. Omiecinski, and S. Navathe, 1995

49
Sampling for Frequent Patterns
  • Select a sample of original database, mine
    frequent patterns within sample using Apriori
  • Scan database once to verify frequent itemsets
    found in sample, only borders of closure of
    frequent patterns are checked
  • Example check abcd instead of ab, ac, , etc.
  • Scan database again to find missed frequent
    patterns
  • H. Toivonen, 1996

50
Bottleneck of Frequent-pattern Mining
  • Multiple database scans are costly
  • Mining long patterns needs many passes of
    scanning and generates lots of candidates
  • To find frequent itemset i1i2i100
  • of scans 100
  • of Candidates
  • Bottleneck candidate-generation-and-test
  • Can we avoid candidate generation?

51
Set Enumeration Tree
  • Subsets of I can be enumerated systematically
  • Ia, b, c, d

?
a
b
c
d
ab
ac
ad
bc
bd
cd
abc
abd
acd
bcd
abcd
52
Borders of Frequent Itemsets
  • Connected
  • X and Y are frequent and X is an ancestor of Y ?
    all patterns between X and Y are frequent

53
Projected Databases
  • To find a child Xy of X, only X-projected
    database is needed
  • The sub-database of transactions containing X
  • Item y is frequent in X-projected database

54
Tree-Projection Method
  • Find frequent 2-itemsets
  • For each frequent 2-itemset xy, form a projected
    database
  • The sub-database containing xy
  • Recursive mining
  • If xy is frequent in xy-proj db, then xyxy is
    a frequent pattern

55
Borders and Max-patterns
  • Max-patterns borders of frequent patterns
  • A subset of max-pattern is frequent
  • A superset of max-pattern is infrequent

56
MaxMiner Mining Max-patterns
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
  • 1st scan find frequent items
  • A, B, C, D, E
  • 2nd scan find support for
  • AB, AC, AD, AE, ABCDE
  • BC, BD, BE, BCDE
  • CD, CE, CDE, DE,
  • Since BCDE is a max-pattern, no need to check
    BCD, BDE, CDE in later scan
  • Baya98

Min_sup2
Potential max-patterns
57
Frequent Closed Patterns
  • For frequent itemset X, if there exists no item y
    s.t. every transaction containing X also contains
    y, then X is a frequent closed pattern
  • acdf is a frequent closed pattern
  • Concise rep. of freq pats
  • Reduce of patterns and rules
  • N. Pasquier et al. In ICDT99

Min_sup2
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
58
CLOSET Mining Frequent Closed Patterns
  • Flist list of all freq items in support asc.
    order
  • Flist d-a-f-e-c
  • Divide search space
  • Patterns having d
  • Patterns having d but no a, etc.
  • Find frequent closed pattern recursively
  • Every transaction having d also has cfa ? cfad is
    a frequent closed pattern
  • PHM00

Min_sup2
TID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f
50 c, e, f
59
Closed and Max-patterns
  • Closed pattern mining algorithms can be adapted
    to mine max-patterns
  • A max-pattern must be closed
  • Depth-first search methods have advantages over
    breadth-first search ones

60
Multiple-level Association Rules
  • Items often form hierarchy
  • Flexible support settings Items at the lower
    level are expected to have lower support.
  • Transaction database can be encoded based on
    dimensions and levels
  • explore shared multi-level mining

61
Multi-dimensional Association Rules
  • Single-dimensional rules
  • buys(X, milk) ? buys(X, bread)
  • MD rules ? 2 dimensions or predicates
  • Inter-dimension assoc. rules (no repeated
    predicates)
  • age(X,19-25) ? occupation(X,student) ?
    buys(X,coke)
  • hybrid-dimension assoc. rules (repeated
    predicates)
  • age(X,19-25) ? buys(X, popcorn) ? buys(X,
    coke)
  • Categorical Attributes finite number of possible
    values, no order among values
  • Quantitative Attributes numeric, implicit order

62
Quantitative/Weighted Association Rules
Numeric attributes are dynamically
discretized maximize the confidence or
compactness of the rules 2-D quantitative
association rules Aquan1 ? Aquan2 ? Acat Cluster
adjacent association rules to form general
rules using a 2-D grid.
70-80k
60-70k
50-60k
40-50k
30-40k
20-30k
lt20k
32 33 34 35 36 37 38
Income
age(X,33-34) ? income(X,30K - 50K) ?
buys(X,high resolution TV)
Age
Write a Comment
User Comments (0)
About PowerShow.com