Association Rules - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Association Rules

Description:

Apriori algorithm uses this property for pruning Push constraints as deep as possible inside the frequent set computation Apriori property revisited Anti ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 38
Provided by: RAS137
Category:

less

Transcript and Presenter's Notes

Title: Association Rules


1
Association Rules
  • slides by Z. W. Ras

2
Market Basket Analysis (MBA)
Customer buying habits by finding associations
and correlations between the different items that
customers place in their shopping basket
Milk, eggs, sugar, bread
Milk, eggs, cereal, bread
Eggs, sugar
Customer1
Customer2
Customer3
3
Market Basket Analysis
  • Given a database of customer transactions, where
    each transaction is a set of items
  • Find groups of items which are frequently
    purchased together

4
Goal of MBA
  • Extract information on purchasing behavior
  • Actionable information can suggest
  • new store layouts
  • new product assortments
  • which products to put on promotion

MBA applicable whenever a customer purchases
multiple things in proximity
5
Association Rules
  • Express how product/services relate to each
    other, and tend to group together
  • if a customer purchases three-way calling, then
    will also purchase call-waiting
  • Simple to understand
  • Actionable information bundle three-way calling
    and call-waiting in a single package

6
Basic Concepts
Transactions Relational format Compact
format ltTid,itemgt ltTid,itemsetgt lt1,
item1gt lt1, item1,item2gt lt1, item2gt lt2,
item3gt lt2, item3gt Item single element,
Itemset set of items Support of an itemset I
denoted by sup(I) card(I) Threshold for
minimum support ? Itemset I is Frequent if
sup(I) ? ?. Frequent Itemset represents set of
items which are positively correlated
itemset
7
Frequent Itemsets
Customer 1
sup(dairy) 3 sup(fruit) 3
sup(dairy, fruit) 2 If ? 3, then
dairy and fruit are frequent while
dairy,fruit is not.
Customer 2
8
Association Rules AR(s,c)
  • A,B - partition of a set of items
  • r A ? B
  • Support of r sup(r) sup(A?B)
  • Confidence of r conf(r) sup(A?B)/sup(A)
  • Thresholds
  • minimum support - s
  • minimum confidence c
  • r ? AS(s, c), if sup(r) ? s and conf(r) ? c

9
Association Rules - Example
Min. support 2 50 Min. confidence - 50
  • For rule A ? C
  • sup(A ? C) 2
  • conf(A ? C) sup(A,C)/sup(A) 2/3
  • The Apriori principle
  • Any subset of a frequent itemset must be frequent

10
The Apriori algorithm Agrawal
  • Fk Set of frequent itemsets of size k
  • Ck Set of candidate itemsets of size k
  • F1 frequent items k1
  • while card(Fk) ? 1 do begin
  • Ck1 new candidates generated from Fk
  • for each transaction t in the database do
  • increment the count of all candidates in Ck1
    that
  • are contained in t
  • Fk1 candidates in Ck1 with minimum support
  • k k1
  • end
  • Answer ? Fk k ? 1 card(Fk) ? 1

11
Apriori - Example
a,d is not frequent, so the 3-itemsets
a,b,d, a,c,d and the 4-itemset a,b,c,d, are
not generated.
12
Algorithm Apriori Illustration
  • The task of mining association rules is mainly to
    discover strong association rules (high
    confidence and strong support) in large
    databases.
  • Mining association rules is composed of two
    steps

TID Items 1000 A, B, C 2000 A,
C 3000 A, D 4000 B, E, F
1. discover the large items, i.e., the sets of
itemsets that have transaction support above
a predetermined minimum support s. 2. Use
the large itemsets to generate the
association rules
MinSup 2
13
Algorithm Apriori Illustration
C1
F1
Database D
A B C
D E
Itemset Count
S 2
A 2 B 3 C
3 E 3
Itemset Count
TID Items 100 A, C, D 200 B, C,
E 300 A, B, C, E 400 B, E
2
Scan D
3
3
1
3
C2
C2
F2
A, B A, C A, E
B, C B, E C, E
Itemset
A,B A,C A,E B,C
B,E C,E
Itemset Count
A, C 2 B, C 2 B, E
3 C, E 2
Itemset Count
1
Scan D
2
1
2
3
2
C3
F3
C3
Scan D
B, C, E
Itemset
B, C, E 2
B, C, E 2
Itemset Count
Itemset Count
14
Representative Association Rules
  • Definition 1.
  • Cover C of a rule X ? Y is denoted by C(X
    ? Y)
  • and defined as follows
  • C(X ? Y) X ? Z ? V Z, V are disjoint
    subsets of Y.
  • Definition 2.
  • Set RR(s, c) of Representative Association
    Rules
  • is defined as follows
  • RR(s, c)
  • r ? AR(s, c) (?rl ? AR(s, c)) rl
    ? r r ? C(rl)
  • s threshold for minimum support
  • c threshold for minimum confidence
  • Representative Rules (informal description)
  • as short as possible ? as long as
    possible

15
Representative Association Rules
Transactions A,B,C,D,E A,B,C,D,E,F A,B,C,D,E
,H,I A,B,E B,C,D,E,H,I
Find RR(2,80)
Representative Rules From (BCDEHI) H ?
B,C,D,E,I I ? B,C,D,E,H From
(ABCDE) A,C ? B,D,E A,D ? B,C,E
Last set (BCDEHI, 2)
16
Frequent Pattern (FP) Growth Strategy
Minimum Support 2
Transactions abcde abc acde bcde bc bde cde
Transactions ordered abcde abc acde bcde bc bde c
de
FP-tree
Frequent Items c 6 b 5 d 5 e 5 a 3
17
Frequent Pattern (FP) Growth Strategy
Mining the FP-tree for frequent itemsets Start
from each item and construct a subdatabase of
transactions (prefix paths) with that item listed
at the end. Reorder the prefix paths in support
descending order. Build a conditional FP-tree.
a 3 Prefix path (c b d e a, 1) (c b a, 1) (c
d e a, 1)
Correct order c 3 b 2 d 2 e 2
18
Frequent Pattern (FP) Growth Strategy
a 2 Prefix path (c b d e a, 1) (c b a, 1) (c
d e a, 1)
Frequent Itemsets (c a, 3) (c b a, 2) (c d a,
2) (c d e a, 2) (c e a, 2)
19
Multidimensional AR
Associations between values of different
attributes
RULES nationality French ? income high
50, 100 income high ? nationality
French 50, 75 age 50 ? nationality
Italian 33, 100
20
Single-dimensional AR vs Multi-dimensional
Multi-dimensional Single-dimensional lt1,
Italian, 50, lowgt lt1, nat/Ita, age/50,
inc/lowgt lt2, French, 45, highgt lt2, nat/Fre,
age/45, inc/highgt Schema ltID, a?, b?, c?,
d?gt lt1, yes, yes, no, nogt lt1, a, bgt lt2, yes,
no, yes, nogt lt2, a, cgt
21
Quantitative Attributes
  • Quantitative attributes (e.g. age, income)
  • Categorical attributes (e.g. color of car)

Problem too many distinct values Solution
transform quantitative attributes into
categorical ones via discretization.
22
Discretization of quantitative attributes
  • Quantitative attributes are statically
    discretized by
  • using predefined concept hierarchies
  • elementary use of background knowledge
  • Loose interaction between Apriori and
    Discretizer
  • Quantitative attributes are dynamically
    discretized
  • into bins based on the distribution of the
    data.
  • considering the distance between data points.
  • Tighter interaction between Apriori and
    Discretizer

23
Constraint-based AR
  • Preprocessing use constraints to focus on a
    subset of transactions
  • Example find association rules where the
    prices of all items are at most 200 Euro
  • Optimizations use constraints to optimize
    Apriori algorithm
  • Anti-monotonicity when a set violates the
    constraint, so does any of its supersets.
  • Apriori algorithm uses this property for
    pruning
  • Push constraints as deep as possible inside the
    frequent set computation

24
Apriori property revisited
  • Anti-monotonicity If a set S violates the
    constraint, any superset of S violates the
    constraint.
  • Examples
  • Price(S) ? v is anti-monotone
  • Price(S) ? v is not anti-monotone
  • Price(S) v is partly anti-monotone
  • Application
  • Push Price(S) ? 1000 deeply into iterative
  • frequent set computation.

25
Mining Association Rules with Constraints
  • Post processing
  • A naive solution apply Apriori for finding all
    frequent sets, and then to test them for
    constraint satisfaction one by one.
  • Optimization
  • Hans approach comprehensive analysis of the
    properties of constraints and try to push them as
    deeply as possible inside the frequent set
    computation.

26
Multilevel AR
  • It is difficult to find interesting patterns at a
    too primitive level
  • high support too few rules
  • low support too many rules, most
    uninteresting
  • Approach reason at suitable level of
    abstraction
  • A common form of background knowledge is that an
    attribute may be generalized or specialized
    according to a hierarchy of concepts
  • Dimensions and levels can be efficiently encoded
    in transactions
  • Multilevel Association Rules rules which
    combine associations with hierarchy of concepts

27
Hierarchy of concepts
28
Multilevel AR
  • Fresh ? Bakery 20, 60
  • Dairy ? Bread 6, 50
  • Fruit ? Bread 1, 50 is not valid

29
Support and Confidence of Multilevel Association
Rules
  • Generalizing/specializing values of attributes
    affects support and confidence
  • from specialized to general support of rules
    increases (new rules may become valid)
  • from general to specialized support of rules
    decreases (rules may become not valid, their
    support falls under the threshold)

30
Mining Multilevel AR
Hierarchical attributes age, salary Association
Rule (age, young) ? (salary, 40k)
Candidate Association Rules (age, 18 ) ?
(salary, 40k), (age, young) ? (salary, low),
(age, 18 ) ? (salary, low)
31
Mining Multilevel AR
  • Calculate frequent itemsets at each concept
    level,
  • until no more frequent itemsets can be found
  • For each level use Apriori
  • A top_down, progressive deepening approach
  • First find high-level strong rules
  • fresh ? bakery 20,
    60.
  • Then find their lower-level weaker rules
  • fruit ? bread
    6, 50.
  • Variations at mining multiple-level association
    rules.
  • Level-crossed association rules
  • fruit ? wheat bread

32
Multi-level Association Uniform Support vs.
Reduced Support
  • Uniform Support the same minimum support for
    all levels
  • One minimum support threshold. No need to
    examine itemsets containing any item whose
    ancestors do not have minimum support.
  • If support threshold
  • too high ? miss low level associations.
  • too low ? generate too many high level
    associations.
  • Reduced Support reduced minimum support at
    lower levels - different strategies possible.

33
Beyond Support and Confidence
  • Example 1 (Aggarwal Yu)
  • tea gt coffee has high support (20) and
    confidence (80)
  • However, a priori probability that a customer
    buys coffee is 90
  • A customer who is known to buy tea is less likely
    to buy coffee (by 10)
  • There is a negative correlation between buying
    tea and buying coffee
  • tea gt coffee has higher confidence (93)

34
Correlation and Interest
  • Two events are independent
  • if P(A ? B) P(A)P(B), otherwise are
    correlated.
  • Interest P(A ? B)/P(B)P(A)
  • Interest expresses measure of correlation. If
  • equal to 1 ? A and B are independent events
  • less than 1 ? A and B negatively correlated,
  • greater than 1 ? A and B positively correlated.
  • In our example,
  • I(drink tea ? drink coffee ) 0.89 i.e.
    they are negatively correlated.

35
Domain dependent measures
  • Together with support, confidence, interest, ,
    use also (in post-processing) domain-dependent
    measures
  • e.g., use rule constraints on rules
  • Example take only rules which are significant
    with respect their economic value
  • sup(LHS) sup(RHS) gt 100

36
A brief history of AR mining research
  • Apriori (Agrawal et. al SIGMOD93)
  • Optimizations of Apriori
  • Fast algorithm (Agrawal et. al)
  • Representative Rules (Kryszkiewicz, Agrawal)
  • Direct Itemset Counting (Brin et. al)
  • Problem extensions
  • Generalized AR (Srikant et. al Han et. al.)
  • Quantitative AR (Srikant et. al)
  • N-dimensional AR (Lu et. al)
  • Temporal AR (Ozden et al)
  • Parallel mining (Agrawal et. al)
  • Distributed mining (Cheung et. al)

37
Questions?
Thank You
Write a Comment
User Comments (0)
About PowerShow.com