Title: Business Intelligence Technologies
1Business Intelligence Technologies Data Mining
- Market Basket Analysis, Association Rules
- Dr. Oualid (Walid) Ben Ali
2Agenda
- Market basket analysis Association rules
- Software demo
- Exercise
3(No Transcript)
4Barbie ? Candy
- Put them closer together in the store.
- Put them far apart in the store.
- Package candy bars with the dolls.
- Package Barbie candy poorly selling items.
- Raise the price on one, lower it on the other.
- Barbie accessories for proofs of purchase.
- Do not advertise candy and Barbie together.
- Offer candies in the shape of a Barbie Doll.
5Market Basket Analysis (MBA)
- MBA in retail setting
- Find out what are bought together
- Cross-selling
- Optimize shelf layout
- Product bundling
- Timing promotions
- Discount planning (avoid double-discounts)
- Product selection under limited space
- Targeted advertisement, Personalized coupons,
item recommendations - Usage beyond Market Basket
- Medical (one symptom after another)
- Financial (customers with mortgage acct also have
saving acct)
6(No Transcript)
7(No Transcript)
8What the data contains
Transaction No. Item 1 Item 2 Item 3 Item 4
100 Beer Diaper Chocolate Cheese
101 Milk Chocolate Shampoo
102 Beer Wine Vodka
103 Beer Cheese Diaper Chocolate
104 Ice Cream Diaper Beer
Customer No. Age Income Saving_acct Children Mortgage
100 gt50 High Yes Yes Yes
101 35-50 Mid No No No
102 lt35 High Yes No Yes
103 gt50 Mid Yes No Yes
104 lt35 Low No Yes No
9Rules Discovered from MBA
- Actionable Rules
- Wal-Mart customers who purchase Barbie dolls have
a 60 likelihood of also purchasing one of three
types of candy bars - Trivial Rules
- Customers who purchase large appliances are very
likely to purchase maintenance agreements - Inexplicable Rules
- When a new hardware store opens, one of the most
commonly sold items is toilet bowl cleaners
10Learning Frequent Itemsets and Association Rules
from Data
A descriptive approach for discovering relevant
and valid associations among items in the data.
If buy diapers
Buy beer
Then
- The itemset corresponding to this rule is
Diaper, Beer - Itemset A collection of items.
- Frequent Itemset An itemset that occurs often in
data. - Often times, finding frequent itemsets is enough.
11Market Basket Analysis
Transaction No. Item 1 Item 2 Item 3 Item 4
100 Beer Diaper Chocolate Cheese
101 Milk Chocolate Shampoo
102 Beer Wine Vodka
103 Beer Cheese Diaper Chocolate
104 Ice Cream Diaper Beer
Examples
Shoppers who buy Diaper are very likely to buy
Beer.
Then
If buy Diaper
Buy Beer
Shoppers who buy Beer and Diaper are likely to
buy Cheese and Chocolate
Then
If buy Beer, Diaper
Buy Cheese, Chocolate
12Association Rules
- Rule format
- If set of items ? Then set of items
- LHS implies RHS
LHS
RHS
If Diaper, Baby Food
Beer, Wine
Then
13Evaluation of Association Rules
- What rules should be considered valid?
- An association rule is valid if it satisfies some
evaluation measures
LHS
RHS
If Diaper
Beer
Then
14Rule Evaluation
- Milk Wine co-occur
- But
- Only 2 out of 200K transactions contain these
items
Transaction No. Item 1 Item 2 Item 3
100 Beer Diaper Chocolate
101 Milk Chocolate Wine
102 Beer Wine Vodka
103 Beer Cheese Diaper
104 Ice Cream Diaper Beer
.
15Rule Evaluation Support
- Support
- The frequency in which the items in LHS and RHS
co-occur. - E.g., The support of the Diaper ? Beer rule
is 3/5 - 60 of the transactions contain both items.
- No. of transactions containing items in LHS and
RHS - Total No. of transactions in the dataset
Support
Transaction No. Item 1 Item 2 Item 3
100 Beer Diaper Chocolate
101 Milk Chocolate Shampoo
102 Beer Wine Vodka
103 Beer Cheese Diaper
104 Ice Cream Diaper Beer
16Support evaluation is not enough?
- My friend, Bill, an 85 years old man, told me a
joke in a party last Friday - An old man is celebrating his 103th birthday.
- I will hold my 104th birthday party next year.
You are all welcome to join me, he announces to
his guests proudly. - How do you know you will still be alive then?
one of his guests asks. - Because very few people died between the age of
103 and 104, he replies. - Explain the logic of the old man and provide your
comments.
17Rule Evaluation - Confidence
- Is Beer leading to Diaper purchase or Diaper
leading to Beer purchase? - Among the transactions with Diaper, 100 have
Beer. - Among the transactions with Beer, 75 have
Diaper. -
Transaction No. Item 1 Item 2 Item 3
100 Beer Diaper Chocolate
101 Milk Chocolate Shampoo
102 Beer Wine Vodka
103 Beer Cheese Diaper
104 Ice Cream Diaper Beer
- No. of transactions containing both LHS
and RHS - No. of transactions containing LHS
- confidence for Diaper ?Beer 3/3
- When Diaper is purchased, the likelihood of Beer
purchase is 100 - confidence for Beer ?Diaper 3/4
- When Beer is purchased, the likelihood of Diaper
purchase is 75 - So, Diaper ?Beer is a more important rule
according to confidence.
Confidence
18Rule Evaluation - Lift
Transaction No. Item 1 Item 2 Item 3 Item 4
100 Beer Diaper Chocolate
101 Milk Chocolate Shampoo
102 Beer Milk Vodka Chocolate
103 Beer Milk Diaper Chocolate
104 Milk Diaper Beer
Whats the support and confidence for rule
Chocolate?Milk?
Support 3/5
Confidence 3/4
Very high support and confidence. Does Chocolate
really lead to Milk purchase?
No! Because Milk occurs in 4 out of 5
transactions. Chocolate is even decreasing the
chance of Milk purchase (3/4 lt 4/5)
Lift (3/4)/(4/5) 0.9375 lt 1
19Rule Evaluation Lift (cont.)
- Measures how much more likely is the RHS given
the LHS than merely the RHS - Lift confidence of the rule / frequency of the
RHS - Example Diaper ? Beer
- Total number of customer in database 1000
- No. of customers buying Diaper 200
- No. of customers buying beer 50
- No. of customers buying Diaper beer 20
- Frequency of Beer 50/1000 (5)
- Confidence 20/200 (10)
- Lift 10/5 2
- Lift higher than 1 implies people have higher
chance to buy Beer when they buy Diaper. Lift
lower than 1 implies people have lower chance to
buy Milk when they buy Chocolate.
20Rule Evaluation - Practical Impact
- Most methods for extracting association rules
find too many trivial rules. Most are either
obvious and uninteresting. - Example If Maternity Ward ? then patient is a
woman. Confidence 100, support 100 - Need to screen for rules that are of particular
interest and significance. - Actionable Keep only rules that can be acted
upon. - Interestingness Various measures for how
surprising or unexpected a rule is. - Example A rule is interesting if it contradicts
what is currently known (e.g., it contradicts a
rule that was previously discovered).
21Algorithm to Extract Association Rules (1)
- Given a set of transactions T, the goal of
association rule mining is to find all rules
having - support minsup threshold
- confidence minconf threshold
- Brute-force approach
- List all possible association rules
- Compute the support and confidence for each rule
- Prune rules that fail the minsup and minconf
thresholds - ? Computationally prohibitive!
22Frequent Itemset Generation
- Brute-force approach
- Each itemset in the lattice is a candidate
frequent itemset - Count the support of each candidate by scanning
the database - Complexity O(NMw) gt Expensive since M 2d
!!!Match each transaction against every candidate - Complexity O(NMw) gt Expensive since M 2d !!!
23Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)
- Observations
- All the above rules are binary partitions of the
same itemset Milk, Diaper, Beer - Rules originating from the same itemset have
identical support but can have different
confidence - Thus, we may decouple the support and confidence
requirements
24Mining Association Rules
- Two-step approach
- Frequent Itemset Generation
- Generate all itemsets whose support ? minsup
- Rule Generation
- Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset - Frequent itemset generation is still
computationally expensive
25Algorithm to Extract Association Rules (2)
- The standard algorithm Apriori
- Rakesh Agrawal, Ramakrishnan Srikant Fast
Algorithms for Mining Association Rules in Large
Databases. VLDB 1994 487-499 - The Association Rules problem was defined as
- Generate all association rules that have
- support greater than the user-specified minimum
support - and confidence greater than the user-specified
minimum confidence - The base algorithm uses support and confidence,
but we can also use lift to rank the rules
discovered by Apriori. - The algorithm performs an efficient search over
the data to find all such rules.
26Finding Association Rules from Data
- Association rules discovery problem is decomposed
- into two sub-problems
- Find all sets of items (itemsets) whose support
is above minimum support --- called frequent
itemsets or large itemsets - From each frequent itemset, generate rules whose
confidence is above minimum confidence. - Given a large itemset Y, and X is a subset of Y
- Calculate confidence of the rule X ? (Y - X)
- If its confidence is above the minimum
confidence, then X ? (Y - X) is an association
rule we are looking for.
27Example
Transaction No. Item 1 Item 2 Item 3
100 Beer Diaper Chocolate
101 Milk Chocolate Shampoo
102 Beer Wine Vodka
103 Beer Cheese Diaper
104 Ice Cream Diaper Beer
- A data set with 5 transactions
- Minimum support 40, Minimum confidence 80
- Phase 1 Find all frequent itemsets
- Beer (support80),
- Diaper (60),
- Chocolate (40)
- Beer, Diaper (60)
-
Phase 2
Beer ? Diaper (conf. 6080 75)
Diaper ? Beer (conf. 6060 100)
28Phase 1 Finding all frequent itemsetsHow to
perform an efficient search of all frequent
itemsets?
- Note frequent itemsets of size n contain
itemsets of size n-1 that also must be frequent - Example if diaper, beer is frequent then
diaper and beer are each frequent as well - This means that
- If an itemset is not frequent (e.g., wine) then
no itemset that includes wine can be frequent
either, such as wine, beer . - We therefore first find all itemsets of size 1
that are frequent. - Then try to expand these by counting the
frequency of all itemsets of size 2 that include
frequent itemsets of size 1. - Example
- If wine is not frequent we need not try to
find out whether wine, beer is frequent. But if
both wine beer were frequent then it is
possible (though not guaranteed) that wine,
beer is also frequent. - Then take only itemsets of size 2 that are
frequent, and try to expand those, etc.
29Phase 2 Generating Association Rules
- Assume Milk, Bread, Butter is a frequent
itemset. - Using items contained in the itemset, list all
possible rules - Milk ? Bread, Butter
- Bread ? Milk, Butter
- Butter ? Milk, Bread
- Milk, Bread ? Butter
- Milk, Butter ? Bread
- Bread, Butter ? Milk
- Calculate the confidence of each rule
- Pick the rules with confidence above the minimum
confidence
Confidence of Milk ? Bread, Butter
Support Milk, Bread, Butter Support Milk
No. of transaction that support Milk, Bread,
Butter No. of transaction that support Milk
30Association
- If the rule Bread, Butter ? Yogurt is found
to have minimum confidence. - Does it mean the rule
- Yogurt ? Bread, Butter also has minimum
confidence? - No.
- Example
- Support of Yogurt is 20,
- Bread and Butter is 50
- Yogurt, Bread, Butter is 10
- Confidence of Bread, Butter ? Yogurt is
10/5020 - Confidence of Yogurt ? Bread, Butter is
10/2050
31Agrawal (94)s Apriori AlgorithmAn Example
Transactions
Itemset sup
A 2
B 3
C 3
D 1
E 3
C1
L1
Itemset sup
A 2
B 3
C 3
E 3
T-ID Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
1st scan
C2
C2
Itemset sup
A, B 1
A, C 2
A, E 1
B, C 2
B, E 3
C, E 2
Itemset
A, B
A, C
A, E
B, C
B, E
C, E
2nd scan
L2
Itemset sup
A, C 2
B, C 2
B, E 3
C, E 2
Itemset
B, C, E
C3
L3
3rd scan
Itemset sup
B, C, E 2
A,B,C?
32Sequential Patterns
- Instead of finding association between items in a
single transactions, find association between
items across related transactions over time.
Customer ID Transaction Data. Item 1 Item 2
AA 2/2/2001 Laptop Case
AA 1/13/2002 Wireless network card Router
BB 4/5/2002 laptop iPaq
BB 8/10/2002 Wireless network card Router
- Sequence Laptop, Wireless Card, Router
- A sequence has to satisfy some predetermined
minimum support
33Examples of Sequence Data
Sequence Database Sequence Element (Transaction) Event(Item)
Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc
Web Data Browsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc
Event data History of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors
Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C
Element (Transaction)
Event (Item)
E1E2
E1E3
E2
E3E4
E2
Sequence
34Examples of Sequence
- Web sequence lt Homepage Electronics
Digital Cameras Canon Digital Camera
Shopping Cart Order Confirmation Return to
Shopping gt - Sequence of books checked out at a library
- ltFellowship of the Ring The Two Towers
Return of the Kinggt
35Applications of Association Rules
- Market-Basket Analysis
- e.g. Product assortment optimization (see next
slide) - Recommendations Determines which books are
frequently purchased together and recommends
associated books or products to people who
express interest in an item. - Healthcare Studying the side-effects in patients
with multiple prescriptions, we can discover
previously unknown interactions and warn patients
about them. - Fraud detection Finding in insurance data that a
certain doctor often works with a certain lawyer
may indicate potential fraudulent activity.
(virtual items) - Sequence Discovery looks for associations
between items bought over time. E.g., we may
notice that people who buy chili tend to buy
antacid within a month. Knowledge like this can
be used to plan inventory levels.
36Product Assortment Optimization
Graphs of expected sales (e.g derived from
association rules) and costs (e.g. of purchasing
and holding inventory) can allow us to optimize
the number and selection (choice) of items in a
product category.
Dollars
Revenues
Costs
Margin
Products in Category
Dollars
Max Profit
Margin Revenues - Costs
Products in Category
36
37Agenda
- Market basket analysis Association rules
- Software demo
- Exercise
38Agenda
- Market basket analysis Association rules
- Software demo
- Exercise
39Exercise
Transaction No. Item 1 Item 2 Item 3 Item 4
100 Beer Diaper Chocolate
101 Milk Chocolate Shampoo
102 Beer Soap Vodka
103 Beer Cheese Wine
104 Milk Diaper Beer Chocolate
- Given the above list of transactions, do the
following - 1) Find all the frequent itemsets (minimum
support 40) - 2) Find all the association rules (minimum
confidence 70) - 3) For the discovered association rules,
calculate the lift
40(No Transcript)