Fast Algorithms for Association Rule Mining - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Fast Algorithms for Association Rule Mining

Description:

In pass k, read a database ... database for counting support after the first pass. ... uses Apriori in the initial passes and switches to AprioriTid when it ... – PowerPoint PPT presentation

Number of Views:420

Avg rating:3.0/5.0

Slides: 31

Provided by: tcU9

Category:

more less

Transcript and Presenter's Notes

Title: Fast Algorithms for Association Rule Mining

1
Fast Algorithms for Association Rule Mining

Presented by
Muhammad Aurangzeb Ahmad
Nupur Bhatnagar

R. Agrawal and R. Srikant
2
Outline

Background and Motivation
Problem Definition
Major Contribution
Key Concepts
Validation
Assumptions
Future Revision

3
Background Motivation

Basket Data
Collection of records consisting of transaction
identifier and the items bought in a
transaction.
Mining for associations among items in a large
database of sales transaction to predict the
occurrence of an item based on the occurrences of
other items in the transaction.
For Example

4
Terms and Notations

Items I i1,i2,,im
Transaction set of items such as
Items are sorted lexicographically
TID unique identifier for each transaction
Association Rule X-gtY where

5
Terms and Notations

Confidence
A rule X-gtY holds in the transaction set D with
confidence c if
c of transactions in D that contain X also
contain Y.
Support
A rule X-gtY has support s if
s of transactions in D contain X and Y.
Large Itemset
Itemsets having support greater than minimum
support and minimum confidence are called large
itemsets other they are called small itemsets.
Candidate Itemsets
A set of itemsets which are generated from a
seed of itemsets which were found to be large in
the previous pass having
support minsup threshold
confidence minconf threshold

6
Problem Definition

INPUT
A set of transactions

Objective Given a set of transactions D,
generate all association rules that have support
and confidence greater than the user-specified
minimum support and minimum confidence. Minimize
computation time by pruning. Constraints Items
should be in lexicographical order
Association Rules Diaper ? Beer, Milk,
Bread ? Eggs, Coke, Beer, Bread ? Milk,
Real World Applications NCR (Terradata) does ARM
for more than 20 large retail organizations
including Walmart. Used for pattern discovery in
biological DBs.
7
Major Contribution

Proposed two new algorithms for fast association
rule mining
Apriori and AprioriTID, along with a hybrid
of the two algorithms .
Empirical evaluations of the performance of the
proposed algorithms as compared with the
contemporary algorithms.
Completeness Find all rules.

8
Related Work -SETM and AIS

Major difference in Candidate Itemset generation
In pass k, read a database transaction t
Determine which of the large itemsets in Lk-1 are
present in t.
Each of these large itemsets l is then extended
with all those large items that are present in t
and occur later in the lexicographic ordering
than any of the items in l.
Results A lot of Candidate Itemsets are
generated which are later discarded.

9
Key Concepts Support and Confidence

Why do we need Support and Confidence?
Given a rule X-gtY
Support determines how often a rule is applicable
to a given data set.
Confidence determines how frequently items in Y
appear in transactions that contains X.
A rule having low support may occur by chance!!
A low support rule tends to be uninteresting from
a business perspective.
Confidence measures the reliability of the
inference made by a rule.

10
Key Concepts Association Rule Mining Problem

Problem
Given a set of transactions T, find all rules
having support gt minsupport and
confidencegtminconfidence.
Decomposition of Problem
Frequent Itemset Generation
Find all itemsets having transaction
support above minimum support.
These itemsets are called frequent
itemsets.
2. Rule Generation
Use the large itemsets to generate rules. These
rules are high-confidence rules extracted from
the frequent itemsets found in the previous step.

11
Frequent Itemset Generation Apriori

Apriori Principle
Given an itemeset Ia,b,c,d,e.
If an item set is frequent, then all of its
subsets must also be frequent and vice-versa.

12
Frequent Itemset Generation Apriori

Apriori Principle

if c,d,e is frequent then all its subsets must
also be frequent
13
Frequent Itemset Generation Apriori

Apriori Principle Candidate Pruning

If a,b is infrequent, then all it supersets are
infrequent
14
Key Concepts Frequent Itemset Generation
Apriori Algorithm

Input
The market base transaction dataset.
Process
Determine large 1-itemsets.
Repeat until no new large 1-itemsets are
identified.
Generate (k1) length candidate itemsets from
length k large itemsets.
Prune candidate itemsets that are not large.
Count the support of each candidate itemset.
Eliminate candidate itemsets that are small.
Output
Itemsets that are large and qualify the min
support and min confidence thresholds.

15
Apriori ExampleMinimum support two transaction
1-itemset
Pruning
2-itemset
Pruning
3-itemset
16
Apriori Candidate Generation

Given an k-itemset, generate k1 itemset in two
steps
C(4)135,235
C(4) 235

JOIN STEP
Delete all candidates having non-frequent subset
PRUNE
Join k- itemset with k-itemset, with the join
condition that the first k-1 items should be the
same.
17
AprioriTID

AprioriTID
Same candidate generation function as Apriori.
Does not use database for counting support after
the first pass.
Encoding of the candidate itemsets used in the
previous pass.
Saves reading effort.

18
Apriori Tid ExampleSupport Count2

Database

L1
C1
C2
C2
L2
C3
C3
L3
19
Apriori Tid Analysis

Advantages
If a transaction does not contain k-itemset
candidates, then Ck will not have an entry for
this transaction.
For large k, each entry may be smaller than the
transaction because very few candidates may be
present in the transaction.
Disadvantages
For small k, each entry may be larger than the
corresponding transaction.
An entry includes all k-itemsets contained in the
transaction.

20
Apriori Hybrid

Apriori Hybrid
It uses Apriori in the initial passes and
switches to AprioriTid when it expects that the
candidate itemsets at the end of the pass will be
in memory.

21
Validation Computer Experiments

Parameters for data generation
D Number of transactions
T Average size of the transaction
I Average size of the maximal potentially large
itemsets
L Number of maximal potentially large itemsets
N Number of Items.
Parameter Settings 6 synthetic data sets

22
Results Execution Time
Apriori is always better than AIS and SETM. SETM
values were too big.
Apriori is better than Apriori TID in large
transactions.
23
Results Analysis

AprioriTid uses Ck instead of the database. If
Ck fits in memory AprioriTid is faster than
Apriori.
When Ck is too big to fit in memory, the
computation time is much longer. Thus Apriori is
faster than AprioriTid.

24
Results Execution time Apriori Hybrid

Graphs

Apriori Hybrid performs better than Apriori in
almost all cases.
25
Scale Up - Experiments
Apriori Hybrid scales up as the number of
transactions is increased from 100,000 to 10
million. Minimum support .75
Apriori Hybrid scales up when average transaction
size was increased. Done to see the affect on
data structures independent of physical db size
and number of large item sets.
26
Results

The Apriori algorithms are better than the SETM
and AIS.
The algorithms performs there best when combined.
The algorithm shows good results in scale-up
experiments.

27
Validation Methodology-Weakness and Strength

Strength
Author use a substantial basket data for guiding
the process of designing fast algorithms for
association rule mining.
Weakness
Synthetic data set is used for validation. The
data might be too synthetic as to not give any
valuable information about real world datasets.

28
Assumptions

Synthetic dataset is used. It is assumed that
performance of the algorithm in the synthetic
dataset is indicative of its performance on a
real world dataset.
All the items in the data are in a
lexicographical order.
Assume that all data is categorical.
It is assumed that all the data is present in
the same site or table and there are no cases
which there would be a requirement to make joins.

29
Possible Revision

Some real world datasets should be used to
perform the experiments .
The number of large itemsets could exponentially
increase with large databases. Modification in
the representation structure is required that
captures just a subset of the candidate large
itemsets.
Limitations of Support and Confidence Framework
Support Potentially interesting
patterns involving low support items might
be eliminated.
Confidence Confidence ignores the
support of the itemset in the rule
consequent.
Improvement Interestingness measure
Computes the ratio between
the rules confidence and the support
of the itemset in the rule consequent.
S(a,b)/s(a) s(b)
Effect of Skewed support Distribution