Fast Algorithms For Mining Association Rules - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Fast Algorithms For Mining Association Rules

Description:

... people who purchase tires and auto accessories also get automotive services done ... In the counting phase, we need storage for Ck and at least one page to buffer ... – PowerPoint PPT presentation

Number of Views:141

Avg rating:3.0/5.0

Slides: 29

Provided by: kingu

Learn more at: https://cis.temple.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fast Algorithms For Mining Association Rules

1
Fast Algorithms For Mining Association Rules

By Rakesh Agrawal and R. Srikant
Presented By Chirayu Modi

2
What is Data mining ?

Data mining is a set of techniques used in an
automated approach to exhaustively explore and
bring to the surface complex relationships in
very large datasets.
is the process of discovering interesting
knowledge...from large amounts of data stored in
databases, data warehouses, and other information
repositories.
Some method includes
Classification and Clustering
Association
Sequencing

3
Association Rules

Association rules are (local) patterns, which
model dependencies between attributes.
Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories.
Applications
Basket data analysis, cross-marketing, catalog
design, loss-leader analysis, clustering,
classification, etc
Rule form Body Head support, confidence.
Examples
buys(x, beers) buys(x, chips) 0.5, 60
major(x, CS) and takes(x, DB) grade(x,
A) 1, 75

4
Association Rule Basic Concepts

Given (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
Find all rules that correlate the presence of
one set of items with that of another set of
items
E.g., 98 of people who purchase tires and auto
accessories also get automotive services done
Applications
? Maintenance Agreement (What the store
should do to boost Maintenance Agreement sales)
Home Electronics ? (What other products
should the store stocks up?)
Attached mailing in direct marketing

5
Mining Association RulesAn Example
Min. support 50 Min. confidence 50
For rule A ? C support support(A ?C)
50 confidence support(A ?C)/support(A)
66.6 The Apriori principle Any subset of a
frequent itemset must be frequent
6
Mining Frequent Itemsets the Key Step

Find the frequent itemsets the sets of items
that have minimum support
A subset of a frequent itemset must also be a
frequent itemset
i.e., if AB is a frequent itemset, both A and
B should be a frequent itemset
Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate association
rules.

7
Mining algorithm

IDEA It relies on the a priori or downward
closure property if an itemset has minimum
support (frequent itemset) then every subset of
itemset also has minimum support.
In the first pass the support for each item is
counted and the large itemsets are obtained.
In the second pass the large itemsets obtained
from the first pass are extended to generate new
itemsets called candidate itemsets.
The support of the candidate itemsets is measured
and large itemsets are obtained.
This is repeated till no large itemsets can be
formed.

8
AIS algorithm

Two concepts are
Extension of an itemset.
Determining what should be in the candidate
itemset.
In case of the AIS, candidate sets are generated
on the fly. New candidate item sets are generated
by extending the large item sets that were
generated in the previous pass with other items
in the transaction.

9
SETM

The implementation is based on expressing the
algorithm in the form of SQL queries.
The 2 steps are
Generating the candidate itemsets using join
operations.
Generating the support counts and determining
the large itemsets.
In case of the SETM too, the candidate sets are
generated on the fly. New candidate item sets are
generated by extending the large item sets that
were generated in the previous pass with other
items in the transaction.

10
Drawbacks of AIS and SETM

They were very slow.
Generated large number of itemsets with the
support/confidence lower than the user specified
minimum support/confidence.
Makes a number of passes over the database.
All aspects of data mining cannot be represented
using SETM algorithm.

11
Apriori algorithm

Candidate itemsets were generated from large
itemsets of previous pass without considering the
database.
The large itemsets of the previous pass were
extended to get the new candidate itemsets.
Pruning was done using the fact that any subset
of a frequent itemset should be frequent.
Step 1 - discover all frequent items that have
support above the minimum support required.
Step 2 - Use the set of frequent items to
generate the association rules that have high
enough confidence

12
Apriori Candidate Generation

Monotonicity Property All subset of a frequent
set are frequent
Given Lk-1, Ck can be generated in two steps
Join Join Lk-1 with Lk-1, with the join
condition that the first k-1 items should be the
same
Prune delete all candidates whose support is
lower than the minimum support specified

13
The Apriori Algorithm

Pseudo-code
Ck Candidate itemset of size k
Lk frequent itemset of size k
L1 frequent items
for (k 1 Lk !? k) do begin
Ck1 candidates generated from Lk
for each transaction t in database do
increment the count of all candidates in
Ck1 that are
contained in t
Lk1 candidates in Ck1 with min_support
end
return ?k Lk

14
Apriori candidate generation (join step)

The Apriori-generation function takes as argument
F(k-1), the set of all frequent (k-1)-item sets.
It returns a superset of the set of all frequent
k-item sets. The function works as follows
First, in the join step, we join F(k-1) with
F(k-1)
insert into C(k) select p.item(1), p.item(2),...
p.item(k-1), q.item(k-1) from F(k-1) as p, F(k-1)
as qwhere p.item(1) q.item(1),...,p.item(k-2)
q.item(k-2), p.item(k-1) lt q.item(k-1)

15
The prune step

We delete all the item sets c in C(k) such that
some (k-1)-subset of c is not in F(k-1)
for each item sets c in C(k) do
for each (k-1)-subsets s of c do
if (s not in F(k-1)) then
delete c from C(k)
Any subset of a frequent item set must be
frequent
Lexicographic order of items is assumed!

16
The Apriori Algorithm Example
17
Buffer Management

In the candidate generation phase of pass k , we
need storage for large itemsets Lk-1 and the
candidate itemsets Ck. In the counting phase, we
need storage for Ck and at least one page to
buffer the database transactions. (Ct is a subset
of Ck)
Lk fits in memory and Ck does not generate as
many Ck as possible, scan database and count
support and write Fk to disk. Delete small
itemsets. Repeat until all of Fk is generated for
that pass.
Lk-1 does not fit in memory externally sort
Lk-1. Bring into memory Lk-1 items in which the
first k-2 items are the same. Generate Candidate
itemsets. Scan data and generate Fk.
Unfortunately, pruning cannot be done.

18
CORRECTNESS

Ck is a superset of Fk.
Ck is a superset of Fk by the way Ck is
generated.
Subset pruning is based on the monotonicity
property and every item pruned is guaranteed not
be large.
Hence, Ck is always a superset of Fk.

19
AprioriTid Algorithm

It is similar to the Apriori Algorithm and uses
Apriori-gen function to determine the candidate
sets.
But the basic difference is that for determining
the support , the database is not used after the
first pass.
Rather a set Ck is used for this purpose
Each member of Ck is of the form ltTID, Xk gt
where Xk is Potentially large k itemset present
in the transaction with the identifier TID
C1 corresponds to database D.

20
Algorithm AprioriTID

L1 ( large 1-itemsets)
C1 database D
for (k2 Lk-1? ? k) do begin
Ck apriori-gen(Lk-1) //New candidates
Ck?
forall entries t ? Ck-1 do begin
//determine candidate itemsets in Ck contained in
the transaction with identifier t.TID
Ct c ? Ck
(c-ck) ?t.set-of-itemsets ? (c-ck-1)
?t.set-of-itemsets
forall candidates c ? Ct do
c.count
if (Ct ? ? ) then Ck lt t.TID,
Ctgt
end
Lk c ? Ck c.count ? minsup
End
Answer Uk Lk

21
Buffer management

In the kth pass, AprioriTid needs memory for Lk-1
and Ck-1 during candidate generation.
During the counting phase, it needs memory for
Ck-1 , Ck , and a page for Ck-1 and Ck .
entries in Ck-1 are needed sequentially, but Ck
can be written out as generated.

22
Drawbacks of Apriori and AprioriTid

Apriori
Takes longer time for calculating support of
candidate itemsets.
For determining the support of the candidate sets
the algorithm always looks into every transaction
in the database. Hence it takes a longer time
(more passes on data)
AprioriTid
During the initial passes the candidate itemsets
generated are very large equivalent to the size
of the database. Hence the time taken will be
equal to that of Apriori. And also it might incur
an additional cost if it cannot completely fit
into the memory.

23
Experimental results

As the minimum support decreases, the execution
times of all the algorithms increase because of
increases in the total number of candidate and
large itemsets.
Apriori beats SETM by more than an order of
magnitude for large datasets.
Apriori beats AIS for all problem sizes, by
factors ranging from 2 for high minimum support
to more than an order of magnitude for low levels
of support.
For small problems, AprioriTid did about as well
as Apriori, but performance degraded to about
twice as slow for large problems.

24
Apriori Hybrid

Initial pass Apriori performs better
Later pass AprioriTid performs better
Apriori Hybrid
Uses Apriori in the initial passes and later
shifts to AprioriTid.
Drawback
An extra cost is incurred when shifting from
Apriori to AprioriTid.
Suppose at the end of K th pass we decide to
switch from Apriori to AprioriTid. Then in the
(k1) pass, after having generated the candidate
sets we also have to add the Tids to Ck1

25
Is Apriori Fast Enough? Performance Bottlenecks

The core of the Apriori algorithm
Use frequent (k 1)-itemsets to generate
candidate frequent k-itemsets
Use database scan and pattern matching to collect
counts for the candidate itemsets
The bottleneck of Apriori candidate generation
Huge candidate sets
104 frequent 1-itemset will generate 107
candidate 2-itemsets
To discover a frequent pattern of size 100, e.g.,
a1, a2, , a100, one needs to generate 2100 ?
1030 candidates.
Multiple scans of database
Needs (n 1 ) scans, n is the length of the
longest pattern

26
Conclusions

Experimental results were shown to prove that the
proposed algorithms outperform AIS and SETM. The
performance gap increased with the problem size,
and ranged from a factor of three for small
problems to more than an order of magnitude for
large problems.
Best features of the two proposed algorithms can
be combined into a hybrid algorithm which then
becomes an algorithm of choice. Experiments
demonstrate the feasibility of using
AprioriHybrid in real applications involving very
large databases.

27
Future Scope

In future authors plan to extend this work along
the following dimensions
Multiple taxonomies (is-a hierarchies) over
items are often available. An example of such a
hierarchy is that a dish washer is a kitchen
appliance is a heavy electric appliance, etc.
Authors are interested in discovering the
association rules that use such hierarchies.
Authors did not consider the quantities of the
items bought in a transaction, which are useful
for some applications. Finding such rules needs
further work.