Mining Association Rules

About This Presentation

Title:

Description:

Number of Views:78

Avg rating:3.0/5.0

Slides: 19

Provided by: csPu

Learn more at: https://www.cs.purdue.edu

Category:

Tags: association | large | mining | rules

Transcript and Presenter's Notes

Title: Mining Association Rules

1
Mining Association Rules

2
Introduction

Data mining is the discovery of knowledge and
useful information from the large amounts of data
stored in databases.
Association Rules describing association
relationships among the attributes in the set of
relevant data.

3
Rules

Body Consequent Support , Confidence
Body represents the examined data.
Consequent represents a discovered property for
the examined data.
Support represents the percentage of the records
satisfying the body or the consequent.
Confidence represents the percentage of the
records satisfying both the body and the
consequent to those satisfying only the body.

4
Association Rules Examples

5
Topics of Discussion

6
Formal Statement of the Problem

I i1 , i2 , , im is a set of items
D is a set of transactions T
Each transaction T is a set of items (subset of
I)
TID is a unique identifier that is associated
with each transaction
The problem is to generate all association rules
that have support and confidence greater than the
user-specified minimum support and minimum
confidence

7
Problem Decomposition

The problem can be decomposed into two
subproblems
Find all sets of items (itemsets) that have
support (number of transactions) greater than the
minimum support (large itemsets).
Use the large itemsets to generate the desired
rules.
For each large itemset l, find all non-empty
subsets, and for each subset a generate a rule a
(l-a) if its confidence is greater than the
minimum confidence.

8
General Algorithm

In the first pass, the support of each individual
item is counted, and the large ones are
determined
In each subsequent pass, the large itemsets
determined in the previous pass is used to
generate new itemsets called candidate itemsets.
The support of each candidate itemset is counted,
and the large ones are determined.
This process continues until no new large
itemsets are found.

9
AIS Algorithm

Candidate itemsets are generated and counted
on-the-fly as the database is scanned.
For each transaction, it is determined which of
the large itemsets of the previous pass are
contained in this transaction.
New candidate itemsets are generated by extending
these large itemsets with other items in this
transaction.
The disadvantage is that this results in
unnecessarily generating and counting too many
candidate itemsets that turn out to be small.

10
Example
Database
L1
C2
C3
11
SETM Algorithm

Candidate itemsets are generated on-the-fly as
the database is scanned, but counted at the end
of the pass.
New candidate itemsets are generated the same way
as in AIS algorithm, but the TID of the
generating transaction is saved with the
candidate itemset in a sequential structure.
At the end of the pass, the support count of
candidate itemsets is determined by aggregating
this sequential structure
It has the same disadvantage of the AIS
algorithm.
Another disadvantage is that for each candidate
itemset, there are as many entries as its support
value.

12
Example
Database
L1
C2
C3
13
Apriori Algorithm

Candidate itemsets are generated using only the
large itemsets of the previous pass without
considering the transactions in the database.
The large itemset of the previous pass is joined
with itself to generate all itemsets whose size
is higher by 1.
Each generated itemset, that has a subset which
is not large, is deleted. The remaining itemsets
are the candidate ones.

14
Example
Database
L1
C2
C3
15
AprioriTid Algorithm

The database is not used at all for counting the
support of candidate itemsets after the first
pass.
The candidate itemsets are generated the same way
as in Apriori algorithm.
Another set C is generated of which each member
has the TID of each transaction and the large
itemsets present in this transaction. This set is
used to count the support of each candidate
itemset.
The advantage is that the number of entries in C
may be smaller than the number of transactions in
the database, especially in the later passes.

16
Example
C2
Database
L1
C2
C3
C3
17
Performance Analysis
18
AprioriHybrid Algorithm

Performance Analysis shows that
Apriori does better than AprioriTid in the
earlier passes.
AprioriTid does better than Apriori in the later
passes.
Hence, a hybrid algorithm can be designed that
uses Apriori in the initial passes and switches
to AprioriTid when it expects that the set C
will fit in memory.

Write a Comment

User Comments (0)

About PowerShow.com

Mining Association Rules - PowerPoint PPT Presentation