Mining Association Rules - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Association Rules

Description:

In each subsequent pass, the large itemsets determined in the previous pass is ... of each candidate itemset is counted, and the large ones are determined. ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 19
Provided by: csPu
Category:

less

Transcript and Presenter's Notes

Title: Mining Association Rules


1
Mining Association Rules
  • Mohamed G. Elfeky

2
Introduction
  • Data mining is the discovery of knowledge and
    useful information from the large amounts of data
    stored in databases.
  • Association Rules describing association
    relationships among the attributes in the set of
    relevant data.

3
Rules
  • Body Consequent Support , Confidence
  • Body represents the examined data.
  • Consequent represents a discovered property for
    the examined data.
  • Support represents the percentage of the records
    satisfying the body or the consequent.
  • Confidence represents the percentage of the
    records satisfying both the body and the
    consequent to those satisfying only the body.

4
Association Rules Examples
  • Basket Data
  • Tea Milk Sugar 0.3 , 0.9
  • Relational Data
  • x.diagnosis Heart x.sex Male x.age 50
    0.4 , 0.7
  • Object-Oriented Data
  • s.hobbies sport , art s.age() Young
    0.5 , 0.8

5
Topics of Discussion
  • Formal Statement of the Problem
  • Different Algorithms
  • AIS
  • SETM
  • Apriori
  • AprioriTid
  • AprioriHybrid
  • Performance Analysis

6
Formal Statement of the Problem
  • I i1 , i2 , , im is a set of items
  • D is a set of transactions T
  • Each transaction T is a set of items (subset of
    I)
  • TID is a unique identifier that is associated
    with each transaction
  • The problem is to generate all association rules
    that have support and confidence greater than the
    user-specified minimum support and minimum
    confidence

7
Problem Decomposition
  • The problem can be decomposed into two
    subproblems
  • Find all sets of items (itemsets) that have
    support (number of transactions) greater than the
    minimum support (large itemsets).
  • Use the large itemsets to generate the desired
    rules.
  • For each large itemset l, find all non-empty
    subsets, and for each subset a generate a rule a
    (l-a) if its confidence is greater than the
    minimum confidence.

8
General Algorithm
  • In the first pass, the support of each individual
    item is counted, and the large ones are
    determined
  • In each subsequent pass, the large itemsets
    determined in the previous pass is used to
    generate new itemsets called candidate itemsets.
  • The support of each candidate itemset is counted,
    and the large ones are determined.
  • This process continues until no new large
    itemsets are found.

9
AIS Algorithm
  • Candidate itemsets are generated and counted
    on-the-fly as the database is scanned.
  • For each transaction, it is determined which of
    the large itemsets of the previous pass are
    contained in this transaction.
  • New candidate itemsets are generated by extending
    these large itemsets with other items in this
    transaction.
  • The disadvantage is that this results in
    unnecessarily generating and counting too many
    candidate itemsets that turn out to be small.

10
Example
Database
L1
C2
C3
11
SETM Algorithm
  • Candidate itemsets are generated on-the-fly as
    the database is scanned, but counted at the end
    of the pass.
  • New candidate itemsets are generated the same way
    as in AIS algorithm, but the TID of the
    generating transaction is saved with the
    candidate itemset in a sequential structure.
  • At the end of the pass, the support count of
    candidate itemsets is determined by aggregating
    this sequential structure
  • It has the same disadvantage of the AIS
    algorithm.
  • Another disadvantage is that for each candidate
    itemset, there are as many entries as its support
    value.

12
Example
Database
L1
C2
C3
13
Apriori Algorithm
  • Candidate itemsets are generated using only the
    large itemsets of the previous pass without
    considering the transactions in the database.
  • The large itemset of the previous pass is joined
    with itself to generate all itemsets whose size
    is higher by 1.
  • Each generated itemset, that has a subset which
    is not large, is deleted. The remaining itemsets
    are the candidate ones.

14
Example
Database
L1
C2
C3
15
AprioriTid Algorithm
  • The database is not used at all for counting the
    support of candidate itemsets after the first
    pass.
  • The candidate itemsets are generated the same way
    as in Apriori algorithm.
  • Another set C is generated of which each member
    has the TID of each transaction and the large
    itemsets present in this transaction. This set is
    used to count the support of each candidate
    itemset.
  • The advantage is that the number of entries in C
    may be smaller than the number of transactions in
    the database, especially in the later passes.

16
Example
C2
Database
L1
C2
C3
C3
17
Performance Analysis
18
AprioriHybrid Algorithm
  • Performance Analysis shows that
  • Apriori does better than AprioriTid in the
    earlier passes.
  • AprioriTid does better than Apriori in the later
    passes.
  • Hence, a hybrid algorithm can be designed that
    uses Apriori in the initial passes and switches
    to AprioriTid when it expects that the set C
    will fit in memory.
Write a Comment
User Comments (0)
About PowerShow.com