Fast Algorithms for Association Rule Mining - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Fast Algorithms for Association Rule Mining

Description:

In pass k, read a database ... database for counting support after the first pass. ... uses Apriori in the initial passes and switches to AprioriTid when it ... – PowerPoint PPT presentation

Number of Views:420
Avg rating:3.0/5.0
Slides: 31
Provided by: tcU9
Category:

less

Transcript and Presenter's Notes

Title: Fast Algorithms for Association Rule Mining


1
Fast Algorithms for Association Rule Mining
  • Presented by
  • Muhammad Aurangzeb Ahmad
  • Nupur Bhatnagar

R. Agrawal and R. Srikant
2
Outline
  • Background and Motivation
  • Problem Definition
  • Major Contribution
  • Key Concepts
  • Validation
  • Assumptions
  • Future Revision

3
Background Motivation
  • Basket Data
  • Collection of records consisting of transaction
    identifier and the items bought in a
    transaction.
  • Mining for associations among items in a large
    database of sales transaction to predict the
    occurrence of an item based on the occurrences of
    other items in the transaction.
  • For Example

4
Terms and Notations
  • Items I i1,i2,,im
  • Transaction set of items such as
  • Items are sorted lexicographically
  • TID unique identifier for each transaction
  • Association Rule X-gtY where

5
Terms and Notations
  • Confidence
  • A rule X-gtY holds in the transaction set D with
    confidence c if
  • c of transactions in D that contain X also
    contain Y.
  • Support
  • A rule X-gtY has support s if
  • s of transactions in D contain X and Y.
  • Large Itemset
  • Itemsets having support greater than minimum
    support and minimum confidence are called large
    itemsets other they are called small itemsets.
  • Candidate Itemsets
  • A set of itemsets which are generated from a
    seed of itemsets which were found to be large in
    the previous pass having
  • support minsup threshold
  • confidence minconf threshold

6
Problem Definition
  • INPUT
  • A set of transactions


Objective Given a set of transactions D,
generate all association rules that have support
and confidence greater than the user-specified
minimum support and minimum confidence. Minimize
computation time by pruning. Constraints Items
should be in lexicographical order
Association Rules Diaper ? Beer, Milk,
Bread ? Eggs, Coke, Beer, Bread ? Milk,
Real World Applications NCR (Terradata) does ARM
for more than 20 large retail organizations
including Walmart. Used for pattern discovery in
biological DBs.
7
Major Contribution
  • Proposed two new algorithms for fast association
    rule mining
  • Apriori and AprioriTID, along with a hybrid
    of the two algorithms .
  • Empirical evaluations of the performance of the
    proposed algorithms as compared with the
    contemporary algorithms.
  • Completeness Find all rules.

8
Related Work -SETM and AIS
  • Major difference in Candidate Itemset generation
  • In pass k, read a database transaction t
  • Determine which of the large itemsets in Lk-1 are
    present in t.
  • Each of these large itemsets l is then extended
    with all those large items that are present in t
    and occur later in the lexicographic ordering
    than any of the items in l.
  • Results A lot of Candidate Itemsets are
    generated which are later discarded.

9
Key Concepts Support and Confidence
  • Why do we need Support and Confidence?
  • Given a rule X-gtY
  • Support determines how often a rule is applicable
    to a given data set.
  • Confidence determines how frequently items in Y
    appear in transactions that contains X.
  • A rule having low support may occur by chance!!
  • A low support rule tends to be uninteresting from
    a business perspective.
  • Confidence measures the reliability of the
    inference made by a rule.

10
Key Concepts Association Rule Mining Problem
  • Problem
  • Given a set of transactions T, find all rules
    having support gt minsupport and
    confidencegtminconfidence.
  • Decomposition of Problem
  • Frequent Itemset Generation
  • Find all itemsets having transaction
    support above minimum support.
  • These itemsets are called frequent
    itemsets.
  • 2. Rule Generation
  • Use the large itemsets to generate rules. These
    rules are high-confidence rules extracted from
    the frequent itemsets found in the previous step.

11
Frequent Itemset Generation Apriori
  • Apriori Principle
  • Given an itemeset Ia,b,c,d,e.
  • If an item set is frequent, then all of its
    subsets must also be frequent and vice-versa.

12
Frequent Itemset Generation Apriori
  • Apriori Principle

if c,d,e is frequent then all its subsets must
also be frequent
13
Frequent Itemset Generation Apriori
  • Apriori Principle Candidate Pruning

If a,b is infrequent, then all it supersets are
infrequent
14
Key Concepts Frequent Itemset Generation
Apriori Algorithm
  • Input
  • The market base transaction dataset.
  • Process
  • Determine large 1-itemsets.
  • Repeat until no new large 1-itemsets are
    identified.
  • Generate (k1) length candidate itemsets from
    length k large itemsets.
  • Prune candidate itemsets that are not large.
  • Count the support of each candidate itemset.
  • Eliminate candidate itemsets that are small.
  • Output
  • Itemsets that are large and qualify the min
    support and min confidence thresholds.

15
Apriori ExampleMinimum support two transaction
1-itemset
Pruning
2-itemset
Pruning
3-itemset
16
Apriori Candidate Generation
  • Given an k-itemset, generate k1 itemset in two
    steps


  • C(4)135,235

  • C(4) 235


JOIN STEP
Delete all candidates having non-frequent subset
PRUNE
Join k- itemset with k-itemset, with the join
condition that the first k-1 items should be the
same.
17
AprioriTID
  • AprioriTID
  • Same candidate generation function as Apriori.
  • Does not use database for counting support after
    the first pass.
  • Encoding of the candidate itemsets used in the
    previous pass.
  • Saves reading effort.

18
Apriori Tid ExampleSupport Count2
  • Database

L1
C1
C2
C2
L2
C3
C3
L3
19
Apriori Tid Analysis
  • Advantages
  • If a transaction does not contain k-itemset
    candidates, then Ck will not have an entry for
    this transaction.
  • For large k, each entry may be smaller than the
    transaction because very few candidates may be
    present in the transaction.
  • Disadvantages
  • For small k, each entry may be larger than the
    corresponding transaction.
  • An entry includes all k-itemsets contained in the
    transaction.

20
Apriori Hybrid
  • Apriori Hybrid
  • It uses Apriori in the initial passes and
    switches to AprioriTid when it expects that the
    candidate itemsets at the end of the pass will be
    in memory.

21
Validation Computer Experiments
  • Parameters for data generation
  • D Number of transactions
  • T Average size of the transaction
  • I Average size of the maximal potentially large
    itemsets
  • L Number of maximal potentially large itemsets
  • N Number of Items.
  • Parameter Settings 6 synthetic data sets

22
Results Execution Time
Apriori is always better than AIS and SETM. SETM
values were too big.
Apriori is better than Apriori TID in large
transactions.
23
Results Analysis
  • AprioriTid uses Ck instead of the database. If
    Ck fits in memory AprioriTid is faster than
    Apriori.
  • When Ck is too big to fit in memory, the
    computation time is much longer. Thus Apriori is
    faster than AprioriTid.

24
Results Execution time Apriori Hybrid
  • Graphs

Apriori Hybrid performs better than Apriori in
almost all cases.
25
Scale Up - Experiments
Apriori Hybrid scales up as the number of
transactions is increased from 100,000 to 10
million. Minimum support .75
Apriori Hybrid scales up when average transaction
size was increased. Done to see the affect on
data structures independent of physical db size
and number of large item sets.
26
Results
  • The Apriori algorithms are better than the SETM
    and AIS.
  • The algorithms performs there best when combined.
  • The algorithm shows good results in scale-up
    experiments.

27
Validation Methodology-Weakness and Strength
  • Strength
  • Author use a substantial basket data for guiding
    the process of designing fast algorithms for
    association rule mining.
  • Weakness
  • Synthetic data set is used for validation. The
    data might be too synthetic as to not give any
    valuable information about real world datasets.

28
Assumptions
  • Synthetic dataset is used. It is assumed that
    performance of the algorithm in the synthetic
    dataset is indicative of its performance on a
    real world dataset.
  • All the items in the data are in a
    lexicographical order.
  • Assume that all data is categorical.
  • It is assumed that all the data is present in
    the same site or table and there are no cases
    which there would be a requirement to make joins.

29
Possible Revision
  • Some real world datasets should be used to
    perform the experiments .
  • The number of large itemsets could exponentially
    increase with large databases. Modification in
    the representation structure is required that
    captures just a subset of the candidate large
    itemsets.
  • Limitations of Support and Confidence Framework
  • Support Potentially interesting
    patterns involving low support items might
    be eliminated.
  • Confidence Confidence ignores the
    support of the itemset in the rule
    consequent.
  • Improvement Interestingness measure
  • Computes the ratio between
    the rules confidence and the support
    of the itemset in the rule consequent.
    S(a,b)/s(a) s(b)
  • Effect of Skewed support Distribution

30
Questions?
Write a Comment
User Comments (0)
About PowerShow.com