Business Systems Intelligence: 4' Mining Association Rules - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Business Systems Intelligence: 4' Mining Association Rules

Description:

These notes are based (heavily) on those provided by the authors to accompany ' ... ECLAT (Equivalence CLASS Transformation) algorithm ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 47
Provided by: jiaw205
Category:

less

Transcript and Presenter's Notes

Title: Business Systems Intelligence: 4' Mining Association Rules


1
Business Systems Intelligence4. Mining
Association Rules
Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)
2
Acknowledgments
  • These notes are based (heavily) on those
    provided by the authors to accompany Data
    Mining Concepts Techniques by Jiawei Han
    and Micheline Kamber
  • Some slides are also based on trainers kits
    provided by

More information about the book is available
atwww-sal.cs.uiuc.edu/hanj/bk2/ And
information on SAS is available atwww.sas.com
3
Mining Association Rules
  • Today we will look at
  • Association rule mining
  • Algorithms for scalable mining of
    (single-dimensional Boolean) association rules in
    transactional databases
  • Sequential pattern mining
  • Applications/extensions of frequent pattern
    mining
  • Summary

4
What Is Association Mining?
  • Association rule mining
  • Finding frequent patterns, associations,
    correlations, or causal structures among sets of
    items or objects in transaction databases,
    relational databases, and other information
    repositories

5
Motivations For Association Mining
  • Motivation Finding regularities in data
  • What products were often purchased together?
  • Beer and nappies!
  • What are the subsequent purchases after buying a
    PC?
  • What kinds of DNA are sensitive to this new drug?
  • Can we automatically classify web documents?

6
Motivations For Association Mining (cont)
  • Foundation for many essential data mining tasks
  • Association, correlation, causality
  • Sequential patterns, temporal or cyclic
    association, partial periodicity, spatial and
    multimedia association
  • Associative classification, cluster analysis,
    iceberg cube, fascicles (semantic data
    compression)

7
Motivations For Association Mining (cont)
  • Broad applications
  • Basket data analysis, cross-marketing, catalog
    design, sale campaign analysis
  • Web log (click stream) analysis, DNA sequence
    analysis, etc.

8
Market Basket Analysis
  • Market basket analysis is a typical example of
    frequent itemset mining
  • Customers buying habits are divined by finding
    associations between different items that
    customers place in their shopping baskets
  • This information can be used to develop marketing
    strategies

9
Market Basket Analysis (cont)
10
Association Rule Basic Concepts
  • Let I be a set of items I1, I2, I3,, Im
  • Let D be a database of transactions where each
    transaction T is a set of items such that T
    I
  • So, if A is a set of items a transaction T is
    said to contain A if and only if A T
  • An association rule is an implication A B
    where A I, B I, and A B

11
Association Rule Support Confidence
  • We say that an association rule A B holds in
    the transaction set D with support, s, and
    confidence, c
  • The support of the association rule is given as
    the percentage of transactions in D that contain
    both A and B (or A B)
  • So, the support can be considered the probability
    P(A B)

12
Association Rule Support Confidence (cont)
  • The confidence of the association rule is given
    as the percentage of transactions in D containing
    A that also contain B
  • So, the confidence can be considered the
    conditional probability P(BA)
  • Association rules that satisfy minimum support
    and confidence values are said to be strong

13
Itemsets Frequent Itemsets
  • An itemset is a set of items
  • A k-itemset is an itemset that contains k items
  • The occurrence frequency of an itemset is the
    number of transactions that contain the itemset
  • This is also known more simply as the frequency,
    support count or count
  • An itemset is said to be frequent if the support
    count satisfies a minimum support count threshold
  • The set of frequent itemsets is denoted Lk

14
Support Confidence Again
  • Support and confidence values can be calculated
    as follows

15
Mining Association Rules An Example
16
Mining Association Rules An Example (cont)
17
Association Rule Mining
  • So, in general association rule mining can be
    reduced to the following two steps
  • Find all frequent itemsets
  • Each itemset will occur at least as frequently as
    as a minimum support count
  • Generate strong association rules from the
    frequent itemsets
  • These rules will satisfy minimum support and
    confidence measures

18
Combinatorial Explosion!
  • A major challenge in mining frequent itemsets is
    that the number of frequent itemsets generated
    can be massive
  • For example, a long frequent itemset will contain
    a combinatorial number of shorter frequent
    sub-itemsets
  • A frequent itemset of length 100 will contains
    the following number of frequent sub-itemsets

19
The Apriori Algorithm
  • Any subset of a frequent itemset must be frequent
  • If beer, nappy, nuts is frequent, so is beer,
    nappy
  • Every transaction having beer, nappy, nuts also
    contains beer, nappy
  • Apriori pruning principle If there is any
    itemset which is infrequent, its superset should
    not be generated/tested!

20
The Apriori Algorithm (cont)
  • The Apriori algorithm is known as a candidate
    generation-and-test approach
  • Method
  • Generate length (k1) candidate itemsets from
    length k frequent itemsets
  • Test the candidates against the DB
  • Performance studies show the algorithms
    efficiency and scalability

21
The Apriori Algorithm An Example
Database TDB
L1
C1
1st scan
C2
C2
L2
2nd scan
L3
C3
3rd scan
22
Important Details Of The Apriori Algorithm
  • There are two crucial questions in implementing
    the Apriori algorithm
  • How to generate candidates?
  • How to count supports of candidates?

23
Generating Candidates
  • There are 2 steps to generating candidates
  • Step 1 Self-joining Lk
  • Step 2 Pruning
  • Example of Candidate-generation
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

24
How to Count Supports Of Candidates?
  • Why counting supports of candidates a problem?
  • The total number of candidates can be huge
  • One transaction may contain many candidates
  • Method
  • Candidate itemsets are stored in a hash-tree
  • Leaf node of hash-tree contains a list of
    itemsets and counts
  • Interior node contains a hash table
  • Subset function finds all the candidates
    contained in a transaction

25
Generating Association Rules
  • Once all frequent itemsets have been found
    association rules can be generated
  • Strong association rules from a frequent itemset
    are generated by calculating the confidence in
    each possible rule arising from that itemset and
    testing it against a minimum confidence threshold

26
Example
27
Example
28
Challenges Of Frequent Pattern Mining
  • Challenges
  • Multiple scans of transaction database
  • Huge number of candidates
  • Tedious workload of support counting for
    candidates
  • Improving Apriori general ideas
  • Reduce passes of transaction database scans
  • Shrink number of candidates
  • Facilitate support counting of candidates

29
Bottleneck Of Frequent-Pattern Mining
  • Multiple database scans are costly
  • Mining long patterns needs many passes of
    scanning and generates lots of candidates
  • To find frequent itemset i1i2i100
  • of scans 100
  • of Candidates 2100-1
    1.271030
  • Bottleneck candidate-generation-and-test

30
Mining Frequent Patterns Without Candidate
Generation
  • Techniques for mining frequent itemsets which
    avoid candidate generation include
  • FP-growth
  • Grow long patterns from short ones using local
    frequent items
  • ECLAT (Equivalence CLASS Transformation)
    algorithm
  • Uses a data representation in which transactions
    are associated with items, rather than the other
    way around (vertical data format)
  • These methods can be much faster than the Apriori
    algorithm

31
Sequence Databases and Sequential Pattern Analysis
  • Frequent patterns vs. (frequent) sequential
    patterns
  • Applications of sequential pattern mining
  • Customer shopping sequences
  • First buy computer, then CD-ROM, and then digital
    camera, within 3 months.
  • Medical treatment, natural disasters (e.g.,
    earthquakes), science engineering processes,
    stocks and markets, etc.
  • Telephone calling patterns, Weblog click streams
  • DNA sequences and gene structures

32
What Is Sequential Pattern Mining?
  • Given a set of sequences, find the complete set
    of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
33
Challenges On Sequential Pattern Mining
  • A huge number of possible sequential patterns are
    hidden in databases
  • A mining algorithm should
  • Find the complete set of patterns, when possible,
    satisfying the minimum support (frequency)
    threshold
  • Be highly efficient, scalable, involving only a
    small number of database scans
  • Be able to incorporate various kinds of
    user-specific constraints

34
A Basic Property Of Sequential Patterns Apriori
  • A basic property Apriori
  • If a sequence S is not frequent
  • Then none of the super-sequences of S is frequent
  • E.g, lthbgt is infrequent ? so are lthabgt and lt(ah)bgt

Given support threshold min_sup 2
35
GSPA Generalized Sequential Pattern Mining
Algorithm
  • GSP (Generalized Sequential Pattern) mining
    algorithm proposed in 1996
  • Outline of the method
  • Initially, every item in DB is a candidate of
    length 1
  • For each level (i.e., sequences of length k)
  • Scan database to collect support count for each
    candidate sequence
  • Generate candidate length (k1) sequences from
    length k frequent sequences using Apriori
  • Repeat until no frequent sequence or no candidate
    can be found
  • Major strength Candidate pruning by Apriori

36
Finding Length 1 Sequential Patterns
  • Examine GSP using an example
  • Initial candidates all singleton sequences
  • ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
  • Scan database once, count support for candidates

37
Generating Length 2 Candidates
51 length-2 Candidates
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
38
Finding Length 2 Sequential Patterns
  • Scan database one more time, collect support
    count for each length 2 candidate
  • There are 19 length 2 candidates which pass the
    minimum support threshold
  • They are length 2 sequential patterns

39
Generating Length 3 Candidates And Finding Length
3 Patterns
  • Generate length 3 candidates
  • Self-join length 2 sequential patterns
  • Based on the Apriori property
  • ltabgt, ltaagt and ltbagt are all length 2 sequential
    patterns ? ltabagt is a length-3 candidate
  • lt(bd)gt, ltbbgt and ltdbgt are all length 2 sequential
    patterns ? lt(bd)bgt is a length-3 candidate
  • 46 candidates are generated
  • Find length 3 sequential patterns
  • Scan database once more, collect support counts
    for candidates
  • 19 out of 46 candidates pass support threshold

40
The GSP Mining Process
min_sup 2
41
The GSP Algorithm
  • Take sequences in form of ltxgt as length 1
    candidates
  • Scan database once, find F1, the set of length 1
    sequential patterns
  • Let k1 while Fk is not empty do
  • Form Ck1, the set of length (k1) candidates
    from Fk
  • If Ck1 is not empty, scan database once, find
    Fk1, the set of length (k1) sequential patterns
  • Let kk1

42
Bottlenecks of GSP
  • A huge set of candidates could be generated
  • 1,000 frequent length 1 sequences generate
    length 2 candidates!
  • Multiple scans of database in mining
  • Real challenge mining long sequential patterns
  • An exponential number of short candidates
  • A length-100 sequential pattern needs 1030
    candidate
    sequences!

43
Improvements On GSP
  • Freespan
  • Projection-based No candidate sequence needs to
    be generated
  • But, projection can be performed at any point in
    the sequence, and the projected sequences will
    not shrink much
  • PrefixSpan
  • Projection-based
  • But only prefix-based projection less
    projections and quickly shrinking sequences

44
Frequent-Pattern Mining Achievements
  • Frequent pattern miningan important task in data
    mining
  • Frequent pattern mining methodology
  • Candidate generation test vs. projection-based
    (frequent-pattern growth)
  • Various optimization methods database partition,
    scan reduction, hash tree, sampling, border
    computation, clustering, etc.
  • Related frequent-pattern mining algorithm scope
    extension
  • Mining closed frequent itemsets and max-patterns
    (e.g., MaxMiner, CLOSET, CHARM, etc.)
  • Mining multi-level, multi-dimensional frequent
    patterns with flexible support constraints
  • Constraint pushing for mining optimization
  • From frequent patterns to correlation and
    causality

45
Frequent-Pattern Mining Research Problems
  • Multi-dimensional gradient analysis patterns
    regarding changes and differences
  • Not just countsother measures, e.g., avg(profit)
  • Mining top-k frequent patterns without support
    constraint
  • Mining fault-tolerant associations
  • 3 out of 4 courses excellent leads to A in data
    mining
  • Fascicles and database compression by frequent
    pattern mining
  • Partial periodic patterns
  • DNA sequence analysis and pattern classification

46
Questions?
  • ?
Write a Comment
User Comments (0)
About PowerShow.com