Business Systems Intelligence: 4' Mining Association Rules - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Business Systems Intelligence: 4' Mining Association Rules

Description:

These notes are based (heavily) on those provided by the authors to accompany ' ... ECLAT (Equivalence CLASS Transformation) algorithm ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 47

Provided by: jiaw205

Category:

more less

Transcript and Presenter's Notes

Title: Business Systems Intelligence: 4' Mining Association Rules

1
Business Systems Intelligence4. Mining
Association Rules
Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)
2
Acknowledgments

These notes are based (heavily) on those
provided by the authors to accompany Data
Mining Concepts Techniques by Jiawei Han
and Micheline Kamber
Some slides are also based on trainers kits
provided by

More information about the book is available
atwww-sal.cs.uiuc.edu/hanj/bk2/ And
information on SAS is available atwww.sas.com
3
Mining Association Rules

Today we will look at
Association rule mining
Algorithms for scalable mining of
(single-dimensional Boolean) association rules in
transactional databases
Sequential pattern mining
Applications/extensions of frequent pattern
mining
Summary

4
What Is Association Mining?

Association rule mining
Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories

5
Motivations For Association Mining

Motivation Finding regularities in data
What products were often purchased together?
Beer and nappies!
What are the subsequent purchases after buying a
PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?

6
Motivations For Association Mining (cont)

Foundation for many essential data mining tasks
Association, correlation, causality
Sequential patterns, temporal or cyclic
association, partial periodicity, spatial and
multimedia association
Associative classification, cluster analysis,
iceberg cube, fascicles (semantic data
compression)

7
Motivations For Association Mining (cont)

Broad applications
Basket data analysis, cross-marketing, catalog
design, sale campaign analysis
Web log (click stream) analysis, DNA sequence
analysis, etc.

8
Market Basket Analysis

Market basket analysis is a typical example of
frequent itemset mining
Customers buying habits are divined by finding
associations between different items that
customers place in their shopping baskets
This information can be used to develop marketing
strategies

9
Market Basket Analysis (cont)
10
Association Rule Basic Concepts

Let I be a set of items I1, I2, I3,, Im
Let D be a database of transactions where each
transaction T is a set of items such that T
I
So, if A is a set of items a transaction T is
said to contain A if and only if A T
An association rule is an implication A B
where A I, B I, and A B

11
Association Rule Support Confidence

We say that an association rule A B holds in
the transaction set D with support, s, and
confidence, c
The support of the association rule is given as
the percentage of transactions in D that contain
both A and B (or A B)
So, the support can be considered the probability
P(A B)

12
Association Rule Support Confidence (cont)

The confidence of the association rule is given
as the percentage of transactions in D containing
A that also contain B
So, the confidence can be considered the
conditional probability P(BA)
Association rules that satisfy minimum support
and confidence values are said to be strong

13
Itemsets Frequent Itemsets

An itemset is a set of items
A k-itemset is an itemset that contains k items
The occurrence frequency of an itemset is the
number of transactions that contain the itemset
This is also known more simply as the frequency,
support count or count
An itemset is said to be frequent if the support
count satisfies a minimum support count threshold
The set of frequent itemsets is denoted Lk

14
Support Confidence Again

Support and confidence values can be calculated
as follows

15
Mining Association Rules An Example
16
Mining Association Rules An Example (cont)
17
Association Rule Mining

So, in general association rule mining can be
reduced to the following two steps
Find all frequent itemsets
Each itemset will occur at least as frequently as
as a minimum support count
Generate strong association rules from the
frequent itemsets
These rules will satisfy minimum support and
confidence measures

18
Combinatorial Explosion!

A major challenge in mining frequent itemsets is
that the number of frequent itemsets generated
can be massive
For example, a long frequent itemset will contain
a combinatorial number of shorter frequent
sub-itemsets
A frequent itemset of length 100 will contains
the following number of frequent sub-itemsets

19
The Apriori Algorithm

Any subset of a frequent itemset must be frequent
If beer, nappy, nuts is frequent, so is beer,
nappy
Every transaction having beer, nappy, nuts also
contains beer, nappy
Apriori pruning principle If there is any
itemset which is infrequent, its superset should
not be generated/tested!

20
The Apriori Algorithm (cont)

The Apriori algorithm is known as a candidate
generation-and-test approach
Method
Generate length (k1) candidate itemsets from
length k frequent itemsets
Test the candidates against the DB
Performance studies show the algorithms
efficiency and scalability

21
The Apriori Algorithm An Example
Database TDB
L1
C1
1st scan
C2
C2
L2
2nd scan
L3
C3
3rd scan
22
Important Details Of The Apriori Algorithm

There are two crucial questions in implementing
the Apriori algorithm
How to generate candidates?
How to count supports of candidates?

23
Generating Candidates

There are 2 steps to generating candidates
Step 1 Self-joining Lk
Step 2 Pruning
Example of Candidate-generation
L3abc, abd, acd, ace, bcd
Self-joining L3L3
abcd from abc and abd
acde from acd and ace
Pruning
acde is removed because ade is not in L3
C4abcd

24
How to Count Supports Of Candidates?

Why counting supports of candidates a problem?
The total number of candidates can be huge
One transaction may contain many candidates
Method
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of
itemsets and counts
Interior node contains a hash table
Subset function finds all the candidates
contained in a transaction

25
Generating Association Rules

Once all frequent itemsets have been found
association rules can be generated
Strong association rules from a frequent itemset
are generated by calculating the confidence in
each possible rule arising from that itemset and
testing it against a minimum confidence threshold

26
Example
27
Example
28
Challenges Of Frequent Pattern Mining

Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
Improving Apriori general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates

29
Bottleneck Of Frequent-Pattern Mining

Multiple database scans are costly
Mining long patterns needs many passes of
scanning and generates lots of candidates
To find frequent itemset i1i2i100
of scans 100
of Candidates 2100-1
1.271030
Bottleneck candidate-generation-and-test

30
Mining Frequent Patterns Without Candidate
Generation

Techniques for mining frequent itemsets which
avoid candidate generation include
FP-growth
Grow long patterns from short ones using local
frequent items
ECLAT (Equivalence CLASS Transformation)
algorithm
Uses a data representation in which transactions
are associated with items, rather than the other
way around (vertical data format)
These methods can be much faster than the Apriori
algorithm

31
Sequence Databases and Sequential Pattern Analysis

Frequent patterns vs. (frequent) sequential
patterns
Applications of sequential pattern mining
Customer shopping sequences
First buy computer, then CD-ROM, and then digital
camera, within 3 months.
Medical treatment, natural disasters (e.g.,
earthquakes), science engineering processes,
stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures

32
What Is Sequential Pattern Mining?

Given a set of sequences, find the complete set
of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
33
Challenges On Sequential Pattern Mining

A huge number of possible sequential patterns are
hidden in databases
A mining algorithm should
Find the complete set of patterns, when possible,
satisfying the minimum support (frequency)
threshold
Be highly efficient, scalable, involving only a
small number of database scans
Be able to incorporate various kinds of
user-specific constraints

34
A Basic Property Of Sequential Patterns Apriori

A basic property Apriori
If a sequence S is not frequent
Then none of the super-sequences of S is frequent
E.g, lthbgt is infrequent ? so are lthabgt and lt(ah)bgt

Given support threshold min_sup 2
35
GSPA Generalized Sequential Pattern Mining
Algorithm

GSP (Generalized Sequential Pattern) mining
algorithm proposed in 1996
Outline of the method
Initially, every item in DB is a candidate of
length 1
For each level (i.e., sequences of length k)
Scan database to collect support count for each
candidate sequence
Generate candidate length (k1) sequences from
length k frequent sequences using Apriori
Repeat until no frequent sequence or no candidate
can be found
Major strength Candidate pruning by Apriori

36
Finding Length 1 Sequential Patterns

Examine GSP using an example
Initial candidates all singleton sequences
ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
Scan database once, count support for candidates

37
Generating Length 2 Candidates
51 length-2 Candidates
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
38
Finding Length 2 Sequential Patterns

Scan database one more time, collect support
count for each length 2 candidate
There are 19 length 2 candidates which pass the
minimum support threshold
They are length 2 sequential patterns

39
Generating Length 3 Candidates And Finding Length
3 Patterns

Generate length 3 candidates
Self-join length 2 sequential patterns
Based on the Apriori property
ltabgt, ltaagt and ltbagt are all length 2 sequential
patterns ? ltabagt is a length-3 candidate
lt(bd)gt, ltbbgt and ltdbgt are all length 2 sequential
patterns ? lt(bd)bgt is a length-3 candidate
46 candidates are generated
Find length 3 sequential patterns
Scan database once more, collect support counts
for candidates
19 out of 46 candidates pass support threshold

40
The GSP Mining Process
min_sup 2
41
The GSP Algorithm

Take sequences in form of ltxgt as length 1
candidates
Scan database once, find F1, the set of length 1
sequential patterns
Let k1 while Fk is not empty do
Form Ck1, the set of length (k1) candidates
from Fk
If Ck1 is not empty, scan database once, find
Fk1, the set of length (k1) sequential patterns
Let kk1

42
Bottlenecks of GSP

A huge set of candidates could be generated
1,000 frequent length 1 sequences generate
length 2 candidates!
Multiple scans of database in mining
Real challenge mining long sequential patterns
An exponential number of short candidates
A length-100 sequential pattern needs 1030
candidate
sequences!

43
Improvements On GSP

Freespan
Projection-based No candidate sequence needs to
be generated
But, projection can be performed at any point in
the sequence, and the projected sequences will
not shrink much
PrefixSpan
Projection-based
But only prefix-based projection less
projections and quickly shrinking sequences

44
Frequent-Pattern Mining Achievements

Frequent pattern miningan important task in data
mining
Frequent pattern mining methodology
Candidate generation test vs. projection-based
(frequent-pattern growth)
Various optimization methods database partition,
scan reduction, hash tree, sampling, border
computation, clustering, etc.
Related frequent-pattern mining algorithm scope
extension
Mining closed frequent itemsets and max-patterns
(e.g., MaxMiner, CLOSET, CHARM, etc.)
Mining multi-level, multi-dimensional frequent
patterns with flexible support constraints
Constraint pushing for mining optimization
From frequent patterns to correlation and
causality