Title: Business Systems Intelligence: 4' Mining Association Rules
1Business Systems Intelligence4. Mining
Association Rules
Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)
2Acknowledgments
- These notes are based (heavily) on those
provided by the authors to accompany Data
Mining Concepts Techniques by Jiawei Han
and Micheline Kamber - Some slides are also based on trainers kits
provided by
More information about the book is available
atwww-sal.cs.uiuc.edu/hanj/bk2/ And
information on SAS is available atwww.sas.com
3Mining Association Rules
- Today we will look at
- Association rule mining
- Algorithms for scalable mining of
(single-dimensional Boolean) association rules in
transactional databases - Sequential pattern mining
- Applications/extensions of frequent pattern
mining - Summary
4What Is Association Mining?
- Association rule mining
- Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories
5Motivations For Association Mining
- Motivation Finding regularities in data
- What products were often purchased together?
- Beer and nappies!
- What are the subsequent purchases after buying a
PC? - What kinds of DNA are sensitive to this new drug?
- Can we automatically classify web documents?
6Motivations For Association Mining (cont)
- Foundation for many essential data mining tasks
- Association, correlation, causality
- Sequential patterns, temporal or cyclic
association, partial periodicity, spatial and
multimedia association - Associative classification, cluster analysis,
iceberg cube, fascicles (semantic data
compression)
7Motivations For Association Mining (cont)
- Broad applications
- Basket data analysis, cross-marketing, catalog
design, sale campaign analysis - Web log (click stream) analysis, DNA sequence
analysis, etc.
8Market Basket Analysis
- Market basket analysis is a typical example of
frequent itemset mining - Customers buying habits are divined by finding
associations between different items that
customers place in their shopping baskets - This information can be used to develop marketing
strategies
9Market Basket Analysis (cont)
10Association Rule Basic Concepts
- Let I be a set of items I1, I2, I3,, Im
- Let D be a database of transactions where each
transaction T is a set of items such that T
I - So, if A is a set of items a transaction T is
said to contain A if and only if A T - An association rule is an implication A B
where A I, B I, and A B
11Association Rule Support Confidence
- We say that an association rule A B holds in
the transaction set D with support, s, and
confidence, c - The support of the association rule is given as
the percentage of transactions in D that contain
both A and B (or A B) - So, the support can be considered the probability
P(A B)
12Association Rule Support Confidence (cont)
- The confidence of the association rule is given
as the percentage of transactions in D containing
A that also contain B - So, the confidence can be considered the
conditional probability P(BA) - Association rules that satisfy minimum support
and confidence values are said to be strong
13Itemsets Frequent Itemsets
- An itemset is a set of items
- A k-itemset is an itemset that contains k items
- The occurrence frequency of an itemset is the
number of transactions that contain the itemset - This is also known more simply as the frequency,
support count or count - An itemset is said to be frequent if the support
count satisfies a minimum support count threshold - The set of frequent itemsets is denoted Lk
14Support Confidence Again
- Support and confidence values can be calculated
as follows
15Mining Association Rules An Example
16Mining Association Rules An Example (cont)
17Association Rule Mining
- So, in general association rule mining can be
reduced to the following two steps - Find all frequent itemsets
- Each itemset will occur at least as frequently as
as a minimum support count - Generate strong association rules from the
frequent itemsets - These rules will satisfy minimum support and
confidence measures
18Combinatorial Explosion!
- A major challenge in mining frequent itemsets is
that the number of frequent itemsets generated
can be massive - For example, a long frequent itemset will contain
a combinatorial number of shorter frequent
sub-itemsets - A frequent itemset of length 100 will contains
the following number of frequent sub-itemsets
19The Apriori Algorithm
- Any subset of a frequent itemset must be frequent
- If beer, nappy, nuts is frequent, so is beer,
nappy - Every transaction having beer, nappy, nuts also
contains beer, nappy - Apriori pruning principle If there is any
itemset which is infrequent, its superset should
not be generated/tested!
20The Apriori Algorithm (cont)
- The Apriori algorithm is known as a candidate
generation-and-test approach - Method
- Generate length (k1) candidate itemsets from
length k frequent itemsets - Test the candidates against the DB
- Performance studies show the algorithms
efficiency and scalability
21The Apriori Algorithm An Example
Database TDB
L1
C1
1st scan
C2
C2
L2
2nd scan
L3
C3
3rd scan
22Important Details Of The Apriori Algorithm
- There are two crucial questions in implementing
the Apriori algorithm - How to generate candidates?
- How to count supports of candidates?
23Generating Candidates
- There are 2 steps to generating candidates
- Step 1 Self-joining Lk
- Step 2 Pruning
- Example of Candidate-generation
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
24How to Count Supports Of Candidates?
- Why counting supports of candidates a problem?
- The total number of candidates can be huge
- One transaction may contain many candidates
- Method
- Candidate itemsets are stored in a hash-tree
- Leaf node of hash-tree contains a list of
itemsets and counts - Interior node contains a hash table
- Subset function finds all the candidates
contained in a transaction
25Generating Association Rules
- Once all frequent itemsets have been found
association rules can be generated - Strong association rules from a frequent itemset
are generated by calculating the confidence in
each possible rule arising from that itemset and
testing it against a minimum confidence threshold
26Example
27Example
28Challenges Of Frequent Pattern Mining
- Challenges
- Multiple scans of transaction database
- Huge number of candidates
- Tedious workload of support counting for
candidates - Improving Apriori general ideas
- Reduce passes of transaction database scans
- Shrink number of candidates
- Facilitate support counting of candidates
29Bottleneck Of Frequent-Pattern Mining
- Multiple database scans are costly
- Mining long patterns needs many passes of
scanning and generates lots of candidates - To find frequent itemset i1i2i100
- of scans 100
- of Candidates 2100-1
1.271030 - Bottleneck candidate-generation-and-test
30Mining Frequent Patterns Without Candidate
Generation
- Techniques for mining frequent itemsets which
avoid candidate generation include - FP-growth
- Grow long patterns from short ones using local
frequent items - ECLAT (Equivalence CLASS Transformation)
algorithm - Uses a data representation in which transactions
are associated with items, rather than the other
way around (vertical data format) - These methods can be much faster than the Apriori
algorithm
31Sequence Databases and Sequential Pattern Analysis
- Frequent patterns vs. (frequent) sequential
patterns - Applications of sequential pattern mining
- Customer shopping sequences
- First buy computer, then CD-ROM, and then digital
camera, within 3 months. - Medical treatment, natural disasters (e.g.,
earthquakes), science engineering processes,
stocks and markets, etc. - Telephone calling patterns, Weblog click streams
- DNA sequences and gene structures
32What Is Sequential Pattern Mining?
- Given a set of sequences, find the complete set
of frequent subsequences
A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
33Challenges On Sequential Pattern Mining
- A huge number of possible sequential patterns are
hidden in databases - A mining algorithm should
- Find the complete set of patterns, when possible,
satisfying the minimum support (frequency)
threshold - Be highly efficient, scalable, involving only a
small number of database scans - Be able to incorporate various kinds of
user-specific constraints
34A Basic Property Of Sequential Patterns Apriori
- A basic property Apriori
- If a sequence S is not frequent
- Then none of the super-sequences of S is frequent
- E.g, lthbgt is infrequent ? so are lthabgt and lt(ah)bgt
Given support threshold min_sup 2
35GSPA Generalized Sequential Pattern Mining
Algorithm
- GSP (Generalized Sequential Pattern) mining
algorithm proposed in 1996 - Outline of the method
- Initially, every item in DB is a candidate of
length 1 - For each level (i.e., sequences of length k)
- Scan database to collect support count for each
candidate sequence - Generate candidate length (k1) sequences from
length k frequent sequences using Apriori - Repeat until no frequent sequence or no candidate
can be found - Major strength Candidate pruning by Apriori
36Finding Length 1 Sequential Patterns
- Examine GSP using an example
- Initial candidates all singleton sequences
- ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt, ltggt, lthgt
- Scan database once, count support for candidates
37Generating Length 2 Candidates
51 length-2 Candidates
Without Apriori property, 8887/292 candidates
Apriori prunes 44.57 candidates
38Finding Length 2 Sequential Patterns
- Scan database one more time, collect support
count for each length 2 candidate - There are 19 length 2 candidates which pass the
minimum support threshold - They are length 2 sequential patterns
39Generating Length 3 Candidates And Finding Length
3 Patterns
- Generate length 3 candidates
- Self-join length 2 sequential patterns
- Based on the Apriori property
- ltabgt, ltaagt and ltbagt are all length 2 sequential
patterns ? ltabagt is a length-3 candidate - lt(bd)gt, ltbbgt and ltdbgt are all length 2 sequential
patterns ? lt(bd)bgt is a length-3 candidate - 46 candidates are generated
- Find length 3 sequential patterns
- Scan database once more, collect support counts
for candidates - 19 out of 46 candidates pass support threshold
40The GSP Mining Process
min_sup 2
41The GSP Algorithm
- Take sequences in form of ltxgt as length 1
candidates - Scan database once, find F1, the set of length 1
sequential patterns - Let k1 while Fk is not empty do
- Form Ck1, the set of length (k1) candidates
from Fk - If Ck1 is not empty, scan database once, find
Fk1, the set of length (k1) sequential patterns - Let kk1
42Bottlenecks of GSP
- A huge set of candidates could be generated
- 1,000 frequent length 1 sequences generate
length 2 candidates! - Multiple scans of database in mining
- Real challenge mining long sequential patterns
- An exponential number of short candidates
- A length-100 sequential pattern needs 1030
candidate
sequences!
43Improvements On GSP
- Freespan
- Projection-based No candidate sequence needs to
be generated - But, projection can be performed at any point in
the sequence, and the projected sequences will
not shrink much - PrefixSpan
- Projection-based
- But only prefix-based projection less
projections and quickly shrinking sequences
44Frequent-Pattern Mining Achievements
- Frequent pattern miningan important task in data
mining - Frequent pattern mining methodology
- Candidate generation test vs. projection-based
(frequent-pattern growth) - Various optimization methods database partition,
scan reduction, hash tree, sampling, border
computation, clustering, etc. - Related frequent-pattern mining algorithm scope
extension - Mining closed frequent itemsets and max-patterns
(e.g., MaxMiner, CLOSET, CHARM, etc.) - Mining multi-level, multi-dimensional frequent
patterns with flexible support constraints - Constraint pushing for mining optimization
- From frequent patterns to correlation and
causality
45Frequent-Pattern Mining Research Problems
- Multi-dimensional gradient analysis patterns
regarding changes and differences - Not just countsother measures, e.g., avg(profit)
- Mining top-k frequent patterns without support
constraint - Mining fault-tolerant associations
- 3 out of 4 courses excellent leads to A in data
mining - Fascicles and database compression by frequent
pattern mining - Partial periodic patterns
- DNA sequence analysis and pattern classification
46Questions?