Title: Association Rule Mining: Apriori Algorithm
1Association Rule MiningApriori Algorithm
- CIT365 Data Mining Data Warehousing
- Bajuna Salehe
- The Institute of Finance Management Computing
and IT Dept.
2Brief About Association Rule Mining
- The results of Market Basket Analysis allowed
companies to more fully understand purchasing
behaviour and, as a result better target market
audiences. - Association mining is user-centric as the
objective is the elicitation of useful (or
interesting) rules from which new knowledge can
be derived.
3Brief About Association Rule Mining
- Association mining applications have been applied
to many different domains including market basket
and risk analysis in commercial environments,
epidemiology, clinical medicine, fluid dynamics,
astrophysics, crime prevention, and
counter-terrorismall areas in which the
relationship between objects can provide useful
knowledge.
4Example of Association Rule
- For example, an insurance company, by finding a
strong correlation between two policies A and B,
of the form A ? B, indicating that customers that
held policy A were also likely to hold policy B,
could more efficiently target the marketing of
policy B through marketing to those clients that
held policy A but not B.
5Brief About Association Rule Mining
- Association mining analysis is a two part
process. - First, the identification of sets of items or
itemsets within the dataset. - Second, the subsequent derivation of inferences
from these itemsets
6Why Use Support and Confidence?
- Support reflects the statistical significance of
a rule. Rules that have very low support are
rarely observed, and thus, are more likely to
occur by chance. For example, the rule A ? B may
not be significant because both items are present
together only once in the previous Table in the
last week lecture.
7Why Use Support and Confidence?
- Additionally, low support rules may not be
actionable from a marketing perspective because
it is not profitable to promote items that are
seldom bought together by customers. - For these reasons, support is often used as a
filter to eliminate uninteresting rules.
8Why Use Support and Confidence?
- Confidence is another useful metric because it
measures how reliable is the inference made by a
rule. - For a given rule A? B , the higher the
confidence, the more likely it is for itemset B
to be present in transactions that contain A. In
a sense, confidence provides an estimate of the
conditional probability for B given A.
9Causality Association Rule
- Finally, it is worth noting that the inference
made by an association rule does not necessarily
imply causality. - Instead, the implication indicates a strong
co-occurrence relationship between items in the
antecedent and consequent of the rule.
10Causality A. Rule
- Causality, on the other hand, requires a
distinction between the causal and effect
attributes of the data and typically involves
relationships occurring over time (e.g., ozone
depletion leads to global warming).
11More About Support and Confidence
- The support for the following candidate rules
- Bread, Cheese ? Milk, Bread,Milk ?
Cheese, Cheese,Milk ? Bread, Bread ?
Cheese,Milk, Milk ? Bread,Cheese, Cheese
? Bread,Milk - are identical since they correspond to the
same itemset, Bread, Cheese, Milk. - If the itemset is infrequent, then all six
candidate rules can be immediately pruned without
having to compute their confidence values.
12More About Support and Confidence
- Therefore, a common strategy adopted by many
association rule mining algorithms is to
decompose the problem into two major subtasks - Frequent Itemset Generation. Find all itemsets
that satisfy the minsup threshold. These itemsets
are called frequent itemsets. - Rule Generation. Extract high confidence
association rules from the frequent - Itemsets found in the previous step. These rules
are called strong rules.
13Frequent Itemset Generation
- A lattice structure can be used to enumerate the
list of possible itemsets. - For example, the figure below illustrates all
itemsets derivable from the set A,B,C,D.
14Frequent Itemset Generation
15Frequent Itemset Generation
- In general, a data set that contains d items may
generate up to 2d (raise to power d) - 1
possible itemsets, excluding the null set. - Because d can be very large in many commercial
databases, frequent itemset generation is an
exponentially expensive task.
16Frequent Itemset Generation
- A naive approach for finding frequent itemsets is
to determine the support count for every
candidate itemset in the lattice structure. - To do this, we need to match each candidate
against every transaction.
17Apriori Algorithm
- This algorithm is among of the algorithms that
are grouped into candidate generation algorithms,
used to identify candidate itemsets. - The common data structure that is used in apriori
algorithm is tree data structures. - Two common type of tree data structures used in
apriori are- - Enumeration Set Tree
- Prefix Tree
18Data Structure for Apriori Algorithm
19Apriori Algorithm
- Frequent itemsets (also called as large
itemsets), are those itemsets whose support is
greater than minSupp (Minimum Support). - The apriori property (downward closure property)
says that any subsets of any frequent itemset are
also frequent itemsets - The use of support for pruning candidate itemsets
is guided by the following principle (Apriori
Principle). - If an itemset is frequent, then all of its subsets
20Reminder Steps of Association Rule Mining
- The major steps in association rule mining are
- Frequent Itemset generation
- Rules derivation
21Apriori Algorithm
- Any subset of a frequent itemset must be frequent
- If beer, nappy, nuts is frequent, so is beer,
nappy - Every transaction having beer, nappy, nuts also
contains beer, nappy - Apriori pruning principle If there is any
itemset which is infrequent, its superset should
not be generated/tested!
22Apriori Algorithm
- The APRIORI algorithm uses the downward closure
property, to prune unnecessary branches for
further consideration. It needs two parameters,
minSupp and minConf. The minSupp is used for
generating frequent itemsets and minConf is used
for rule derivation.
23The Apriori Algorithm An Example
Itemset sup
A 2
B 3
C 3
D 1
E 3
Itemset sup
A 2
B 3
C 3
E 3
Database TDB
L1
C1
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
1st scan
C2
C2
Itemset sup
A, B 1
A, C 2
A, E 1
B, C 2
B, E 3
C, E 2
Itemset
A, B
A, C
A, E
B, C
B, E
C, E
L2
2nd scan
Itemset sup
A, C 2
B, C 2
B, E 3
C, E 2
L3
C3
Itemset
B, C, E
3rd scan
Itemset sup
B, C, E 2
24Important Details Of The Apriori Algorithm
- There are two crucial questions in implementing
the Apriori algorithm - How to generate candidates?
- How to count supports of candidates?
25Generating Candidates
- There are 2 steps to generating candidates
- Step 1 Self-joining Lk
- Step 2 Pruning
- Example of Candidate-generation
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
26Apriori Algorithm
- k 1.
- Fk i i ? I ? s(i)/ N minsup. Find
all frequent 1-itemsets - repeat
- k k 1.
- Ck apriori-gen(Fk-1). Generate candidate
itemsets - for each transaction t ? T do
- Ct subset(Ck, t). Identify all candidates
that belong to t - for each candidate itemset c ? Ct do
- s(c) s(c) 1. Increment support
count - end for
- end for
- Fk c c ? Ck ? s(c) /N minsup. Extract
the frequent k-itemsets - until Fk Ø
- Result _Fk
27How to Count Supports Of Candidates?
- Why counting supports of candidates a problem?
- The total number of candidates can be huge
- One transaction may contain many candidates
- Method
- Candidate itemsets are stored in a hash-tree
- Leaf node of hash-tree contains a list of
itemsets and counts - Interior node contains a hash table
- Subset function finds all the candidates
contained in a transaction
28Generating Association Rules
- Once all frequent itemsets have been found
association rules can be generated - Strong association rules from a frequent itemset
are generated by calculating the confidence in
each possible rule arising from that itemset and
testing it against a minimum confidence threshold
29Example
TID List of item_IDs
T100 Beer, Crisps, Milk
T200 Crisps, Bread
T300 Crisps, Nappies
T400 Beer, Crisps, Bread
T500 Beer, Nappies
T600 Crisps, Nappies
T700 Beer, Nappies
T800 Beer, Crisps, Nappies, Milk
T900 Beer, Crisps, Nappies
ID Item
I1 Beer
I2 Crisps
I3 Nappies
I4 Bread
I5 Milk
30Example
31Challenges Of Frequent Pattern Mining
- Challenges
- Multiple scans of transaction database
- Huge number of candidates
- Tedious workload of support counting for
candidates - Improving Apriori general ideas
- Reduce passes of transaction database scans
- Shrink number of candidates
- Facilitate support counting of candida
32Bottleneck Of Frequent-Pattern Mining
- Multiple database scans are costly
- Mining long patterns needs many passes of
scanning and generates lots of candidates - To find frequent itemset i1i2i100
- of scans 100
- of Candidates 2100-1
1.271030 - Bottleneck candidate-generation-and-test
33Mining Frequent Patterns Without Candidate
Generation
- Techniques for mining frequent itemsets which
avoid candidate generation include - FP-growth
- Grow long patterns from short ones using local
frequent items - ECLAT (Equivalence CLASS Transformation)
algorithm - Uses a data representation in which transactions
are associated with items, rather than the other
way around (vertical data format) - These methods can be much faster than the Apriori
algorithm