Title: Association Rule Mining (II)
1Association Rule Mining (II)
- Instructor Qiang Yang
- Thanks J.Han and J. Pei
2Bottleneck of Frequent-pattern Mining
- Multiple database scans are costly
- Mining long patterns needs many passes of
scanning and generates lots of candidates - To find frequent itemset i1i2i100
- of scans 100
- of Candidates (1001) (1002) (110000)
2100-1 1.271030 ! - Bottleneck candidate-generation-and-test
- Can we avoid candidate generation?
3FP-growth Frequent-pattern Mining Without
Candidate Generation
- Heuristic let P be a frequent itemset, S be the
set of transactions contain P, and x be an item.
If x is a frequent item in S, x ?P must be a
frequent itemset - No candidate generation!
- A compact data structure, FP-tree, to store
information for frequent pattern mining - Recursive mining algorithm for mining complete
set of frequent patterns
4Example
Items Bought
f,a,c,d,g,i,m,p
a,b,c,f,l,m,o
b,f,h,j,o
b,c,k,s,p
a,f,c,e,l,p,m,n
Min Support 3
5Scan the database
- List of frequent items, sorted (itemsupport)
- lt(f4), (c4), (a3),(b3),(m3),(p3)gt
- The root of the tree is created and labeled with
- Scan the database
- Scanning the first transaction leads to the first
branch of the tree lt(f1),(c1),(a1),(m1),(p1)
gt - Order according to frequency
6Scanning TID100
root
Transaction Database TID Items 100 f,a,c,d,g,i,m,p
Header Table Node Item count head
f 1 c 1 a 1 m 1 p 1
f1
c1
a1
m1
p1
7Scanning TID200
Items Bought
f,a,c,d,g,i,m,p
a,b,c,f,l,m,o
b,f,h,j,o
b,c,k,s,p
a,f,c,e,l,p,m,n
- Frequent Single Items
- F1ltf,c,a,b,m,pgt
- TID200
- Possible frequent items
- Intersect with F1
- f,c,a,b,m
- Along the first branch of ltf,c,a,m,pgt, intersect
- ltf,c,agt
- Generate two children
- ltbgt, ltmgt
8Scanning TID200
root
Transaction Database TID Items 200 f,c,a,b,m
Header Table Node Item count head
f 1 c 1 a 1 b 1 m 2 p 1
f2
c2
a2
m1
b1
p1
m1
9The final FP-tree
Transaction Database TID Items 100 f,a,c,d,g,i,m,p
200 a,b,c,f,l,m,o 300 b,f,h,j,o 400 b,c,k,s,p 500
a,f,c,e,l,p,m,n Min support 3
Header Table Node Item count head
f 1 c 2 a 1 b 3 m 2 p 2
f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
Frequent 1-items in frequency descending order
f,c,a,b,m,p
10FP-Tree Construction
- Scans the database only twice
- Subsequent mining based on the FP-tree
11How to Mine an FP-tree?
- Step 1 form conditional pattern base
- Step 2 construct conditional FP-tree
- Step 3 recursively mine conditional FP-trees
12Conditional Pattern Base
- Let I be a frequent item
- A sub database which
- consists of the set of prefix paths in the
FP-tree - With item I as a co-occurring suffix pattern
- Example
- m is a frequent item
- ms conditional pattern base
- ltf,c,agt support 2
- ltf,c,a,bgt support 1
- Mine recursively on such databases
f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
13Conditional Pattern Tree
- Let I be a suffix item, DBI be
theconditional pattern base - The frequent pattern tree TreeI is knownas the
conditional pattern tree - Example
- m is a frequent item
- ms conditional pattern base
- ltf,c,agt support 2
- ltf,c,a,bgt support 1
- ms conditional pattern tree
f4
c3
a3
m2
14Composition of patterns a and b
- Let a be a frequent item in DB, B be as
conditional pattern base, and b be an itemset in
B. Then a b is frequent in DB if and only if b
is frequent in B. - Example
- Starting with ap
- ps conditional pattern base (from the tree) B
- (f,c,a,m) 2 (c,b) 1
- Let b be c.
- Then abp,c, with support 3.
15Single path tree
- Let P be a single path FP tree
- Let I1, I2, Ik be an itemset in the tree
- Let Ij have the lowest support
- Then the support(I1, I2, Ik)support(Ij)
- Example
f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
16FP_growth Algorithm Fig 6.10
- Recursive Algorithm
- Input A transaction database, min_supp
- Output The complete set of frequent patterns
- 1. FP-Tree construction
- 2. Mining FP-Tree by calling FP_growth(FP_tree,
null) - Key Idea consider single path FP-tree and
multi-path FP-tree separately - Continue to split until get single-path FP-tree
17FP_Growth (tree, a)
- If tree contains a single path P, then
- For each combination (denoted as b) of the nodes
in the path P, then - Generate pattern ba with support min_supp of
nodes in b - Else for each a in the header of tree, do
- Generate pattern b a a with support
a.support - Construct
- (1) bs conditional pattern base and
- (2) bs conditional FP-tree Treeb
- If Treeb is not empty, then
- Call FP-growth(Treeb, b)
18FP-Growth vs. Apriori Scalability With the
Support Threshold
Data set T25I20D10K
19FP-Growth vs. Tree-Projection Scalability with
the Support Threshold
Data set T25I20D100K
20Why Is FP-Growth the Winner?
- Divide-and-conquer
- decompose both the mining task and DB according
to the frequent patterns obtained so far - leads to focused search of smaller databases
- Other factors
- no candidate generation, no candidate test
- compressed database FP-tree structure
- no repeated scan of entire database
- basic opscounting and FP-tree building, not
pattern search and matching
21Implications of the Methodology Papers by Han,
et al.
- Mining closed frequent itemsets and max-patterns
- CLOSET (DMKD00)
- Mining sequential patterns
- FreeSpan (KDD00), PrefixSpan (ICDE01)
- Constraint-based mining of frequent patterns
- Convertible constraints (KDD00, ICDE01)
- Computing iceberg data cubes with complex
measures - H-tree and H-cubing algorithm (SIGMOD01)
22Visualization of Association Rules Pane Graph
23Visualization of Association Rules Rule Graph
24Mining Various Kinds of Rules or Regularities
- Multi-level, quantitative association rules,
correlation and causality, ratio rules,
sequential patterns, emerging patterns, temporal
associations, partial periodicity - Classification, clustering, iceberg cubes, etc.
25Multiple-level Association Rules
- Items often form hierarchy
- Flexible support settings Items at the lower
level are expected to have lower support. - Transaction database can be encoded based on
dimensions and levels - explore shared multi-level mining
26Quantitative Association Rules
- Numeric attributes are dynamically discretized
- Such that the confidence or compactness of the
rules mined is maximized. - 2-D quantitative association rules Aquan1 ?
Aquan2 ? Acat - Cluster adjacent
- association rules
- to form general
- rules using a 2-D
- grid.
- Example
age(X,34-35) ? income(X,30K - 50K) ?
buys(X,high resolution TV)
27Redundant Rules SA95
- Which rule is redundant?
- milk ? wheat bread, support 8, confidence
70 - skim milk ? wheat bread, support 2,
confidence 72 - The first rule is more general than the second
rule. - A rule is redundant if its support is close to
the expected value, based on a general rule,
and its confidence is close to that of the
general rule.
28INCREMENTAL MINING CHNW96
- Rules in DB were found and a set of new tuples db
is added to DB, - Task to find new rules in DB db.
- Usually, DB is much larger than db.
- Properties of Itemsets
- frequent in DB db if frequent in both DB and
db. - infrequent in DB db if also in both DB and db.
- frequent only in DB, then merge with counts in
db. - No DB scan is needed!
- frequent only in db, then scan DB once to update
their itemset counts. - Same principle applicable to distributed/parallel
mining.
29CORRELATION RULES
- Association does not measure correlation BMS97,
AY98. - Among 5000 students
- 3000 play basketball, 3750 eat cereal, 2000 do
both - play basketball ? eat cereal 40, 66.7
- Conclusion basketball and cereal are
correlated is misleading - because the overall percentage of students eating
cereal is 75, higher than 66.7. - Confidence does not always give correct picture!
30Correlation Rules
- P(BA)/P(B) is known as the lift of rule B?A
- If less than one, then B and A are negatively
correlated. - Basketball?Cereal
- 2000/(30003750/5000)20005000/30003750lt1
- P(AB)P(B)P(A), if A and B are independent
events - A and B negatively correlated? the value is less
than 1 - Otherwise A and B positively correlated.
31Chi-square Correlation BMS97
- The cutoff value at 95 significance level is
3.84 gt 0.9 - Thus, we do not reject the independence
assumption.
32Constraint-based Data Mining
- Finding all the patterns in a database
autonomously? unrealistic! - The patterns could be too many but not focused!
- Data mining should be an interactive process
- User directs what to be mined using a data mining
query language (or a graphical user interface) - Constraint-based mining
- User flexibility provides constraints on what to
be mined - System optimization explores such constraints
for efficient miningconstraint-based mining
33Constraints in Data Mining
- Knowledge type constraint
- classification, association, etc.
- Data constraint using SQL-like queries
- find product pairs sold together in stores in
Vancouver in Dec.00 - Dimension/level constraint
- in relevance to region, price, brand, customer
category - Rule (or pattern) constraint
- small sales (price lt 10) triggers big sales
(sum gt 200) - Interestingness constraint
- strong rules min_support ? 3, min_confidence
? 60
34Constrained Mining vs. Constraint-Based Search
- Constrained mining vs. constraint-based
search/reasoning - Both are aimed at reducing search space
- Finding all patterns satisfying constraints vs.
finding some (or one) answer in constraint-based
search in AI - Constraint-pushing vs. heuristic search
- It is an interesting research problem on how to
integrate them - Constrained mining vs. query processing in DBMS
- Database query processing requires to find all
- Constrained pattern mining shares a similar
philosophy as pushing selections deeply in query
processing
35Constrained Frequent Pattern Mining A Mining
Query Optimization Problem
- Given a frequent pattern mining query with a set
of constraints C, the algorithm should be - sound it only finds frequent sets that satisfy
the given constraints C - complete all frequent sets satisfying the given
constraints C are found - A naïve solution
- First find all frequent sets, and then test them
for constraint satisfaction - More efficient approaches
- Analyze the properties of constraints
comprehensively - Push them as deeply as possible inside the
frequent pattern computation.
36Anti-Monotonicity in Constraint-Based Mining
TDB (min_sup2)
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g
- Anti-monotonicity
- intemset S satisfies the constraint, so does any
of its subset - sum(S.Price) ? v is anti-monotone
- sum(S.Price) ? v is not anti-monotone
- Example. C range(S.profit) ? 15 is anti-monotone
- Itemset ab violates C
- So does every superset of ab
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
37Which Constraints Are Anti-Monotone?
Constraint Antimonotone
v ? S No
S ? V no
S ? V yes
min(S) ? v no
min(S) ? v yes
max(S) ? v yes
max(S) ? v no
count(S) ? v yes
count(S) ? v no
sum(S) ? v ( a ? S, a ? 0 ) yes
sum(S) ? v ( a ? S, a ? 0 ) no
range(S) ? v yes
range(S) ? v no
avg(S) ? v, ? ? ?, ?, ? convertible
support(S) ? ? yes
support(S) ? ? no
38Monotonicity in Constraint-Based Mining
TDB (min_sup2)
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g
- Monotonicity
- When an intemset S satisfies the constraint, so
does any of its superset - sum(S.Price) ? v is monotone
- min(S.Price) ? v is monotone
- Example. C range(S.profit) ? 15
- Itemset ab satisfies C
- So does every superset of ab
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
39Which Constraints Are Monotone?
Constraint Monotone
v ? S yes
S ? V yes
S ? V no
min(S) ? v yes
min(S) ? v no
max(S) ? v no
max(S) ? v yes
count(S) ? v no
count(S) ? v yes
sum(S) ? v ( a ? S, a ? 0 ) no
sum(S) ? v ( a ? S, a ? 0 ) yes
range(S) ? v no
range(S) ? v yes
avg(S) ? v, ? ? ?, ?, ? convertible
support(S) ? ? no
support(S) ? ? yes
40Succinctness, Convertible, Inconvertable
Constraints in Book
- We will not consider these in this course.
41Associative Classification
- Mine association possible rules in form of
itemset ? class - Itemset a set of attribute-value pairs
- Class class label
- Build Classifier
- Organize rules according to decreasing precedence
based on confidence and support - B. Liu, W. Hsu Y. Ma. Integrating
classification and association rule mining. In
KDD98
42Classification by Aggregating Emerging Patterns
- Emerging pattern (EP) A pattern frequent in one
class of data but infrequent in others. - Agelt30 is frequent in class buys_computeryes
and infrequent in class buys_computerno - Rule agelt30 ? buys computer
- G. Dong J. Li. Efficient mining of emerging
patterns discovering trends and differences. In
KDD99