Association Rule Mining (II)

About This Presentation

Title:

Association Rule Mining (II)

Description:

Mining long patterns needs many passes of scanning and ... Convertible constraints (KDD'00, ICDE'01) Computing iceberg data cubes with complex measures ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 43

Provided by: Jiawe7

Category:

more less

Transcript and Presenter's Notes

Title: Association Rule Mining (II)

1
Association Rule Mining (II)

Instructor Qiang Yang
Thanks J.Han and J. Pei

2
Bottleneck of Frequent-pattern Mining

Multiple database scans are costly
Mining long patterns needs many passes of
scanning and generates lots of candidates
To find frequent itemset i1i2i100
of scans 100
of Candidates (1001) (1002) (110000)
2100-1 1.271030 !
Bottleneck candidate-generation-and-test
Can we avoid candidate generation?

3
FP-growth Frequent-pattern Mining Without
Candidate Generation

Heuristic let P be a frequent itemset, S be the
set of transactions contain P, and x be an item.
If x is a frequent item in S, x ?P must be a
frequent itemset
No candidate generation!
A compact data structure, FP-tree, to store
information for frequent pattern mining
Recursive mining algorithm for mining complete
set of frequent patterns

4
Example
Items Bought
f,a,c,d,g,i,m,p
a,b,c,f,l,m,o
b,f,h,j,o
b,c,k,s,p
a,f,c,e,l,p,m,n
Min Support 3
5
Scan the database

List of frequent items, sorted (itemsupport)
lt(f4), (c4), (a3),(b3),(m3),(p3)gt
The root of the tree is created and labeled with
Scan the database
Scanning the first transaction leads to the first
branch of the tree lt(f1),(c1),(a1),(m1),(p1)
gt
Order according to frequency

6
Scanning TID100
root

Transaction Database TID Items 100 f,a,c,d,g,i,m,p

Header Table Node Item count head
f 1 c 1 a 1 m 1 p 1
f1
c1
a1
m1
p1
7
Scanning TID200
Items Bought
f,a,c,d,g,i,m,p
a,b,c,f,l,m,o
b,f,h,j,o
b,c,k,s,p
a,f,c,e,l,p,m,n

Frequent Single Items
F1ltf,c,a,b,m,pgt
TID200
Possible frequent items
Intersect with F1
f,c,a,b,m
Along the first branch of ltf,c,a,m,pgt, intersect
ltf,c,agt
Generate two children
ltbgt, ltmgt

8
Scanning TID200
root

Transaction Database TID Items 200 f,c,a,b,m
Header Table Node Item count head
f 1 c 1 a 1 b 1 m 2 p 1
f2
c2
a2
m1
b1
p1
m1
9
The final FP-tree

Transaction Database TID Items 100 f,a,c,d,g,i,m,p
200 a,b,c,f,l,m,o 300 b,f,h,j,o 400 b,c,k,s,p 500
a,f,c,e,l,p,m,n Min support 3
Header Table Node Item count head
f 1 c 2 a 1 b 3 m 2 p 2
f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
Frequent 1-items in frequency descending order
f,c,a,b,m,p
10
FP-Tree Construction

Scans the database only twice
Subsequent mining based on the FP-tree

11
How to Mine an FP-tree?

Step 1 form conditional pattern base
Step 2 construct conditional FP-tree
Step 3 recursively mine conditional FP-trees

12
Conditional Pattern Base

Let I be a frequent item
A sub database which
consists of the set of prefix paths in the
FP-tree
With item I as a co-occurring suffix pattern
Example
m is a frequent item
ms conditional pattern base
ltf,c,agt support 2
ltf,c,a,bgt support 1
Mine recursively on such databases

f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
13
Conditional Pattern Tree

Let I be a suffix item, DBI be
theconditional pattern base
The frequent pattern tree TreeI is knownas the
conditional pattern tree
Example
m is a frequent item
ms conditional pattern base
ltf,c,agt support 2
ltf,c,a,bgt support 1
ms conditional pattern tree

f4
c3
a3
m2
14
Composition of patterns a and b

Let a be a frequent item in DB, B be as
conditional pattern base, and b be an itemset in
B. Then a b is frequent in DB if and only if b
is frequent in B.
Example
Starting with ap
ps conditional pattern base (from the tree) B
(f,c,a,m) 2 (c,b) 1
Let b be c.
Then abp,c, with support 3.

15
Single path tree

Let P be a single path FP tree
Let I1, I2, Ik be an itemset in the tree
Let Ij have the lowest support
Then the support(I1, I2, Ik)support(Ij)
Example

f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
16
FP_growth Algorithm Fig 6.10

Recursive Algorithm
Input A transaction database, min_supp
Output The complete set of frequent patterns
1. FP-Tree construction
2. Mining FP-Tree by calling FP_growth(FP_tree,
null)
Key Idea consider single path FP-tree and
multi-path FP-tree separately
Continue to split until get single-path FP-tree

17
FP_Growth (tree, a)

If tree contains a single path P, then
For each combination (denoted as b) of the nodes
in the path P, then
Generate pattern ba with support min_supp of
nodes in b
Else for each a in the header of tree, do
Generate pattern b a a with support
a.support
Construct
(1) bs conditional pattern base and
(2) bs conditional FP-tree Treeb
If Treeb is not empty, then
Call FP-growth(Treeb, b)

18
FP-Growth vs. Apriori Scalability With the
Support Threshold
Data set T25I20D10K
19
FP-Growth vs. Tree-Projection Scalability with
the Support Threshold
Data set T25I20D100K
20
Why Is FP-Growth the Winner?

Divide-and-conquer
decompose both the mining task and DB according
to the frequent patterns obtained so far
leads to focused search of smaller databases
Other factors
no candidate generation, no candidate test
compressed database FP-tree structure
no repeated scan of entire database
basic opscounting and FP-tree building, not
pattern search and matching

21
Implications of the Methodology Papers by Han,
et al.

Mining closed frequent itemsets and max-patterns
CLOSET (DMKD00)
Mining sequential patterns
FreeSpan (KDD00), PrefixSpan (ICDE01)
Constraint-based mining of frequent patterns
Convertible constraints (KDD00, ICDE01)
Computing iceberg data cubes with complex
measures
H-tree and H-cubing algorithm (SIGMOD01)

22
Visualization of Association Rules Pane Graph
23
Visualization of Association Rules Rule Graph
24
Mining Various Kinds of Rules or Regularities

Multi-level, quantitative association rules,
correlation and causality, ratio rules,
sequential patterns, emerging patterns, temporal
associations, partial periodicity
Classification, clustering, iceberg cubes, etc.

25
Multiple-level Association Rules

Items often form hierarchy
Flexible support settings Items at the lower
level are expected to have lower support.
Transaction database can be encoded based on
dimensions and levels
explore shared multi-level mining

26
Quantitative Association Rules

Numeric attributes are dynamically discretized
Such that the confidence or compactness of the
rules mined is maximized.
2-D quantitative association rules Aquan1 ?
Aquan2 ? Acat
Cluster adjacent
association rules
to form general
rules using a 2-D
grid.
Example

age(X,34-35) ? income(X,30K - 50K) ?
buys(X,high resolution TV)
27
Redundant Rules SA95

Which rule is redundant?
milk ? wheat bread, support 8, confidence
70
skim milk ? wheat bread, support 2,
confidence 72
The first rule is more general than the second
rule.
A rule is redundant if its support is close to
the expected value, based on a general rule,
and its confidence is close to that of the
general rule.

28
INCREMENTAL MINING CHNW96

Rules in DB were found and a set of new tuples db
is added to DB,
Task to find new rules in DB db.
Usually, DB is much larger than db.
Properties of Itemsets
frequent in DB db if frequent in both DB and
db.
infrequent in DB db if also in both DB and db.
frequent only in DB, then merge with counts in
db.
No DB scan is needed!
frequent only in db, then scan DB once to update
their itemset counts.
Same principle applicable to distributed/parallel
mining.

29
CORRELATION RULES

Association does not measure correlation BMS97,
AY98.
Among 5000 students
3000 play basketball, 3750 eat cereal, 2000 do
both
play basketball ? eat cereal 40, 66.7
Conclusion basketball and cereal are
correlated is misleading
because the overall percentage of students eating
cereal is 75, higher than 66.7.
Confidence does not always give correct picture!

30
Correlation Rules

P(BA)/P(B) is known as the lift of rule B?A
If less than one, then B and A are negatively
correlated.
Basketball?Cereal
2000/(30003750/5000)20005000/30003750lt1

P(AB)P(B)P(A), if A and B are independent
events
A and B negatively correlated? the value is less
than 1
Otherwise A and B positively correlated.

31
Chi-square Correlation BMS97

The cutoff value at 95 significance level is
3.84 gt 0.9
Thus, we do not reject the independence
assumption.

32
Constraint-based Data Mining

Finding all the patterns in a database
autonomously? unrealistic!
The patterns could be too many but not focused!
Data mining should be an interactive process
User directs what to be mined using a data mining
query language (or a graphical user interface)
Constraint-based mining
User flexibility provides constraints on what to
be mined
System optimization explores such constraints
for efficient miningconstraint-based mining

33
Constraints in Data Mining

Knowledge type constraint
classification, association, etc.
Data constraint using SQL-like queries
find product pairs sold together in stores in
Vancouver in Dec.00
Dimension/level constraint
in relevance to region, price, brand, customer
category
Rule (or pattern) constraint
small sales (price lt 10) triggers big sales
(sum gt 200)
Interestingness constraint
strong rules min_support ? 3, min_confidence
? 60

34
Constrained Mining vs. Constraint-Based Search

Constrained mining vs. constraint-based
search/reasoning
Both are aimed at reducing search space
Finding all patterns satisfying constraints vs.
finding some (or one) answer in constraint-based
search in AI
Constraint-pushing vs. heuristic search
It is an interesting research problem on how to
integrate them
Constrained mining vs. query processing in DBMS
Database query processing requires to find all
Constrained pattern mining shares a similar
philosophy as pushing selections deeply in query
processing

35
Constrained Frequent Pattern Mining A Mining
Query Optimization Problem

Given a frequent pattern mining query with a set
of constraints C, the algorithm should be
sound it only finds frequent sets that satisfy
the given constraints C
complete all frequent sets satisfying the given
constraints C are found
A naïve solution
First find all frequent sets, and then test them
for constraint satisfaction
More efficient approaches
Analyze the properties of constraints
comprehensively
Push them as deeply as possible inside the
frequent pattern computation.

36
Anti-Monotonicity in Constraint-Based Mining
TDB (min_sup2)
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g

Anti-monotonicity
intemset S satisfies the constraint, so does any
of its subset
sum(S.Price) ? v is anti-monotone
sum(S.Price) ? v is not anti-monotone
Example. C range(S.profit) ? 15 is anti-monotone
Itemset ab violates C
So does every superset of ab

Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
37
Which Constraints Are Anti-Monotone?
Constraint Antimonotone
v ? S No
S ? V no
S ? V yes
min(S) ? v no
min(S) ? v yes
max(S) ? v yes
max(S) ? v no
count(S) ? v yes
count(S) ? v no
sum(S) ? v ( a ? S, a ? 0 ) yes
sum(S) ? v ( a ? S, a ? 0 ) no
range(S) ? v yes
range(S) ? v no
avg(S) ? v, ? ? ?, ?, ? convertible
support(S) ? ? yes
support(S) ? ? no
38
Monotonicity in Constraint-Based Mining
TDB (min_sup2)
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g

Monotonicity
When an intemset S satisfies the constraint, so
does any of its superset
sum(S.Price) ? v is monotone
min(S.Price) ? v is monotone
Example. C range(S.profit) ? 15
Itemset ab satisfies C
So does every superset of ab

Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
39
Which Constraints Are Monotone?
Constraint Monotone
v ? S yes
S ? V yes
S ? V no
min(S) ? v yes
min(S) ? v no
max(S) ? v no
max(S) ? v yes
count(S) ? v no
count(S) ? v yes
sum(S) ? v ( a ? S, a ? 0 ) no
sum(S) ? v ( a ? S, a ? 0 ) yes
range(S) ? v no
range(S) ? v yes
avg(S) ? v, ? ? ?, ?, ? convertible
support(S) ? ? no
support(S) ? ? yes
40
Succinctness, Convertible, Inconvertable
Constraints in Book

We will not consider these in this course.

41
Associative Classification

Mine association possible rules in form of
itemset ? class
Itemset a set of attribute-value pairs
Class class label
Build Classifier
Organize rules according to decreasing precedence
based on confidence and support
B. Liu, W. Hsu Y. Ma. Integrating
classification and association rule mining. In
KDD98

42
Classification by Aggregating Emerging Patterns

Emerging pattern (EP) A pattern frequent in one
class of data but infrequent in others.
Agelt30 is frequent in class buys_computeryes
and infrequent in class buys_computerno
Rule agelt30 ? buys computer
G. Dong J. Li. Efficient mining of emerging
patterns discovering trends and differences. In
KDD99

Write a Comment

User Comments (0)

About PowerShow.com

Association Rule Mining (II) - PowerPoint PPT Presentation

Association Rule Mining (II)

Mining long patterns needs many passes of scanning and ... Convertible constraints (KDD'00, ICDE'01) Computing iceberg data cubes with complex measures ... – PowerPoint PPT presentation