Association Rule Mining (II) - PowerPoint PPT Presentation

About This Presentation
Title:

Association Rule Mining (II)

Description:

Mining long patterns needs many passes of scanning and ... Convertible constraints (KDD'00, ICDE'01) Computing iceberg data cubes with complex measures ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 43
Provided by: Jiawe7
Category:

less

Transcript and Presenter's Notes

Title: Association Rule Mining (II)


1
Association Rule Mining (II)
  • Instructor Qiang Yang
  • Thanks J.Han and J. Pei

2
Bottleneck of Frequent-pattern Mining
  • Multiple database scans are costly
  • Mining long patterns needs many passes of
    scanning and generates lots of candidates
  • To find frequent itemset i1i2i100
  • of scans 100
  • of Candidates (1001) (1002) (110000)
    2100-1 1.271030 !
  • Bottleneck candidate-generation-and-test
  • Can we avoid candidate generation?

3
FP-growth Frequent-pattern Mining Without
Candidate Generation
  • Heuristic let P be a frequent itemset, S be the
    set of transactions contain P, and x be an item.
    If x is a frequent item in S, x ?P must be a
    frequent itemset
  • No candidate generation!
  • A compact data structure, FP-tree, to store
    information for frequent pattern mining
  • Recursive mining algorithm for mining complete
    set of frequent patterns

4
Example
Items Bought
f,a,c,d,g,i,m,p
a,b,c,f,l,m,o
b,f,h,j,o
b,c,k,s,p
a,f,c,e,l,p,m,n
Min Support 3
5
Scan the database
  • List of frequent items, sorted (itemsupport)
  • lt(f4), (c4), (a3),(b3),(m3),(p3)gt
  • The root of the tree is created and labeled with
  • Scan the database
  • Scanning the first transaction leads to the first
    branch of the tree lt(f1),(c1),(a1),(m1),(p1)
    gt
  • Order according to frequency

6
Scanning TID100
root

Transaction Database TID Items 100 f,a,c,d,g,i,m,p

Header Table Node Item count head
f 1 c 1 a 1 m 1 p 1
f1
c1
a1
m1
p1
7
Scanning TID200
Items Bought
f,a,c,d,g,i,m,p
a,b,c,f,l,m,o
b,f,h,j,o
b,c,k,s,p
a,f,c,e,l,p,m,n
  • Frequent Single Items
  • F1ltf,c,a,b,m,pgt
  • TID200
  • Possible frequent items
  • Intersect with F1
  • f,c,a,b,m
  • Along the first branch of ltf,c,a,m,pgt, intersect
  • ltf,c,agt
  • Generate two children
  • ltbgt, ltmgt

8
Scanning TID200
root

Transaction Database TID Items 200 f,c,a,b,m
Header Table Node Item count head
f 1 c 1 a 1 b 1 m 2 p 1
f2
c2
a2
m1
b1
p1
m1
9
The final FP-tree

Transaction Database TID Items 100 f,a,c,d,g,i,m,p
200 a,b,c,f,l,m,o 300 b,f,h,j,o 400 b,c,k,s,p 500
a,f,c,e,l,p,m,n Min support 3
Header Table Node Item count head
f 1 c 2 a 1 b 3 m 2 p 2
f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
Frequent 1-items in frequency descending order
f,c,a,b,m,p
10
FP-Tree Construction
  • Scans the database only twice
  • Subsequent mining based on the FP-tree

11
How to Mine an FP-tree?
  • Step 1 form conditional pattern base
  • Step 2 construct conditional FP-tree
  • Step 3 recursively mine conditional FP-trees

12
Conditional Pattern Base
  • Let I be a frequent item
  • A sub database which
  • consists of the set of prefix paths in the
    FP-tree
  • With item I as a co-occurring suffix pattern
  • Example
  • m is a frequent item
  • ms conditional pattern base
  • ltf,c,agt support 2
  • ltf,c,a,bgt support 1
  • Mine recursively on such databases


f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
13
Conditional Pattern Tree
  • Let I be a suffix item, DBI be
    theconditional pattern base
  • The frequent pattern tree TreeI is knownas the
    conditional pattern tree
  • Example
  • m is a frequent item
  • ms conditional pattern base
  • ltf,c,agt support 2
  • ltf,c,a,bgt support 1
  • ms conditional pattern tree


f4
c3
a3
m2
14
Composition of patterns a and b
  • Let a be a frequent item in DB, B be as
    conditional pattern base, and b be an itemset in
    B. Then a b is frequent in DB if and only if b
    is frequent in B.
  • Example
  • Starting with ap
  • ps conditional pattern base (from the tree) B
  • (f,c,a,m) 2 (c,b) 1
  • Let b be c.
  • Then abp,c, with support 3.

15
Single path tree
  • Let P be a single path FP tree
  • Let I1, I2, Ik be an itemset in the tree
  • Let Ij have the lowest support
  • Then the support(I1, I2, Ik)support(Ij)
  • Example


f4
c1
b1
b1
c3
p1
a3
b1
m2
p2
m1
16
FP_growth Algorithm Fig 6.10
  • Recursive Algorithm
  • Input A transaction database, min_supp
  • Output The complete set of frequent patterns
  • 1. FP-Tree construction
  • 2. Mining FP-Tree by calling FP_growth(FP_tree,
    null)
  • Key Idea consider single path FP-tree and
    multi-path FP-tree separately
  • Continue to split until get single-path FP-tree

17
FP_Growth (tree, a)
  • If tree contains a single path P, then
  • For each combination (denoted as b) of the nodes
    in the path P, then
  • Generate pattern ba with support min_supp of
    nodes in b
  • Else for each a in the header of tree, do
  • Generate pattern b a a with support
    a.support
  • Construct
  • (1) bs conditional pattern base and
  • (2) bs conditional FP-tree Treeb
  • If Treeb is not empty, then
  • Call FP-growth(Treeb, b)

18
FP-Growth vs. Apriori Scalability With the
Support Threshold
Data set T25I20D10K
19
FP-Growth vs. Tree-Projection Scalability with
the Support Threshold
Data set T25I20D100K
20
Why Is FP-Growth the Winner?
  • Divide-and-conquer
  • decompose both the mining task and DB according
    to the frequent patterns obtained so far
  • leads to focused search of smaller databases
  • Other factors
  • no candidate generation, no candidate test
  • compressed database FP-tree structure
  • no repeated scan of entire database
  • basic opscounting and FP-tree building, not
    pattern search and matching

21
Implications of the Methodology Papers by Han,
et al.
  • Mining closed frequent itemsets and max-patterns
  • CLOSET (DMKD00)
  • Mining sequential patterns
  • FreeSpan (KDD00), PrefixSpan (ICDE01)
  • Constraint-based mining of frequent patterns
  • Convertible constraints (KDD00, ICDE01)
  • Computing iceberg data cubes with complex
    measures
  • H-tree and H-cubing algorithm (SIGMOD01)

22
Visualization of Association Rules Pane Graph
23
Visualization of Association Rules Rule Graph
24
Mining Various Kinds of Rules or Regularities
  • Multi-level, quantitative association rules,
    correlation and causality, ratio rules,
    sequential patterns, emerging patterns, temporal
    associations, partial periodicity
  • Classification, clustering, iceberg cubes, etc.

25
Multiple-level Association Rules
  • Items often form hierarchy
  • Flexible support settings Items at the lower
    level are expected to have lower support.
  • Transaction database can be encoded based on
    dimensions and levels
  • explore shared multi-level mining

26
Quantitative Association Rules
  • Numeric attributes are dynamically discretized
  • Such that the confidence or compactness of the
    rules mined is maximized.
  • 2-D quantitative association rules Aquan1 ?
    Aquan2 ? Acat
  • Cluster adjacent
  • association rules
  • to form general
  • rules using a 2-D
  • grid.
  • Example

age(X,34-35) ? income(X,30K - 50K) ?
buys(X,high resolution TV)
27
Redundant Rules SA95
  • Which rule is redundant?
  • milk ? wheat bread, support 8, confidence
    70
  • skim milk ? wheat bread, support 2,
    confidence 72
  • The first rule is more general than the second
    rule.
  • A rule is redundant if its support is close to
    the expected value, based on a general rule,
    and its confidence is close to that of the
    general rule.

28
INCREMENTAL MINING CHNW96
  • Rules in DB were found and a set of new tuples db
    is added to DB,
  • Task to find new rules in DB db.
  • Usually, DB is much larger than db.
  • Properties of Itemsets
  • frequent in DB db if frequent in both DB and
    db.
  • infrequent in DB db if also in both DB and db.
  • frequent only in DB, then merge with counts in
    db.
  • No DB scan is needed!
  • frequent only in db, then scan DB once to update
    their itemset counts.
  • Same principle applicable to distributed/parallel
    mining.

29
CORRELATION RULES
  • Association does not measure correlation BMS97,
    AY98.
  • Among 5000 students
  • 3000 play basketball, 3750 eat cereal, 2000 do
    both
  • play basketball ? eat cereal 40, 66.7
  • Conclusion basketball and cereal are
    correlated is misleading
  • because the overall percentage of students eating
    cereal is 75, higher than 66.7.
  • Confidence does not always give correct picture!

30
Correlation Rules
  • P(BA)/P(B) is known as the lift of rule B?A
  • If less than one, then B and A are negatively
    correlated.
  • Basketball?Cereal
  • 2000/(30003750/5000)20005000/30003750lt1
  • P(AB)P(B)P(A), if A and B are independent
    events
  • A and B negatively correlated? the value is less
    than 1
  • Otherwise A and B positively correlated.

31
Chi-square Correlation BMS97
  • The cutoff value at 95 significance level is
    3.84 gt 0.9
  • Thus, we do not reject the independence
    assumption.

32
Constraint-based Data Mining
  • Finding all the patterns in a database
    autonomously? unrealistic!
  • The patterns could be too many but not focused!
  • Data mining should be an interactive process
  • User directs what to be mined using a data mining
    query language (or a graphical user interface)
  • Constraint-based mining
  • User flexibility provides constraints on what to
    be mined
  • System optimization explores such constraints
    for efficient miningconstraint-based mining

33
Constraints in Data Mining
  • Knowledge type constraint
  • classification, association, etc.
  • Data constraint using SQL-like queries
  • find product pairs sold together in stores in
    Vancouver in Dec.00
  • Dimension/level constraint
  • in relevance to region, price, brand, customer
    category
  • Rule (or pattern) constraint
  • small sales (price lt 10) triggers big sales
    (sum gt 200)
  • Interestingness constraint
  • strong rules min_support ? 3, min_confidence
    ? 60

34
Constrained Mining vs. Constraint-Based Search
  • Constrained mining vs. constraint-based
    search/reasoning
  • Both are aimed at reducing search space
  • Finding all patterns satisfying constraints vs.
    finding some (or one) answer in constraint-based
    search in AI
  • Constraint-pushing vs. heuristic search
  • It is an interesting research problem on how to
    integrate them
  • Constrained mining vs. query processing in DBMS
  • Database query processing requires to find all
  • Constrained pattern mining shares a similar
    philosophy as pushing selections deeply in query
    processing

35
Constrained Frequent Pattern Mining A Mining
Query Optimization Problem
  • Given a frequent pattern mining query with a set
    of constraints C, the algorithm should be
  • sound it only finds frequent sets that satisfy
    the given constraints C
  • complete all frequent sets satisfying the given
    constraints C are found
  • A naïve solution
  • First find all frequent sets, and then test them
    for constraint satisfaction
  • More efficient approaches
  • Analyze the properties of constraints
    comprehensively
  • Push them as deeply as possible inside the
    frequent pattern computation.

36
Anti-Monotonicity in Constraint-Based Mining
TDB (min_sup2)
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g
  • Anti-monotonicity
  • intemset S satisfies the constraint, so does any
    of its subset
  • sum(S.Price) ? v is anti-monotone
  • sum(S.Price) ? v is not anti-monotone
  • Example. C range(S.profit) ? 15 is anti-monotone
  • Itemset ab violates C
  • So does every superset of ab

Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
37
Which Constraints Are Anti-Monotone?
Constraint Antimonotone
v ? S No
S ? V no
S ? V yes
min(S) ? v no
min(S) ? v yes
max(S) ? v yes
max(S) ? v no
count(S) ? v yes
count(S) ? v no
sum(S) ? v ( a ? S, a ? 0 ) yes
sum(S) ? v ( a ? S, a ? 0 ) no
range(S) ? v yes
range(S) ? v no
avg(S) ? v, ? ? ?, ?, ? convertible
support(S) ? ? yes
support(S) ? ? no
38
Monotonicity in Constraint-Based Mining
TDB (min_sup2)
TID Transaction
10 a, b, c, d, f
20 b, c, d, f, g, h
30 a, c, d, e, f
40 c, e, f, g
  • Monotonicity
  • When an intemset S satisfies the constraint, so
    does any of its superset
  • sum(S.Price) ? v is monotone
  • min(S.Price) ? v is monotone
  • Example. C range(S.profit) ? 15
  • Itemset ab satisfies C
  • So does every superset of ab

Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
39
Which Constraints Are Monotone?
Constraint Monotone
v ? S yes
S ? V yes
S ? V no
min(S) ? v yes
min(S) ? v no
max(S) ? v no
max(S) ? v yes
count(S) ? v no
count(S) ? v yes
sum(S) ? v ( a ? S, a ? 0 ) no
sum(S) ? v ( a ? S, a ? 0 ) yes
range(S) ? v no
range(S) ? v yes
avg(S) ? v, ? ? ?, ?, ? convertible
support(S) ? ? no
support(S) ? ? yes
40
Succinctness, Convertible, Inconvertable
Constraints in Book
  • We will not consider these in this course.

41
Associative Classification
  • Mine association possible rules in form of
    itemset ? class
  • Itemset a set of attribute-value pairs
  • Class class label
  • Build Classifier
  • Organize rules according to decreasing precedence
    based on confidence and support
  • B. Liu, W. Hsu Y. Ma. Integrating
    classification and association rule mining. In
    KDD98

42
Classification by Aggregating Emerging Patterns
  • Emerging pattern (EP) A pattern frequent in one
    class of data but infrequent in others.
  • Agelt30 is frequent in class buys_computeryes
    and infrequent in class buys_computerno
  • Rule agelt30 ? buys computer
  • G. Dong J. Li. Efficient mining of emerging
    patterns discovering trends and differences. In
    KDD99
Write a Comment
User Comments (0)
About PowerShow.com