Title: Association Rules
1Association Rules Correlations
- Basic concepts
- Efficient and scalable frequent itemset mining
methods - Apriori, and improvements
- FP-growth
- Rule postmining visualization and validation
- Interesting association rules.
2Rule Validations
- Only a small subset of derived rules might be
meaningful/useful - Domain expert must validate the rules
- Useful tools
- Visualization
- Correlation analysis
3Visualization of Association Rules Plane Graph
4Visualization of Association Rules (SGI/MineSet
3.0)
5Pattern Evaluation
- Association rule algorithms tend to produce too
many rules - many of them are uninteresting or redundant
- confidence(A ?B) p(BA) p(A B)/p(A)
- Confidence is not discriminative enough criterion
- Beyond original support confidence
- Interestingness measures can be used to
prune/rank the derived patterns
6Application of Interestingness Measure
7Computing Interestingness Measure
- Given a rule X ? Y, information needed to compute
rule interestingness can be obtained from a
contingency table
Contingency table for X ? Y
Y Y
X f11 f10 f1
X f01 f00 fo
f1 f0 T
- Used to define various measures
- support, confidence, lift, Gini, J-measure,
etc.
8Drawback of Confidence
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
9Statistical-Based Measures
- Measures that take into account statistical
dependence
Does X lift the probability of Y? i.e.
probability of Y given X over probability of Y.
This is the same as interest factor I 1
independence, Igt 1 positive association (lt1
negative)
)
(
X
Y
P
Lift
)
(
Y
P
)
,
(
Y
X
P
Interest
)
(
)
(
Y
P
X
P
-
)
(
)
(
)
,
(
Y
P
X
P
Y
X
P
PS
Many other measures PS Piatesky-Shapiro
10Example Lift/Interest
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
- Association Rule Tea ? Coffee
- Confidence P(CoffeeTea) 0.75
- but P(Coffee) 0.9
- Lift 0.75/0.9 0.8333 (lt 1, therefore is
negatively associated)
11Drawback of Lift Interest
Statistical independenceIf P(X,Y)P(X)P(Y) gt
Lift 1
Y Y
X 10 0 10
X 0 90 90
10 90 100
Y Y
X 90 0 90
X 0 10 10
90 10 100
- Lift favors infrequent items
- Other criteria proposed Gini, J-measure, etc.
12There are lots of measures proposed in the
literature Some measures are good for certain
applications, but not for others What criteria
should we use to determine whether a measure is
good or bad? What about Apriori-style support
based pruning? How does it affect these measures?
13Association Rules Correlations
- Basic concepts
- Efficient and scalable frequent itemset mining
methods - Apriori, and improvements
- FP-growth
- Rule derivation, visualization and validation
- Multi-level Associations
- Summary
14Multiple-Level Association Rules
- Items often form hierarchy.
- Items at the lower level are expected to have
lower support. - Rules regarding itemsets at
- appropriate levels could be quite useful.
- Transaction database can be encoded based on
dimensions and levels - We can explore shared multi-level mining
15Mining Multi-Level Associations
- A top_down, progressive deepening approach
- First find high-level strong rules
- milk bread
20, 60. - Then find their lower-level weaker rules
- 2 milk wheat
bread 6, 50. - Variations at mining multiple-level association
rules. - Level-crossed association rules
- 2 milk Wonder wheat bread
- Association rules with multiple, alternative
hierarchies - 2 milk Wonder bread
16Multi-level Association Uniform Support vs.
Reduced Support
- Uniform Support the same minimum support for all
levels - One minimum support threshold. No need to
examine itemsets containing any item whose
ancestors do not have minimum support. - Lower level items do not occur as frequently.
If support threshold - too high ? miss low level associations
- too low ? generate too many high level
associations - Reduced Support reduced minimum support at lower
levels - There are 4 search strategies
- Level-by-level independent
- Level-cross filtering by k-itemset
- Level-cross filtering by single item
- Controlled level-cross filtering by single item
17Uniform Support
Multi-level mining with uniform support
Milk support 10
Level 1 min_sup 5
2 Milk support 6
Skim Milk support 4
Level 2 min_sup 5
Back
18Reduced Support
Multi-level mining with reduced support
Level 1 min_sup 5
Milk support 10
2 Milk support 6
Skim Milk support 4
Level 2 min_sup 3
Back
19Multi-level Association Redundancy Filtering
- Some rules may be redundant due to ancestor
relationships between Example - milk ? wheat bread support 8, confidence
70 - Say that 2Milk is 25 of milk sales, then
- 2 milk ? wheat bread support 2, confidence
72 - We say the first rule is an ancestor of the
second rule. - A rule is redundant if its support is close to
the expected value, based on the rules
ancestor.
20Multi-Level Mining Progressive Deepening
- A top-down, progressive deepening approach
- First mine high-level frequent items
- milk (15), bread
(10) - Then mine their lower-level weaker frequent
itemsets - 2 milk (5),
wheat bread (4) - Different min_support threshold across
multi-levels lead to different algorithms - If adopting the same min_support across
multi-levels - then toss t if any of ts ancestors is
infrequent. - If adopting reduced min_support at lower levels
- then examine only those descendents whose
ancestors support is frequent/non-negligible.
21Association Rules Correlations
- Basic concepts
- Efficient and scalable frequent itemset mining
methods - Apriori, and improvements
- FP-growth
- Rule derivation, visualization and validation
- Multi-level Associations
- Temporal associations and frequent sequences
- Other association mining methods
- Summary
- Temporal associations and frequent sequences
later
22Other Association Mining Methods
- CHARM Mining frequent itemsets by a Vertical
Data Format - Mining Frequent Closed Patterns
- Mining Max-patterns
- Mining Quantitative Associations e.g., what is
the implication between age and income? - Constraint-base association mining
- Frequent Patterns in Data Streams very
difficult problem. Performance is a real issue - Constraint-based (Query-Directed) Mining
- Mining sequential and structured patterns
23Summary
- Association rule mining
- probably the most significant contribution from
the database community in KDD - New interesting research directions
- Association analysis in other types of data
spatial data, multimedia data, time series data, - Association Rule Mining for Data Streams a very
difficult challenge.
24Statistical Independence
- Population of 1000 students
- 600 students know how to swim (S)
- 700 students know how to bike (B)
- 420 students know how to swim and bike (S,B)
- P(S?B) 420/1000 0.42
- P(S) ? P(B) 0.6 ? 0.7 0.42
- P(S?B) P(S) ? P(B) gt Statistical independence
- P(S?B) gt P(S) ? P(B) gt Positively correlated
- P(S?B) lt P(S) ? P(B) gt Negatively correlated