Title: Fundamentos de Miner
1Fundamentos de Minería de Datos
Fernando Berzalfberzal_at_decsai.ugr.es
2Motivation
- Association mining searches for interesting
relationships among items in a given data set - EXAMPLES
- Diapers and six-packs are bought together,
specially on Thursday evening (a myth?) - A sequence such as buying first a digital camera
and then a memory card is a frequent (sequential)
pattern
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
3Motivation
MARKET BASKET ANALYSIS The earliest form of
association rule mining Applications
Catalog design, store layout, cross-marketing
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
4Definition
- Item
- In transactional databases
- Any of the items included in a transaction.
- In relational databases
- (Attribute, value) pair
-
- k-itemset
- Set of k items
- Itemset support support(I) P(I)
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
5Definition
- Association rule
- X ? Y
- Support
- support(X?Y) support(XUY) P(XUY)
- Confidence
- confidence(X?Y) support(XUY) / support(X)
- P(YX)
- NOTE Both support and confidence are relative
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
6Discovery
- Association rule mining
- Find all frequent itemsets
- Generate strong association rules from the
frequent itemsetsStrong association rules are
those that satisfy both a minimum support
threshold and a minimum confidence threshold.
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
7Discovery
Apriori Observation All non-empty subsets
of a frequent itemset must also be
frequent Algorithm Frequent k-itemsets are
used to explore potentially frequent
(k1)-itemsets (i.e. candidates)
? Agrawal Skirant "Fast Algorithms for
Mining Association Rules", VLDB'94
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
8Discovery
- Apriori improvements (I)
- Reducing the number of candidates? Park, Chen
Yu "An Effective Hash-Based Algorithm for Mining
Association Rules", SIGMOD'95 - Sampling? Toivonen "Sampling Large Databases
for Association Rules", VLDB'96? Park, Yu
Chen "Mining Association Rules with Adjustable
Accuracy", CIKM'97 - Partitioning? Savasere, Omiecinski Navathe
"An Efficient Algorithm for Mining Association
Rules in Large Databases", VLDB'95
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
9Discovery
- Apriori improvements (II)
- Transaction reduction? Agrawal Skirant "Fast
Algorithms for Mining Association Rules", VLDB'94
(AprioriTID) - Dynamic itemset counting? Brin, Motwani, Ullman
Tsur "Dynamic Itemset Counting and Implication
Rules for Market Basket Data", SIGMOD'97 (DIC)?
Hidber "Online Association Rule Mining",
SIGMOD'99 (CARMA)
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
10Discovery
Apriori-like algorithm TBAR (Tree-based
association rule mining)
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
? Berzal, Cubero, Sánchez Serrano TBAR An
efficient method for association rule mining in
relational databases Data Knowledge
Engineering, 2001
11Discovery TBAR
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
L1
7 instances wih A
6 instances with AB
L2
5 instances with AD
6 instances with BC
5 instances with ABD
L3
12Discovery
An alternative to Apriori Compress the database
representing frequent items into a
frequent-pattern tree (FP-tree) ? Han, Pei
Yin "Mining Frequent Patterns without
Candidate Generation", SIGMOD'2000
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
13Discovery
- A challenge
- When an itemset is frequent,all its subsets are
also frequent - Closed itemset C There exists no proper
super-itemset S such that support(S)support(C) - Maximal (frequent) itemset MM is frequent and
there exists no super-itemset Y such that M?Y and
Y is frequent.
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
14Variations
- Based on the kinds of patterns to be mined
- Frequent itemset mining(transactional and
relational data) - Sequential pattern mining(sequence data sets,
e.g. bioinformatics) - Structured pattern mining(structured data, e.g.
graphs)
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
15Variations
- Based on the types of values handled
- Boolean association rules
- Quantitative association rules
- Fuzzy association rules
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
? Delgado, Marín, Sánchez Vila Fuzzy
association rules General model and
applications IEEE Transactions on Fuzzy
Systems, 2003
16Variations
- More options
- Generalized association rules(a.k.a. multilevel
association rules) - Constraint-based association rule mining
- Incremental algorithms
- Top-k algorithms
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
ICDM FIMIWorkshop on Frequent Itemset Mining
Implementations http//fimi.cs.helsinki.fi/
17Visualization
- Integrated into data mining tools to help users
understand data mining results - Table-based approache.g. SAS Enterprise Miner,
DBMiner - 2D Matrix-based approache.g. SGI MineSet,
DBMiner - Graph-based techniquese.g. DBMiner ball graphs
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
18Visualization Tables
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
19Visualization Visual aids
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
20Visualization 2D Matrix
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
21Visualization Graphs
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
22Visualization VisAR
Based on parallel coordinates (Techapichetvanich
Datta, ADMA2005)
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
23Extensions
Confidence is not the best possibleinterestingne
ss measure for rules e.g. A very frequent item
will always appear in rule consequents,
regardless its true relationship with the rule
antecedent X went to war ? X did not serve in
Vietnam (from the US Census)
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
24Extensions
- Desirable properties for interestingness
measuresPiatetsky-Shapiro, 1991 - P1 ACC(A?C) 0 when supp(A?C) supp(A)supp(C)
- P2 ACC(A?C) monotonically increases with
supp(A?C) - P3 ACC(A?C) monotonically decreases with supp(A)
(or supp(C))
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
25Extensions
- Certainty factors
- satisfy Piatetsky-Shapiros properties
- are widely-used in expert systems
- are not symmetric (as interest/lift)
- can substitute conviction when CFgt0
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
? Berzal, Blanco, Sánchez VilaMeasuring the
accuracy and interest of association rules A new
framework", Intelligent Data Analysis, 2002
26Extensions
References
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
? Hilderman Hamilton Evaluation of
interestingness measures for ranking discovered
knowledge. PAKDD, 2001
? Tan, Kumar Srivastava Selecting the right
objective measure for association analysis.
Information Systems, vol. 29, pp. 293-313, 2004.
? Berzal, Cubero, Marín, Sánchez, Serrano Vila
Association rule evaluation for classification
purposes TAMIDA2005
27Applications
- Two sample applications where associations rules
have been successful - Classification (ART)
- Anomaly detection (ATBAR)
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
? Berzal, Cubero, Sánchez Serrano ART A
hybrid classification model Machine Learning
Journal, 2004
? Balderas, Berzal, Cubero, Eisman
Marín Discovering Hidden Association Rules
KDD2005, Chicago, Illinois, USA
28Classification
- Classification models based on association rules
- Partial classification models
- vg Bayardo
- Associative classification models vg CBA
(Liu et al.) - Bayesian classifiers
- vg LB (Meretakis et al.)
- Emergent patterns
- vg CAEP (Dong et al.)
- Rule trees
- vg Wang et al.
- Rules with exceptions
- vg Liu et al.
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
29Classification
GOAL Simple, intelligible, and robust
classification models obtained in an efficient
and scalable way MEANS
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
Decision Tree Induction Association Rule
Mining ART Association Rule Trees
30ART Classification Model
IDEA Make use of efficient association rule
mining algorithms to build a decision-tree-shaped
classification model. ART Association Rule
Tree KEY Association rules else
branches Hybrid between decision trees and
decision lists
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
31ART Classification Model
SPLICE
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
32Construction
ART classification model
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
33Construction
ART classification model
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
- Rule mining Candidate hypotheses
- MinSupp Minimum support threshold
- MinConf Minimum confidence threshold
- Fixed threshold
- Automatic selection
34Construction
ART classification model
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
- Rule selection
- Rules grouped by sets of attributes.
- Preference criterion.
35Example Dataset
ART classification model
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
36Example Level 1 K 1
ART classification model
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
- LEVEL 1 Association rule mining
- Minimum support threshold 20
- Automatic confidence threshold selection
S1 if (Y0) then C0 with confidence 75 if
(Y1) then C1 with confidence 75 S2 if (Z0)
then C0 with confidence 75 if (Z1) then
C1 with confidence 75
37Example Level 1 K 2
ART classification model
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
- LEVEL 1 Association rule mining
- Minimum support threshold 20
- Automatic confidence threshold selection
S1 if (X0 and Y0) then C0 (100) if (X0
and Y1) then C1 (100) S2 if (X1 and Z0)
then C0 (100) if (X1 and Z1) then C1
(100) S3 if (Y0 and Z0) then C0 (100)
if (Y1 and Z1) then C1 (100)
38Example Level 1
ART classification model
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
LEVEL 1 Best rule set selection e.g. S1
S1 if (X0 and Y0) then C0 (100) if (X0
and Y1) then C1 (100)
X0 and Y0 C0 (2) X0 and Y1 C1 (2)
else ...
39Example Level 1 ? Level 2
ART classification model
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
40Example Level 2
ART classification model
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
LEVEL 2 Rule mining
S1 if (Z0) then C0 with confidence 100 if
(Z1) then C1 with confidence 100
RESULT
X0 and Y0 C0 (2) X0 and Y1 C1 (2)
else Z0 C0 (2) Z1 C1 (2)
41Example ART vs. TDIDT
ART classification model
ART
TDIDT
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
42Classifier accuracy
ART classification model gt Experimental results
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
43Classifier complexity
ART classification model gt Experimental results
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
44Training time
ART classification model gt Experimental results
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
45I/O Operations - Scans
ART classification model gt Experimental results
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
46I/O Operations - Records
ART classification model gt Experimental results
47I/O Operations - Pages
ART classification model gt Experimental results
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
48Final comments
ART classification model
- Classification models
- Acceptable accuracy
- Reduced complexity
- Attribute interactions
- Robustness (noise primary keys)
- Classifier building method
- Efficient algorithm
- Good scalability properties
- Automatic parameter selection
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
49Anomaly detection
- It is often more interesting to find surprising
non-frequent events than frequent ones - EXAMPLES
- Abnormal network activity patterns in intrusion
detection systems. - Exceptions to common rules in Medicine (useful
for diagnosis, drug evaluation, detection of
conflicting therapies)
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
50Anomaly detection
Anomalous association rule Confident rule
representing homogeneous deviations from common
behavior.
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
51Anomaly detection
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
When X does not imply Y, then it usually implies
A (the Anomaly)
X
Y
?
confident
A
Anomalous association rule
X Y ? A
confident
52Anomaly detection
X Y A1 Z1
X Y A1 Z2
X Y A2 Z3
X Y A2 Z1
X Y A3 Z2
X Y A3 Z3
X Y A Z
X Y3 A Z3
X Y3 A Z
X Y4 A Z
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
X ? Y is the dominant rule
X ? A when Yis the anomalous rule
53Anomaly detection
Suzuki et al.s Exception Rules
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
X ? Y is an association rule
X ?
I
is the exception rule
Y
I is the interacting itemset
X ? I is the reference rule
- Too many exceptions
- The cause needs to be present
54Anomaly detection ATBAR
Anomalous association rules
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
First scan
Second scan
55Anomaly detection ATBAR
Anomalous association rules
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
First scan
Second scan
56Anomaly detection ATBAR
Anomalous association rules Rule generation
is immediate from the frequent and extended
itemsets obtained by ATBAR
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
57Anomaly detection Results
- Experiments on health-related datasetsfrom the
UCI Machine Learning Repository - Relatively small set of anomalous rules
(typically, gt90 reduction with respect to
standard association rules) - Reasonable overhead needed to obtain anomalous
association rules(about 20 in ATBAR w.r.t. TBAR)
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
58Anomaly detection Results
An example from the Census dataset
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
if WORKCLASS Local-gov then CAPGAIN
99999.0 , 99999.0 (7 out of 7) when not
CAPGAIN 0.0 , 20051.0
59Anomaly detection Results
- Anomalous association rules(novel
characterization of potentially interesting
knowledge) - An efficient algorithm for discovering anomalous
association rules ATBAR - Some heuristics for filtering the discovered
anomalous association rules
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR
60Anomaly detection Future
- Additional heuristics for focusing on interesting
anomalies (maybe domain- or even
application-specific). - Alternative measures for the evaluation and
ranking of anomalous association rules - Certainty factors / Conviction
- Motivation
- Definition
- Discovery
- Variations
- Visualization
- Extensions
- Applications
- ART
- ATBAR