Association Rule Clustering - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Association Rule Clustering

Description:

172,000 cash register transactions. 2831 frequent item sets. 4782 association rules ... dab= Cab - 0.5(Aa Ab) Cab= 1/(nanb) j=1 to na j=1 to nb dij ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 28
Provided by: NoN560
Category:

less

Transcript and Presenter's Notes

Title: Association Rule Clustering


1
Association Rule Clustering
  • Term Project Presentation 04-May-1999
  • Gunjan K. Gupta
  • Alexander Strehl

2
Presentation Overview
  • Introduction to Data Mining and Association Rules
  • Association Rule Clustering
  • Motivation
  • Approach
  • Distance Metrics Based on
  • Rule Features
  • Original Transactions
  • Transaction Probabilities
  • Clustering Using
  • Agglomerative Clustering using Chaining
  • Multi-dimensional Scaling and SOM
  • Results on Real Data
  • Intuitive Results of Each Clustering Technique
  • Quantitative Comparison of Techniques

3
Data Mining on Relational DBS
  • Taxonomy
  • Item (Attribute) I
  • Set of all Items I (Relation Schema) R
  • Transaction (Row) t
  • Database of all Transactions t (Relation) r
  • Particular Item-set X
  • All Transactions t Containing X m(X)

4
Association Rules
  • Obtained Using A Priori Algorithm
  • Left-hand-side (LHS)
  • Right-hand-side (RHS)
  • Both Sides (BS)
  • Support
  • Confidence

5
Clustering Association Rules
  • Scenario
  • We use data provided by Knowledge Discovery One
  • Discovering association rules is a standard and
    very popular data mining technique
  • The set of rules discovered may be very large (gt
    10000)
  • Problem
  • Too many to rules for (manual) user
    interpretation
  • Clustering can help in browsing, visualizing,
    ordering, pruning, merging of rules
  • There is no intuitive distance metric for rules
  • Approach
  • Rules are similar when they hold in similar
    settings (such as transactions or customers)
  • Find good distance metrics and clustering
    techniques

6
Direct Distance Metrics
  • Association Rule Features
  • Confidence
  • Support
  • Lift
  • LHS, RHS, BS bit-vectors
  • LHS, RHS, BS counts
  • Domain specific features such as average
    revenue/margin per covered transaction
  • Distance Defined in Terms of Features
  • Bit-vector Hamming is too coarse (discrete)
  • Most others have little relevance
  • Do not capture the interaction on the data

7
Indirect Distance Metrics
  • Rules About Same Items are Considered Equal
  • Transaction Distance
  • Database size dependent, strong correlation to
    support and confidence
  • Conditional Probability Distance

8
Good Neighbors
  • Rules at an Interesting Distance are Good
    Neighbors
  • Distance 0 apply to the same transactions
  • Distance 1 no common occurrence
  • Interesting distances are neither 0 nor 1
  • Subset relationship in rules item-sets
  • Meta-association rules
  • Histogram for 50 bins (highest with 1650000)

9
Results
  • Home Improvement Data Set
  • 172,000 cash register transactions
  • 2831 frequent item sets
  • 4782 association rules
  • 1311 association rules after clustering by BS
  • Distance Statistics
  • 1311x1311 matrix
  • Median distance is 1
  • Mean distance is 0.9943 (0.9992 maximum)
  • 92.97 of distances are 1 (99.92 maximum)
  • Most rules have between 1 and 20 good neighbors

10
DiVis Examples I
  • Transaction Distance vs. Conditional
    Probability Distance
  • Rule 738 (278533)
  • LIGHTING-FIXTURES-FLUORES.-UTILITY
  • LIGHT-BULBS-INCANDESCENT
  • Rule 606 (277617)
  • LIGHTING-FIXTURES-FLUORES.-UTILITY
  • LIGHT-BULBS-FLUORESCENT

11
DiVis Examples II
Rule 761 has 273 (20.8238) good neighbors 17050
ELECTRICAL-OUTLETS/SWITCHES 17140
ELECTRICAL-FITTINGS Rule 282 17050
ELECTRICAL-OUTLETS/SWITCHES 17140
ELECTRICAL-FITTINGS 17090 ELECTRICAL-BOXES--COVER
S
Rule 144 has 130 (9.9161) good neighbors 18408
FASTENERS---PACKAGED 17090 ELECTRICAL-BOXES--COVE
RS Rule 330 17030 ELECTRICAL-WALL-PLATES 18408
FASTENERS---PACKAGED 17090 ELECTRICAL-BOXES--COVE
RS
12
Agglomerative Clustering
  • A Hierarchical Clustering Technique which
    generates a tree structure.
  • Many different Variations Available.
  • Chaining O (N2)
  • Single Link O (N2)
  • Vertical Length of Branches defined resulting in
    a height information for clusters.
  • Height represents cluster compactness
  • Splitting along a particular height results in
    clusters of approximately equal compactness.
  • Multiple resolutions of clusters available.

13
Agglomerative Chaining
  • Centroid used as cluster center.
  • Centroids of clusters at level N used for forming
    clusters of level N1
  • Nearest neighbor linking.
  • Merging two clusters with a link between them.
  • Unique clusters at each level.
  • Cluster Width used as the height of the node.
  • Cluster width increases when the clusters are
    merged to the next level.
  • An example -

14
Agglomerative Chaining continued...
15
Centroid estimation without the coordinates.
  • Generic equation for a center

dk(i,j) ?idki ?jdki ?d(ij) ?dki- dkj
For centroid, ?i ni /(ni nj), ?j nj /(ni
nj), ? - ?i ?j, ? 0
For three points -
Centroidk(i,j) 0.5(dki dkj) - 0.25 d(ij)
16
Centroid estimation continued ...
  • Approximation of the above and applying it to our
    problem, we can estimate distances from the
    centroid of clusters at level N1 using centroid
    distances of level N

dab Cab - 0.5(Aa Ab)
Cab 1/(nanb) ? j1 to na ? j1 to nb dij
Aa 1/sqr(na) ? j1 to na ? j1 to na dij
Ab 1/sqr(nb) ? i1 to nb ? j1 to nb dij
  • As we can see for two identical (superimposed)
    clusters the distance is zero.
  • An example -

17
Performance of the centroid distance estimator on
simulation
One example of the runs, on 10 points (see Slide
14).
  • Order preserved.
  • Zero distance for distance of the cluster on
    itself.
  • Almost identical clusters for upto 100 points.
  • Centroid estimator works better for higher
    dimensions.
  • Clustering results 99 identical even for 1000
    points.

18
Tree Splitting for Agglomerative Clustering
  • Height of a node in a tree defined as the average
    cluster width. Variations like minimum or maximum
    possible.
  • Height increases for successive levels, but not
    equally for all clusters.
  • Splitting at any height gives the final clusters.
  • Useful Splitting points are only the unique
    heights for all the all the nodes together.
  • Useful splitting points calculated and stored in
    an array.
  • For a given number of clusters K required, the
    best split resulting in closest to K clusters
    returned without compromising on cluster quality.
  • Splitting closer to root results in low
    resolution, near leaf results in high resolution.

19
Agglomerative Tree results on a small rule set of
genuine market data -
  • 19 major categories for 1311 itemsets at the top
    level.
  • 953 clusters at the lowest level of split.
  • 289 unique splitting points in between.
  • Many good clusters by visual inspection with one
    kind of rule class in it.
  • Some examples - painting raw-material,
    construction tools, electrical tools, electrical
    supply, christmas-tree construction material,
    plumbing-materials.
  • Sepration of rules into categories helps in
    visualization.

20
SOM Training with Distance Matrix
  • Mutidimensional scaling Conversion of the NXN
    distance matrix into NxM vector input space (N
    points of M dimension each).
  • Each M dimensional point input to a SOM.
  • Choosing a 2-D SOM grid of KxK size and mapping
    the M points onto the SOM.

Mutli-Dimensional Scaling
  • From matrix M, find the first N Singular Value
    Decompositions. They represent the N most
    important dimensions.
  • Choose the first L dimensions.
  • Regenerate the distance matrix M.
  • Calculate an error between M M. If the error
    less than threshold, stop else try for higher
    dimensions.

21
Mutli-Dimensional Scaling Continued..
  • Error Measure - Stress

Stress (M- M).2/M.2
  • For simulated N-d data, the Stress drops
    exponentially and goes suddenly very close to 0
    for LN
  • For our data, Stress drops below 5 only for a
    very large value of L.
  • Shorted Binary Search finds L750 in a 3 steps.
    Binary search finds 757 in more than the double
    the steps.

Stress
2.3 error, 750 dimensions
No. of dimensions
22
Combining SOM Training Results with Agglomerative
Clustering
  • Overlap might be taken as a sign of good clusters
    since SOM and Agglomerative clustering should
    have different bias.
  • 715 clusters for one splitting of Agglomerative
    Clustering tree.
  • Training 1000 points, 750 dimensions.
  • 1000 Input points mapped to a 2-D SOM of 27x27
    (729 output classes) would give best comparison.
  • Preliminary comparison for 10x10 SOM.
  • 40,000 epochs.
  • Coloring SOM classified points with class labels
    of Agglomerative clustering provides easy
    visualization. See example -

23
SOM Training Results, 1000 to 100 mapping
24
A randomly picked cluster ..
Rule 278932 (278932) 16846,PAINT BRUSHES (new)
16926,PAINT-INTERIOR-ONE ONLY (new)
16844,PAINTING DROPCLOTHS (new) Rule 278899
(278899) 12328,TRIM-A-TREE-ORNAMENTS-GLASS
SATIN (new) 12331,TRIM-A-TREE-IMPORT THEME
ORNAMENTS (new) 12336,TRIM-A-TREE-MISC.
CHRISTMAS ITEMS (new) Rule 278963 (278963)
17131,ELECTRICAL-WIRE/CABLE NM/UF RET CTN (new)
17090,ELECTRICAL-BOXES COVERS (new)
17152,ELECTRICAL-CONNECTORS/TERMINALS (new)
17050,ELECTRICAL-OUTLETS/SWITCHES (new)
25
A Cluster from Agglomerative Clustering
Rule 278932 (278932) 16846,PAINT BRUSHES (new)
16926,PAINT-INTERIOR-ONE ONLY (new)
16844,PAINTING DROPCLOTHS (new) Rule 278773
(278773) 16846,PAINT BRUSHES (new)
16871,PAINTING ACC. - SHURLINE (new)
16730,PAINT-MISC SUNDRY ITEMS (new) Rule 277258
(277258) 16840,PAINTING ACCESSORY ITEMS (new)
16926,PAINT-INTERIOR-ONE ONLY (new)
16844,PAINTING DROPCLOTHS (new) Rule 277269
(277269) 16871,PAINTING ACC. - SHURLINE (new)
16926,PAINT-INTERIOR-ONE ONLY (new)
16844,PAINTING DROPCLOTHS (new) Rule 277433
(277433) 16871,PAINTING ACC. - SHURLINE (new)
16730,PAINT-MISC SUNDRY ITEMS (new) Rule 277435
(277435) ... and so on.(insufficient space to
show here)..
26
A Cluster (3,7) from SOM results
27
Conclusions and Future Work
  • Conclusions
  • Sparse and high-dimensional data
  • Future Work
  • Complexity and scalability issues
  • Sub-sampling for distance computation
  • Merge similar rules
  • Incorporate meta-data for validation or to
    support clustering and merging
  • Explore other distance measures (log-likelihood
    functions instead of probabilities)
  • More work on SOM coloring - use hierarchical
    coloring. Also interactive visualization of rules
    in SOM result.
  • Reference
  • Cluster Analysis, Brian Everitt
  • Applied Multivariate Statistical Analysis,
    Johnson Wichern
Write a Comment
User Comments (0)
About PowerShow.com