Title: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering
1MOSAIC A Proximity Graph Approachfor
Agglomerative Clustering
- Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen
Chen, Ulvi Celepcikay, Christian Guisti, and
Christoph F. Eick - Department of Computer Science, University of
Houston - Organization
- Motivation
- Scope of the research
- Region Discovery
- Traditional Clustering
- Clustering with Plug-In Fitness Functions
- Shape-aware Clustering Algorithms
- Ideas of MOSAIC
- Background
- The MOSAIC Algorithm
- Experimental Evalution
- Related Work
- Conclusion and Future Work
21.1 Motivation Examples of Region Discovery
Application 1 Hot-spot Discovery
EVDW06 Application 2 Find Interesting Regions
with respect to a Continuous Variable Application
3 Find representative regions
(Sampling) Application 4 Regional Co-location
Mining Application 5 Regional Association Rule
Mining DEWY06 Application 6 Regional
Association Rule Scoping EDYKN07
b1.01
RD-Algorithm
b1.04
Wells in Texas Green safe well with respect to
arsenic Red unsafe well
3Region Discovery Framework
- The algorithms we currently investigate solve the
following problem - Given
- A dataset O with a schema R
- A distance function d defined on instances of R
- A fitness function q(X) that evaluates clustering
Xc1,,ck as follows - q(X) ?c?X reward(c)?c?X interestingness(c)size(
c)? with bgt1 - Objective
- Find c1,,ck ? O such that
- ci?cj? if i?j
- Xc1,,ck maximizes q(X)
- All cluster ci?X are contiguous
- c1?,,?ck ? O
- c1,,ck are usually ranked based on the reward
each cluster receives, and low reward clusters
are frequently not reported
41.2 Clustering with Plug-In Fitness Functions
Clustering algorithms
No fitness function
Fixed Fitness Function
Implicit Fitness Function
Provides plug-in fitness function
DBSCAN Hierarchical Clustering
K-Means
PAM
CHAMELEON MOSAIC
51.3 Shape-aware Clustering
- Shape is a significant characteristic in
traditional clustering and region discovery - Examples
Fig. 1 some chain-like patterns in Volcano
dataset
Fig.2 arbitrary shape of regions of high (low)
arsenic concentration in Texas wells
61.4 Ideas Underlying MOSAIC
- MOSAIC provides a generic framework that
integrates representative-based clustering,
agglomerative clustering, and proximity graphs,
and which approximates arbitrary shape clusters
using unions of small convex polygons
(a) input
(b) output
Fig. 6 An illustration of MOSAICs approach
7Talk Organization
- Motivation
- Background
- Representative-based clustering
- Agglomerative clustering
- Proximity Graphs
- The MOSAIC Algorithm
- Experimental Evaluation
- Related Work
- Conclusion and Future Work
82.1 Representative-based Clustering
2
Attribute1
1
3
Attribute2
4
Objective Find a set of objects OR such that the
clustering X obtained by using the objects in OR
as representatives minimizes q(X). Properties
Cluster shapes are convex polygons Popular
Algorithms K-means, K-medoids, SCEC
92.2 MOSAIC and Agglomerative Clustering
- Advantages MOSAIC over traditional agglomerative
clustering - Wider searchconsiders all neighboring clusters
- Plug-in fitness function
- Clusters are always contiguous
- Expensive algorithm is only run for 20-1000
iterations - Highly generic algorithm
102.3 Proximity Graphs
- How to identify neighboring clusters for
representative-based clustering algorithms? - Proximity graphs provide various definitions of
neighbour
NNG Nearest Neighbour Graph MST Minimum
Spanning Tree RNG Relative Neighbourhood
Graph GG Gabriel Graph DT Delaunay
Triangulation (neighbours of a 1NN-classifier)
11Proximity Graphs Delaunay
- The Delaunay Triangulation is the dual of the
Voronoi diagram - Three points are each others neighbours if their
tangent sphere contains no other points - Complete captures all neighbouring clusters
- Expensive to compute in high dimensions
12Proximity Graphs Gabriel
- The Gabriel graph is a subset of the Delaunay
Triangulation (some decision boundary might be
missed) - Points are neighbours only if their (diametral)
sphere of influence is empty - Can be computed more efficiently O(k3)
133. MOSAIC
Fig. 10 Gabriel graph for clusters generated by
a representative-based clustering algorithm
14Pseudo Code MOSAIC
1. Run a representative-based clustering
algorithm to create a large number of
clusters. 2. Read the representatives of the
obtained clusters. 3. Create a merge candidate
relation using proximity graphs. 4. WHILE there
are merge-candidates (Ci ,Cj) left BEGIN
Merge the pair of merge-candidates (Ci,Cj), that
enhances fitness function q the most,
into a new cluster C Update
merge-candidates ?C Merge-Candidate(C,C) ?
Merge-Candidate(Ci,C) Merge-Candidate(Cj,C
) END RETURN the best clustering X found.
15Complexity MOSAIC
- Let
- n be the number of objects in the dataset
- k be the number of clusters returned by the
representative-based algorithm - Complexity MOSAIC O(k3 k2O(q(x)))
- Remarks
- The above formula assumes that fitness is
computed from the scratch when a new clustering
is obtained - Lower complexities can be obtained with
incrementally reusing results of previous fitness
computations - Our current implementation assumes that only
additive fitness functions are used
164. Experimental Evaluation for Traditional
Clustering
- Compared MOSAIC with DBSCAN and K-means
- Used silhouette as q(X) when running MOSAIC
Silhouette considers cohesion and separation
(measured as the distance to the nearest
cluster). - Used 9-Diamonds, Volcano, Diabetes, Ionosphere,
and Vehicle datasets in the experimental
evaluation
17Experimental Results
- Finding good parameter setting for DBSCAN turned
out to be problematic for the 9-Diamonds and
Volcano spatial datasets. - Neither DBSCAN nor MOSAIC were able to obtain to
identify all chain-like patterns in the Volcano
dataset. - We compared MOSAIC and K-means for the
Ionosphere, Diabetes, and Vehicle
high-dimensional datasets. Cluster quality was
measured using Silhouette. MOSAIC outperformed
K-means on these datasets.
18Volcano Dataset Result MOSAIC
19Volcano Dataset Result DBSCAN
20Open Issues What is a Good Fitness Function for
Traditional Clustering?
- The use plug-in fitness functions within
traditional clustering algorithms is not very
common. - Use existing cluster evaluation measures as
fitness function, such as cohesion, separation,
and silhouette, does not lead to very good
clustering when confronted with arbitrary shape
clusters Choo07. - Question Can we find better cluster evaluation
measures or is finding good evaluation measures
for traditional clustering a hopeless project?
215. Related Work
- CURE integrates a partitioning algorithm with an
agglomerative hierarchical algorithm GRS98. - CHAMELEON KHK99 provides a sophisticated
two-phased clustering algorithm a multilevel
graph partitioning algorithm and agglomerative
clustering algorithm on knn sparse graph.
22Related Work Continued
- Lin and Zhong LC02 and ZG03 propose hybrid
clustering algorithms that combine
representative-based clustering and agglomerative
clustering methods. - Surdeanu STA05 proposes a hybrid clustering
approach that combines agglomerative clustering
algorithm with the Expectation Maximization (EM)
algorithm.
236. Conclusion
- A new clustering algorithm was introduced that
approximates arbitrary shape clusters through
unions of convex polygons - The algorithm performs a wider search by
considering all neighboring clusters as merge
candidates. Gabriel graphs are used to determine
neighboring clusters - The algorithm is generic in that it can be used
with any initial merge candidate relation, any
fitness function, and any representative-based
algorithms - MOSAIC can also be seen as a generalization of
agglomerative grid-based clustering algorithms. - We mainly use MOSAIC in the region discovery
project mentioned earlier.
24Future Work Learn fitness function based on
feedback
- Idea employs machine learning techniques to
learn a fitness function by using the feedback of
a domain expert. - Pros
- It provides more adaptive approach to give the
changes to tailor the fitness function based on
the domain experts requirements. - The process of finding an appropriate fitness
function is automatic. - Cons
- features selection is non-trivial
- Learning the function is a difficult machine
learning task