MOSAIC: A Proximity Graph Approach for Agglomerative Clustering

About This Presentation

Title:

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering

Description:

Talk Organization. Motivation. Background. Representative-based clustering ... Our current implementation assumes that only additive fitness functions are used ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 25

Provided by: lindaj156

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: MOSAIC: A Proximity Graph Approach for Agglomerative Clustering

1
MOSAIC A Proximity Graph Approachfor
Agglomerative Clustering

Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen
Chen, Ulvi Celepcikay, Christian Guisti, and
Christoph F. Eick
Department of Computer Science, University of
Houston
Organization
Motivation
Scope of the research
Region Discovery
Traditional Clustering
Clustering with Plug-In Fitness Functions
Shape-aware Clustering Algorithms
Ideas of MOSAIC
Background
The MOSAIC Algorithm
Experimental Evalution
Related Work
Conclusion and Future Work

2
1.1 Motivation Examples of Region Discovery
Application 1 Hot-spot Discovery
EVDW06 Application 2 Find Interesting Regions
with respect to a Continuous Variable Application
3 Find representative regions
(Sampling) Application 4 Regional Co-location
Mining Application 5 Regional Association Rule
Mining DEWY06 Application 6 Regional
Association Rule Scoping EDYKN07
b1.01
RD-Algorithm
b1.04
Wells in Texas Green safe well with respect to
arsenic Red unsafe well
3
Region Discovery Framework

The algorithms we currently investigate solve the
following problem
Given
A dataset O with a schema R
A distance function d defined on instances of R
A fitness function q(X) that evaluates clustering
Xc1,,ck as follows
q(X) ?c?X reward(c)?c?X interestingness(c)size(
c)? with bgt1
Objective
Find c1,,ck ? O such that
ci?cj? if i?j
Xc1,,ck maximizes q(X)
All cluster ci?X are contiguous
c1?,,?ck ? O
c1,,ck are usually ranked based on the reward
each cluster receives, and low reward clusters
are frequently not reported

4
1.2 Clustering with Plug-In Fitness Functions
Clustering algorithms
No fitness function
Fixed Fitness Function
Implicit Fitness Function
Provides plug-in fitness function
DBSCAN Hierarchical Clustering
K-Means
PAM
CHAMELEON MOSAIC
5
1.3 Shape-aware Clustering

Shape is a significant characteristic in
traditional clustering and region discovery
Examples

Fig. 1 some chain-like patterns in Volcano
dataset
Fig.2 arbitrary shape of regions of high (low)
arsenic concentration in Texas wells
6
1.4 Ideas Underlying MOSAIC

MOSAIC provides a generic framework that
integrates representative-based clustering,
agglomerative clustering, and proximity graphs,
and which approximates arbitrary shape clusters
using unions of small convex polygons

(a) input
(b) output
Fig. 6 An illustration of MOSAICs approach
7
Talk Organization

Motivation
Background
Representative-based clustering
Agglomerative clustering
Proximity Graphs
The MOSAIC Algorithm
Experimental Evaluation
Related Work
Conclusion and Future Work

8
2.1 Representative-based Clustering
2
Attribute1
1
3
Attribute2
4
Objective Find a set of objects OR such that the
clustering X obtained by using the objects in OR
as representatives minimizes q(X). Properties
Cluster shapes are convex polygons Popular
Algorithms K-means, K-medoids, SCEC
9
2.2 MOSAIC and Agglomerative Clustering

Advantages MOSAIC over traditional agglomerative
clustering
Wider searchconsiders all neighboring clusters
Plug-in fitness function
Clusters are always contiguous
Expensive algorithm is only run for 20-1000
iterations
Highly generic algorithm

10
2.3 Proximity Graphs

How to identify neighboring clusters for
representative-based clustering algorithms?
Proximity graphs provide various definitions of
neighbour

NNG Nearest Neighbour Graph MST Minimum
Spanning Tree RNG Relative Neighbourhood
Graph GG Gabriel Graph DT Delaunay
Triangulation (neighbours of a 1NN-classifier)
11
Proximity Graphs Delaunay

The Delaunay Triangulation is the dual of the
Voronoi diagram
Three points are each others neighbours if their
tangent sphere contains no other points
Complete captures all neighbouring clusters
Expensive to compute in high dimensions

12
Proximity Graphs Gabriel

The Gabriel graph is a subset of the Delaunay
Triangulation (some decision boundary might be
missed)
Points are neighbours only if their (diametral)
sphere of influence is empty
Can be computed more efficiently O(k3)

13
3. MOSAIC
Fig. 10 Gabriel graph for clusters generated by
a representative-based clustering algorithm
14
Pseudo Code MOSAIC
1. Run a representative-based clustering
algorithm to create a large number of
clusters. 2. Read the representatives of the
obtained clusters. 3. Create a merge candidate
relation using proximity graphs. 4. WHILE there
are merge-candidates (Ci ,Cj) left BEGIN
Merge the pair of merge-candidates (Ci,Cj), that
enhances fitness function q the most,
into a new cluster C Update
merge-candidates ?C Merge-Candidate(C,C) ?
Merge-Candidate(Ci,C) Merge-Candidate(Cj,C
) END RETURN the best clustering X found.
15
Complexity MOSAIC

Let
n be the number of objects in the dataset
k be the number of clusters returned by the
representative-based algorithm
Complexity MOSAIC O(k3 k2O(q(x)))
Remarks
The above formula assumes that fitness is
computed from the scratch when a new clustering
is obtained
Lower complexities can be obtained with
incrementally reusing results of previous fitness
computations
Our current implementation assumes that only
additive fitness functions are used

16
4. Experimental Evaluation for Traditional
Clustering

Compared MOSAIC with DBSCAN and K-means
Used silhouette as q(X) when running MOSAIC
Silhouette considers cohesion and separation
(measured as the distance to the nearest
cluster).
Used 9-Diamonds, Volcano, Diabetes, Ionosphere,
and Vehicle datasets in the experimental
evaluation

17
Experimental Results

Finding good parameter setting for DBSCAN turned
out to be problematic for the 9-Diamonds and
Volcano spatial datasets.
Neither DBSCAN nor MOSAIC were able to obtain to
identify all chain-like patterns in the Volcano
dataset.
We compared MOSAIC and K-means for the
Ionosphere, Diabetes, and Vehicle
high-dimensional datasets. Cluster quality was
measured using Silhouette. MOSAIC outperformed
K-means on these datasets.

18
Volcano Dataset Result MOSAIC
19
Volcano Dataset Result DBSCAN
20
Open Issues What is a Good Fitness Function for
Traditional Clustering?

The use plug-in fitness functions within
traditional clustering algorithms is not very
common.
Use existing cluster evaluation measures as
fitness function, such as cohesion, separation,
and silhouette, does not lead to very good
clustering when confronted with arbitrary shape
clusters Choo07.
Question Can we find better cluster evaluation
measures or is finding good evaluation measures
for traditional clustering a hopeless project?

21
5. Related Work

CURE integrates a partitioning algorithm with an
agglomerative hierarchical algorithm GRS98.
CHAMELEON KHK99 provides a sophisticated
two-phased clustering algorithm a multilevel
graph partitioning algorithm and agglomerative
clustering algorithm on knn sparse graph.

22
Related Work Continued

Lin and Zhong LC02 and ZG03 propose hybrid
clustering algorithms that combine
representative-based clustering and agglomerative
clustering methods.
Surdeanu STA05 proposes a hybrid clustering
approach that combines agglomerative clustering
algorithm with the Expectation Maximization (EM)
algorithm.

23
6. Conclusion

A new clustering algorithm was introduced that
approximates arbitrary shape clusters through
unions of convex polygons
The algorithm performs a wider search by
considering all neighboring clusters as merge
candidates. Gabriel graphs are used to determine
neighboring clusters
The algorithm is generic in that it can be used
with any initial merge candidate relation, any
fitness function, and any representative-based
algorithms
MOSAIC can also be seen as a generalization of
agglomerative grid-based clustering algorithms.
We mainly use MOSAIC in the region discovery
project mentioned earlier.

24
Future Work Learn fitness function based on
feedback

Idea employs machine learning techniques to
learn a fitness function by using the feedback of
a domain expert.
Pros
It provides more adaptive approach to give the
changes to tailor the fitness function based on
the domain experts requirements.
The process of finding an appropriate fitness
function is automatic.
Cons
features selection is non-trivial
Learning the function is a difficult machine
learning task

Write a Comment

User Comments (0)

About PowerShow.com

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering - PowerPoint PPT Presentation

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering

Talk Organization. Motivation. Background. Representative-based clustering ... Our current implementation assumes that only additive fitness functions are used ... – PowerPoint PPT presentation