Title: Project Presentation CPSC 695
1Project PresentationCPSC 695
- Prepared By
- Priyadarshi Bhattacharya
2Outline of Talk
- Introduction to clustering and its relevance to
my research interests. - Discussion on existing clustering techniques and
their shortcomings. - Introduction to a new Delaunay based clustering
algorithm. - Experimental Results and comparison with other
methods. - Direction of future research.
3Clustering Definition
- Automatic identification of groups of similar
objects. - A method of grouping data such that intracluster
similarity is maximized and intercluster
similarity is minimized.
4Properties of clustering
- Scalability Clustering performance should
decrease linearly with data size increase - Ability to detect clusters of different shapes
- Minimal input parameter
- Robust with regard to noise
- Insensitive to data input order
- Scalability to higher dimensions
- (properties referred from On Data Clustering
Analysis Scalability, Constraints and
Validation with minor - modifications)
5Relevance to my research
- Identification of high-risk areas in the sea
based on incident data from the Maritime Activity
and Risk Investigation System (MARIS), maintained
primarily by the University of Halifax.
Marine Route Planning
Clustering Algorithm
Incident Data
High-risk areas
(ESRI Shape File)
Location of SAR Bases
6Existing clustering algorithms
Clustering
Partitioning
Hierarchical
Density-based
Grid-based
K-Means, K-Medoid
BIRCH, CURE, ROCK, CHAMELEON
DBSCAN, TURN
WaveCluster1, CLIQUE
1WaveCluster A novel clustering approach based
on wavelet transforms. Applies a multi-resolution
grid structure on the data space. For more
details, refer to Wavecluster a
multi-resolution clustering approach for very
large spatial databases, Proc. 24th Conf. on
Very Large Databases.
7Shortcomings of existing methods
- Require large number of parameters to be input by
user. Example number of clusters, threshold to
quantify similarity, stopping condition, number
of nearest neighbors etc. - Sensitivity to user-supplied parameters.
- Capability of identifying clusters degrades with
increase in noise. - Inability to identify clusters of widely varying
shapes and sizes. Most detect spherical ones
only. - Identification of dense clusters in presence of
sparse ones, clusters connected by multiple
bridges, closely lying dense clusters remains
elusive.
8CRYSTAL A new Delaunay based clustering
algorithm
- The algorithm has 3 stages
- Triangulation phase Forms the Delaunay
Triangulation of the data points and sorts the
vertices in the order of decreasing average
length of adjacent edges. - Grow cluster phase Scans the sorted vertex list
and grows clusters from the vertices in that
order, first encompassing first order neighbors,
then second order neighbors and so on. The growth
stops when the boundary of the cluster is
determined. - Noise removal phase The algorithm identifies
noise as sparse clusters. They can be easily
eliminated by removing clusters which are very
small in size or which have a very low density.
9Description of stage I
- Triangulation phase
- Triangulation is done in O(nlogn) time using
the incremental - algorithm.
- An auxiliary grid structure (O(n) in size) is
used to speed up - the point location problem in the
Delaunay Triangulation. - This considerably reduces length of walk
in the graph to - locate the triangle containing the data
point. - The well-known Winged-Edge data-structure is
used to - represent the Delaunay Triangulation
because of its - efficiency in answering proximity
queries.
10Description of Stage II
- Grow Cluster phase
- A queue is used to maintain a list of
vertices in order, from which the cluster is
grown. Only vertices that are not boundary points
are inserted into the queue. -
- To decide whether a point belongs to the
cluster, the edge length is compared with the
average edge length of the cluster. To decide
whether a point is on the boundary of a cluster,
the average adjacent edge length of the point is
compared to the average edge length of the
cluster.
11Description of Stage III
- Noise Removal Phase
- Noise in the data may be in the form of
isolated data points or scattered throughout the
data. In the former case, - cluster based at these data points will not
be able to grow. - However, if the noise is scattered uniformly
throughout the data, our algorithm identifies it
as a single sparse cluster. This phase simply
gets rid of noise by eliminating the cluster with
the highest average edge length. Also any trivial
clusters (size less than an acceptable number)
are removed in this phase.
12Complexity Analysis
- The algorithm operates in O(nlogn) time.
- Delaunay Triangulation is generated in
O(nlogn) time. As a vertex once assigned to a
cluster is not considered again, the clustering
is done in O(n) time.
Cluster size (1000) Vs Time consumed (ms)
13Clustering in action
14Experimental Results
- Comparison with K-Means based approaches
15Experimental Results (contd.)
1. Clusters of different shapes 2. Closely
lying dense clusters
16Experimental Results (contd.)
1. Clusters connected by multiple bridges
2. Clusters of widely varying density
17Experimental Results (contd.)
Data set
K-Means
GEM
CRYSTAL
18Experimental Results (contd.)
Results on t7.10k.dat (originally used in
CHAMELEON A Hierarchical Clustering Algorithm
Using Dynamic Modeling)
19Conclusion Future Work
- CRYSTAL is a fast O(nlogn) clustering
algorithm that - automatically identifies clusters of widely
varying shapes, sizes - and densities without requiring any input from
user. - Future work will involve
- Application of the clustering algorithm in
identification of high-risk areas in the sea
using the MARIS database. - Extension of the algorithm to 3D.
- Considering physical constraints in clustering.
In GIS, physical constraints such as rivers,
highways, mountain ranges can hinder or alter the
clustering result.
20References
- G. Papari, N. Petkov Algorithm That Mimics Human
Perceptual Grouping of Dot Patterns. Lecture
Notes in Computer Science (2005) 497-506 - Vladimir Estivill-Castro, Ickjai Lee AUTOCLUST
Automatic Clustering via Boundary Extraction for
Mining Massive Point-Data Sets. Fifth
International Conference on Geocomputation (2000) - Osmar R. Zaiane, Andrew Foss, Chi-Hoon Lee,
Weinan Wang - On Data Clustering Analysis Scalability,
Constraints and Validation. - Advances in Knowledge Discovery and Data
Mining, Springer-Verlag (2002 ) - Z.S.H. Chan, N. Kasabov Efficient global
clustering using the Greedy Elimination Method. - Electronics Letters 40 25 (2004 )
- Aristidis Likas, Nikos Vlassis, Jakob J. Verbeek
The global k-means clustering algorithm. - Pattern Recognition 36 2 (2003 ) 451-461
- Ying Xu, Victor Olman, Dong Xu Minimum Spanning
Trees for Gene Expression Data Clustering.
Computational Protein Structure Group, Life
Sciences Division, Oak Ridge National Laboratory,
USA - C. Eldershaw, M. Hegland Cluster Analysis using
Triangulation. Computational Techniques and
Applications CTAC97, 201-208. World Scientific,
Singapore, 1997 - Mir Abolfazl Mostafavi, Christopher Gold, Maciej
Dakowicz Delete and insert operations in
Voronoi/Delaunay methods and applications.
Computers \ Geosciences 29 4 523-530 (2003) - Atsuyuki Okabe, Barry Boots, Kokichi Sugihara
Spatial Tessellations Concepts and Applications
of Voronoi Diagrams.
21Thank You!
All 11 identified by CRYSTAL! Questions?