Project Presentation CPSC 695 - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Project Presentation CPSC 695

Description:

Project Presentation CPSC 695 Prepared By: Priyadarshi Bhattacharya Outline of Talk Introduction to clustering and its relevance to my research interests. – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 22

Provided by: pbha7

Category:

more less

Transcript and Presenter's Notes

Title: Project Presentation CPSC 695

1
Project PresentationCPSC 695

Prepared By
Priyadarshi Bhattacharya

2
Outline of Talk

Introduction to clustering and its relevance to
my research interests.
Discussion on existing clustering techniques and
their shortcomings.
Introduction to a new Delaunay based clustering
algorithm.
Experimental Results and comparison with other
methods.
Direction of future research.

3
Clustering Definition

Automatic identification of groups of similar
objects.
A method of grouping data such that intracluster
similarity is maximized and intercluster
similarity is minimized.

4
Properties of clustering

Scalability Clustering performance should
decrease linearly with data size increase
Ability to detect clusters of different shapes
Minimal input parameter
Robust with regard to noise
Insensitive to data input order
Scalability to higher dimensions
(properties referred from On Data Clustering
Analysis Scalability, Constraints and
Validation with minor
modifications)

5
Relevance to my research

Identification of high-risk areas in the sea
based on incident data from the Maritime Activity
and Risk Investigation System (MARIS), maintained
primarily by the University of Halifax.

Marine Route Planning
Clustering Algorithm
Incident Data
High-risk areas
(ESRI Shape File)
Location of SAR Bases
6
Existing clustering algorithms
Clustering
Partitioning
Hierarchical
Density-based
Grid-based
K-Means, K-Medoid
BIRCH, CURE, ROCK, CHAMELEON
DBSCAN, TURN
WaveCluster1, CLIQUE
1WaveCluster A novel clustering approach based
on wavelet transforms. Applies a multi-resolution
grid structure on the data space. For more
details, refer to Wavecluster a
multi-resolution clustering approach for very
large spatial databases, Proc. 24th Conf. on
Very Large Databases.
7
Shortcomings of existing methods

Require large number of parameters to be input by
user. Example number of clusters, threshold to
quantify similarity, stopping condition, number
of nearest neighbors etc.
Sensitivity to user-supplied parameters.
Capability of identifying clusters degrades with
increase in noise.
Inability to identify clusters of widely varying
shapes and sizes. Most detect spherical ones
only.
Identification of dense clusters in presence of
sparse ones, clusters connected by multiple
bridges, closely lying dense clusters remains
elusive.

8
CRYSTAL A new Delaunay based clustering
algorithm

The algorithm has 3 stages
Triangulation phase Forms the Delaunay
Triangulation of the data points and sorts the
vertices in the order of decreasing average
length of adjacent edges.
Grow cluster phase Scans the sorted vertex list
and grows clusters from the vertices in that
order, first encompassing first order neighbors,
then second order neighbors and so on. The growth
stops when the boundary of the cluster is
determined.
Noise removal phase The algorithm identifies
noise as sparse clusters. They can be easily
eliminated by removing clusters which are very
small in size or which have a very low density.

9
Description of stage I

Triangulation phase
Triangulation is done in O(nlogn) time using
the incremental
algorithm.
An auxiliary grid structure (O(n) in size) is
used to speed up
the point location problem in the
Delaunay Triangulation.
This considerably reduces length of walk
in the graph to
locate the triangle containing the data
point.
The well-known Winged-Edge data-structure is
used to
represent the Delaunay Triangulation
because of its
efficiency in answering proximity
queries.

10
Description of Stage II

Grow Cluster phase
A queue is used to maintain a list of
vertices in order, from which the cluster is
grown. Only vertices that are not boundary points
are inserted into the queue.
To decide whether a point belongs to the
cluster, the edge length is compared with the
average edge length of the cluster. To decide
whether a point is on the boundary of a cluster,
the average adjacent edge length of the point is
compared to the average edge length of the
cluster.

11
Description of Stage III

Noise Removal Phase
Noise in the data may be in the form of
isolated data points or scattered throughout the
data. In the former case,
cluster based at these data points will not
be able to grow.
However, if the noise is scattered uniformly
throughout the data, our algorithm identifies it
as a single sparse cluster. This phase simply
gets rid of noise by eliminating the cluster with
the highest average edge length. Also any trivial
clusters (size less than an acceptable number)
are removed in this phase.

12
Complexity Analysis

The algorithm operates in O(nlogn) time.
Delaunay Triangulation is generated in
O(nlogn) time. As a vertex once assigned to a
cluster is not considered again, the clustering
is done in O(n) time.

Cluster size (1000) Vs Time consumed (ms)
13
Clustering in action
14
Experimental Results

Comparison with K-Means based approaches

15
Experimental Results (contd.)
1. Clusters of different shapes 2. Closely
lying dense clusters
16
Experimental Results (contd.)
1. Clusters connected by multiple bridges
2. Clusters of widely varying density
17
Experimental Results (contd.)
Data set
K-Means
GEM
CRYSTAL
18
Experimental Results (contd.)
Results on t7.10k.dat (originally used in
CHAMELEON A Hierarchical Clustering Algorithm
Using Dynamic Modeling)
19
Conclusion Future Work

CRYSTAL is a fast O(nlogn) clustering
algorithm that
automatically identifies clusters of widely
varying shapes, sizes
and densities without requiring any input from
user.
Future work will involve
Application of the clustering algorithm in
identification of high-risk areas in the sea
using the MARIS database.
Extension of the algorithm to 3D.
Considering physical constraints in clustering.
In GIS, physical constraints such as rivers,
highways, mountain ranges can hinder or alter the
clustering result.

20
References

G. Papari, N. Petkov Algorithm That Mimics Human
Perceptual Grouping of Dot Patterns. Lecture
Notes in Computer Science (2005) 497-506
Vladimir Estivill-Castro, Ickjai Lee AUTOCLUST
Automatic Clustering via Boundary Extraction for
Mining Massive Point-Data Sets. Fifth
International Conference on Geocomputation (2000)
Osmar R. Zaiane, Andrew Foss, Chi-Hoon Lee,
Weinan Wang
On Data Clustering Analysis Scalability,
Constraints and Validation.
Advances in Knowledge Discovery and Data
Mining, Springer-Verlag (2002 )
Z.S.H. Chan, N. Kasabov Efficient global
clustering using the Greedy Elimination Method.
Electronics Letters 40 25 (2004 )
Aristidis Likas, Nikos Vlassis, Jakob J. Verbeek
The global k-means clustering algorithm.
Pattern Recognition 36 2 (2003 ) 451-461
Ying Xu, Victor Olman, Dong Xu Minimum Spanning
Trees for Gene Expression Data Clustering.
Computational Protein Structure Group, Life
Sciences Division, Oak Ridge National Laboratory,
USA
C. Eldershaw, M. Hegland Cluster Analysis using
Triangulation. Computational Techniques and
Applications CTAC97, 201-208. World Scientific,
Singapore, 1997
Mir Abolfazl Mostafavi, Christopher Gold, Maciej
Dakowicz Delete and insert operations in
Voronoi/Delaunay methods and applications.
Computers \ Geosciences 29 4 523-530 (2003)
Atsuyuki Okabe, Barry Boots, Kokichi Sugihara
Spatial Tessellations Concepts and Applications
of Voronoi Diagrams.