Title: CSE634: DATA CLUSTERING METHODS
1 CSE634 DATACLUSTERING METHODS
Group 9
- Karthik Anandh Govindaraj (105845335)
- Shashank Viswandha ( 105955553 )
- Praveen Durairaj (105948340 )
- Ravikanth Pulavarthy ( 105227609 )
2 CSE634 DATACLUSTERING METHODS
Group 9
- Karthik Anandh Govindaraj
- (karthikanandh_at_gmail.com)
3References
- Ester, M., Kriegel, H.-P., Sander, J., and Xu, X.
- A Density-Based Algorithm for Discovering
Clusters in Large Spatial Databases with Noise.,
In Proc. 2nd International Conference on
Knowledge Discovery and Data Mining
(KDD'96),pages 226-231,1996 - M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J.
Sander.OPTICS Ordering Points to Identify the
Clustering Structure.In Proc. ACM SIGMOD Int.
Conf. on Management of Data (SIGMOD99), pages
4960,1999. - Hinneburg A., Keim D.A. An Efficient Approach to
Clustering in Large Multimedia Databases with
Noise, In Proc. 4rd Int. Conf. on Knowledge
Discovery and Data Mining(KDD'98), AAAI Press,
pages 58-65,1998.
4What is Cluster Analysis?
- Cluster a collection of data objects
- Similar to the objects in the same cluster
(Intraclass similarity) - Dissimilar to the objects in other clusters
(Interclass dissimilarity) - Cluster analysis
- Statistical method for grouping a set of data
objects into clusters - A good clustering method produces high quality
clusters with high intraclass similarity and low
interclass similarity - Clustering is unsupervised classification
5 Clustering methods
- Partitioning methods
- Hierarchical methods
- Density-based methods
- Grid-based methods
6Issues - Large Spatial Databases
- Minimal requirements of domain knowledge to
determine the input parameters - Discovery of clusters with arbitrary shape
- Good efficiency on large databases
7Density-Based Clustering Methods
- Clustering based on density (local cluster
criterion), such as density-connected points - Features
- Discover clusters of arbitrary shape
- Handle noise
- One scan
- Need density parameters as termination condition
- Studies
- DBSCAN Ester, et al. (KDD96)
- OPTICS Ankerst, et al (SIGMOD99).
- DENCLUE Hinneburg Keim (KDD98)
8Density-Based Clustering Definitions
- Parameters
- Eps Maximum radius of the neighbourhood
- MinPts Minimum number of points in an
Eps-neighbourhood of that point - NEps(p) q belongs to D dist(p,q) lt Eps
9Definitions
- Directly density-reachable A point p is directly
density-reachable from a point q wrt. Eps, MinPts
if - 1) p belongs to NEps(q)
- 2) core point condition
- NEps (q) gt MinPts
10Contd.
- Density-reachable
- A point p is density-reachable from a point q
wrt. Eps, MinPts if there is a chain of points
p1, , pn, p1 q, pn p such that pi1 is
directly density-reachable from pi - Density-connected
- A point p is density-connected to a point q wrt.
Eps, MinPts if there is a point o such that both,
p and q are density-reachable from o wrt. Eps and
MinPts.
p
p1
q
11Density Based Cluster Definition
- cluster - A maximal set of density-connected
points - A cluster C is a subset of D satisfying
- For all p,q if p is in C, and q is density
reachable from p, then q is also in C - For all p,q in C p is density connected to q
12Contd.
- Lemma 1 If p is a core point, and O is the set
of points density reachable from p, then O is a
cluster - Lemma 2 Let C be a cluster and p be any core
point of C, then C equals the set of density
reachable points from p - Implication Finding density reachable point of
an arbitrary point generates a cluster. A cluster
is unique determined by any of its core points
13DBSCAN The Algorithm
- Arbitrary select a point p
- Retrieve all points density-reachable from p wrt
Eps and MinPts. - If p is a core point, a cluster is formed.
- If p is a border point, no points are
density-reachable from p and DBSCAN visits the
next point of the database. - Continue the process until all of the points have
been processed. - Complexity O(kN2)
14OPTICS A Cluster-Ordering Method
- OPTICS Ordering Points To Identify the
Clustering Structure - Ankerst, Breunig, Kriegel, and Sander (SIGMOD99)
- Produces a special order of the database wrt its
density-based clustering structure - This cluster-ordering contains info equiv to the
density-based clusterings corresponding to a
broad range of parameter settings - Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering
structure - Can be represented graphically or using
visualization techniques
15Core- and Reachability Distance
- Parameters generating distance e, fixed value
MinPts - core-distancee,MinPts(o)
- smallest distance such that o is a core object
- (if that distance is e ? otherwise)
- reachability-distancee,MinPts(p, o)
- smallest distance such that p is
- directly density-reachable from o
- (if that distance is e ? otherwise)
16The Algorithm OPTICS
- foreach o e Database
- // initially, o.processed false for all objects
o - if o.processed false
- insert (o, ?) into ControlList
- while ControlList is not empty
- select first element (o, r-dist) from
ControlList - retrieve Ne(o) and determine c_dist
core-distance(o) - set o.processed true
- write (o, r_dist, c_dist) to file
- if o is a core object at any distance e
17 Contd..
- foreach p e Ne(o) not yet processed
- determine r_distp reachability-distance(p, o)
- if (p, _) ? ControlList
- insert (p, r_distp) in ControlList
- else if (p, old_r_dist) e ControlList and r_distp
lt old_r_dist - update (p, r_distp) in ControlList
18Reachability-distance
undefined
Cluster-order of the objects
19DENCLUE using density functions
- DENsity-based CLUstEring by Hinneburg Keim
(KDD98) - Major features
- Solid mathematical foundation
- Good for data sets with large amounts of noise
- Allows a compact mathematical description of
arbitrarily shaped clusters in high-dimensional
data sets - Significant faster than existing algorithm
(faster than DBSCAN by a factor of up to 45) - But needs a large number of parameters
20DENCLUE Technical Essence
- Uses grid cells but only keeps information about
grid cells that do actually contain data points
and manages these cells in a tree-based access
structure. - Influence function describes the impact of a
data point within its neighborhood. - Overall density of the data space can be
calculated as the sum of the influence function
of all data points. - Clusters can be determined mathematically by
identifying density attractors. - Density attractors are local maximal of the
overall density function.
21Gradient The steepness of a slope
22Example Density Computation
Dx1,x2,x3,x4 fDGaussian(x) influence(x1)
influence(x2) influence(x3)
influence(x4)0.040.060.080.60.78
x1
x3
0.04
0.08
y
x2
x4
0.06
x
0.6
Remark the density value of y would be larger
than the one for x
23Density Attractor
24Basic Steps - DENCLUE Algorithms
- Determine density atttractors
- Associate data objects with density attractors (?
initial clustering) - Merge the initial clusters further relying on a
hierarchical clustering approach
25 CSE634 DATACLUSTERING METHODS
Group 9
- Shashank Viswanadha
- (sviswana_at_cs.sunysb.edu)
26Sources and References
- Data Mining, Concepts and Techniques by Jiawei
Han and Micheline Kamber ( Second Edition ) - Data Clustering by A.K.Jain (Michigan State
University ), M.N.Murthy ( Indian Institute of
Science ) and P.J.Flynn ( The Ohio State
University ) - http//www.cs.rutgers.edu/mlittman/cours
es/lightai03/jain99data.pdf - STING A Statistical Information Grid Approach to
Spatial Data Mining by Wei Wang ( University of
California, LA ), Joing Yang ( University of
California, LA ), and Richard Muntz ( University
of California, LA ). - http//www.sigmod.org/vldb/conf/1997/P186.PDF
- WaveCluster a wavelet-based clustering approach
for spatial data in very large databases by
Gholamhosein Sheikholeslami, Surojit Chatterjee,
Aidong Zhang. (The VLDB Journal (2000) 8 289304
) - http//www.cs.uiuc.edu/homes/hanj/refs/pa
pers/sheikholeslami98.pdf
27Overview
- Grid-Based Methods
- STING
- WaveCluster
- Clustering High-Dimensional Data
- CLIQUE
- PROCLUS
28Grid-Based Methods
- Uses multiresolution grid data structure
- Operations are performed on finite number of
cells which form a grid - Fast processing
- Examples
- STING Explores statistical information stored in
grid cells - WaveCluster Clusters objects using wavelet
transform methods
29STING STatistical INformation Grid
- Grid-based multiresolution clustering technique
in which the spatial area is divided into
rectangular cells. - Different levels of rectangular cell corresponds
to different levels of resolution forming
hierarchical structure. - Statistical parameters of higher-level cells can
be computed from the parameters of the
lower-level cells.
30Contd.
31Contd.
32Contd.
- Types of parameters
- Attribute independent number of objects in a
cell - Attribute dependent mean, stdev, min and max.
- Types of distribution that the attribute value
can follow are - Normal
- Uniform
- Exponential
- None ( if distribution unknown )
33Contd.
- Majorly used for query answering.
- Advantages-
- Query independent
- Facilitates parallel processing and incremental
updating - Efficiency
- Disadvantage-
- All the cluster boundaries are either horizontal
or - vertical, and no diagonal boundary is
detected - Time complexity for query processing is O(g),
where g is the total number of grid cells at the
lowest level which is much smaller than number of
objects.
34WaveCluster Clustering using wavelet
Transformation
- WaveCluster, a multiresolution clustering
algorithm involves two steps - Summarizes the data imposing a multidimensional
grid structure onto data space. - Transform the original feature space, finding
dense regions in the transformed space. - Handles large data sets efficiently, discovers
clusters with arbitrary shape, handles outliers,
insensitive to order of input.
35Contd.
- Why is wavelet transformation useful for
clustering - Unsupervised clustering It uses hat-shape
filters to emphasize region where points cluster,
but simultaneously to suppress weaker information
in their boundary
36Contd.
- Effective removal of outliers
37Contd.
- Multiresolution The multiresolution property of
wavelet transform can help in detecting the
clusters at different levels of detail. wavelet
transform provides multiple levels of
decompositions, which results in clusters at
different scales from fine to coarse. - Cost efficiency Since applying wavelet transform
is very fast, it makes our approach
cost-effective. As will be shown in the
experiments, clustering very large datasets takes
only a few seconds. Using parallel processing, we
can obtain even faster responses.
38Clustering High-Dimensional Data
- Introduce clustering methods which are designed
for clustering high dimensions generally over 10,
or even thousands of dimensions for some tasks - Primitives to be avoided for when finding
clusters in high dimensional data are - Noise produced by irrelevant dimensions
- Sparse data
- Data points located at different dimensions
become equally distanced.
39Contd.
- Techniques used
- Feature transformation methods Transform the
data onto smaller space while generally
preserving the original distance between objects.
Examples, principal component analysis and
singular value decomposition - Feature selection methods Commonly used for data
reduction by removing irrelevant or redundant
dimensions ( attributes ). - Subspace clustering This is an extension to
feature selection method. Searches for groups of
clusters within different subspaces of the same
data set.
40Contd.
- Examples
- CLIQUE dimension-growth subspace clustering
- PROCLUS dimension-reduction projected clustering
41CLIQUE
- CLIQUEs clustering algorithm outline
- Identify the sparse and crowded areas in space,
thereby discovering the overall distribution. - Cluster is defined as a maximal set of connected
dense units. - Performs multidimensional clustering in two steps
- Partitions d-dimensional data space into
nononverlaping rectangular units, identifying the
dense units among these. - Generates a minimal description for each cluster.
42Contd.
- Insensitive to the order of input object
- Doesnt presume any canonical data distribution
- It scales linearly with the size of input and
hence has good scalability - Clustering results are dependent on proper tuning
of grid size. - Also difficult to find clusters of rather
different density within different dimensional
subspaces
43PROCLUS
- Typical dimension-reduction subspace clustering
method. - Consists of three phases
- Initialization
- Iteration
- Cluster refinement
- Initialization select a set of initial mediods
that are far apart from each other so as to
ensure that each cluster is represented by
atleast one object in the selected set.
44Contd.
- Iteration selects a random set of K mediods
from the reduced set and replaces bad mediods
with randomly chosen new mediods if the
clustering is improved. - Refinement computes new dimensions for each
mediod based on the clusters found, reassigns
points to mediods and removes outliners.
45 CSE634 DATACLUSTERING METHODS
Group 9
- Praveen Durairaj
- (praveend_at_cs.sunysb.edu)
46Sources and References
- Data Mining, Concepts and Techniques by Jiawei
Han and Micheline Kamber ( Second Edition ) - Data Clustering by A.K.Jain (Michigan State
University ), M.N.Murthy ( Indian Institute of
Science ) and P.J.Flynn ( The Ohio State
University ) - http//www.cs.rutgers.edu/mlittman/cours
es/lightai03/jain99data.pdf - Clustering Through Decision Tree Construction
(2000) - Bing Liu, Yiyuan Xia, Philip S Yu - link
47Constraint-Based Cluster Analysis
- Used when clustering task involves a very high
dimensional space. - User Preferences.
- Constraints while clustering.
- Example
- Expected number of Clusters
- Minimal/Maximal cluster size
48Categories of constraint based clustering
- Constraints on individual objects
- Constraints on the selection of clustering
parameters - Constraints on distance or similarity functions
- Clustering with obstacle object
- User specified constraints on properties of
individual clusters - Semi-supervised clustering based on partial
supervision
49Clustering with Obstacle objects
- Consider obstacle objects during clustering
- Partitioning clustering method
- k-medoids method
- Uses triangulation method to compute the distance
between two objects. - Computational cost is very high if a large number
of objects and obstacles are present.
50Solving clustering with obstacles-Visibility
Graphs
- A visibility graph is the graph, VG (V,E), such
that the each vertex of the obstacles has a
corresponding node in V and two nodes v1 and
v2, in V are joined by an edge in E if and
only if the corresponding vertices they represent
are visible to each other. - An example visibility graph
51Visibility graphs
- Consider another visibility graph VG (V, E)
created from VG by adding two points p and q,
in V. - The shortest points between the two points p
and q will be a sub-path of VG.
52Cost of distance computation
- Preprocessing and optimization techniques are
used - Triangulating the region into triangles
- Group nearby points to form micro-clusters
- Uses two types of indices for optimization
- VV indices, for any pair of obstacle vertices
- MV indices, for any pair of micro-cluster and
obstacle vertex
53User constrained cluster analysis
- Constrained optimization problem
- Package industry n customers and k service
stations - Customer classification
- High value customers
- Ordinary customers
54Micro-clustering
- Partition data set into k clusters satisfying
user specified constraints - Iterative refinement of solution
- Move m surplus objects from cluster Ci to
- Cj
- Total sum of the distance of the objects to their
corresponding cluster centers is reduced -
55Computational efficiency
- Should handle deadlock situations
- Constraint may be impossible to satisfy
- Data preprocessed to form micro-clusters
- Object Movement
- Deadlock detection
- Constraint satisfaction
- Advantage
- Scalability is reduced
56Semi-supervised cluster analysis
- Clustering process based on user feedback or
guidance constraints - Pair-wise constraints
- Objects are labeled as belonging to the same
cluster or different clusters - Generates highly desirable clusters
57Methods
- Constraint-based semi supervised clustering
- Relies on user provided labels or constraints
- Example CLTree (based on decision trees)
- Distance-based semi supervised clustering
- Adaptive distance measure
- String-edit distance using Expectation-Maximizatio
n - Euclidean distance
58Clustering using decision trees
- Converts clustering problem into a classification
problem - Considers set of points to be clustered in one
class Y - Adds a set of relatively uniformly distributed
non existence points with label N - Do not physically add points, but only assume
their existence
59Clustering using decision trees
a) Set of data points (Y) to be clustered
b) Addition of uniformly distributed points
c) Clustering the resulting with Y points only
60Clustering using decision trees
- Works efficiently because the decision tree only
needs the number of N points - The number of N points for the current node E is
determined by the following rule (note that at
the root node, the number of inherited N points
is 0) - If the number of N points inherited from the
parent node of E is less than the number of Y
points in E then - the number of N points for E is increased to the
number of Y points in E - else the number of inherited N points is used for
E
61Clustering in Data Mining
- Searching for useful information in large volumes
of data - Current real world Data mining systems
- Detecting trends and patterns of play for NBA
players - Categorizing patterns of children in the foster
care system - Data Mining approaches while using clustering
- Segmentation
- Predictive Modeling
- Visualization of large databases
-
62Segmentation
- Clusters homogeneous groups
- Clustering pixels in Landsat images
- Each pixel has 7 values from different satellite
bands - Clusters these 7 values into 256 groups and
performs a k-means algorithm - Image displayed with the spatial information
63Predictive Modeling
- Clusters group items
- Infers rules to characterize groups and suggest
models - Consider magazine subscribers
- Clustered based on age, sex, income etc
- Groups clustered further to predict whether the
subscribers will renew the subscription
64Visualization
- Aid human analysts in identifying groups that
have similar characteristics - WinViz tool
- Exports derived clusters as new attributes and
characterizes them - Cereals can be clustered based on calories,
carbohydrates, sugar etc - Milk cereals can be characterized by high
potassium content
65Mining large unstructured databases
- Classifying web documents using words or
functions of words - Problems
- Very high dimensionality of data sets
- Relatively small sets of labeled samples
- Cluster words from a small collection in the
world wide space in the document space
66 CSE634 DATACLUSTERING METHODS
Group 9
- Ravikanth Pulavarthy
- (ravi.ingr_at_gmail.com)
67Sources and References
- Data Mining, Concepts and Techniques by Jiawei
Han and Micheline Kamber ( Second Edition) - Data Clustering by A.K.Jain (Michigan State
University ), M.N.Murthy ( Indian Institute of
Science ) and P.J.Flynn ( The Ohio State
University ) - http//www.cs.rutgers.edu/mlittman/cours
es/lightai03/jain99data.pdf - Parsing images of Architectural
Scenes-A.Berg,M.Agrawala,J.Malik - http//www.cs.berkeley.edu/asimma/294-fal
l06/projects/reports/grabler.pdf
68What defines an object?
"I stand at the window and see a house, trees,
sky. Theoretically I might say there were 327
brightnesses and nuances of colour. Do I have
"327"? No. I have sky, house, and trees." --Max
Wertheimer
69Segmentation and Grouping
To recognize objects rather than dealing with
too many pixels we need a compact/summary
representation Obtain this representation from
image/motion sequence/set of tokens What is
interesting and what is not depends on the
application
70Image segmentation
- Segmentation splitting an image into regions
based on some criteria (intensity, color,
texture, orientation energy, ).
71Segmentation Algorithms
- Simple Segmentation Algorithms
- Thresholding
- Segmentation by Clustering
- Agglomerative clustering
- Divisive clustering
- K-means
72Thresholding
- Gray level thresholding is the simplest
segmentation process.
Multilevel thresholding
(object)
(background)
73Thresholding
- Thresholding is computationally inexpensive and
fast - Correct threshold selection is crucial for
successful threshold segmentation
74Thresholding-example
75Simple Clustering Methods
- Two natural Algorithms
- Agglomerative clustering (bottom up)
- attach closest to cluster it is closest to
- repeat
- Divisive clustering (top-down)
- split cluster along best boundary
- repeat
76Agglomerative Methods
- Make each point a separate cluster
- Until the clustering is satisfactory
- Merge the two clusters with the smallest
inter-cluster distance
77Divisive Methods
- Construct a single cluster containing all points
- Until the clustering is satisfactory
- - Split the cluster that yields the two
components with the largest intercluster distance
78Agglomerative Versus Divisive Clustering
- The user can specify the desired number of
clusters as a termination condition
79Measure of distance used
- Min Distance dmin( Ci ,Cj)minP?Ci,P?Cjp-p
- Nearest Neighbor Clustering Algorithm
-
-
- Max Distancedmax ( Ci ,Cj)maxP?Ci,P?Cjp-p
-
- Farthest Neighbour Clustering
Algorithm - Mean Distance
- dmean(Ci ,Cj )mi-mj where mi is
the mean for Ci. - Average Distancedavg(Ci ,Cj )1/ninjS
S p-p -
P?Ci P?Cj -
-
80Single Linkage
- The distance between clusters is based on the
points in each cluster that are nearest apart.
81Complete Linkage Method
- The distance between clusters is based on the
points in each cluster that are farthest apart.
82Centroid Linkage Method
- The distance between clusters is defined as the
distance between cluster centroids.
83Average Linkage Method
- The distance between clusters is the average
distance between all pairs of observations.
84Optimality
- Neither agglomerative clustering nor divisive
clustering is optimal - In other words, the set of centroids which they
give is not guaranteed to minimise distortion
85 Contd.
- For example
- In agglomerative clustering, a dense cluster of
data points will be combined into a single
centroid - But to minimise distortion, need several
centroids in a region where there are many data
points - A single outlier may get its own cluster
- Agglomerative clustering provides a useful
starting point, but further refinement is needed
86K-means Clustering
- Choose a fixed number of clusters
- Choose cluster centers and point-cluster
allocations to minimize error
87K-means Algorithm
- Choose k data points to act as cluster centers
- Until the clustering is satisfactory
- Assign each data point to the cluster that has
the nearest cluster center - Ensure each cluster has at least one data point
splitting, etc - Replace the cluster centers with the means of the
elements in the clusters
88Image
Clusters on intensity
Clusters on color
K-means clustering using intensity alone and
color alone
89Conclusion
- The approaches for the high dimensional spatial
data clustering methods are well addressed. - Some of the applications of data clustering in
data mining and image segmentation are discussed
as these are vital as huge amounts of spatial
data are obtained in real life from satellite
images, medical equipments, geographic
information systems (GIS), image database
exploration, etc.
90THANK YOU