Title: Clustering
1Clustering
- dr. János Abonyi
- University of Veszprem
- abonyij_at_fmt.vein.hu
- www.fmt.vein.hu/softcomp/dw
- www.fmt.vein.hu/ai_phd
2What is Cluster Analysis?
- Cluster a collection of data objects
- Similar to one another within the same cluster
- Dissimilar to the objects in other clusters
- Cluster analysis
- Grouping a set of data objects into clusters
- Clustering is unsupervised classification no
predefined classes - Typical applications
- As a stand-alone tool to get insight into data
distribution - As a preprocessing step for other algorithms
3Input Data for Clustering
- A set of N points in an M dimensional
spaceOR - A proximity matrix that gives the pairwise
distance or similarity between points. - Can be viewed as a weighted graph.
4Measures of Similarity
- The first step in clustering raw data is to
define some measure of similarity between two
data items - That is, we need to know when two data items are
close enough to be considered members of the same
class - Different measures may produce entirely different
clusters, so the measure selected must reflect
the nature of the data
5Similarity and Dissimilarity Between Objects
- Distances are normally used to measure the
similarity or dissimilarity between two data
objects - Some popular ones include Minkowski distance
- where i (xi1, xi2, , xip) and j (xj1, xj2,
, xjp) are two p-dimensional data objects, and q
is a positive integer - If q 1, d is Manhattan distance
6Similarity and Dissimilarity Between Objects
(Cont.)
- If q 2, d is Euclidean distance
- Properties
- d(i,j) ? 0
- d(i,i) 0
- d(i,j) d(j,i)
- d(i,j) ? d(i,k) d(k,j)
- Also one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures.
7Dissimilarity between Binary Variables
8Binary Variables
- A contingency table for binary data
- Simple matching coefficient
- Jaccard coefficient
9Similarity Coefficients
10Nominal Variables
- A generalization of the binary variable in that
it can take more than 2 states, e.g., red,
yellow, blue, green - Method 1 Simple matching
- m of matches, p total of variables
- Method 2 use a large number of binary variables
- creating a new binary variable for each of the M
nominal states
11Major Clustering Approaches
- Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion - Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion - Density-based based on connectivity and density
functions - Grid-based based on a multiple-level granularity
structure - Model-based A model is hypothesized for each of
the clusters and the idea is to find the best fit
of that model to each other
12Types of Clustering Partitional and Hierarchical
- Partitional Clustering ( K-means and K-medoid)
finds a one-level partitioning of the data into K
disjoint groups. - Hierarchical Clustering finds a hierarchy of
nested clusters (dendogram). - May proceed either bottom-up (agglomerative) or
top-down (divisive). - Uses a proximity matrix.
- Can be viewed as operating on a proximity graph.
13Hierarchical Clustering
- Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition
14AGNES (Agglomerative Nesting)
- Implemented in statistical analysis packages
- Use the Single-Link method and the dissimilarity
matrix. - Merge nodes that have the least dissimilarity
- Eventually all nodes belong to the same cluster
15A Dendrogram Shows How the Clusters are Merged
Hierarchically
Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is
obtained by cutting the dendrogram at the desired
level, then each connected component forms a
cluster.
16Cluster Similarity MIN or Single Link
- Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters. - Determined by one pair of points, i.e., by one
link in the proximity graph. - Can handle non-elliptical shapes.
- Sensitive to noise and outliers.
17(No Transcript)
18Hierarchical Clustering Problems and Limitations
- Once a decision is made to combine two clusters,
it cannot be undone. - Do not scale well
- No objective function is directly minimized.
- Different schemes have problems with one or more
of the following - Sensitivity to noise and outliers.
- Difficulty handling different sized clusters and
convex shapes. - Breaking large clusters.
19Partitioning Algorithms Basic Concept
- Partitioning method Construct a partition of a
database D of n objects into a set of k clusters - Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion - Global optimal exhaustively enumerate all
partitions - Heuristic methods k-means and k-medoids
algorithms - k-means (MacQueen67) Each cluster is
represented by the center of the cluster - k-medoids or PAM (Partition around medoids)
(Kaufman Rousseeuw87) Each cluster is
represented by one of the objects in the cluster
20Clustering Heuristic
- Our objective will be to look for a k
representative points for the clusters. - These points will be the cluster centers or
means. May not be part of the data set. - This gives rise to the famous k-means algorithm.
21K-means algorithm
- Given a set of d-dimensional points
- Find k-points
- Which minimizes
- Where, the Ps are disjoint and their union covers
the entire data set.
22K-means algorithm
- Notice once k centers are picked, they give rise
to a natural partition of the entire data set.
Namely associate each data point to its nearest
center.
23K-means Clustering
- Find a single partition of the data into K
clusters such that the within cluster error - is minimized.
- Basic K-means Algorithm
- 1. Select K points as the initial centroids.
- 2. Assign all points to the closest centroid.
- 3. Recompute the centroids.
- 4. Repeat steps 2 and 3 until the centroids dont
change. - K-means is a gradient-descent algorithm that
always converges - perhaps to a local minimum.
(Clustering for Applications, Anderberg)
24Example
25Example II.
Initial Data and Seeds
Final Clustering
26Example III.
Initial Data and Seeds
Final Clustering
27K-means Initial Point Selection
- Bad set of initial points gives a poor solution.
- Random selection
- Simple and efficient.
- Initial points dont cover clusters with high
probability. - Many runs may be needed for optimal solution.
- Choose initial points from
- Dense regions so that the points are
well-separated.
28K-means How to Update Centroids
- Depends on the exact error criterion used.
- If trying to minimize the squared error,
, then the new centroid is the
mean of the points in a cluster. - If trying to minimize the sum of distances,
, then the new centroid is the median of
the points in a cluster.
29K-means When to Update Centroids
- Update the centroids only after all points are
assigned to centers. - Update the centroids after each point assignment.
- May adjust the relative weight of the point being
added and the current center to speed
convergence. - Possibility of better accuracy and faster
convergence at the cost of more work. - Update issues are similar to those of updating
weights for neural-nets using back-propagation.
(Artificial Intelligence, Winston)
30K-means Pre and Post Processing
- Outliers can dominate the clustering and, in some
cases, are eliminated by preprocessing. - Post-processing attempts to fix-up the
clustering produced by the K-means algorithm. - Merge clusters that are close to each other.
- Split loose clusters that contribute most to
the error. - Permanently eliminate small clusters since they
may represent groups of outliers. - Approaches are based on heuristics and require
the user to choose parameter values.
31K-means Time and Space requirements
- O(MN) space since it uses just the vectors, not
the proximity matrix. - M is the number of attributes.
- N is the number of points.
- Also keep track of which cluster each point
belongs to and the K cluster centers. - Time for basic K-means is O(TKMN),
- T is the number of iterations. (T is often
small, 5-10, and can easily be bounded, as few
changes occur after the first few iterations).
32- Example 1 Relay stations for mobile phones
- Optimal placement of relay stations optimal
k-clustering! - Complications
- points correspond to phones
- positions are not fixed
- number of patterns is not fixed
- how to choose k ?
- distance function complicated 3D geographic
model with mountains and buildings, shadowing,
...
33- Example 2 Placement of Warehouses for Goods
- points correspond to customer locations
- centroids correspond to locations of warehouses
- distance function is delivery time from
warehouse multiplied by number of trips, i.e.
related to volume of delivered goods - multilevel clustering, e.g. for post office,
train companies, airlines (which airports to
choose as hubs), etc.
34K-means Determining the Number of Clusters
- Mostly heuristic and domain dependant
approaches. - Plot the error for 2, 3, clusters and find the
knee in the curve. - Use domain specific knowledge and inspect the
clusters for desired characteristics.
35K-means Problems and Limitations
- Based on minimizing within cluster error - a
criterion that is not appropriate for many
situations. - Unsuitable when clusters have widely different
sizes or have convex shapes. - Restricted to data in Euclidean spaces, but
variants of K-means can be used for other types
of data.
36Feature Extraction
- (Nonlinear) mapping of the input space into a
lower dimensional one - Reduction of the number of inputs
- Useful for visualisation Non-parametric (Sammon
projection) or Model-based (principal curves,
NN, Gaussian mixtures, SOM)
37Brains self-organization
The brain maps the external multidimensional
representation of the world into a similar 1 or 2
- dimensional internal representation. That
is, the brain processes the external signals in a
topology-preserving way Mimicking the way the
brain learns, our system should be able to do the
same thing.
38Senso-motoric map
Visual signals are analyzed by maps coupled with
motor maps and providing senso-motoric responses.
Figure from P.S. Churchland, T.J.
Sejnowski, The computational brain. MIT Press,
1992
39Somatosensoric and motor maps
40Representation of fingers
Hand
Face
41Models of self-organizaiton
- SOM or SOFM (Self-Organized Feature Mapping)
self-organizing feature map, one of the simplest
models.
How can such maps develop spontaneously? Local
neural connections neurons interact strongly
with those nearby, but weakly with those that are
far (in addition inhibiting some intermediate
neurons).
History von der Malsburg and Willshaw (1976),
competitive learning, Hebb mechanisms, Mexican
hat interactions, models of visual
systems. Amari (1980) models of continuous
neural tissue. Kohonen (1981) - simplification,
no inhibition leaving two essential factors
competition and cooperation.
42Self-Organized Map idea
Data vectors XT (X1, ... Xd) from
d-dimensional space. Grid of nodes, with local
processor (called neuron) in each node. Local
processor j has d adaptive parameters W(j).
Goal change W(j) parameters to recover data
clusters in X space.
43SOM algorithm competition
- Nodes should calculate similarity of input data
to their parameters. - Input vector X is compared to node parameters W.
- Similar minimal distance or maximal scalar
product. Competition find node jc with W most
similar to X.
Node number c is most similar to the input vector
X It is a winner, and it will learn to be more
similar to X, hence this is a competitive
learning procedure. Brain those neurons that
react to some signals pick it up and learn.
44SOM algorithm cooperation
Cooperation nodes on a grid close to the winner
c should behave similarly. Define the
neighborhood function O(c)
t iteration number (or time) rc position of
the winning node c (in physical space, usually
2D). r-rc distance from the winning node,
scaled by sc(t). h0(t) slowly decreasing
multiplicative factor The neighborhood function
determines how strongly the parameters of the
winning node and nodes in its neighborhood will
be changed, making them more similar to data X
45SOM algorithm dynamics
Adaptation rule take the winner node c, and
those in its neighborhood O(rc), change their
parameters making them more similar to the data X
- Select randomly new sample vector X, and repeat.
- Decrease h0(t) slowly until there will be no
changes. - Result
- W(i) point to the center of local clusters in
the X feature space - Nodes in the neighborhood point to adjacent
areas in X space
46SOM algorithm
- XT(X1, X2 .. Xd), samples from feature space.
- Create a grid with nodes i 1 .. K in 1D, 2D or
3D, - each node with d-dimensional vector W(i)T
(W1(i) W2(i) .. Wd(i)), - W(i) W(i)(t), changing with t discrete time.
- Initialize random small W(i)(0) for all i1...K.
Define parameters of neighborhood function
h(ri-rc/s(t),t) - Iterate select randomly input vector X
- Calculate distances d(X,W(i)), find the winner
node W(c) most similar (closest to) X - Update weights of all neurons in the neighborhood
O(rc) - Decrease the influence h0(t) and shrink
neighborhood s(t). - If in the last T steps all W(i) changed less than
e stop.
471D network, 2D data
Position in the feature space
Processors in 1D array
482D network, 3D data
49Training process
Java demos http//www.neuroinformatik.ruhr-uni-bo
chum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
502D gt 2D, square
Initially all W?0, but over time they learn to
point in adjacent positions.
512D gt 1D in a triangle
The line in the data space forms a Peano curve,
an example of a fractal.
52Map distortions
Initial distortions may slowly disappear or may
get frozen
53Italian olive oil
An example of SOM application
- 572 samples of olive oil were collected from 9
Italian provinces. - Content of 8 fats were
- was determine for each oil.
- SOM 20 x 20 network,
- Map 8D gt 2D.
- Classification accuracy was around 95-97.
Note that topographical relations are preserve,
region 3 is most diverse.
54Similarity of faces
300 faces, similarity matrix evaluated and Sammon
mapping applied (from Klock Buhmann 1997).
55Some examples of real-life applications
- Helsinki University of Technology web site
- http//www.cis.hut.fi/research/refs/
- has a list of gt 5000 papers on SOM and its
applications! - Brain research modeling of formation of various
topographical maps in motor, auditory, visual and
somatotopic areas. - AI and robotics analysis of data from sensors,
control of robots movement (motor maps), spatial
orientation maps. - Information retrieval and text categorization.
- Clusterization of genes, protein properties,
chemical compounds, speech phonemes, sounds of
birds and insects, astronomical objects,
economical data, business and financial data
.... - Data compression (images and audio), information
filtering. - Medical and technical diagnostics.
56More examples
- Natural language processing linguistic analysis,
parsing, learning languages, hyphenation
patterns. - Optimization configuration of telephone
connections, VLSI design, time series prediction,
scheduling algorithms. - Signal processing adaptive filters, real-time
signal analysis, radar, sonar seismic, USG, EKG,
EEG and other medical signals ... - Image recognition and processing segmentation,
object recognition, texture recognition ... - Content-based retrieval examples of WebSOM,
Cartia, Visier - PicSom similarity based image retrieval.
- http//www.ntu.edu.sg/home/aswduch/CI.htmlSOM
57Quality of life data
- WorldBank data 1992, 39 quality of life
indicators. - SOM map and the same colors on the world map.
- More examples of business applications from
http//www.eudaptics.com/
58Semantic maps
- How to capture the meaning of words, semantic
relations? - 16 animals pigeon, chicken, duck, goose, owl,
hawk, eagle, fox, dog, wolf, cat, tiger, lion,
horse, zebra, cow. - Use 13 binary features small, medium large has
2 legs, 4 legs, hair, hoofs, mane, feathers
hunts, runs, flies, swims. - Form 76 sentences that describe 16 animals using
13 features. - Horse runs horse has 4 legs horse is big ...
eagle fly, fox hunt, ... - Assign a vector of properties to each animal
- V(horse) small0,medium0,large1,has 2 legs0,
4 legs1, ... - 0,0,1,0,1,1,1,1,0,0,1,0,0
- Map these 13D vectors in 2D
59Semantic maps MDS SOM
60SOM software
- A number of free programs for SOM were written.
- Best visualization is offered by Viscovery free
viewerhttp//www.eudaptics.com/It can be used
with free SOM_pack software fromhttp//www.cis.hu
t.fi/research/som_lvq_pak.shtml
61Concept of the SOM I.
Input space Input layer
Reduced feature space Map layer
Ba
s1
s2
Mn
Sr
Cluster centers (code vectors)
Place of these code vectors in the reduced space
Clustering and ordering of the cluster centers
in a two dimensional grid
62Concept of the SOM II.
We can use it for visualization
Ba
Mn
SA3
We can use it for classification
Sr
SA3
Mg
We can use it for clustering
63World map of clinkers I.
We can use it for visualization
We can use it for correlation hunting
64World map of clinkers II.
65Density-Based Clustering Methods
- Clustering based on density (local cluster
criterion), such as density-connected points - Major features
- Discover clusters of arbitrary shape
- Handle noise
- One scan
- Need density parameters as termination condition
- Several interesting studies
- DBSCAN Ester, et al. (KDD96)
- OPTICS Ankerst, et al (SIGMOD99).
- DENCLUE Hinneburg D. Keim (KDD98)
- CLIQUE Agrawal, et al. (SIGMOD98)
66Requirements of Clustering in Data Mining
- Scalability
- Ability to deal with different types of
attributes - Discovery of clusters with arbitrary shape
- Minimal requirements for domain knowledge to
determine input parameters - Able to deal with noise and outliers
- Insensitive to order of input records
- High dimensionality
- Incorporation of user-specified constraints
- Interpretability and usability
67Clustering Summary
- Clustering is an old and multidisciplinary area.
- New challenges related to new or newly important
kinds of data - Noisy
- Large
- High-Dimensional
- New Kinds of Similarity Measures (non-metric)
- Clusters of Variable Size and Density
- Arbitrary Cluster Shapes (non-globular)
- Many and Mixed Attribute Types (temporal,
continuous, categorical) - New data mining approaches and algorithms are
being developed that may be more suitable for
these problems.