Clustering

About This Presentation

Title:

Clustering

Description:

If q = 1, d is Manhattan distance. 11/8/09. Adatt rh zak s kiakn z suk: Adatok ... The brain maps the external multidimensional representation of the ... map ... – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 68

Provided by: jiaw219

Category:

more less

Transcript and Presenter's Notes

Title: Clustering

1
Clustering

dr. János Abonyi
University of Veszprem
abonyij_at_fmt.vein.hu
www.fmt.vein.hu/softcomp/dw
www.fmt.vein.hu/ai_phd

2
What is Cluster Analysis?

Cluster a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification no
predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms

3
Input Data for Clustering

A set of N points in an M dimensional
spaceOR
A proximity matrix that gives the pairwise
distance or similarity between points.
Can be viewed as a weighted graph.

4
Measures of Similarity

The first step in clustering raw data is to
define some measure of similarity between two
data items
That is, we need to know when two data items are
close enough to be considered members of the same
class
Different measures may produce entirely different
clusters, so the measure selected must reflect
the nature of the data

5
Similarity and Dissimilarity Between Objects

Distances are normally used to measure the
similarity or dissimilarity between two data
objects
Some popular ones include Minkowski distance
where i (xi1, xi2, , xip) and j (xj1, xj2,
, xjp) are two p-dimensional data objects, and q
is a positive integer
If q 1, d is Manhattan distance

6
Similarity and Dissimilarity Between Objects
(Cont.)

If q 2, d is Euclidean distance
Properties
d(i,j) ? 0
d(i,i) 0
d(i,j) d(j,i)
d(i,j) ? d(i,k) d(k,j)
Also one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures.

7
Dissimilarity between Binary Variables

Jaccard coefficient

8
Binary Variables

A contingency table for binary data
Simple matching coefficient
Jaccard coefficient

9
Similarity Coefficients
10
Nominal Variables

A generalization of the binary variable in that
it can take more than 2 states, e.g., red,
yellow, blue, green
Method 1 Simple matching
m of matches, p total of variables
Method 2 use a large number of binary variables
creating a new binary variable for each of the M
nominal states

11
Major Clustering Approaches

Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion
Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion
Density-based based on connectivity and density
functions
Grid-based based on a multiple-level granularity
structure
Model-based A model is hypothesized for each of
the clusters and the idea is to find the best fit
of that model to each other

12
Types of Clustering Partitional and Hierarchical

Partitional Clustering ( K-means and K-medoid)
finds a one-level partitioning of the data into K
disjoint groups.
Hierarchical Clustering finds a hierarchy of
nested clusters (dendogram).
May proceed either bottom-up (agglomerative) or
top-down (divisive).
Uses a proximity matrix.
Can be viewed as operating on a proximity graph.

13
Hierarchical Clustering

Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition

14
AGNES (Agglomerative Nesting)

Implemented in statistical analysis packages
Use the Single-Link method and the dissimilarity
matrix.
Merge nodes that have the least dissimilarity
Eventually all nodes belong to the same cluster

15
A Dendrogram Shows How the Clusters are Merged
Hierarchically
Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is
obtained by cutting the dendrogram at the desired
level, then each connected component forms a
cluster.
16
Cluster Similarity MIN or Single Link

Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters.
Determined by one pair of points, i.e., by one
link in the proximity graph.
Can handle non-elliptical shapes.
Sensitive to noise and outliers.

17
(No Transcript)
18
Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters,
it cannot be undone.
Do not scale well
No objective function is directly minimized.
Different schemes have problems with one or more
of the following
Sensitivity to noise and outliers.
Difficulty handling different sized clusters and
convex shapes.
Breaking large clusters.

19
Partitioning Algorithms Basic Concept

Partitioning method Construct a partition of a
database D of n objects into a set of k clusters
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
Global optimal exhaustively enumerate all
partitions
Heuristic methods k-means and k-medoids
algorithms
k-means (MacQueen67) Each cluster is
represented by the center of the cluster
k-medoids or PAM (Partition around medoids)
(Kaufman Rousseeuw87) Each cluster is
represented by one of the objects in the cluster

20
Clustering Heuristic

Our objective will be to look for a k
representative points for the clusters.
These points will be the cluster centers or
means. May not be part of the data set.
This gives rise to the famous k-means algorithm.

21
K-means algorithm

Given a set of d-dimensional points
Find k-points
Which minimizes
Where, the Ps are disjoint and their union covers
the entire data set.

22
K-means algorithm

Notice once k centers are picked, they give rise
to a natural partition of the entire data set.
Namely associate each data point to its nearest
center.

23
K-means Clustering

Find a single partition of the data into K
clusters such that the within cluster error
is minimized.
Basic K-means Algorithm
1. Select K points as the initial centroids.
2. Assign all points to the closest centroid.
3. Recompute the centroids.
4. Repeat steps 2 and 3 until the centroids dont
change.
K-means is a gradient-descent algorithm that
always converges - perhaps to a local minimum.
(Clustering for Applications, Anderberg)

24
Example
25
Example II.
Initial Data and Seeds
Final Clustering
26
Example III.
Initial Data and Seeds
Final Clustering
27
K-means Initial Point Selection

Bad set of initial points gives a poor solution.
Random selection
Simple and efficient.
Initial points dont cover clusters with high
probability.
Many runs may be needed for optimal solution.
Choose initial points from
Dense regions so that the points are
well-separated.

28
K-means How to Update Centroids

Depends on the exact error criterion used.
If trying to minimize the squared error,
, then the new centroid is the
mean of the points in a cluster.
If trying to minimize the sum of distances,
, then the new centroid is the median of
the points in a cluster.

29
K-means When to Update Centroids

Update the centroids only after all points are
assigned to centers.
Update the centroids after each point assignment.
May adjust the relative weight of the point being
added and the current center to speed
convergence.
Possibility of better accuracy and faster
convergence at the cost of more work.
Update issues are similar to those of updating
weights for neural-nets using back-propagation.
(Artificial Intelligence, Winston)

30
K-means Pre and Post Processing

Outliers can dominate the clustering and, in some
cases, are eliminated by preprocessing.
Post-processing attempts to fix-up the
clustering produced by the K-means algorithm.
Merge clusters that are close to each other.
Split loose clusters that contribute most to
the error.
Permanently eliminate small clusters since they
may represent groups of outliers.
Approaches are based on heuristics and require
the user to choose parameter values.

31
K-means Time and Space requirements

O(MN) space since it uses just the vectors, not
the proximity matrix.
M is the number of attributes.
N is the number of points.
Also keep track of which cluster each point
belongs to and the K cluster centers.
Time for basic K-means is O(TKMN),
T is the number of iterations. (T is often
small, 5-10, and can easily be bounded, as few
changes occur after the first few iterations).

Example 1 Relay stations for mobile phones
Optimal placement of relay stations optimal
k-clustering!
Complications
points correspond to phones
positions are not fixed
number of patterns is not fixed
how to choose k ?
distance function complicated 3D geographic
model with mountains and buildings, shadowing,
...

Example 2 Placement of Warehouses for Goods
points correspond to customer locations
centroids correspond to locations of warehouses
distance function is delivery time from
warehouse multiplied by number of trips, i.e.
related to volume of delivered goods
multilevel clustering, e.g. for post office,
train companies, airlines (which airports to
choose as hubs), etc.

34
K-means Determining the Number of Clusters

Mostly heuristic and domain dependant
approaches.
Plot the error for 2, 3, clusters and find the
knee in the curve.
Use domain specific knowledge and inspect the
clusters for desired characteristics.

35
K-means Problems and Limitations

Based on minimizing within cluster error - a
criterion that is not appropriate for many
situations.
Unsuitable when clusters have widely different
sizes or have convex shapes.
Restricted to data in Euclidean spaces, but
variants of K-means can be used for other types
of data.

36
Feature Extraction

(Nonlinear) mapping of the input space into a
lower dimensional one
Reduction of the number of inputs
Useful for visualisation Non-parametric (Sammon
projection) or Model-based (principal curves,
NN, Gaussian mixtures, SOM)

37
Brains self-organization
The brain maps the external multidimensional
representation of the world into a similar 1 or 2
- dimensional internal representation. That
is, the brain processes the external signals in a
topology-preserving way Mimicking the way the
brain learns, our system should be able to do the
same thing.
38
Senso-motoric map
Visual signals are analyzed by maps coupled with
motor maps and providing senso-motoric responses.
Figure from P.S. Churchland, T.J.
Sejnowski, The computational brain. MIT Press,
1992
39
Somatosensoric and motor maps
40
Representation of fingers
Hand
Face
41
Models of self-organizaiton

SOM or SOFM (Self-Organized Feature Mapping)
self-organizing feature map, one of the simplest
models.

How can such maps develop spontaneously? Local
neural connections neurons interact strongly
with those nearby, but weakly with those that are
far (in addition inhibiting some intermediate
neurons).
History von der Malsburg and Willshaw (1976),
competitive learning, Hebb mechanisms, Mexican
hat interactions, models of visual
systems. Amari (1980) models of continuous
neural tissue. Kohonen (1981) - simplification,
no inhibition leaving two essential factors
competition and cooperation.
42
Self-Organized Map idea
Data vectors XT (X1, ... Xd) from
d-dimensional space. Grid of nodes, with local
processor (called neuron) in each node. Local
processor j has d adaptive parameters W(j).
Goal change W(j) parameters to recover data
clusters in X space.
43
SOM algorithm competition

Nodes should calculate similarity of input data
to their parameters.
Input vector X is compared to node parameters W.
Similar minimal distance or maximal scalar
product. Competition find node jc with W most
similar to X.

Node number c is most similar to the input vector
X It is a winner, and it will learn to be more
similar to X, hence this is a competitive
learning procedure. Brain those neurons that
react to some signals pick it up and learn.
44
SOM algorithm cooperation
Cooperation nodes on a grid close to the winner
c should behave similarly. Define the
neighborhood function O(c)
t iteration number (or time) rc position of
the winning node c (in physical space, usually
2D). r-rc distance from the winning node,
scaled by sc(t). h0(t) slowly decreasing
multiplicative factor The neighborhood function
determines how strongly the parameters of the
winning node and nodes in its neighborhood will
be changed, making them more similar to data X
45
SOM algorithm dynamics
Adaptation rule take the winner node c, and
those in its neighborhood O(rc), change their
parameters making them more similar to the data X

Select randomly new sample vector X, and repeat.
Decrease h0(t) slowly until there will be no
changes.
Result
W(i) point to the center of local clusters in
the X feature space
Nodes in the neighborhood point to adjacent
areas in X space

46
SOM algorithm

XT(X1, X2 .. Xd), samples from feature space.
Create a grid with nodes i 1 .. K in 1D, 2D or
3D,
each node with d-dimensional vector W(i)T
(W1(i) W2(i) .. Wd(i)),
W(i) W(i)(t), changing with t discrete time.

Initialize random small W(i)(0) for all i1...K.
Define parameters of neighborhood function
h(ri-rc/s(t),t)
Iterate select randomly input vector X
Calculate distances d(X,W(i)), find the winner
node W(c) most similar (closest to) X
Update weights of all neurons in the neighborhood
O(rc)
Decrease the influence h0(t) and shrink
neighborhood s(t).
If in the last T steps all W(i) changed less than
e stop.

47
1D network, 2D data
Position in the feature space
Processors in 1D array
48
2D network, 3D data
49
Training process
Java demos http//www.neuroinformatik.ruhr-uni-bo
chum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
50
2D gt 2D, square
Initially all W?0, but over time they learn to
point in adjacent positions.
51
2D gt 1D in a triangle
The line in the data space forms a Peano curve,
an example of a fractal.
52
Map distortions
Initial distortions may slowly disappear or may
get frozen
53
Italian olive oil
An example of SOM application

572 samples of olive oil were collected from 9
Italian provinces.
Content of 8 fats were
was determine for each oil.
SOM 20 x 20 network,
Map 8D gt 2D.
Classification accuracy was around 95-97.

Note that topographical relations are preserve,
region 3 is most diverse.
54
Similarity of faces
300 faces, similarity matrix evaluated and Sammon
mapping applied (from Klock Buhmann 1997).
55
Some examples of real-life applications

Helsinki University of Technology web site
http//www.cis.hut.fi/research/refs/
has a list of gt 5000 papers on SOM and its
applications!
Brain research modeling of formation of various
topographical maps in motor, auditory, visual and
somatotopic areas.
AI and robotics analysis of data from sensors,
control of robots movement (motor maps), spatial
orientation maps.
Information retrieval and text categorization.
Clusterization of genes, protein properties,
chemical compounds, speech phonemes, sounds of
birds and insects, astronomical objects,
economical data, business and financial data
....
Data compression (images and audio), information
filtering.
Medical and technical diagnostics.

56
More examples

Natural language processing linguistic analysis,
parsing, learning languages, hyphenation
patterns.
Optimization configuration of telephone
connections, VLSI design, time series prediction,
scheduling algorithms.
Signal processing adaptive filters, real-time
signal analysis, radar, sonar seismic, USG, EKG,
EEG and other medical signals ...
Image recognition and processing segmentation,
object recognition, texture recognition ...
Content-based retrieval examples of WebSOM,
Cartia, Visier
PicSom similarity based image retrieval.
http//www.ntu.edu.sg/home/aswduch/CI.htmlSOM

57
Quality of life data

WorldBank data 1992, 39 quality of life
indicators.
SOM map and the same colors on the world map.
More examples of business applications from
http//www.eudaptics.com/

58
Semantic maps

How to capture the meaning of words, semantic
relations?
16 animals pigeon, chicken, duck, goose, owl,
hawk, eagle, fox, dog, wolf, cat, tiger, lion,
horse, zebra, cow.
Use 13 binary features small, medium large has
2 legs, 4 legs, hair, hoofs, mane, feathers
hunts, runs, flies, swims.
Form 76 sentences that describe 16 animals using
13 features.
Horse runs horse has 4 legs horse is big ...
eagle fly, fox hunt, ...
Assign a vector of properties to each animal
V(horse) small0,medium0,large1,has 2 legs0,
4 legs1, ...
0,0,1,0,1,1,1,1,0,0,1,0,0
Map these 13D vectors in 2D

59
Semantic maps MDS SOM
60
SOM software

A number of free programs for SOM were written.
Best visualization is offered by Viscovery free
viewerhttp//www.eudaptics.com/It can be used
with free SOM_pack software fromhttp//www.cis.hu
t.fi/research/som_lvq_pak.shtml

61
Concept of the SOM I.
Input space Input layer
Reduced feature space Map layer
Ba
s1
s2
Mn
Sr
Cluster centers (code vectors)
Place of these code vectors in the reduced space
Clustering and ordering of the cluster centers
in a two dimensional grid
62
Concept of the SOM II.
We can use it for visualization
Ba
Mn
SA3
We can use it for classification
Sr
SA3

Mg
We can use it for clustering
63
World map of clinkers I.
We can use it for visualization
We can use it for correlation hunting
64
World map of clinkers II.
65
Density-Based Clustering Methods

Clustering based on density (local cluster
criterion), such as density-connected points
Major features
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies
DBSCAN Ester, et al. (KDD96)
OPTICS Ankerst, et al (SIGMOD99).
DENCLUE Hinneburg D. Keim (KDD98)
CLIQUE Agrawal, et al. (SIGMOD98)

66
Requirements of Clustering in Data Mining

Scalability
Ability to deal with different types of
attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability

67
Clustering Summary

Clustering is an old and multidisciplinary area.
New challenges related to new or newly important
kinds of data
Noisy
Large
High-Dimensional
New Kinds of Similarity Measures (non-metric)
Clusters of Variable Size and Density
Arbitrary Cluster Shapes (non-globular)
Many and Mixed Attribute Types (temporal,
continuous, categorical)
New data mining approaches and algorithms are
being developed that may be more suitable for
these problems.