Title: CLUSTERING
1CLUSTERING
2Overview
- Definition of Clustering
- Existing clustering methods
- Clustering examples
3Definition
- Clustering can be considered the most important
unsupervised learning technique so, as every
other problem of this kind, it deals with finding
a structure in a collection of unlabeled data. - Clustering is the process of organizing objects
into groups whose members are similar in some
way. - A cluster is therefore a collection of objects
which are similar between them and are
dissimilar to the objects belonging to other
clusters.
4(No Transcript)
5Why clustering?
- A few good reasons ...
- Simplifications
- Pattern detection
- Useful in data concept construction
- Unsupervised learning process
6Where to use clustering?
- Data mining
- Information retrieval
- text mining
- Web analysis
- medical diagnostic
7Major Existing clustering methods
- Distance-based
- Hierarchical
- Partitioning
- Probabilistic
8Measuring Similarity
- Dissimilarity/Similarity metric Similarity is
expressed in terms of a distance function, which
is typically metric d(i, j) - There is a separate quality function that
measures the goodness of a cluster. - The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal and ratio variables. - Weights should be associated with different
variables based on applications and data
semantics. - It is hard to define similar enough or good
enough - the answer is typically highly subjective.
9Hierarchical clustering
- Agglomerative (bottom up)
- start with 1 point (singleton)
- recursively add two or more appropriate clusters
- Stop when k number of clusters is achieved.
- Divisive (top down)
- Start with a big cluster
- Recursively divide into smaller clusters
- Stop when k number of clusters is achieved.
10general steps of hierarchical clustering
- Given a set of N items to be clustered, and an
NN distance (or similarity) matrix, the basic
process of hierarchical clustering (defined by
S.C. Johnson in 1967) is this - Start by assigning each item to a cluster, so
that if you have N items, you now have N
clusters, each containing just one item. Let the
distances (similarities) between the clusters the
same as the distances (similarities) between the
items they contain. - Find the closest (most similar) pair of clusters
and merge them into a single cluster, so that now
you have one cluster less. - Compute distances (similarities) between the new
cluster and each of the old clusters. - Repeat steps 2 and 3 until all items are
clustered into K number of clusters
11- Exclusive vs. non exclusive clustering
- In the first case data are grouped in an
exclusive way, so that if a certain datum belongs
to a definite cluster then it could not be
included in another cluster. A simple example of
that is shown in the figure below, where the
separation of points is achieved by a straight
line on a bi-dimensional plane. - On the contrary the second type, the overlapping
clustering, uses fuzzy sets to cluster data, so
that each point may belong to two or more
clusters with different degrees of membership.
12Partitioning clustering
- Divide data into proper subset
- recursively go through each subset and relocate
points between clusters (opposite to visit-once
approach in Hierarchical approach) - This recursive relocation higher quality cluster
13Probabilistic clustering
- Data are picked from mixture of probability
distribution. - Use the mean, variance of each distribution as
parameters for cluster - Single cluster membership
14Single-Linkage Clustering(hierarchical)
- The NN proximity matrix is D d(i,j)
- The clusterings are assigned sequence numbers
0,1,......, (n-1) - L(k) is the level of the kth clustering
- A cluster with sequence number m is denoted (m)
- The proximity between clusters (r) and (s) is
denoted d (r),(s)
15The algorithm is composed of the following steps
- Begin with the disjoint clustering having level
L(0) 0 and sequence number m 0. - Find the least dissimilar pair of clusters in the
current clustering, say pair (r), (s), according
tod(r),(s) min d(i),(j)where the
minimum is over all pairs of clusters in the
current clustering.
16The algorithm is composed of the following
steps(cont.)
- Increment the sequence number m m 1. Merge
clusters (r) and (s) into a single cluster to
form the next clustering m. Set the level of this
clustering toL(m) d(r),(s) - Update the proximity matrix, D, by deleting the
rows and columns corresponding to clusters (r)
and (s) and adding a row and column corresponding
to the newly formed cluster. The proximity
between the new cluster, denoted (r,s) and old
cluster (k) is defined in this wayd(k),
(r,s) min d(k),(r), d(k),(s) - If all objects are in one cluster, stop. Else, go
to step 2.
17Hierarchical clustering example
- Lets now see a simple example a hierarchical
clustering of distances in kilometers between
some Italian cities. The method used is
single-linkage. - Input distance matrix (L 0 for all the
clusters)
18- The nearest pair of cities is MI and TO, at
distance 138. These are merged into a single
cluster called "MI/TO". The level of the new
cluster is L(MI/TO) 138 and the new sequence
number is m 1.Then we compute the distance
from this new compound object to all other
objects. In single link clustering the rule is
that the distance from the compound object to
another object is equal to the shortest distance
from any member of the cluster to the outside
object. So the distance from "MI/TO" to RM is
chosen to be 564, which is the distance from MI
to RM, and so on.
19- After merging MI with TO we obtain the following
matrix
20- min d(i,j) d(NA,RM) 219 gt merge NA and RM
into a new cluster called NA/RML(NA/RM) 219m
2
21- min d(i,j) d(BA,NA/RM) 255 gt merge BA and
NA/RM into a new cluster called
BA/NA/RML(BA/NA/RM) 255m 3
22- min d(i,j) d(BA/NA/RM,FI) 268 gt merge
BA/NA/RM and FI into a new cluster called
BA/FI/NA/RML(BA/FI/NA/RM) 268m 4
23- Finally, we merge the last two clusters at level
295. - The process is summarized by the following
hierarchical tree
24K-mean algorithm
- It accepts the number of clusters to group data
into, and the dataset to cluster as input values.
- It then creates the first K initial clusters (K
number of clusters needed) from the dataset by
choosing K rows of data randomly from the
dataset. For Example, if there are 10,000 rows of
data in the dataset and 3 clusters need to be
formed, then the first K3 initial clusters will
be created by selecting 3 records randomly from
the dataset as the initial clusters. Each of the
3 initial clusters formed will have just one row
of data.
25 3. The K-Means algorithm calculates the
Arithmetic Mean of each cluster formed in the
dataset. The Arithmetic Mean of a cluster is the
mean of all the individual records in the
cluster. In each of the first K initial
clusters, their is only one record. The
Arithmetic Mean of a cluster with one record is
the set of values that make up that record. For
Example if the dataset we are discussing is a set
of Height, Weight and Age measurements for
students in a University, where a record P in the
dataset S is represented by a Height, Weight and
Age measurement, then P Age, Height,
Weight. Then a record containing
the measurements of a student John, would be
represented as John 20, 170, 80 where John's
Age 20 years, Height 1.70 meters and Weight
80 Pounds. Since there is only one record in each
initial cluster then the Arithmetic Mean of a
cluster with only the record for John as a member
20, 170, 80.
26- Next, K-Means assigns each record in the dataset
to only one of the initial clusters. Each record
is assigned to the nearest cluster (the cluster
which it is most similar to) using a measure of
distance or similarity like the Euclidean
Distance Measure or Manhattan/City-Block Distance
Measure. - K-Means re-assigns each record in
the dataset to the most similar cluster
and re-calculates the arithmetic mean of all the
clusters in the dataset. The arithmetic mean of a
cluster is the arithmetic mean of all the records
in that cluster. For Example, if a cluster
contains two records where the record of the set
of measurements for John 20, 170, 80 and
Henry 30, 160, 120, then the arithmetic mean
Pmean is represented as Pmean Agemean,
Heightmean, Weightmean). Agemean (20 30)/2,
Heightmean (170 160)/2 and Weightmean (80
120)/2. The arithmetic mean of this cluster
25, 165, 100. This new arithmetic mean becomes
the center of this new cluster. Following the
same procedure, new cluster centers are formed
for all the existing clusters.
27- K-Means re-assigns each record in the dataset to
only one of the new clusters formed. A record or
data point is assigned to the nearest cluster
(the cluster which it is most similar to) using a
measure of distance or similarity - The preceding steps are repeated until stable
clusters are formed and the K-Means clustering
procedure is completed. Stable clusters are
formed when new iterations or repetitions of the
K-Means clustering algorithm does not create new
clusters as the cluster center or Arithmetic Mean
of each cluster formed is the same as the old
cluster center. There are different techniques
for determining when a stable cluster is formed
or when the k-means clustering algorithm
procedure is completed.