Title: Clustering Algorithms
1KI2 - 7
Clustering Algorithms
Johan Everts
Kunstmatige Intelligentie / RuG
2What is Clustering?
Find K clusters (or a classification that
consists of K clusters) so that the objects of
one cluster are similar to each other whereas
objects of different clusters are dissimilar.
(Bacher 1996)
3The Goals of Clustering
- Determine the intrinsic grouping in a set of
unlabeled data. - What constitutes a good clustering?
- All clustering algorithms will produce clusters,
- regardless of whether the data contains them
- There is no golden standard, depends on goal
- data reduction
- natural clusters
- useful clusters
- outlier detection
4Stages in clustering
5Taxonomy of Clustering Approaches
6Hierarchical Clustering
- Agglomerative clustering treats each data point
as a singleton cluster, and then successively
merges clusters until all points have been merged
into a single remaining cluster. Divisive
clustering works the other way around.
7Agglomerative Clustering
Single link
In single-link hierarchical clustering, we merge
in each step the two clusters whose two closest
members have the smallest distance.
8Agglomerative Clustering
Complete link
In complete-link hierarchical clustering, we
merge in each step the two clusters whose merger
has the smallest diameter.
9Example Single Link AC
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
10Example Single Link AC
11Example Single Link AC
BA 0 662 877 255 412
FI 662 0 295 468 268
MI/TO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
12Example Single Link AC
13Example Single Link AC
BA 0 662 877 255
FI 662 0 295 268
MI/TO 877 295 0 564
NA/RM 255 268 564 0
14Example Single Link AC
15Example Single Link AC
BA/NA/RM 0 268 564
FI 268 0 295
MI/TO 564 295 0
16Example Single Link AC
17Example Single Link AC
BA/FI/NA/RM 0 295
MI/TO 295 0
18Example Single Link AC
19Example Single Link AC
20Taxonomy of Clustering Approaches
21Square error
- Step 0 Start with a random partition into K
clusters - Step 1 Generate a new partition by assigning
each pattern to its closest cluster center - Step 2 Compute new cluster centers as the
centroids of the clusters. - Step 3 Steps 1 and 2 are repeated until there is
no change in the membership (also cluster centers
remain the same)
24K-Means How many Ks ?
25K-Means How many Ks ?
26Locating the knee
The knee of a curve is defined as the point of
maximum curvature.
27Leader - Follower
- Online
- Specify threshold distance
- Find the closest cluster center
- Distance above threshold ? Create new cluster
- Or else, add instance to cluster
28Leader - Follower
- Find the closest cluster center
- Distance above threshold ? Create new cluster
- Or else, add instance to cluster
29Leader - Follower
- Find the closest cluster center
- Distance above threshold ? Create new cluster
- Or else, add instance to cluster and update
cluster center
Distance lt Threshold
30Leader - Follower
- Find the closest cluster center
- Distance above threshold ? Create new cluster
- Or else, add instance to cluster and update
cluster center
31Leader - Follower
- Find the closest cluster center
- Distance above threshold ? Create new cluster
- Or else, add instance to cluster and update
cluster center
Distance gt Threshold
32Kohonen SOMs
The Self-Organizing Map (SOM) is an unsupervised
artificial neural network algorithm. It is a
compromise between biological modeling and
statistical data processing
33Kohonen SOMs
- Each weight is representative of a certain
input. - Input patterns are shown to all neurons
simultaneously. - Competitive learning the neuron with the
largest response is chosen.
34Kohonen SOMs
- Initialize weights
- Repeat until convergence
- Select next input pattern
- Find Best Matching Unit
- Update weights of winner and neighbours
- Decrease learning rate neighbourhood size
Learning rate neighbourhood size
35Kohonen SOMs
Distance related learning
36Kohonen SOMs
37Some nice illustrations
38Kohonen SOMs
- Kohonen SOM Demo (from ai-junkie.com)
- mapping a 3D colorspace on a 2D Kohonen map
39Performance Analysis
- K-Means
- Depends a lot on a priori knowledge (K)
- Very Stable
- Leader Follower
- Depends a lot on a priori knowledge (Threshold)
- Faster but unstable
40Performance Analysis
- Self Organizing Map
- Stability and Convergence Assured
- Principle of self-ordering
- Slow and many iterations needed for convergence
- Computationally intensive
- No Free Lunch theorema
- Any elevated performance over one class, is
exactly paid for in performance over another
class - Ensemble clustering ?
- Use SOM and Basic Leader Follower to identify
clusters and then use k-mean clustering to
42Any Questions ?