Title: K-medoid-style Clustering Algorithms for Supervised Summary Generation
1K-medoid-style Clustering Algorithms for
Supervised Summary Generation
- Nidal Zeidat Christoph F. Eick
- Dept. of Computer Science
- University of Houston
2Talk Outline
- What is Supervised Clustering?
- Representative-based Clustering Algorithms
- Benefits of Supervised Clustering
- Algorithms for Supervised Clustering
- Empirical Results
- Conclusion and Areas of Future Work
31. (Traditional) Clustering
- Partition a set of objects into groups of similar
objects. Each group is called cluster - Clustering is used to detect classes in data
set (unsupervised learning) - Clustering is based on a fitness function that
relies on a distance measure and usually tries to
minimize distance between objects within a
cluster.
4(Traditional) Clustering (continued)
Attribute1
A
C
B
Attribute2
5Supervised Clustering
- Assumes that clustering is applied to classified
examples. - The goal of supervised clustering is to identify
class-uniform clusters that have a high
probability density. ? prefers clusters whose
members belong to single class (low impurity) - We would, also, like to keep the number of
clusters low (small number of clusters).
6Supervised Clustering (continued)
Attribute 1
Attribute 1
Attribute 2
Attribute 2
Traditional Clustering
Supervised Clustering
7A Fitness Function for Supervised Clustering
- q(X) Impurity(X) ßPenalty(k)
-
k number of clusters used n number of examples
the dataset c number of classes in a dataset.
ß Weight for Penalty(k), 0lt ß 2.0
82. Representative-Based Supervised Clustering
(RSC)
- Aims at finding a set of objects among all
objects (called representatives) in the data set
that best represent the objects in the data set.
Each representative corresponds to a cluster. - The remaining objects in the data set are, then,
clustered around these representatives by
assigning objects to the cluster of the closest
representative. - Remark The popular k-medoid algorithm, also
called PAM, is a representative-based clustering
algorithm.
9Representative Based Supervised Clustering
(Continued)
Attribute1
Attribute2
10Representative Based Supervised Clustering
(Continued)
2
Attribute1
1
3
Attribute2
4
11Representative Based Supervised Clustering
(Continued)
2
Attribute1
1
3
Attribute2
4
Objective of RSC Find a subset OR of O such that
the clustering X obtained by using the objects
in OR as representatives minimizes q(X).
12Why do we use Representative-Based Clustering
Algorithms?
- Representatives themselves are useful
- can be used for summarization
- can be used for dataset compression
- Smaller search space if compared with algorithms,
such as k-means. - Less sensitive to outliers
- Can be applied to datasets that contain nominal
attributes (not feasible to compute means)
133. Applications of Supervised Clustering
- Enhance classification algorithms.
- Use SC for Dataset Editing to enhance
NN-classifiers ICDM04 - Improve Simple Classifiers ICDM03
- Learn Sub-classes / Summary Generation
- Distance Function Learning
- Dataset Compression/Reduction
- For Measuring the Difficulty of a Classification
Task
14Representative Based Supervised Clustering ?
Dataset Editing
Attribute1
Attribute1
A
B
D
C
F
E
Attribute2
Attribute2
a. Dataset clustered using supervised clustering.
b. Dataset edited using cluster representatives.
15Representative Based Supervised Clustering ?
Enhance Simple Classifiers
Attribute1
Attribute2
16Representative Based Supervised Clustering ?
Learning Sub-classes
Attribute1
Ford Trucks
Ford
GMC
GMC Trucks
GMC Van
Ford Vans
Ford Trucks
Attribute2
GMC Van
174. Clustering Algorithms Currently Investigated
- Partitioning Around Medoids (PAM). ? Traditional
- Supervised Partitioning Around Medoids (SPAM).
- Single Representative Insertion/Deletion Steepest
Decent Hill Climbing with Randomized Restart
(SRIDHCR). - Top Down Splitting Algorithm (TDS).
- Supervised Clustering using Evolutionary
Computing (SCEC).
18Algorithm SRIDHCR
19Set of Medoids after adding one non-medoid q(X) Set of Medoids after removing a medoid q(X)
8 42 62 148 (Initial solution) 0.086 42 62 148 0.086
8 42 62 148 1 0.091 8 62 148 0.073
8 42 62 148 2 0.091 8 42 148 0.313
.... . 8 42 62 0.333
8 42 62 148 52 0.065 42 62 148 0.086
.
8 42 62 148 150 0.0715
Trials in first part (add a non-medoid) Trials in first part (add a non-medoid) Trials in second part (drop a medoid) Trials in second part (drop a medoid)
Run Set of Medoids producing lowest q(X) in the run q(X) Purity
0 8 42 62 148 (Init. Solution) 0.086 0.947
1 8 42 62 148 52 0.065 0.947
2 8 42 62 148 52 122 0.041 0.973
3 42 62 148 52 122 117 0.030 0.987
4 8 62 148 52 122 117 0.021 0.993
5 8 62 148 52 122 117 87 0.016 1.000
6 8 62 52 122 117 87 0.014 1.000
7 8 62 122 117 87 0.012 1.000
20Algorithm SPAM
21Differences between SPAM and SRIDHCR
- SPAM tries to improve the current solution by
replacing a representative by a
non-representative, whereas SRIDHCR improves the
current solution by removing a representative/by
inserting a non-representative. - SPAM is run keeping the number of clusters k
fixed, whereas SRIDHCR searches for a good
value of k, therefore exploring a larger solution
space. However, in the case of SRIDHCR which
choices for k are good is somewhat restricted by
the selection of the parameter b. - SRIDHCR is run r times starting from a random
initial solution, SPAM is only run once.
225. Performance Measures for the Experimental
Evaluation
- The investigated algorithms were evaluated based
on the following performance measures - Cluster Purity (Majority ).
- Value of the fitness function q(X).
- Average dissimilarity between all objects and
their representatives (cluster tightness). - Wall-Clock Time (WCT). Actual time, in seconds,
that the algorithm took to finish the clustering
task.
23 Algorithm Purity q(X) Tightness(X).
Iris-Plants data set, clusters3 Iris-Plants data set, clusters3 Iris-Plants data set, clusters3 Iris-Plants data set, clusters3
PAM 0.907 0.0933 0.081
SRIDHCR 0.981 0.0200 0.093
SPAM 0.973 0.0267 0.133
Vehicle data set, clusters 65 Vehicle data set, clusters 65 Vehicle data set, clusters 65 Vehicle data set, clusters 65
PAM 0.701 0.326 0.044
SRIDHCR 0.835 0.192 0.072
SPAM 0.764 0.263 0.097
Image-Segment data set, clusters 53 Image-Segment data set, clusters 53 Image-Segment data set, clusters 53 Image-Segment data set, clusters 53
PAM 0.880 0.135 0.027
SRIDHCR 0.980 0.035 0.050
SPAM 0.944 0.071 0.061
Pima-Indian Diabetes data set, clusters 45 Pima-Indian Diabetes data set, clusters 45 Pima-Indian Diabetes data set, clusters 45 Pima-Indian Diabetes data set, clusters 45
PAM 0.763 0.237 0.056
SRIDHCR 0.859 0.164 0.093
SPAM 0.822 0.202 0.086
7
19
Table 4 Traditional vs. Supervised Clustering
(ß0.1)
24Algorithm q(X) Purity Tightness (X) WCT (Sec.)
IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3
PAM 0.0933 0.907 0.081 0.06
SRIDHCR 0.0200 0.980 0.093 11.00
SPAM 0.0267 0.973 0.133 0.32
Vehicle Dataset, clusters65 Vehicle Dataset, clusters65 Vehicle Dataset, clusters65 Vehicle Dataset, clusters65 Vehicle Dataset, clusters65
PAM 0.326 0.701 0.044 372.00
SRIDHCR 0.192 0.835 0.072 1715.00
SPAM 0.263 0.764 0.097 1090.00
Segmentation Dataset, clusters53 Segmentation Dataset, clusters53 Segmentation Dataset, clusters53 Segmentation Dataset, clusters53 Segmentation Dataset, clusters53
PAM 0.135 0.880 0.027 4073.00
SRIDHCR 0.035 0.980 0.050 11250.00
SPAM 0.071 0.944 0.061 1422.00
Pima-Indians-Diabetes, clusters45 Pima-Indians-Diabetes, clusters45 Pima-Indians-Diabetes, clusters45 Pima-Indians-Diabetes, clusters45 Pima-Indians-Diabetes, clusters45
PAM 0.237 0.763 0.056 186.00
SRIDHCR 0.164 0.859 0.093 660.00
SPAM 0.202 0.822 0.086 58.00
Table 5 Comparative Performance of the Different
Algorithms, ß0.1
25Algorithm Avg. Purity Tightness(X) Avg.WCT (Sec.)
IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3
PAM 0.907 0.081 0.06
SRIDHCR 0.959 0.104 0.18
SPAM 0.973 0.133 0.33
Vehicle Dataset, clusters56 Vehicle Dataset, clusters56 Vehicle Dataset, clusters56 Vehicle Dataset, clusters56
PAM 0.681 0.046 505.00
SRIDHCR 0.762 0.081 22.58
SPAM 0.754 0.100 681.00
Segmentation Dataset, clusters32 Segmentation Dataset, clusters32 Segmentation Dataset, clusters32 Segmentation Dataset, clusters32
PAM 0.875 0.032 1529.00
SRIDHCR 0.946 0.054 169.39
SPAM 0.940 0.065 1053.00
Pima-Indians-Diabetes, clusters2 Pima-Indians-Diabetes, clusters2 Pima-Indians-Diabetes, clusters2 Pima-Indians-Diabetes, clusters2
PAM 0.656 0.104 0.97
SRIDHCR 0.795 0.109 5.08
SPAM 0.772 0.125 2.70
Table 6 Average Comparative Performance of the
Different Algorithms, ß0.4
26Why is SRIDHCR performing so much better than
SPAM?
- SPAM is relatively slow compared with a single
run of SRIDHCR allowing for 5-30 restarts of
SRIDHCR using the same resources. This enables
SRIDHCR to conduct a more balanced exploration of
the search space. - Fitness landscape induced by q(X) contains many
plateau-like structures (q(X1)q(X2)) and many
local minima and SPAM seems to get stuck more
easily. - The fact that SPAM uses a fixed k-value does not
seem beneficiary for finding good solutions,
e.g. SRIDHCR might explore u1,u2,u3,u4??u1,u2
,u3,u4,v1,v2 ?? u3,u4,v1,v2, whereas SPAM
might terminate with the sub-optimal solution
u1,u2,u3,u4, if neither the replacement of u1
through v1 nor the replacement of u2 by v2
enhances q(X).
27Dataset k ß Ties Using q(X) Ties Using Tightness(X)
Iris-Plants 10 0.00001 5.8 0.0004
Iris-Plants 10 0.4 5.7 0.0004
Iris-Plants 50 0.00001 20.5 0.0019
Iris-Plants 50 0.4 20.9 0.0018
Vehicle 10 0.00001 1.04 0.000001
Vehicle 10 0.4 1.06 0.000001
Vehicle 50 0.00001 1.78 0.000001
Vehicle 50 0.4 1.84 0.000001
Segmentation 10 0.00001 0.220 0.000000
Segmentation 10 0.4 0.225 0.000001
Segmentation 50 0.00001 0.626 0.000001
Segmentation 50 0.4 0.638 0.000000
Diabetes 10 0.00001 2.06 0.0
Diabetes 10 0.4 2.05 0.0
Diabetes 50 0.00001 3.43 0.0002
Diabetes 50 0.4 3.45 0.0002
Table 7 Ties distribution
28Figure 2 How Purity and k Change as ß Increases
296. Conclusions
- As expected, supervised clustering algorithms
produced significantly better cluster purity than
traditional clustering. Improvements range
between 7 and 19 for different data sets. - Algorithms that too greedily explore the search
space, such as SPAM, do not seem to be very
suitable for supervised clustering. In general,
algorithms that explore the search space more
randomly seem to be more suitable for supervised
clustering. - Supervised clustering can be used to enhance
classifiers, dataset summarization, and generate
better distance functions.
30Future Work
- Continue work on supervised clustering algorithms
- Find better solutions
- Faster
- Explain some observations
- Using supervised clustering for summary
generation/learning subclasses - Using supervised clustering to find compressed
nearest neighbor classifiers. - Using supervised clustering to enhance simple
classifiers - Distance function learning
31K-Means Algorithm
2
Attribute1
1
3
Attribute2
4
32K-Means Algorithm
2
Attribute1
1
3
Attribute2
4