K-medoid-style Clustering Algorithms for Supervised Summary Generation - PowerPoint PPT Presentation

About This Presentation
Title:

K-medoid-style Clustering Algorithms for Supervised Summary Generation

Description:

Ford Trucks. Attribute1. Ford Trucks. Ford Vans. GMC Trucks. GMC Van. GMC Van :Ford :GMC. 4. Clustering Algorithms Currently Investigated. Partitioning Around ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 33
Provided by: nidalz
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: K-medoid-style Clustering Algorithms for Supervised Summary Generation


1
K-medoid-style Clustering Algorithms for
Supervised Summary Generation
  • Nidal Zeidat Christoph F. Eick
  • Dept. of Computer Science
  • University of Houston

2
Talk Outline
  1. What is Supervised Clustering?
  2. Representative-based Clustering Algorithms
  3. Benefits of Supervised Clustering
  4. Algorithms for Supervised Clustering
  5. Empirical Results
  6. Conclusion and Areas of Future Work

3
1. (Traditional) Clustering
  • Partition a set of objects into groups of similar
    objects. Each group is called cluster
  • Clustering is used to detect classes in data
    set (unsupervised learning)
  • Clustering is based on a fitness function that
    relies on a distance measure and usually tries to
    minimize distance between objects within a
    cluster.

4
(Traditional) Clustering (continued)
Attribute1
A
C
B
Attribute2
5
Supervised Clustering
  • Assumes that clustering is applied to classified
    examples.
  • The goal of supervised clustering is to identify
    class-uniform clusters that have a high
    probability density. ? prefers clusters whose
    members belong to single class (low impurity)
  • We would, also, like to keep the number of
    clusters low (small number of clusters).

6
Supervised Clustering (continued)
Attribute 1
Attribute 1
Attribute 2
Attribute 2
Traditional Clustering
Supervised Clustering
7
A Fitness Function for Supervised Clustering
  • q(X) Impurity(X) ßPenalty(k)

k number of clusters used n number of examples
the dataset c number of classes in a dataset.
ß Weight for Penalty(k), 0lt ß 2.0
8
2. Representative-Based Supervised Clustering
(RSC)
  • Aims at finding a set of objects among all
    objects (called representatives) in the data set
    that best represent the objects in the data set.
    Each representative corresponds to a cluster.
  • The remaining objects in the data set are, then,
    clustered around these representatives by
    assigning objects to the cluster of the closest
    representative.
  • Remark The popular k-medoid algorithm, also
    called PAM, is a representative-based clustering
    algorithm.

9
Representative Based Supervised Clustering
(Continued)
Attribute1
Attribute2
10
Representative Based Supervised Clustering
(Continued)
2
Attribute1
1
3
Attribute2
4
11
Representative Based Supervised Clustering
(Continued)
2
Attribute1
1
3
Attribute2
4
Objective of RSC Find a subset OR of O such that
the clustering X obtained by using the objects
in OR as representatives minimizes q(X).
12
Why do we use Representative-Based Clustering
Algorithms?
  • Representatives themselves are useful
  • can be used for summarization
  • can be used for dataset compression
  • Smaller search space if compared with algorithms,
    such as k-means.
  • Less sensitive to outliers
  • Can be applied to datasets that contain nominal
    attributes (not feasible to compute means)

13
3. Applications of Supervised Clustering
  • Enhance classification algorithms.
  • Use SC for Dataset Editing to enhance
    NN-classifiers ICDM04
  • Improve Simple Classifiers ICDM03
  • Learn Sub-classes / Summary Generation
  • Distance Function Learning
  • Dataset Compression/Reduction
  • For Measuring the Difficulty of a Classification
    Task

14
Representative Based Supervised Clustering ?
Dataset Editing
Attribute1
Attribute1
A
B
D
C
F
E
Attribute2
Attribute2
a. Dataset clustered using supervised clustering.
b. Dataset edited using cluster representatives.
15
Representative Based Supervised Clustering ?
Enhance Simple Classifiers
Attribute1
Attribute2
16
Representative Based Supervised Clustering ?
Learning Sub-classes
Attribute1
Ford Trucks
Ford
GMC
GMC Trucks
GMC Van
Ford Vans
Ford Trucks
Attribute2
GMC Van
17
4. Clustering Algorithms Currently Investigated
  1. Partitioning Around Medoids (PAM). ? Traditional
  2. Supervised Partitioning Around Medoids (SPAM).
  3. Single Representative Insertion/Deletion Steepest
    Decent Hill Climbing with Randomized Restart
    (SRIDHCR).
  4. Top Down Splitting Algorithm (TDS).
  5. Supervised Clustering using Evolutionary
    Computing (SCEC).

18
Algorithm SRIDHCR
19
Set of Medoids after adding one non-medoid q(X) Set of Medoids after removing a medoid q(X)
8 42 62 148 (Initial solution) 0.086 42 62 148 0.086
8 42 62 148 1 0.091 8 62 148 0.073
8 42 62 148 2 0.091 8 42 148 0.313
.... . 8 42 62 0.333
8 42 62 148 52 0.065 42 62 148 0.086
.
8 42 62 148 150 0.0715
Trials in first part (add a non-medoid) Trials in first part (add a non-medoid) Trials in second part (drop a medoid) Trials in second part (drop a medoid)
Run Set of Medoids producing lowest q(X) in the run q(X) Purity
0 8 42 62 148 (Init. Solution) 0.086 0.947
1 8 42 62 148 52 0.065 0.947
2 8 42 62 148 52 122 0.041 0.973
3 42 62 148 52 122 117 0.030 0.987
4 8 62 148 52 122 117 0.021 0.993
5 8 62 148 52 122 117 87 0.016 1.000
6 8 62 52 122 117 87 0.014 1.000
7 8 62 122 117 87 0.012 1.000
20
Algorithm SPAM
21
Differences between SPAM and SRIDHCR
  1. SPAM tries to improve the current solution by
    replacing a representative by a
    non-representative, whereas SRIDHCR improves the
    current solution by removing a representative/by
    inserting a non-representative.
  2. SPAM is run keeping the number of clusters k
    fixed, whereas SRIDHCR searches for a good
    value of k, therefore exploring a larger solution
    space. However, in the case of SRIDHCR which
    choices for k are good is somewhat restricted by
    the selection of the parameter b.
  3. SRIDHCR is run r times starting from a random
    initial solution, SPAM is only run once.

22
5. Performance Measures for the Experimental
Evaluation
  • The investigated algorithms were evaluated based
    on the following performance measures
  • Cluster Purity (Majority ).
  • Value of the fitness function q(X).
  • Average dissimilarity between all objects and
    their representatives (cluster tightness).
  • Wall-Clock Time (WCT). Actual time, in seconds,
    that the algorithm took to finish the clustering
    task.

23
Algorithm Purity q(X) Tightness(X).
Iris-Plants data set, clusters3 Iris-Plants data set, clusters3 Iris-Plants data set, clusters3 Iris-Plants data set, clusters3
PAM 0.907 0.0933 0.081
SRIDHCR 0.981 0.0200 0.093
SPAM 0.973 0.0267 0.133
Vehicle data set, clusters 65 Vehicle data set, clusters 65 Vehicle data set, clusters 65 Vehicle data set, clusters 65
PAM 0.701 0.326 0.044
SRIDHCR 0.835 0.192 0.072
SPAM 0.764 0.263 0.097
Image-Segment data set, clusters 53 Image-Segment data set, clusters 53 Image-Segment data set, clusters 53 Image-Segment data set, clusters 53
PAM 0.880 0.135 0.027
SRIDHCR 0.980 0.035 0.050
SPAM 0.944 0.071 0.061
Pima-Indian Diabetes data set, clusters 45 Pima-Indian Diabetes data set, clusters 45 Pima-Indian Diabetes data set, clusters 45 Pima-Indian Diabetes data set, clusters 45
PAM 0.763 0.237 0.056
SRIDHCR 0.859 0.164 0.093
SPAM 0.822 0.202 0.086
7
19
Table 4 Traditional vs. Supervised Clustering
(ß0.1)
24
Algorithm q(X) Purity Tightness (X) WCT (Sec.)
IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3
PAM 0.0933 0.907 0.081 0.06
SRIDHCR 0.0200 0.980 0.093 11.00
SPAM 0.0267 0.973 0.133 0.32
Vehicle Dataset, clusters65 Vehicle Dataset, clusters65 Vehicle Dataset, clusters65 Vehicle Dataset, clusters65 Vehicle Dataset, clusters65
PAM 0.326 0.701 0.044 372.00
SRIDHCR 0.192 0.835 0.072 1715.00
SPAM 0.263 0.764 0.097 1090.00
Segmentation Dataset, clusters53 Segmentation Dataset, clusters53 Segmentation Dataset, clusters53 Segmentation Dataset, clusters53 Segmentation Dataset, clusters53
PAM 0.135 0.880 0.027 4073.00
SRIDHCR 0.035 0.980 0.050 11250.00
SPAM 0.071 0.944 0.061 1422.00
Pima-Indians-Diabetes, clusters45 Pima-Indians-Diabetes, clusters45 Pima-Indians-Diabetes, clusters45 Pima-Indians-Diabetes, clusters45 Pima-Indians-Diabetes, clusters45
PAM 0.237 0.763 0.056 186.00
SRIDHCR 0.164 0.859 0.093 660.00
SPAM 0.202 0.822 0.086 58.00
Table 5 Comparative Performance of the Different
Algorithms, ß0.1
25
Algorithm Avg. Purity Tightness(X) Avg.WCT (Sec.)
IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3
PAM 0.907 0.081 0.06
SRIDHCR 0.959 0.104 0.18
SPAM 0.973 0.133 0.33
Vehicle Dataset, clusters56 Vehicle Dataset, clusters56 Vehicle Dataset, clusters56 Vehicle Dataset, clusters56
PAM 0.681 0.046 505.00
SRIDHCR 0.762 0.081 22.58
SPAM 0.754 0.100 681.00
Segmentation Dataset, clusters32 Segmentation Dataset, clusters32 Segmentation Dataset, clusters32 Segmentation Dataset, clusters32
PAM 0.875 0.032 1529.00
SRIDHCR 0.946 0.054 169.39
SPAM 0.940 0.065 1053.00
Pima-Indians-Diabetes, clusters2 Pima-Indians-Diabetes, clusters2 Pima-Indians-Diabetes, clusters2 Pima-Indians-Diabetes, clusters2
PAM 0.656 0.104 0.97
SRIDHCR 0.795 0.109 5.08
SPAM 0.772 0.125 2.70
Table 6 Average Comparative Performance of the
Different Algorithms, ß0.4
26
Why is SRIDHCR performing so much better than
SPAM?
  • SPAM is relatively slow compared with a single
    run of SRIDHCR allowing for 5-30 restarts of
    SRIDHCR using the same resources. This enables
    SRIDHCR to conduct a more balanced exploration of
    the search space.
  • Fitness landscape induced by q(X) contains many
    plateau-like structures (q(X1)q(X2)) and many
    local minima and SPAM seems to get stuck more
    easily.
  • The fact that SPAM uses a fixed k-value does not
    seem beneficiary for finding good solutions,
    e.g. SRIDHCR might explore u1,u2,u3,u4??u1,u2
    ,u3,u4,v1,v2 ?? u3,u4,v1,v2, whereas SPAM
    might terminate with the sub-optimal solution
    u1,u2,u3,u4, if neither the replacement of u1
    through v1 nor the replacement of u2 by v2
    enhances q(X).

27
Dataset k ß Ties Using q(X) Ties Using Tightness(X)
Iris-Plants 10 0.00001 5.8 0.0004
Iris-Plants 10 0.4 5.7 0.0004
Iris-Plants 50 0.00001 20.5 0.0019
Iris-Plants 50 0.4 20.9 0.0018

Vehicle 10 0.00001 1.04 0.000001
Vehicle 10 0.4 1.06 0.000001
Vehicle 50 0.00001 1.78 0.000001
Vehicle 50 0.4 1.84 0.000001

Segmentation 10 0.00001 0.220 0.000000
Segmentation 10 0.4 0.225 0.000001
Segmentation 50 0.00001 0.626 0.000001
Segmentation 50 0.4 0.638 0.000000

Diabetes 10 0.00001 2.06 0.0
Diabetes 10 0.4 2.05 0.0
Diabetes 50 0.00001 3.43 0.0002
Diabetes 50 0.4 3.45 0.0002
Table 7 Ties distribution
28
Figure 2 How Purity and k Change as ß Increases
29
6. Conclusions
  1. As expected, supervised clustering algorithms
    produced significantly better cluster purity than
    traditional clustering. Improvements range
    between 7 and 19 for different data sets.
  2. Algorithms that too greedily explore the search
    space, such as SPAM, do not seem to be very
    suitable for supervised clustering. In general,
    algorithms that explore the search space more
    randomly seem to be more suitable for supervised
    clustering.
  3. Supervised clustering can be used to enhance
    classifiers, dataset summarization, and generate
    better distance functions.

30
Future Work
  • Continue work on supervised clustering algorithms
  • Find better solutions
  • Faster
  • Explain some observations
  • Using supervised clustering for summary
    generation/learning subclasses
  • Using supervised clustering to find compressed
    nearest neighbor classifiers.
  • Using supervised clustering to enhance simple
    classifiers
  • Distance function learning

31
K-Means Algorithm
2
Attribute1
1
3
Attribute2
4
32
K-Means Algorithm
2
Attribute1
1
3
Attribute2
4
Write a Comment
User Comments (0)
About PowerShow.com