Adapting the Right Measures for Kmeans Clustering - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Adapting the Right Measures for Kmeans Clustering

Description:

measure properties. rules for use. Main Contributions ... Established the importance of measure normalization and designed normalization ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 33

Provided by: malte

Category:

more less

Transcript and Presenter's Notes

Title: Adapting the Right Measures for Kmeans Clustering

1
Adapting the Right Measures for K-means Clustering

Junjie Wu (wujj_at_buaa.edu.cn)
Beihang University

Joint Work with Hui Xiong (Rutgers Univ.) Jian
Chen (Tsinghua Univ.)
2
Outline

Introduction
Defective Validation Measures
Measure Normalization
Measure Properties
Concluding Remarks

3
Clustering and Cluster Validation

Cluster analysis provides insight into the data
by dividing the objects into groups (clusters) of
objects, such that objects in a cluster are more
similar to each other than to objects in other
clusters.
Cluster validation refers to procedures that
evaluate the results of clustering in a
quantitative and objective fashion.Jain Dubes,
1988
How to be quantitative To employ the measures.
How to be objective To validate the measures!

4
Cluster Validation Measures

A Typical View of Cluster Validation Measures
External measures
Match a cluster structure to a prior information,
e.g., class labels.
E.g., Rand index, G statistics, F-measure, Mutual
Information
Internal measures
Assess the fit between the structure and the data
themselves only.
E.g., Silhouette index, CPCC, G statistics
Relative measures
Decide which of two structures is better, often
used for selecting the right clustering
parameters, e.g., the cluster number.
E.g., Dunns indices, Davies-Bouldin index,
partition coefficient
Other Views
Partitional Indices vs. Hierarchical Indices
Fuzzy Indices vs. Non-Fuzzy Indices
Statistics-based Indices vs. Information-based
Indices

5
Research Motivations

There is little work on evaluating the
effectiveness of cluster validation measures in a
systematic way. Many questions remain!
What are the measures widely used?
Are these measures objective?
Why and how these measures should be normalized?
What are the properties and interrelationships of
these measures?
How to adapt the right measures for a specific
clustering algorithm?
The answers to the above questions are essential
to the success of cluster analysis!

6
The Scope of this Study

To provide an organized study of external
validation measures for K-means clustering.
K-means is a well-known, widely used, and
successful clustering method.
16 external measures studied, 13 remained.

7
Workflow Towards Right Measures
8
Main Contributions

In general, we provided an organized study of
selecting the right measures for K-means
clustering. Specifically, we
Reviewed 16 well-known external validation
measures
Identified some defective measures
Established the importance of measure
normalization and designed normalization
solutions for several validation measures
Revealed some major properties of these external
measures, such as consistency, sensitivity, and
symmetry properties.
Provided the final guidance for adapting right
measures for K-means clustering

9
Outline

Introduction
Defective Validation Measures
Measure Normalization
Measure Properties
Concluding Remarks

10
K-means The Uniform Effect

For data sets in skewed distributions, K-means
tends to produce clusters with relatively uniform
sizes.

On the document data set sports
11
A Necessary Selection Criterion

Two Clustering Results for a Sample Data Set

CV10
CV11.125
far away
CV01.166
12
Identifying Defective Measures An Example

The cluster validation results
Now only 10 measures remained.

13
Exploring the Defectiveness

Entropy and Purity
Mutual Information

?jmaxinij/n
H(PC)
14
Improving the Defective Measures

Variation of Information (VI) vs. Entropy (E)
van Dongen criterion (VD) vs. Purity (P)

15
Outline

Introduction
Defective Validation Measures
Measure Normalization
Measure Properties
Concluding Remarks

16
Two Normalization Methods

Normalization enables the use of measures for the
comparisons of clustering results of different
data sets.
Two types of normalization schemes
Statistics-based normalization
Extreme value-based normalization
Basic Assumptions
Multivariate hyper-geometricdistribution

fixed
17
Normalization Solutions

The Normalized Measures

Type I
Type II
18
Test Normalizations The DCV Criterion and the
Settings

The DCV Criterion
DCVCV1-CV0
As the DCV values go down, the clustering results
by K-means tend to be away from true class
distributions.
As the DCV values go down, the good measures are
expected to show worse clustering performances.
The Experimental Setup
Data Sets simulated sampled, with increased
DCV.
Tools
Matlab 7.1
Cluto 2.1.1

19
Normalization Experiments The Results

Remark
If we use the unnormalized measures to do cluster
validation, only three measures, namely R, G, G,
have strong consistency with DCV.
All the normalized measures show perfect
consistency with DCV except for Fn and ?n.
Wider value ranges afternormalization.

Kendalls rank correlation
20
The Impact of the Number of Clusters

Remark
The measurement values for all the measures will
change as the increase of the cluster numbers.
The normalized measures can capture the same
optimal cluster number 5.

The data set la2
21
Outline

Introduction
Defective Validation Measures
Measure Normalization
Measure Properties
Concluding Remarks

22
The Consistency

The Experiment Setup
Data Sets 29 benchmark document data sets.
Tools CLUTO.
Measures Kendalls rank correlation.
Result Correlations of the Measures
The normalized measures have much stronger
consistency than the unnormalized measures.

23
The Consistency, Contd

Hierarchical Clustering on the Normalized
Measures
are
equivalent.
are more similar to one another.
show
inconsistency in varying degrees.
Only 7 normalized measures remained!

24
The Sensitivity

Remarks
All the measures show different validation
results for the two clusterings except for VDn
and Fn.
VIn is the most sensitive measure.

25
Math Properties
.
26
Math Properties, Contd
27
Math Properties, Contd
28
The Selection Process An Overview

The Way to the Right Measures
Step I Discard M, MAP and GK. 13 measures
remained.
Step II Filter out E, P, and MI. 10 measures
remained.
Step III Normalize the measures. 10 normalized
measures remained.
Step IV Discard . 7
normalized measures remained.
Step V Filter out Fn and ?n. 5 normalized
measures remained.
Step VI Discard FMn and Gn. 3 normalized
measures remained.
The Three Right Measures for K-means Clustering
Normalized van Dongen criterion (VDn)
Normalized variation of information (VIn)
Normalized Rand index (Rn)

29
Insights

Guidance for K-means Clustering Validation
It is most suitable to use VDn, since VDn has a
simple computation form, satisfies all
mathematically sound properties, and can measure
well on the data with imbalanced class
distributions.
For the case that the clustering performances are
hard to distinguish, we may use VIn instead,
since VIn has high sensitivity on detecting the
clustering changes.
Rn can also be used as a complementary to the
above two measures.

30
Outline

Introduction
Defective Validation Measures
Measure Normalization
Measure Properties
Concluding Remarks

31
Conclusions

In this study, we compared and contrasted
external validation measures for K-means
clustering
It is necessary to normalize validation measures
before they can be employed for clustering
validation
Provided normalization solutions for the measures
whose normalized solutions are not available
Summarized the key properties of these measures.
These properties should be considered before
deciding what is the right measure to use in
practice
Investigated the relationships among these
validation measures.