Title: Adapting the Right Measures for Kmeans Clustering
1Adapting the Right Measures for K-means Clustering
- Junjie Wu (wujj_at_buaa.edu.cn)
- Beihang University
Joint Work with Hui Xiong (Rutgers Univ.) Jian
Chen (Tsinghua Univ.)
2Outline
- Introduction
- Defective Validation Measures
- Measure Normalization
- Measure Properties
- Concluding Remarks
3Clustering and Cluster Validation
- Cluster analysis provides insight into the data
by dividing the objects into groups (clusters) of
objects, such that objects in a cluster are more
similar to each other than to objects in other
clusters. - Cluster validation refers to procedures that
evaluate the results of clustering in a
quantitative and objective fashion.Jain Dubes,
1988 - How to be quantitative To employ the measures.
- How to be objective To validate the measures!
4Cluster Validation Measures
- A Typical View of Cluster Validation Measures
- External measures
- Match a cluster structure to a prior information,
e.g., class labels. - E.g., Rand index, G statistics, F-measure, Mutual
Information - Internal measures
- Assess the fit between the structure and the data
themselves only. - E.g., Silhouette index, CPCC, G statistics
- Relative measures
- Decide which of two structures is better, often
used for selecting the right clustering
parameters, e.g., the cluster number. - E.g., Dunns indices, Davies-Bouldin index,
partition coefficient - Other Views
- Partitional Indices vs. Hierarchical Indices
- Fuzzy Indices vs. Non-Fuzzy Indices
- Statistics-based Indices vs. Information-based
Indices
5Research Motivations
- There is little work on evaluating the
effectiveness of cluster validation measures in a
systematic way. Many questions remain! - What are the measures widely used?
- Are these measures objective?
- Why and how these measures should be normalized?
- What are the properties and interrelationships of
these measures? - How to adapt the right measures for a specific
clustering algorithm? - The answers to the above questions are essential
to the success of cluster analysis!
6The Scope of this Study
- To provide an organized study of external
validation measures for K-means clustering. - K-means is a well-known, widely used, and
successful clustering method. - 16 external measures studied, 13 remained.
7Workflow Towards Right Measures
8Main Contributions
- In general, we provided an organized study of
selecting the right measures for K-means
clustering. Specifically, we - Reviewed 16 well-known external validation
measures - Identified some defective measures
- Established the importance of measure
normalization and designed normalization
solutions for several validation measures - Revealed some major properties of these external
measures, such as consistency, sensitivity, and
symmetry properties. - Provided the final guidance for adapting right
measures for K-means clustering
9Outline
- Introduction
- Defective Validation Measures
- Measure Normalization
- Measure Properties
- Concluding Remarks
10K-means The Uniform Effect
- For data sets in skewed distributions, K-means
tends to produce clusters with relatively uniform
sizes.
On the document data set sports
11A Necessary Selection Criterion
- Two Clustering Results for a Sample Data Set
CV10
CV11.125
far away
CV01.166
12Identifying Defective Measures An Example
- The cluster validation results
- Now only 10 measures remained.
13Exploring the Defectiveness
- Entropy and Purity
- Mutual Information
?jmaxinij/n
H(PC)
14Improving the Defective Measures
- Variation of Information (VI) vs. Entropy (E)
- van Dongen criterion (VD) vs. Purity (P)
15Outline
- Introduction
- Defective Validation Measures
- Measure Normalization
- Measure Properties
- Concluding Remarks
16Two Normalization Methods
- Normalization enables the use of measures for the
comparisons of clustering results of different
data sets. - Two types of normalization schemes
- Statistics-based normalization
- Extreme value-based normalization
- Basic Assumptions
- Multivariate hyper-geometricdistribution
fixed
17Normalization Solutions
Type I
Type II
18Test Normalizations The DCV Criterion and the
Settings
- The DCV Criterion
- DCVCV1-CV0
- As the DCV values go down, the clustering results
by K-means tend to be away from true class
distributions. - As the DCV values go down, the good measures are
expected to show worse clustering performances. - The Experimental Setup
- Data Sets simulated sampled, with increased
DCV. - Tools
- Matlab 7.1
- Cluto 2.1.1
19Normalization Experiments The Results
- Remark
- If we use the unnormalized measures to do cluster
validation, only three measures, namely R, G, G,
have strong consistency with DCV. - All the normalized measures show perfect
consistency with DCV except for Fn and ?n. - Wider value ranges afternormalization.
Kendalls rank correlation
20The Impact of the Number of Clusters
- Remark
- The measurement values for all the measures will
change as the increase of the cluster numbers. - The normalized measures can capture the same
optimal cluster number 5.
The data set la2
21Outline
- Introduction
- Defective Validation Measures
- Measure Normalization
- Measure Properties
- Concluding Remarks
22The Consistency
- The Experiment Setup
- Data Sets 29 benchmark document data sets.
- Tools CLUTO.
- Measures Kendalls rank correlation.
- Result Correlations of the Measures
- The normalized measures have much stronger
consistency than the unnormalized measures.
23The Consistency, Contd
- Hierarchical Clustering on the Normalized
Measures - are
equivalent. -
are more similar to one another. - show
inconsistency in varying degrees. - Only 7 normalized measures remained!
24The Sensitivity
- Remarks
- All the measures show different validation
results for the two clusterings except for VDn
and Fn. - VIn is the most sensitive measure.
25Math Properties
.
26Math Properties, Contd
27Math Properties, Contd
28The Selection Process An Overview
- The Way to the Right Measures
- Step I Discard M, MAP and GK. 13 measures
remained. - Step II Filter out E, P, and MI. 10 measures
remained. - Step III Normalize the measures. 10 normalized
measures remained. - Step IV Discard . 7
normalized measures remained. - Step V Filter out Fn and ?n. 5 normalized
measures remained. - Step VI Discard FMn and Gn. 3 normalized
measures remained. - The Three Right Measures for K-means Clustering
- Normalized van Dongen criterion (VDn)
- Normalized variation of information (VIn)
- Normalized Rand index (Rn)
29Insights
- Guidance for K-means Clustering Validation
- It is most suitable to use VDn, since VDn has a
simple computation form, satisfies all
mathematically sound properties, and can measure
well on the data with imbalanced class
distributions. - For the case that the clustering performances are
hard to distinguish, we may use VIn instead,
since VIn has high sensitivity on detecting the
clustering changes. - Rn can also be used as a complementary to the
above two measures.
30Outline
- Introduction
- Defective Validation Measures
- Measure Normalization
- Measure Properties
- Concluding Remarks
31Conclusions
- In this study, we compared and contrasted
external validation measures for K-means
clustering - It is necessary to normalize validation measures
before they can be employed for clustering
validation - Provided normalization solutions for the measures
whose normalized solutions are not available - Summarized the key properties of these measures.
These properties should be considered before
deciding what is the right measure to use in
practice - Investigated the relationships among these
validation measures.
32Thank You!
http//datamining.buaa.edu.cn