Adapting the Right Measures for Kmeans Clustering - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Adapting the Right Measures for Kmeans Clustering

Description:

measure properties. rules for use. Main Contributions ... Established the importance of measure normalization and designed normalization ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 33
Provided by: malte
Category:

less

Transcript and Presenter's Notes

Title: Adapting the Right Measures for Kmeans Clustering


1
Adapting the Right Measures for K-means Clustering
  • Junjie Wu (wujj_at_buaa.edu.cn)
  • Beihang University

Joint Work with Hui Xiong (Rutgers Univ.) Jian
Chen (Tsinghua Univ.)
2
Outline
  • Introduction
  • Defective Validation Measures
  • Measure Normalization
  • Measure Properties
  • Concluding Remarks

3
Clustering and Cluster Validation
  • Cluster analysis provides insight into the data
    by dividing the objects into groups (clusters) of
    objects, such that objects in a cluster are more
    similar to each other than to objects in other
    clusters.
  • Cluster validation refers to procedures that
    evaluate the results of clustering in a
    quantitative and objective fashion.Jain Dubes,
    1988
  • How to be quantitative To employ the measures.
  • How to be objective To validate the measures!

4
Cluster Validation Measures
  • A Typical View of Cluster Validation Measures
  • External measures
  • Match a cluster structure to a prior information,
    e.g., class labels.
  • E.g., Rand index, G statistics, F-measure, Mutual
    Information
  • Internal measures
  • Assess the fit between the structure and the data
    themselves only.
  • E.g., Silhouette index, CPCC, G statistics
  • Relative measures
  • Decide which of two structures is better, often
    used for selecting the right clustering
    parameters, e.g., the cluster number.
  • E.g., Dunns indices, Davies-Bouldin index,
    partition coefficient
  • Other Views
  • Partitional Indices vs. Hierarchical Indices
  • Fuzzy Indices vs. Non-Fuzzy Indices
  • Statistics-based Indices vs. Information-based
    Indices

5
Research Motivations
  • There is little work on evaluating the
    effectiveness of cluster validation measures in a
    systematic way. Many questions remain!
  • What are the measures widely used?
  • Are these measures objective?
  • Why and how these measures should be normalized?
  • What are the properties and interrelationships of
    these measures?
  • How to adapt the right measures for a specific
    clustering algorithm?
  • The answers to the above questions are essential
    to the success of cluster analysis!

6
The Scope of this Study
  • To provide an organized study of external
    validation measures for K-means clustering.
  • K-means is a well-known, widely used, and
    successful clustering method.
  • 16 external measures studied, 13 remained.

7
Workflow Towards Right Measures
8
Main Contributions
  • In general, we provided an organized study of
    selecting the right measures for K-means
    clustering. Specifically, we
  • Reviewed 16 well-known external validation
    measures
  • Identified some defective measures
  • Established the importance of measure
    normalization and designed normalization
    solutions for several validation measures
  • Revealed some major properties of these external
    measures, such as consistency, sensitivity, and
    symmetry properties.
  • Provided the final guidance for adapting right
    measures for K-means clustering

9
Outline
  • Introduction
  • Defective Validation Measures
  • Measure Normalization
  • Measure Properties
  • Concluding Remarks

10
K-means The Uniform Effect
  • For data sets in skewed distributions, K-means
    tends to produce clusters with relatively uniform
    sizes.

On the document data set sports
11
A Necessary Selection Criterion
  • Two Clustering Results for a Sample Data Set

CV10
CV11.125
far away
CV01.166
12
Identifying Defective Measures An Example
  • The cluster validation results
  • Now only 10 measures remained.

13
Exploring the Defectiveness
  • Entropy and Purity
  • Mutual Information

?jmaxinij/n
H(PC)
14
Improving the Defective Measures
  • Variation of Information (VI) vs. Entropy (E)
  • van Dongen criterion (VD) vs. Purity (P)

15
Outline
  • Introduction
  • Defective Validation Measures
  • Measure Normalization
  • Measure Properties
  • Concluding Remarks

16
Two Normalization Methods
  • Normalization enables the use of measures for the
    comparisons of clustering results of different
    data sets.
  • Two types of normalization schemes
  • Statistics-based normalization
  • Extreme value-based normalization
  • Basic Assumptions
  • Multivariate hyper-geometricdistribution

fixed
17
Normalization Solutions
  • The Normalized Measures

Type I
Type II
18
Test Normalizations The DCV Criterion and the
Settings
  • The DCV Criterion
  • DCVCV1-CV0
  • As the DCV values go down, the clustering results
    by K-means tend to be away from true class
    distributions.
  • As the DCV values go down, the good measures are
    expected to show worse clustering performances.
  • The Experimental Setup
  • Data Sets simulated sampled, with increased
    DCV.
  • Tools
  • Matlab 7.1
  • Cluto 2.1.1

19
Normalization Experiments The Results
  • Remark
  • If we use the unnormalized measures to do cluster
    validation, only three measures, namely R, G, G,
    have strong consistency with DCV.
  • All the normalized measures show perfect
    consistency with DCV except for Fn and ?n.
  • Wider value ranges afternormalization.

Kendalls rank correlation
20
The Impact of the Number of Clusters
  • Remark
  • The measurement values for all the measures will
    change as the increase of the cluster numbers.
  • The normalized measures can capture the same
    optimal cluster number 5.

The data set la2
21
Outline
  • Introduction
  • Defective Validation Measures
  • Measure Normalization
  • Measure Properties
  • Concluding Remarks

22
The Consistency
  • The Experiment Setup
  • Data Sets 29 benchmark document data sets.
  • Tools CLUTO.
  • Measures Kendalls rank correlation.
  • Result Correlations of the Measures
  • The normalized measures have much stronger
    consistency than the unnormalized measures.

23
The Consistency, Contd
  • Hierarchical Clustering on the Normalized
    Measures
  • are
    equivalent.

  • are more similar to one another.
  • show
    inconsistency in varying degrees.
  • Only 7 normalized measures remained!

24
The Sensitivity
  • Remarks
  • All the measures show different validation
    results for the two clusterings except for VDn
    and Fn.
  • VIn is the most sensitive measure.

25
Math Properties
.
26
Math Properties, Contd
27
Math Properties, Contd
28
The Selection Process An Overview
  • The Way to the Right Measures
  • Step I Discard M, MAP and GK. 13 measures
    remained.
  • Step II Filter out E, P, and MI. 10 measures
    remained.
  • Step III Normalize the measures. 10 normalized
    measures remained.
  • Step IV Discard . 7
    normalized measures remained.
  • Step V Filter out Fn and ?n. 5 normalized
    measures remained.
  • Step VI Discard FMn and Gn. 3 normalized
    measures remained.
  • The Three Right Measures for K-means Clustering
  • Normalized van Dongen criterion (VDn)
  • Normalized variation of information (VIn)
  • Normalized Rand index (Rn)

29
Insights
  • Guidance for K-means Clustering Validation
  • It is most suitable to use VDn, since VDn has a
    simple computation form, satisfies all
    mathematically sound properties, and can measure
    well on the data with imbalanced class
    distributions.
  • For the case that the clustering performances are
    hard to distinguish, we may use VIn instead,
    since VIn has high sensitivity on detecting the
    clustering changes.
  • Rn can also be used as a complementary to the
    above two measures.

30
Outline
  • Introduction
  • Defective Validation Measures
  • Measure Normalization
  • Measure Properties
  • Concluding Remarks

31
Conclusions
  • In this study, we compared and contrasted
    external validation measures for K-means
    clustering
  • It is necessary to normalize validation measures
    before they can be employed for clustering
    validation
  • Provided normalization solutions for the measures
    whose normalized solutions are not available
  • Summarized the key properties of these measures.
    These properties should be considered before
    deciding what is the right measure to use in
    practice
  • Investigated the relationships among these
    validation measures.

32
Thank You!
http//datamining.buaa.edu.cn
Write a Comment
User Comments (0)
About PowerShow.com