A1261599305pDEjL - PowerPoint PPT Presentation

1 / 1

About This Presentation

Title:

A1261599305pDEjL

Description:

A Consensus-Based Clustering Method. for Summarizing Diverse Data Categorizations ... expensive, making it unattractive to apply on large collections of data objects. ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 2

Provided by: ateh3

Category:

more less

Transcript and Presenter's Notes

Title: A1261599305pDEjL

1
A Consensus-Based Clustering Method for
Summarizing Diverse Data Categorizations
Hanan G. Ayad, and Mohamed S. Kamel Pattern
Analysis and Machine Intelligence Lab, University
of Waterloo, Canada.
LORNET Theme 4 - Object Mining and Knowledge
Discovery

Introduction of the idea of cumulative voting
for transforming a clustering to a probabilistic
representation with respect to a common reference
of the ensemble.
Definition of criteria for estimating an optimal
representation for a clustering ensemble with
maximum information content.
Building upon the Information Bottleneck
principle, an optimally compressed summary of
estimated stochastic data is extracted such that
maximum relevant information about the data is
preserved.
Effectiveness of the developed cumulative voting
method is demonstrated as follows
Diverse cluster structures for a collection of
text documents are generated with arbitrary
coarse-to-fine resolutions, and consensus
solution obtained.
Comparison with equally efficient state-of-the
art consensus methods.

Contributions

We seek the discovery of the complex
categorization structure inherent in a collection
of data objects, by obtaining a consensus among a
set of diverse cluster structures of the
collection.
We aim at achieving the above objective by
developing a computationally efficient consensus
method. A competitive consensus method is
demonstrated in recent literature, but is
computationally expensive, making it unattractive
to apply on large collections of data objects.

Introduction
Cumulative Voting Method

A text document is represented by a list of
numeric weights corresponding to words of the
corpus vocabulary.
For a set X of n objects, a clustering Ci
assigns each object to one of ki clusters denoted
by the symbolic labels
Multiple clusterings C1 Cb of the dataset are
generated with induced diversity, by varying the
number of clusters randomly, obtaining k1 kb
clusters, respectively.
The clustering of the ensemble which has maximum
information content is selected as an initial
reference clustering
An iterative voting procedure is implemented as
follows. For each clustering Ci
Cumulative voting is applied whereby each
current cluster votes for each current
reference cluster according to estimated
probabilities.
Each clustering Ci is transformed to a
stochastic representation with respect to the
reference clustering.
Reference clustering is updated to represent
current estimates based on clusterings processed
so far.

Experimental Results

Based on the cumulative voting method, three
variant algorithms with different properties and
weighting schemes were developed.
Un-normalized fixed-Reference Cumulative Voting
(URCV), fixed-Reference Cumulative Voting (RCV),
and Adaptive Cumulative Voting (ACV). The last
two use a normalized weighting scheme. The latter
apply an iterative voting procedure whereas the
first two use a fixed reference.
The following performance measures are used,
which measure the quality of the obtained
consensus solution compared to human
categorization of the data.
Adjusted Rand Index
Normalized Mutual Information
Comparison with the following consensus
algorithms is shown.
Hyper-Graph Partitioning Algorithm HGPA,
Meta-Clustering Algorithm MCLA, (Strehl et. al.
2002)
Quadratic Mutual Information Algorithm QMI,
(Topchy et. al. 2005)
Each generated ensemble consists of 25
clusterings. Boxplots show the distribution of
the performance measures over 10 runs.

The Voting process

The joint statistics P(C,X) of two categorical
random variables representing the set of
categories and the set of objects
X are estimated.
An agglomerative information-theoretic
algorithm, derived from the information
bottleneck principle, is developed to extract an
optimal compressed summary of the estimated
probability distribution so that maximum relevant
information about the data is preserved.
Based on the summary, each object is assigned to
its most likely category.

Based on the idea of cumulative voting and the
information bottleneck principle, efficient
consensus clustering algorithms were developed to
derive a meaningful consensus clustering from
diverse clusterings of the data objects.
Superior accuracy compared to recent consensus
algorithms is obtained.
Computational complexity is linear in the number
of data object.

Conclusion

Hanan G. Ayad and Mohamed S. Kamel. Cumulative
Voting Consensus Method for Partitions with a
Variable Number of Clusters. IEEE Transactions on
Pattern Analysis and Machine Intelligence. To
Appear.

Further Reading
Fourth Annual Scientific Conference LORNET
Research Network, November 4th - 7th, 2007,
Montreal, Canada.

Write a Comment

User Comments (0)