A1261599305pDEjL - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

A1261599305pDEjL

Description:

A Consensus-Based Clustering Method. for Summarizing Diverse Data Categorizations ... expensive, making it unattractive to apply on large collections of data objects. ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 2
Provided by: ateh3
Category:

less

Transcript and Presenter's Notes

Title: A1261599305pDEjL


1
A Consensus-Based Clustering Method for
Summarizing Diverse Data Categorizations
Hanan G. Ayad, and Mohamed S. Kamel Pattern
Analysis and Machine Intelligence Lab, University
of Waterloo, Canada.
LORNET Theme 4 - Object Mining and Knowledge
Discovery
  • Introduction of the idea of cumulative voting
    for transforming a clustering to a probabilistic
    representation with respect to a common reference
    of the ensemble.
  • Definition of criteria for estimating an optimal
    representation for a clustering ensemble with
    maximum information content.
  • Building upon the Information Bottleneck
    principle, an optimally compressed summary of
    estimated stochastic data is extracted such that
    maximum relevant information about the data is
    preserved.
  • Effectiveness of the developed cumulative voting
    method is demonstrated as follows
  • Diverse cluster structures for a collection of
    text documents are generated with arbitrary
    coarse-to-fine resolutions, and consensus
    solution obtained.
  • Comparison with equally efficient state-of-the
    art consensus methods.

Contributions
  • We seek the discovery of the complex
    categorization structure inherent in a collection
    of data objects, by obtaining a consensus among a
    set of diverse cluster structures of the
    collection.
  • We aim at achieving the above objective by
    developing a computationally efficient consensus
    method. A competitive consensus method is
    demonstrated in recent literature, but is
    computationally expensive, making it unattractive
    to apply on large collections of data objects.

Introduction
Cumulative Voting Method
  • A text document is represented by a list of
    numeric weights corresponding to words of the
    corpus vocabulary.
  • For a set X of n objects, a clustering Ci
    assigns each object to one of ki clusters denoted
    by the symbolic labels
  • Multiple clusterings C1 Cb of the dataset are
    generated with induced diversity, by varying the
    number of clusters randomly, obtaining k1 kb
    clusters, respectively.
  • The clustering of the ensemble which has maximum
    information content is selected as an initial
    reference clustering
  • An iterative voting procedure is implemented as
    follows. For each clustering Ci
  • Cumulative voting is applied whereby each
    current cluster votes for each current
    reference cluster according to estimated
    probabilities.
  • Each clustering Ci is transformed to a
    stochastic representation with respect to the
    reference clustering.
  • Reference clustering is updated to represent
    current estimates based on clusterings processed
    so far.

Experimental Results
  • Based on the cumulative voting method, three
    variant algorithms with different properties and
    weighting schemes were developed.
  • Un-normalized fixed-Reference Cumulative Voting
    (URCV), fixed-Reference Cumulative Voting (RCV),
    and Adaptive Cumulative Voting (ACV). The last
    two use a normalized weighting scheme. The latter
    apply an iterative voting procedure whereas the
    first two use a fixed reference.
  • The following performance measures are used,
    which measure the quality of the obtained
    consensus solution compared to human
    categorization of the data.
  • Adjusted Rand Index
  • Normalized Mutual Information
  • Comparison with the following consensus
    algorithms is shown.
  • Hyper-Graph Partitioning Algorithm HGPA,
    Meta-Clustering Algorithm MCLA, (Strehl et. al.
    2002)
  • Quadratic Mutual Information Algorithm QMI,
    (Topchy et. al. 2005)
  • Each generated ensemble consists of 25
    clusterings. Boxplots show the distribution of
    the performance measures over 10 runs.

The Voting process
  • The joint statistics P(C,X) of two categorical
    random variables representing the set of
    categories and the set of objects
    X are estimated.
  • An agglomerative information-theoretic
    algorithm, derived from the information
    bottleneck principle, is developed to extract an
    optimal compressed summary of the estimated
    probability distribution so that maximum relevant
    information about the data is preserved.
  • Based on the summary, each object is assigned to
    its most likely category.
  • Based on the idea of cumulative voting and the
    information bottleneck principle, efficient
    consensus clustering algorithms were developed to
    derive a meaningful consensus clustering from
    diverse clusterings of the data objects.
  • Superior accuracy compared to recent consensus
    algorithms is obtained.
  • Computational complexity is linear in the number
    of data object.

Conclusion
  • Hanan G. Ayad and Mohamed S. Kamel. Cumulative
    Voting Consensus Method for Partitions with a
    Variable Number of Clusters. IEEE Transactions on
    Pattern Analysis and Machine Intelligence. To
    Appear.

Further Reading
Fourth Annual Scientific Conference LORNET
Research Network, November 4th - 7th, 2007,
Montreal, Canada.
Write a Comment
User Comments (0)
About PowerShow.com