Standardized evaluation method for web clustering results - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Standardized evaluation method for web clustering results

Description:

... method for web clustering results. Daniel Crabtree, Xiaoying Gao, Peter Andreae. ... The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI' ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 17
Provided by: sil973
Category:

less

Transcript and Presenter's Notes

Title: Standardized evaluation method for web clustering results


1
Standardized evaluation method for web clustering
results
  • Daniel Crabtree, Xiaoying Gao, Peter Andreae.
    Standardized evaluation method for web clustering
    results. The 2005 IEEE/WIC/ACM International
    Conference on Web Intelligence (WI'05), pp.
    280-283. September 2005.
  • http//www.danielcrabtree.com/site/Research_Papers
  • Presenter Suhan Yu

2
Introduction
  • Problem 1
  • Current search engines allow a user to retrieval
    pages that match a search query , but the number
    of results returned is often huge.
  • Many of the results may be irrelevant to users
    goal.
  • Solve
  • To organize the result set into clusters of
    semantically related pages.
  • Benefit
  • Quickly overview the entire result set.
  • Use the clusters themselves to filter the result
    set or refine the query.
  • Methods
  • K-means, Hierarchical Agglomerative Clustering,
    Link and Contents Clustering, Suffix Tree
    Clustering.

3
Introduction
  • Problem 2
  • There are many different methods to evaluate web
    clustering algorithms, but the results are often
    incomparable.
  • How to set up a standard evaluation method for
    comparing web clustering?
  • Three difficulties
  • The cluster may be coarse or fine.
  • Clusters may be disjoint or the clusters may
    overlap.
  • The clustering may be flat so that all clusters
    are at the same level.
  • Goal
  • Allows the fair comparison of all web clustering
    algorithms.

4
Varying Cluster Structure
  • The structure can vary
  • Flat
    Hierarchy
  • Coarse grained
    Fine grained
  • Disjoint
    Overlapping

5
Previous methods
  • There are two broad methodologies for evaluating
    clusters
  • Internal quality- evaluates a clustering only in
    terms of a function of the clusters themselves.
  • External quality- evaluates a clustering using
    external information, such as an ideal clustering.

6
Previous methods
  • Internal quality (model based, unsupervised)
  • Such as the sum of squared errors.
  • Cluster centroids
  • The sum of squared errors is
  • Compared clusterings are well-balanced

7
Previous methods
  • External quality (model-free, semi-supervised)
  • Purity purity is defined as the ratio of the
    number of objects in the dominant category to the
    total number of objects.
  • Entropy
  • Mutual information
  • Precision and recall

8
Measurements
  • A clustering can be less than perfect in two
    ways
  • Quality some clusters may be of poor quality
    because they do not match any topics well.
  • Coverage Clustering may not cover all the pages
    in the ideal clustering.
  • Methods
  • F-measure
  • Purity
  • Entropy
  • Mutual Information

9
A New Ideal Clustering
coarse
Pages can overlap
fine
10
A New Ideal Clustering
  • All pages must be contained in at least one topic
    at each level.
  • Deal with outlier
  • The hierarchy may contain a single outlier topic
    (present at every level) that contain all the
    outliers.
  • The outlier topic must be disjoint from the other
    topics.

11
New method
  • QC4 (quality, coverage, and 4 overall
    measurements)
  • Cluster quality
  • Topic coverage

12
Cluster quality
  • Cluster quality is a measure of how closely a
    cluster matches a single topic.
  • Entropy the average over the topics of log
    precision
  • Problem1 entropy does not work with overlapping
    topics since pages in multiple topics are
    overcounted.
  • Overlap at different level or at the same level.
  • F-measure
  • Using f-measure to choose the level
    containing the topic which is the most similar to
    the cluster.
  • Disjoint and overlapping topics be handle fairly.

13
Cluster quality
  • Problem2 cluster size.
  • A very small cluster, even if they are highly
    focused,are not very useful to users.
  • The cluster should also have good recall.
  • Problem 3 overlapping
  • MI
  • MI can extracting the intersections of topics
    into distinct topics.

14
Topic Coverage
  • Topic coverage is a measure of how well the pages
    in a topic are covered by the cluster.
  • A page will be better covered if it is contained
    in a cluster that matches a topic that the page
    is in.
  • precision can be a good measure of how well the
    page is covered.

15
Evaluation of QC4
  • Analysis the output of QC4 on a variety of
    synthetic data set, which contains
  • Perfect clustering
  • Near perfect clustering
  • Set of all singleton clusters
  • Mid-sized cluster
  • Overlapping cluster
  • Cluster at different level granularity
  • QC4 can apply to real data sets

16
Conclusion
  • This paper introduced QC4, a new standardized web
    clustering evaluation method.
  • QC4 minimized method bias, and universally
    characterize clustering with different
    characteristic.
  • Cluster granularity
  • Clustering structure
  • Clustering size
  • The future work is to compute the complexity of
    QC4.
Write a Comment
User Comments (0)
About PowerShow.com