Standardized evaluation method for web clustering results - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

Standardized evaluation method for web clustering results

Description:

Number of Views:54

Avg rating:3.0/5.0

Slides: 17

Provided by: sil973

Category:

more less

Transcript and Presenter's Notes

Title: Standardized evaluation method for web clustering results

1
Standardized evaluation method for web clustering
results

Daniel Crabtree, Xiaoying Gao, Peter Andreae.
Standardized evaluation method for web clustering
results. The 2005 IEEE/WIC/ACM International
Conference on Web Intelligence (WI'05), pp.
280-283. September 2005.
http//www.danielcrabtree.com/site/Research_Papers
Presenter Suhan Yu

2
Introduction

Problem 1
Current search engines allow a user to retrieval
pages that match a search query , but the number
of results returned is often huge.
Many of the results may be irrelevant to users
goal.
Solve
To organize the result set into clusters of
semantically related pages.
Benefit
Quickly overview the entire result set.
Use the clusters themselves to filter the result
set or refine the query.
Methods
K-means, Hierarchical Agglomerative Clustering,
Link and Contents Clustering, Suffix Tree
Clustering.

3
Introduction

Problem 2
There are many different methods to evaluate web
clustering algorithms, but the results are often
incomparable.
How to set up a standard evaluation method for
comparing web clustering?
Three difficulties
The cluster may be coarse or fine.
Clusters may be disjoint or the clusters may
overlap.
The clustering may be flat so that all clusters
are at the same level.
Goal
Allows the fair comparison of all web clustering
algorithms.

4
Varying Cluster Structure

5
Previous methods

There are two broad methodologies for evaluating
clusters
Internal quality- evaluates a clustering only in
terms of a function of the clusters themselves.
External quality- evaluates a clustering using
external information, such as an ideal clustering.

6
Previous methods

7
Previous methods

External quality (model-free, semi-supervised)
Purity purity is defined as the ratio of the
number of objects in the dominant category to the
total number of objects.
Entropy
Mutual information
Precision and recall

8
Measurements

A clustering can be less than perfect in two
ways
Quality some clusters may be of poor quality
because they do not match any topics well.
Coverage Clustering may not cover all the pages
in the ideal clustering.
Methods
F-measure
Purity
Entropy
Mutual Information

9
A New Ideal Clustering
coarse
Pages can overlap
fine
10
A New Ideal Clustering

All pages must be contained in at least one topic
at each level.
Deal with outlier
The hierarchy may contain a single outlier topic
(present at every level) that contain all the
outliers.
The outlier topic must be disjoint from the other
topics.

11
New method

12
Cluster quality

Cluster quality is a measure of how closely a
cluster matches a single topic.
Entropy the average over the topics of log
precision
Problem1 entropy does not work with overlapping
topics since pages in multiple topics are
overcounted.
Overlap at different level or at the same level.
F-measure
Using f-measure to choose the level
containing the topic which is the most similar to
the cluster.
Disjoint and overlapping topics be handle fairly.

13
Cluster quality

Problem2 cluster size.
A very small cluster, even if they are highly
focused,are not very useful to users.
The cluster should also have good recall.
Problem 3 overlapping
MI
MI can extracting the intersections of topics
into distinct topics.

14
Topic Coverage

Topic coverage is a measure of how well the pages
in a topic are covered by the cluster.
A page will be better covered if it is contained
in a cluster that matches a topic that the page
is in.
precision can be a good measure of how well the
page is covered.

15
Evaluation of QC4

16
Conclusion

This paper introduced QC4, a new standardized web
clustering evaluation method.
QC4 minimized method bias, and universally
characterize clustering with different
characteristic.
Cluster granularity
Clustering structure
Clustering size
The future work is to compute the complexity of
QC4.