Title: Improving Web Clustering by Cluster Selection
1Improving Web Clustering by Cluster Selection
Daniel Crabtree, Xiaoying Gao, Peter
Andreae Victoria University of Wellington, New
Zealand
2Web Search
2
- Iterative Process
- Problems with Standard Web Search
- Many Irrelevant Results
- Single Long List
- Solution
- Identify and Present Implicit Clusters
3Web Clustering
3
Search Results for Jaguar 1 6 of
70,000,000
1. Jaguar Official worldwide web site of Jaguar
Cars. 2. Apple - Mac OS X The Apple Mac OS X
product page. 3. Jaguar UK - R is for Racing
The essence of the Jaguar breed 4.
Jaguar General information from Big Cats Online.
5. Jaguar AU - Jaguar Cars Services and
news 6. Jaguar -- Defenders of Wildlife Size,
appearance, life span and diet.
4. Jaguar General information from Big Cats
Online. 6. Jaguar -- Defenders of
Wildlife Size, appearance, life span and diet.
Clusters 1. Car 2. Animal 3. Mac OS 4. Other
4Web Clustering Algorithms
4
- Many standard clustering algorithms.
- Text oriented clustering algorithms
- STC - Suffix Tree Clustering
- ESTC - Improvement on STC
5Suffix Tree Clustering
5
Reference Zamir and Etzioni
6STC Identify Base Clusters
6
7STC Combining Base Clusters
7
Merge Clusters Based On Overlap
30
18
7
12
6
Merged Cluster Score is sum of base cluster scores
8STC Rank/Select Clusters
8
- Sort Clusters by Score
- Select Best N
9Problems with STC
9
- STC is better than many other algorithms
- BUT not good enough
- Scores
- Poor Cluster Quality Measure
- Selection
- Poor Coverage
- Excessive Overlap
10ESTC Better Cluster Scoring
10
- Base Cluster Scores OK
- Combined Cluster Scores BAD
- Overlap between clusters over counted in sum
- Example - Particularly Similar Pages
11ESTC Scoring Solution
11
- Solution
- Eliminate the over counting of the overlap
- Merged Cluster Score
- Sum over document scores
- Document Score
- Average phrase score of base clusters containing
the document in the merged cluster
12ESTC Better Cluster Selection
12
- Top N Clusters BAD
- Dominant Topic over represented
13ESTC Smarter Selection The Search
13
- ESTC Smarter selection
- Heuristic
- Minimize Overlap
- Maximize Coverage
14ESTC The Search
14
- Incremental
- Greedy
- Look-ahead Protection
- Sophisticated Branch and Bound Pruning
15Evaluation Method
15
- Gold Standard - Ideal Clustering
- 2 Searches and 2 Types of Input Data
- Jaguar and Salsa
- Snippets and Full Text
- Precision
- Cluster accuracy against the best matching ideal
cluster - Recall
- Coverage of ideal cluster in matched clusters
- F-measure
- Combination of precision and recall
16Results STC, STC-NS, ESTC
16
Jaguar Full Text Clustering Results
17Results ESTC vs Grokker
17
- Similar performance without page titles
- Page titles are often very useful
- Algorithm Input F-measure
- ESTC Snippets 58
- Grokker Snippets Page Titles 62
- ESTC Full Text 74
18Conclusions
18
- ESTC has
- A new cluster scoring
- A new cluster selection algorithm
- ESTC is better than STC, and compares favourably
with Grokker. - ESTC Scoring function applicable to any
agglomerative clustering algorithm. - ESTC Cluster Selection algorithm more widely
applicable.
19Future Work
19
- Make improvements to other stages of STC
- Particularly Combining Base Clusters
- Apply cluster selection method to other
algorithms - Improve cluster selection heuristic