Web Document Clustering

About This Presentation

Title:

Web Document Clustering

Description:

(3) Using current web search engines Metacrawler. Step1: When MetaCrawler receives a query, it posts the query to multiple search ... – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 13

Provided by: sangche

Learn more at: http://homepage.divms.uiowa.edu

Category:

more less

Transcript and Presenter's Notes

Title: Web Document Clustering

1
Web Document Clustering

By Sang-Cheol Seok

2
1.Introduction Web document clustering? Why?

Two results for the same query amazon
Google currently the most powerful search
engine
Metacrawler a search engine which cluster
retrieved web documents.

3
2. Approaches

Using contents of documents
Using users usage logs
Using current search engines
Using hyperlinks
Other classical methods

4
(1) Using Contents of Documents

Creating clusters based on snippets returned by
web search engines.
clusters based on snippets are almost as good as
clusters created using the full text of Web
documents.
Suffix Tree Clustering (STC) incremental, O(n)
time algorithm
three logical steps (1) document cleaning, (2)
identifying base clusters using a suffix tree,
and (3) combining these base clusters into
clusters

5
(2) Using users usage logs
Cluster 1 /shuttle/missions/41-c/news /shuttle/missions/61-b
Cluster 2 /history/apollo/sa-2/news/ /history/apollo/sa-2/images
Cluster 3 /software/winvn/userguide/3_3_2.htm /software/winvn/userguide/3_3_4.htm
.

Advantage relevancy information is objectively
reflected by the usage logs
An experimental result on www.nasa.gov/

6
(3) Using current web search engines Metacrawler

Step1 When MetaCrawler receives a query, it
posts the query to multiple search engines in
parallel.
Step2 performs sophisticated pruning on the
responses returned. (prune 75 of the returned
responses as irrelevant, outdated, or unavailable
)
Metacrawler at U. of Washington.

7
(4) Using hyperlinks

Consider web documents as vertices and the
hyperlinks as direct edges in a direct graph.
Similarity-based clustering method was
successfully used in image segmentation
Kleinbergs HITS algorithm
based purely on hyperlink information.
authority and hub documents for a user query.
only cover the most popular topics and leave out
the less popular ones.

8
(4) Using Hyperlinks continued

cluster web documents based on both the textual
and hyperlink
the hyperlink structure is used as the dominant
factor in the similarity metric

9
(5) Other classical clustering methods

K-means method
HAC (hierarchical agglomerative clustering)
DBSCAN (Density-based SCAN)
And Single-link and group-average methods,
Complete-link methods, Single-pass methods, and
Buckshot and Fraction have been used

10
3. Key requirements and future challenges

(1) key requirements for Web document clustering
methods
Relevance
Browsable Summaries
Overlap
Speed
Incrementality for some methods.

11
3. Key requirements and future challenges
continued

(2) Concerns on current methods
Each method has pros and cons.
Using hyperlinks the best accuracy and still
some room to improve and it does not overlap.
STC best to browse and for incrementality.
Metacrawler best to prune.

12
3. Key requirements and future challenges
continued