Title: Web Document Clustering
1Web Document Clustering
21.Introduction Web document clustering? Why?
- Two results for the same query amazon
- Google currently the most powerful search
engine - Metacrawler a search engine which cluster
retrieved web documents.
32. Approaches
- Using contents of documents
- Using users usage logs
- Using current search engines
- Using hyperlinks
- Other classical methods
4(1) Using Contents of Documents
- Creating clusters based on snippets returned by
web search engines. - clusters based on snippets are almost as good as
clusters created using the full text of Web
documents. - Suffix Tree Clustering (STC) incremental, O(n)
time algorithm - three logical steps (1) document cleaning, (2)
identifying base clusters using a suffix tree,
and (3) combining these base clusters into
clusters
5(2) Using users usage logs
Cluster 1 /shuttle/missions/41-c/news /shuttle/missions/61-b
Cluster 2 /history/apollo/sa-2/news/ /history/apollo/sa-2/images
Cluster 3 /software/winvn/userguide/3_3_2.htm /software/winvn/userguide/3_3_4.htm
.
- Advantage relevancy information is objectively
reflected by the usage logs - An experimental result on www.nasa.gov/
6(3) Using current web search engines Metacrawler
- Step1 When MetaCrawler receives a query, it
posts the query to multiple search engines in
parallel. - Step2 performs sophisticated pruning on the
responses returned. (prune 75 of the returned
responses as irrelevant, outdated, or unavailable
) - Metacrawler at U. of Washington.
7(4) Using hyperlinks
- Consider web documents as vertices and the
hyperlinks as direct edges in a direct graph. - Similarity-based clustering method was
successfully used in image segmentation - Kleinbergs HITS algorithm
- based purely on hyperlink information.
- authority and hub documents for a user query.
- only cover the most popular topics and leave out
the less popular ones.
8(4) Using Hyperlinks continued
- cluster web documents based on both the textual
and hyperlink - the hyperlink structure is used as the dominant
factor in the similarity metric
9(5) Other classical clustering methods
- K-means method
- HAC (hierarchical agglomerative clustering)
- DBSCAN (Density-based SCAN)
- And Single-link and group-average methods,
Complete-link methods, Single-pass methods, and
Buckshot and Fraction have been used
103. Key requirements and future challenges
- (1) key requirements for Web document clustering
methods - Relevance
- Browsable Summaries
- Overlap
- Speed
- Incrementality for some methods.
113. Key requirements and future challenges
continued
- (2) Concerns on current methods
- Each method has pros and cons.
- Using hyperlinks the best accuracy and still
some room to improve and it does not overlap. - STC best to browse and for incrementality.
- Metacrawler best to prune.
123. Key requirements and future challenges
continued
- Future challenges
- We can not take advantage of all pros of each
method. - Some pros work against other pros.
- So, we have to trade off.
- Moreover, we need to find improvements.