Web Document Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Web Document Clustering

Description:

(3) Using current web search engines Metacrawler. Step1: When MetaCrawler receives a query, it posts the query to multiple search ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 13
Provided by: sangche
Category:

less

Transcript and Presenter's Notes

Title: Web Document Clustering


1
Web Document Clustering
  • By Sang-Cheol Seok

2
1.Introduction Web document clustering? Why?
  • Two results for the same query amazon
  • Google currently the most powerful search
    engine
  • Metacrawler a search engine which cluster
    retrieved web documents.

3
2. Approaches
  • Using contents of documents
  • Using users usage logs
  • Using current search engines
  • Using hyperlinks
  • Other classical methods

4
(1) Using Contents of Documents
  • Creating clusters based on snippets returned by
    web search engines.
  • clusters based on snippets are almost as good as
    clusters created using the full text of Web
    documents.
  • Suffix Tree Clustering (STC) incremental, O(n)
    time algorithm
  • three logical steps (1) document cleaning, (2)
    identifying base clusters using a suffix tree,
    and (3) combining these base clusters into
    clusters

5
(2) Using users usage logs
Cluster 1 /shuttle/missions/41-c/news /shuttle/missions/61-b
Cluster 2 /history/apollo/sa-2/news/ /history/apollo/sa-2/images
Cluster 3 /software/winvn/userguide/3_3_2.htm /software/winvn/userguide/3_3_4.htm
.
  • Advantage relevancy information is objectively
    reflected by the usage logs
  • An experimental result on www.nasa.gov/

6
(3) Using current web search engines Metacrawler
  • Step1 When MetaCrawler receives a query, it
    posts the query to multiple search engines in
    parallel.
  • Step2 performs sophisticated pruning on the
    responses returned. (prune 75 of the returned
    responses as irrelevant, outdated, or unavailable
    )
  • Metacrawler at U. of Washington.

7
(4) Using hyperlinks
  • Consider web documents as vertices and the
    hyperlinks as direct edges in a direct graph.
  • Similarity-based clustering method was
    successfully used in image segmentation
  • Kleinbergs HITS algorithm
  • based purely on hyperlink information.
  • authority and hub documents for a user query.
  • only cover the most popular topics and leave out
    the less popular ones.

8
(4) Using Hyperlinks continued
  • cluster web documents based on both the textual
    and hyperlink
  • the hyperlink structure is used as the dominant
    factor in the similarity metric

9
(5) Other classical clustering methods
  • K-means method
  • HAC (hierarchical agglomerative clustering)
  • DBSCAN (Density-based SCAN)
  • And Single-link and group-average methods,
    Complete-link methods, Single-pass methods, and
    Buckshot and Fraction have been used

10
3. Key requirements and future challenges
  • (1) key requirements for Web document clustering
    methods
  • Relevance
  • Browsable Summaries
  • Overlap
  • Speed
  • Incrementality for some methods.

11
3. Key requirements and future challenges
continued
  • (2) Concerns on current methods
  • Each method has pros and cons.
  • Using hyperlinks the best accuracy and still
    some room to improve and it does not overlap.
  • STC best to browse and for incrementality.
  • Metacrawler best to prune.

12
3. Key requirements and future challenges
continued
  • Future challenges
  • We can not take advantage of all pros of each
    method.
  • Some pros work against other pros.
  • So, we have to trade off.
  • Moreover, we need to find improvements.
Write a Comment
User Comments (0)
About PowerShow.com