Title: Clustering Web Pages: a critical literature review
1Clustering Web Pages a critical literature
review
Weizheng Gao 2003/06/20
2Outline
- Introduction
- STC (suffix tree) Algorithm
- Grouper-A Clustering Engine
- Reference
31. Introduction
- Problem of conventional document retrieval
systems - ? Low precision
- ? Rank list presentation
- How about off-line clustering?
4An alternative Model
5Requirements for Web document clustering methods
- ? Relevance
- ? Browsable Summaries
- ? Overlap
- ? Snippet-tolerance
- ? Speed
- ? Incrementality
62. STC (Suffix Tree Clustering)
- A novel, incremental, O(n) time algorithm
- Treats a document as a string
- Relies on Suffix Tree to identify common phrases
- Uses the common information to create clusters
- Also uses this information to summarize the
contents of clusters
7What is Suffix Tree?
- A suffix tree is a rooted, directed tree
- Each internal node has at least 2 children
- Each edge is labeled with a non-empty sub-string
of S. - The label of a node is the concatenation of the
edge-labels on the path from the root to that
node. - No two edges out of the same node can have
edge-labels that begin with the same word. - For each suffix s of S, there exists a
suffix-node whose label equals s
8An Example
I know you know I know
Trimming
9Logical Steps
- Step-1 Document Cleaning
- Step-2 Identifying Base Clusters
- Step-3 Combining Base Clusters
- Step-4 Score clusters
10Step-1 Document Cleaning
? Use a light stemming algorithm ? Mark sentence
boundaries ? Stripped non-word tokens
The original document strings are kept, as well
as pointers from the beginning of each word in
the transformed string to its position in the
original strings.
11Step-2 Identifying Base Clusters
- Strings
- cat1 ate2 cheese3
- 2. mouse1 ate2 cheese3 too4
- 3. cat1 ate2 mouse3 too4
The first number designates the string of
origin. The second number designates which suffix
of that string labels that suffix-node.
12The suffix tree of the strings cat ate cheese,
mouse ate cheese too and cat ate mouse too
13Each node represents a base cluster
Table 1 Six nodes and their corresponding base
clusters
14Each base cluster is assigned a score
The Score s(B) of base cluster B with phrase p is
given by
s(B) B f(P) ? tfidf(wi)
B is the number of documents in base cluster
B. P is the number of words in P. The function
f penalizes single word phrases, is linear for
phrase that are two to six words long, and
becomes constant for longer phrases. ? tfidf(wi)
is a sum of standard term frequency-inverse
document frequency term ranking factor for all
terms in phrase P.
15Step-3 Combining Base Clusters
Binary similarity measure
The similarity of Bm and Bn to be 1 iff BmnBn
/ Bm gt a and BmnBn / Bn gt a Otherwise,
their similarity is 0.
The base cluster graph that a0.5
16The phrase cluster graph
(a) for ? 0.7 there are four connected
components in the graph, representing four merged
clusters. (b) for ? 0.6 there is a single
connected component in the graph, representing
one merged cluster. (c) If the word ate had been
in stoplist, the phrase cluster b would have been
discarded as it would have had a score of 0, and
for ? 0.6 we would have had three connected
components in the graph, representing three
merged clusters.
17Merged clusters as connected components in the
phrase cluster graph
18Step-4 Score Clusters
Nc is the number of documents in cluster C. Only
consider labels l0 to ln that are in C and are
not subsets of any other label.
p(l) ? p(w)
P(w) log(1/fw) if fw gt0 and P(w) log(1/.5)
if fw0
19The main advantage of STC
- It is phrase-based
- It does not adhere to any model of the data
- STC uses a simple cluster definition
- STC allows overlapping clusters
- STC is a fast incremental, linear time algorithm
204. Grouper- A Clustering Engine
- Grouper is a clustering interface to the
HuskySearch meta-search service. - Grouper clusters the results as they arrive using
the STC algorithm.
21User interface
Groupers query interface. Users Neednt to enter
any parameters for the clustering algorithm
22The main result page
The main results page in Grouper for the query
israel
23Reference
- Oren Zamir, Oren Etzioni, Omid Madani, Richard
M.Karp, Fast and Intuitive Clustering of Web
Documents, 1997, KDD - Oren Zamir, Oren Etzioni, Web Document
Clustering A Feasibility Demonstration, In Proc.
ACM SIGIR'98, 1998 - Oren Zamir, Oren Etzioni, Grouper A Dynamic
Clustering Interface to Web Search Results, WWW8
1999 - Steve Branson, Ari GreenBerg, Clustering Web
Search Results Using Suffix Tree Methods,
Stanford University
24Thanks!