Title: Semantic Smoothing of Document Models for Agglomerative Clustering
1Semantic Smoothing of Document Models for
Agglomerative Clustering
- Xiaohua Zhou, Xiaodan Zhang, Tony Hu
- College of Information Science Technology
Drexel University, USA
2Agglomerative Clustering
- Algorithm Overview
- Initially assign each document into its own
cluster and repeatedly merge pairs of most
similar clusters until only one cluster is left. - The core of this algorithm is to compute
pair-wise document similarities. - Cosine and Euclidean similarity (distance) are
frequently used.
3Where is the problem?
- Density of Topic-free General Words
- An extreme example is stop words.
- Those words will be assigned with high
probability (or high score), but no contribution
to the clustering task. - Any document pair could be considered similar for
clustering because they share lots of common
words (Steinbach et al., 2000 ) - Need to discount the effect of those general
words. The same idea as TF.IDF weighting schema.
4Where is the problem?
- Sparsity of Topic-specific Words
I am looking for any information about the space
program. This includes NASA, the shuttles,
history, anything! I would like to know if
anyone could suggest books, periodicals, even ftp
sites for a novice who is interested in the space
program.
The Phobos mission did return some useful data
including images of Phobos itself By the way, the
new book entitled "Mars" (Kieffer et al, 1992,
University of Arizona Press) has a great chapter
on spacecraft exploration of the planet. The
chapter is co-authored by V.I. Moroz of the Space
Research Institute in Moscow, and includes
details never before published in the West.
(From 20-Newsgroup)
5Existing Solutions
- Density of general words
- Removing stop words
- Using TF-IDF score
- Term reweighing techniques
- Sparsity of topic-specific words
- Ontology-based term similarity
- Problems of existing solutions
- All these approaches are heuristic
- The ontology is not available or the ontology is
very limited.
6Language Modeling Approach
- Agglomerative Clustering
- Assume each document is generated by a language
model. - The pairwise document similarity is defined as
the similarity (i.e., KL-divergence) of
corresponding document models
7Jelinek-Mercer Smoothing
- The document model is smoothed by the corpus
model (simple language model) - Discounting general words
- Partially solve the data sparsity problem
c(w d) is the count of word w in document d. C
denotes the corpus (Zhai and Lafferty 2001)
8Semantic Smoothing
- Descriptions
- Like the statistical translation model (Berger
and Lafferty 1999), term semantic relationships
are used for model smoothing. - Unlike the statistical translation model,
contextual and sense information are considered - Decompose a document into a set of
context-sensitive multiword phrases and then
statistically translate phrases into individual
words.
9Semantic Smoothing Model
- Linearly interpolate the phrase-based translation
model with a simple language model
Where the translation coefficient (?) controls
the influence of the translation component in the
mixture model.
c(ti, d) is the frequency of topic signature ti
in document d.
10Semantic Translation Example
11Semantic Smoothing Example
- Doc2
- the Phobos mission did return some useful data
including images of Phobos itselfBy the way, the
new book entitled "Mars" (Kieffer et al, 1992,
University of Arizona Press) has a great chapter
on spacecraft exploration of the planet. The
chapter is co-authored by V.I. Moroz of the Space
Research Institute in Moscow, and includes
details never before published in the West.
- Doc1
- I am looking for any information about the space
program. This includes NASA, the shuttles,
history, anything! I would like to know if
anyone could suggest books, periodicals, even ftp
sites for a novice who is interested in the space
program.
- Doc3
- ROCKET LAUNCH OBSERVED! A bright light phenomenon
was observed in the Eastern Finland on April 21.
I don't know if there were satellite launches in
Plesetsk Cosmodrome near Arkhangelsk, but this
may be a rocket experiment too.
12Translation Probability Estimate
- Method
- Use co-occurrence counts (multiword phrase and
individual words) - Use a mixture model to remove noise from
topic-free general words
Figure 1. Illustration of document indexing. Vt,
Vd and Vw are phrase set, document set and word
set, respectively.
Denotes Dk the set of documents containing the
phrase tk. The parameter a is the coefficient
controlling the influence of the corpus model in
the mixture model.
13Translation Probability Estimate
- Log likelihood of generating Dk
Where is the document frequency of term w in Dk,
i.e., the cooccurrence count of w and tk in the
whole collection.
14Phrase Extraction
- Phrase Dictionary
- Use Xtract (Smadja 1993) to learn a phrase
dictionary. - Phrase Extraction
- Extract phrases from documents using exact string
matching.
15Experiment Settings
- Agglomerative clustering
- Complete linkage
- Evaluation criterion
- Normalized mutual information (NMI, Banerjee and
Ghosh, 2002) - Entropy (Steinbach et al., 2000 )
- Purity (Zhao and Karypis, 2001 )
- Experiment Design
- Randomly create testing collections. 100
documents are randomly selected for each class. - Execute 5 runs for each collection and average
the results
16Statistics of Three Datasets
Table 1. Statistics of three datasets
17Agglomerative Clustering
Table 2. NMI results of the agglomerative
hierarchical clustering with complete linkage
criterion. JM and Semantic denote
Jelinek-Mercer smoothing and semantic smoothing,
respectively. means stop words are not removed.
The translation coefficient ? is trained from
TDT2.
18Effect of Document Smoothing
Figure 2. The variance of the cluster quality
with the translation coefficient (?) which
controls the influence of semantic smoothing
19Comparison to the K-Means
Table 3. means stop words are not removed. The
agglomerative clustering with semantic smoothing
is comparable to the standard K-Means clustering.
20Summary
- Proposed a context-sensitive semantic smoothing
method which statistically translates multiword
phrases into individual terms. - Semantic smoothing not only discounted general
words, but also solved data sparsity problem very
well. - Semantic smoothing is much more effective than
other schemes on agglomerative clustering where
data sparsity is the major problem. - Removing stops or not have no effect on TFIDF,
background smoothing, and semantic smoothing, but
significant effect on other schemes.
21Future Work
- How to optimize translation coefficient
- Alternative translation intermediates (e.g. word
pair, concept pair) - Applies semantic document smoothing to other
applications such as text retrieval, text
summarization, and text classification.
22Dragon Toolkit
- Descriptions
- Text retrieval and mining toolkit
- Written in Java
- Used for this work
- Phrase extraction
- Phrase-word translation probability estimates
- Clustering
- Download
- http//www.ischool.drexel.edu/dmbio/dragontool
- Search Google with keywords dragon toolkit
23Questions/Comments?