Semantic Smoothing of Document Models for Agglomerative Clustering

1 / 23

About This Presentation

Title:

Semantic Smoothing of Document Models for Agglomerative Clustering

Description:

... 037; earth 0.037; moon 0.035; orbit 0.032; satellite 0.031; Mar 0.030; explorer 0.028; station 0.028; rocket 0.027; technology 0.026; ... ROCKET LAUNCH OBSERVED! ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 24

Provided by: NanZ1

more less

Transcript and Presenter's Notes

Title: Semantic Smoothing of Document Models for Agglomerative Clustering

1
Semantic Smoothing of Document Models for
Agglomerative Clustering

Xiaohua Zhou, Xiaodan Zhang, Tony Hu
College of Information Science Technology
Drexel University, USA

2
Agglomerative Clustering

Algorithm Overview
Initially assign each document into its own
cluster and repeatedly merge pairs of most
similar clusters until only one cluster is left.
The core of this algorithm is to compute
pair-wise document similarities.
Cosine and Euclidean similarity (distance) are
frequently used.

3
Where is the problem?

Density of Topic-free General Words
An extreme example is stop words.
Those words will be assigned with high
probability (or high score), but no contribution
to the clustering task.
Any document pair could be considered similar for
clustering because they share lots of common
words (Steinbach et al., 2000 )
Need to discount the effect of those general
words. The same idea as TF.IDF weighting schema.

4
Where is the problem?

Sparsity of Topic-specific Words

I am looking for any information about the space
program. This includes NASA, the shuttles,
history, anything! I would like to know if
anyone could suggest books, periodicals, even ftp
sites for a novice who is interested in the space
program.
The Phobos mission did return some useful data
including images of Phobos itself By the way, the
new book entitled "Mars" (Kieffer et al, 1992,
University of Arizona Press) has a great chapter
on spacecraft exploration of the planet. The
chapter is co-authored by V.I. Moroz of the Space
Research Institute in Moscow, and includes
details never before published in the West.
(From 20-Newsgroup)
5
Existing Solutions

Density of general words
Removing stop words
Using TF-IDF score
Term reweighing techniques
Sparsity of topic-specific words
Ontology-based term similarity
Problems of existing solutions
All these approaches are heuristic
The ontology is not available or the ontology is
very limited.

6
Language Modeling Approach

Agglomerative Clustering
Assume each document is generated by a language
model.
The pairwise document similarity is defined as
the similarity (i.e., KL-divergence) of
corresponding document models

7
Jelinek-Mercer Smoothing

The document model is smoothed by the corpus
model (simple language model)
Discounting general words
Partially solve the data sparsity problem

c(w d) is the count of word w in document d. C
denotes the corpus (Zhai and Lafferty 2001)
8
Semantic Smoothing

Descriptions
Like the statistical translation model (Berger
and Lafferty 1999), term semantic relationships
are used for model smoothing.
Unlike the statistical translation model,
contextual and sense information are considered
Decompose a document into a set of
context-sensitive multiword phrases and then
statistically translate phrases into individual
words.

9
Semantic Smoothing Model

Linearly interpolate the phrase-based translation
model with a simple language model

Where the translation coefficient (?) controls
the influence of the translation component in the
mixture model.
c(ti, d) is the frequency of topic signature ti
in document d.
10
Semantic Translation Example
11
Semantic Smoothing Example

Doc2
the Phobos mission did return some useful data
including images of Phobos itselfBy the way, the
new book entitled "Mars" (Kieffer et al, 1992,
University of Arizona Press) has a great chapter
on spacecraft exploration of the planet. The
chapter is co-authored by V.I. Moroz of the Space
Research Institute in Moscow, and includes
details never before published in the West.

Doc1
I am looking for any information about the space
program. This includes NASA, the shuttles,
history, anything! I would like to know if
anyone could suggest books, periodicals, even ftp
sites for a novice who is interested in the space
program.

Doc3
ROCKET LAUNCH OBSERVED! A bright light phenomenon
was observed in the Eastern Finland on April 21.
I don't know if there were satellite launches in
Plesetsk Cosmodrome near Arkhangelsk, but this
may be a rocket experiment too.

12
Translation Probability Estimate

Method
Use co-occurrence counts (multiword phrase and
individual words)
Use a mixture model to remove noise from
topic-free general words

Figure 1. Illustration of document indexing. Vt,
Vd and Vw are phrase set, document set and word
set, respectively.
Denotes Dk the set of documents containing the
phrase tk. The parameter a is the coefficient
controlling the influence of the corpus model in
the mixture model.
13
Translation Probability Estimate

Log likelihood of generating Dk

EM for estimation

Where is the document frequency of term w in Dk,
i.e., the cooccurrence count of w and tk in the
whole collection.
14
Phrase Extraction

Phrase Dictionary
Use Xtract (Smadja 1993) to learn a phrase
dictionary.
Phrase Extraction
Extract phrases from documents using exact string
matching.

15
Experiment Settings

Agglomerative clustering
Complete linkage
Evaluation criterion
Normalized mutual information (NMI, Banerjee and
Ghosh, 2002)
Entropy (Steinbach et al., 2000 )
Purity (Zhao and Karypis, 2001 )
Experiment Design
Randomly create testing collections. 100
documents are randomly selected for each class.
Execute 5 runs for each collection and average
the results

16
Statistics of Three Datasets
Table 1. Statistics of three datasets
17
Agglomerative Clustering
Table 2. NMI results of the agglomerative
hierarchical clustering with complete linkage
criterion. JM and Semantic denote
Jelinek-Mercer smoothing and semantic smoothing,
respectively. means stop words are not removed.
The translation coefficient ? is trained from
TDT2.
18
Effect of Document Smoothing
Figure 2. The variance of the cluster quality
with the translation coefficient (?) which
controls the influence of semantic smoothing
19
Comparison to the K-Means
Table 3. means stop words are not removed. The
agglomerative clustering with semantic smoothing
is comparable to the standard K-Means clustering.
20
Summary

Proposed a context-sensitive semantic smoothing
method which statistically translates multiword
phrases into individual terms.
Semantic smoothing not only discounted general
words, but also solved data sparsity problem very
well.
Semantic smoothing is much more effective than
other schemes on agglomerative clustering where
data sparsity is the major problem.
Removing stops or not have no effect on TFIDF,
background smoothing, and semantic smoothing, but
significant effect on other schemes.

21
Future Work

How to optimize translation coefficient
Alternative translation intermediates (e.g. word
pair, concept pair)
Applies semantic document smoothing to other
applications such as text retrieval, text
summarization, and text classification.

22
Dragon Toolkit