Semanticbased Language Models for Text Retrieval and Clustering

About This Presentation

Title:

Semanticbased Language Models for Text Retrieval and Clustering

Description:

... 0.040; mission 0.038; flight 0.037; earth 0.037; moon 0.035; orbit 0.032; satellite 0.031; Mar 0.030; explorer 0.028; station 0.028; rocket 0.027; technology 0.026; ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 44

Provided by: NanZ1

more less

Transcript and Presenter's Notes

Title: Semanticbased Language Models for Text Retrieval and Clustering

1
Semantic-based Language Models for Text Retrieval
and Clustering

Xiaohua (Davis) Zhou
College of Information Science Technology
Drexel University

2
Summary of Research Work

Publication in last three years
2 Journal Papers, 13 Conference Papers, 2
Workshop Papers, 10 of which are first-authored.
IJCAI07 (best conference in AI and Knowledge
Base), SIGIR06 (best conference in Information
Retrieval), and ICDM06 (one the three primary
conferences in Data Mining)
Topic Distribution
Information Retrieval (4), Information Extraction
(4), Text Mining (7), Other (2)
Main Research Direction
Statistical Language Models for Text Retrieval
and Mining

3
Selected Publications

Zhou, X., Zhang, X., and Hu, X., Semantic
Smoothing of Document Models for Agglomerative
Clustering, to appear in IJCAI 2007 (15.7)
Zhou, X., Hu, X., Zhang, X., Lin, X., and Song,
I.-Y. Context-Sensitive Semantic Smoothing for
the Language Modeling Approach to Genomic IR, ACM
SIGIR 2006 (18.5)
Zhou, X., Hu, X., Lin, X., and Zhang, X.
Relation-based Document Retrieval for Biomedical
IR, Transactions on Computational Systems
Biology, 2006 Vol. 4
Zhou, X., Zhang, X., and Hu, X., Using
Concept-based Indexing to Improve Language
Modeling Approach to Genomic IR, ECIR 2006 (21)
Zhang, X., Zhou, X., and Hu, X., Semantic
Smoothing for Model-based Document Clustering, to
appear in ICDM 2006 (short paper, 20)

4
Statistical Language Models

Statistical Language Models (LM)
A language model is about a distribution over
words
Text is randomly generated according to a given
language model.
Basic Questions
Text Generation given the model, compute the
probability of generating a text
Inference given a text, infer the behind model
which generates the text.

5
Example of Language Modeling
Figure 1. Illustration of the generative process
and the problem of statistical inference
underlying topic models.
6
Applications of LM

Text Prediction
Computing the probability of generating a
sequence of words according to the trained model.
Applications Information Extraction, Text
Retrieval, Text Clustering, and Text
Classification
Model Inference
Infer the underlying model according to the
generated texts.
Applications Text Decoding, Topic Models

7
LM for Text Retrieval

Relevance
The probability of generating the query by the
document (model), i.e. p(qd)
Example
Document 1(A,3), (B, 5), (C,2)
Document 2(A,4), (B, 1), (C,5)
QueryA, B
Which document is more relevant to the query?

Doc 1 0.30.50.15 Doc 2 0.40.10.04 Doc 1 is
more relevant to the query than Doc 2
8
LM for Text Clustering

Agglomerative Clustering
The pairwise document similarity is defined as
the similarity (i.e., KL-divergence) of two
document models
Partitional Clustering
The similarity of between a document and a
cluster is defined as the generative probability
of the document by the cluster model, i.e. p(dcj)

9
Where Is the Problem?

Sparsity of Topical Words
Document 1(A,3), (B, 5), (C,2)
Document 2(A,4), (B, 1), (C,5)
QueryA, D
Which document is more relevant to the query?

Doc 1 0.300 Doc 2 0.400 Obviously, this
result is not reasonable. The text clustering has
the same problem.
10
Where Is the Problem?

Density of Topic-free General Words
An extreme example is stop words.
Those words will be assigned with high
probability, but no contribution to the retrieval
or clustering task.
Any document pair could be considered similar for
clustering because they share lots of common
words
Need to discount the effect of those general
words. The same idea as TF.IDF weighting schema.

11
Summary of the LM Problems

Need to assign reasonable probability (count) to
those unseen words in the training data.
Technically avoid zero probability
Account for the semantic relationship between
training words and testing words, e.g. the query
car for document containing auto.
Need to discount topic-free general words.
Remove noise
Concentrate on topic-related words
These two issues are exactly the goals of
statistical model smoothing.

12
Laplacian Smoothing

Basic Idea
Simply assume the prior count of each word is 2.
Just for preventing zero probability, not make
lot of senses for real applications.

13
Background Smoothing

The document model is smoothed by the corpus
model.

c(w d) is the count of word w in document d. C
denotes the corpus (Zhai and Lafferty 2001)
14
Two-Stage Smoothing

Basic Idea
The first stage smoothes the document model using
the corpus model and the second stage involves
the computation of the likelihood of the query
according to a query model.

(Zhai and Lafferty 2002)
15
Cluster Language Model

Motivation
Have more similar documents to estimate a more
accurate and smoothed model (Liu and Croft 2004)
Weakness
Time-consuming and not scalable to cluster a
large collection
The assumption that one document is associated
with only one cluster does not hold very well.

16
Statistical Translation Model

Motivating Example
The document containing auto should return for
query car
Statistical Translation Model
Semantic relationships between document term and
query term are considered (Berger and Lafferty
1999)
Follow-up Jin, Hauptmann and Zhai 2002 Lafferty
and Zhai 2001 Cao et al. 2005
Unable to incorporate contextual and sense
information into the translation procedure.

17
Context-Sensitive Semantic Smoothing (Our
Approach)

Definition
Like the statistical translation model, term
semantic relationships are used for model
smoothing.
Unlike the statistical translation model,
contextual and sense information are considered
Method
Decompose a document into a set of
context-sensitive topic signatures and then
statistically translate topic signatures into
individual words.

18
Topic Signatures

Concept Pairs
A pair of two concepts which are semantically and
syntactically related to each other
Example computer and mouse, hypertension and
obesity
Extraction Ontology-based approach (Zhou et al.
2006, SIGIR)
Multiword Phrases
Example Space Program, Star War, White House
Extraction Xtract (Smadja 1993)

19
Translation Probability Estimate

Method
Use cooccurrence counts (topic signature and
individual words)
Use a mixture model to remove noise from
topic-free general words

Figure 2. Illustration of document indexing. Vt,
Vd and Vw are topic signature set, document set
and word set, respectively.
Denotes Dk the set of documents containing the
topic signature tk. The parameter a is the
coefficient controlling the influence of the
corpus model in the mixture model.
20
Translation Probability Estimate

Log likelihood of generating Dk

EM for estimation

Where is the document frequency of term w in Dk,
i.e., the cooccurrence count of w and tk in the
whole collection.
21
Contrasting Translation Example
22
Topic Signature LM

Basic Idea
Linearly interpolate the topic signature based
translation model with a simple language model.
The document expansions based on
context-sensitive semantic smoothing will be very
specific.
The simple language model can capture the points
the topic signatures miss.

Where the translation coefficient (?) controls
the influence of the translation component in the
mixture model.
23
Topic Signature LM

The Simple Language Model

The Topic Signature Translation Model

c(ti, d) is the frequency of topic signature ti
in document d.
24
Text Retrieval Experiments

Collections
TREC Genomics Track 2004 and 2005
Use sub-collection
2004 48,753 documents
2005 41,018 documents
Measures
Mean Average Precision (AP), Recall
Settings
Simple language model as the baseline
Use concept pairs as topic signatures
Background coefficient 0.05
Pseudo-relevance feedback top 50 documents,
expand10 terms

25
IR Experiment Results
Table 1. The comparison of the baseline language
model with the topic signature document model and
the topic signature query model . The parameters
? and ? are trained from TREC04 dataset.
26
Effect of Document Smoothing
Figure 3. The variance of MAP with the
translation coefficient (?), which controls the
influence of the translation component in the
topic signature language model.
27
Effect of Document Smoothing
Figure 4. The variance of MAP with the
translation coefficient (?), which controls the
influence of the translation component in the
topic signature language model.
28
vs. Context-Insensitive Model

Context-Insensitive Semantic Smoothing

c(t(w, wk)) is the frequency count of topic
signature t(w, wk) in the whole collection
29
vs. Context-insensitive Model

Experiment Results

Table 2. Comparison of the context-sensitive
semantic smoothing (Sensitive) to the
context-insensitive semantic smoothing
(Insensitive) on MAP. The rightmost column is the
change of Sensitive over Insensitive.
30
vs. Other Approaches

Other Approaches
Simple language model with word as indexing unit
Local Information Flow (Song and Bruza 2003)
Context sensitive semantic smoothing
Can not incorporate domain knowledge
Model-based Feedback (Zhai and Lafferty 2001)
Findings
Our approach achieved the best result for both
2004 and 2005.
The incorporation of domain knowledge did not
help much when using simple language model

31
vs. Other Approaches

Experiment Results

Table 3. Comparison of the retrieval performance
of six approaches on TREC genomic track 2004 and
2005. The concept-based indexing is based on the
UMLS Metathesaurus. All approaches are
implemented by us.
32
Text Clustering Experiments

Using multiword phrases as topic signatures
The meaning is unambiguous in most cases
There are many statistical approaches to phrase
extraction
Applicable to any domain
Testing Collections
20-newsgroups, LA Times, TDT2
Evaluation Criterion
Normalized mutual information (NMI, Banerjee and
Ghosh, 2002)
Entropy (Steinbach et al., 2000 )
Purity (Zhao and Karypis, 2001 )

33
Statistics of Three Datasets
Table 4. Statistics of three datasets
Notes In the testing phase, we create both small
and large testing collections. For small
collections, 100 documents are randomly selected
for each class. For large collections, 1000
documents are randomly selected for each class.
Agglomerative clustering is only evaluated on
small collections while partitional clustering
tests both.
34
Agglomerative Clustering
Table 5. NMI results of the agglomerative
hierarchical clustering with complete linkage
criterion. Bkg and Semantic denote simple
background smoothing and semantic smoothing,
respectively. means stop words are not removed.
The translation coefficient ? is trained from
TDT2.
35
Effect of Document Smoothing
Figure 5. The variance of the cluster quality
with the translation coefficient (?) which
controls the influence of semantic smoothing
36
Partitional Clustering
Table 6. NMI results of partitional clustering on
large datasets and small datasets. Lap, Bkg,
and Semantic denote Laplacian smoothing,
background smoothing, and semantic smoothing,
respectively. means stop words are not
removed. The translation coefficient ? is trained
from TDT2.
37
Effect of Cluster Smoothing
Figure 6. The variance of the cluster quality on
small datasets with the translation coefficient
(?) which controls the influence of semantic
smoothing. Stop words are removed.
38
Effect of Cluster Smoothing
Figure 7. The variance of the cluster quality on
large datasets with the translation coefficient
(?) which controls the influence of semantic
smoothing. Stop words are removed.
39
Clustering Result Summary

Semantic smoothing is much more effective than
other schemes on agglomerative clustering where
data sparsity is the major problem.
For partitional clustering, when dataset is small
and data sparsity is the major problem, semantic
smoothing is very effective otherwise, it equals
to background smoothing.
Although both semantic smoothing and background
smoothing can weaken the effect of general words,
they are less effective than TFIDF which is more
aggressive on discounting general words.
Laplacian smoothing is the worst among all tested
schemes.
Removing stops or not have no effect on TFIDF,
background smoothing, and semantic smoothing, but
significant effect on other schemes.

40
Conclusions and Future Work

The topic signature language model is very
effective in discounting general words and
smoothing unseen topic-related words
Using different implementations of topic
signature, the model can incorporate domain
knowledge or not, can be applied to either
general domains or specific domains.
Future work
the optimization of translation coefficient can
be improved.
Applied to other applications such as text
summarization and text classification.

41
Software and Papers

Download Dragon Toolkit http//www.ischool.drexel.
edu/dmbio/dragontool
Related Papers and Slideshttp//www.pages.drexel.
edu/xz37 orhttp//www.daviszhou.net

42
Acknowledgements

Thank my advisor Dr. Tony Hu as well as thesis
committee members Dr. Song and Dr. Lin for their
advice on my research work
Thank Dr. Han and Dr. Weber for their help when I
was working with them as RA
Thank IST for the generous support of my travels
to academic conferences
My research work is supported in part by NSF
Career grant (NSF IIS 0448023), NSF CCF
0514679, PA Dept of Health Grants (No. 240205,
No. 240196, No. 239667).

43
Questions/Comments?

Write a Comment

User Comments (0)