Title: Semanticbased Language Models for Text Retrieval and Clustering
1Semantic-based Language Models for Text Retrieval
and Clustering
- Xiaohua (Davis) Zhou
- College of Information Science Technology
Drexel University
2Summary of Research Work
- Publication in last three years
- 2 Journal Papers, 13 Conference Papers, 2
Workshop Papers, 10 of which are first-authored. - IJCAI07 (best conference in AI and Knowledge
Base), SIGIR06 (best conference in Information
Retrieval), and ICDM06 (one the three primary
conferences in Data Mining) - Topic Distribution
- Information Retrieval (4), Information Extraction
(4), Text Mining (7), Other (2) - Main Research Direction
- Statistical Language Models for Text Retrieval
and Mining
3Selected Publications
- Zhou, X., Zhang, X., and Hu, X., Semantic
Smoothing of Document Models for Agglomerative
Clustering, to appear in IJCAI 2007 (15.7) - Zhou, X., Hu, X., Zhang, X., Lin, X., and Song,
I.-Y. Context-Sensitive Semantic Smoothing for
the Language Modeling Approach to Genomic IR, ACM
SIGIR 2006 (18.5) - Zhou, X., Hu, X., Lin, X., and Zhang, X.
Relation-based Document Retrieval for Biomedical
IR, Transactions on Computational Systems
Biology, 2006 Vol. 4 - Zhou, X., Zhang, X., and Hu, X., Using
Concept-based Indexing to Improve Language
Modeling Approach to Genomic IR, ECIR 2006 (21) - Zhang, X., Zhou, X., and Hu, X., Semantic
Smoothing for Model-based Document Clustering, to
appear in ICDM 2006 (short paper, 20)
4Statistical Language Models
- Statistical Language Models (LM)
- A language model is about a distribution over
words - Text is randomly generated according to a given
language model. - Basic Questions
- Text Generation given the model, compute the
probability of generating a text - Inference given a text, infer the behind model
which generates the text.
5Example of Language Modeling
Figure 1. Illustration of the generative process
and the problem of statistical inference
underlying topic models.
6Applications of LM
- Text Prediction
- Computing the probability of generating a
sequence of words according to the trained model. - Applications Information Extraction, Text
Retrieval, Text Clustering, and Text
Classification - Model Inference
- Infer the underlying model according to the
generated texts. - Applications Text Decoding, Topic Models
7LM for Text Retrieval
- Relevance
- The probability of generating the query by the
document (model), i.e. p(qd) - Example
- Document 1(A,3), (B, 5), (C,2)
- Document 2(A,4), (B, 1), (C,5)
- QueryA, B
- Which document is more relevant to the query?
Doc 1 0.30.50.15 Doc 2 0.40.10.04 Doc 1 is
more relevant to the query than Doc 2
8LM for Text Clustering
- Agglomerative Clustering
- The pairwise document similarity is defined as
the similarity (i.e., KL-divergence) of two
document models - Partitional Clustering
- The similarity of between a document and a
cluster is defined as the generative probability
of the document by the cluster model, i.e. p(dcj)
9Where Is the Problem?
- Sparsity of Topical Words
- Document 1(A,3), (B, 5), (C,2)
- Document 2(A,4), (B, 1), (C,5)
- QueryA, D
- Which document is more relevant to the query?
Doc 1 0.300 Doc 2 0.400 Obviously, this
result is not reasonable. The text clustering has
the same problem.
10Where Is the Problem?
- Density of Topic-free General Words
- An extreme example is stop words.
- Those words will be assigned with high
probability, but no contribution to the retrieval
or clustering task. - Any document pair could be considered similar for
clustering because they share lots of common
words - Need to discount the effect of those general
words. The same idea as TF.IDF weighting schema.
11Summary of the LM Problems
- Need to assign reasonable probability (count) to
those unseen words in the training data. - Technically avoid zero probability
- Account for the semantic relationship between
training words and testing words, e.g. the query
car for document containing auto. - Need to discount topic-free general words.
- Remove noise
- Concentrate on topic-related words
- These two issues are exactly the goals of
statistical model smoothing.
12Laplacian Smoothing
- Basic Idea
- Simply assume the prior count of each word is 2.
- Just for preventing zero probability, not make
lot of senses for real applications.
13Background Smoothing
- The document model is smoothed by the corpus
model.
c(w d) is the count of word w in document d. C
denotes the corpus (Zhai and Lafferty 2001)
14Two-Stage Smoothing
- Basic Idea
- The first stage smoothes the document model using
the corpus model and the second stage involves
the computation of the likelihood of the query
according to a query model.
(Zhai and Lafferty 2002)
15Cluster Language Model
- Motivation
- Have more similar documents to estimate a more
accurate and smoothed model (Liu and Croft 2004) - Weakness
- Time-consuming and not scalable to cluster a
large collection - The assumption that one document is associated
with only one cluster does not hold very well.
16Statistical Translation Model
- Motivating Example
- The document containing auto should return for
query car - Statistical Translation Model
- Semantic relationships between document term and
query term are considered (Berger and Lafferty
1999) - Follow-up Jin, Hauptmann and Zhai 2002 Lafferty
and Zhai 2001 Cao et al. 2005 - Unable to incorporate contextual and sense
information into the translation procedure.
17Context-Sensitive Semantic Smoothing (Our
Approach)
- Definition
- Like the statistical translation model, term
semantic relationships are used for model
smoothing. - Unlike the statistical translation model,
contextual and sense information are considered - Method
- Decompose a document into a set of
context-sensitive topic signatures and then
statistically translate topic signatures into
individual words.
18Topic Signatures
- Concept Pairs
- A pair of two concepts which are semantically and
syntactically related to each other - Example computer and mouse, hypertension and
obesity - Extraction Ontology-based approach (Zhou et al.
2006, SIGIR) - Multiword Phrases
- Example Space Program, Star War, White House
- Extraction Xtract (Smadja 1993)
19Translation Probability Estimate
- Method
- Use cooccurrence counts (topic signature and
individual words) - Use a mixture model to remove noise from
topic-free general words
Figure 2. Illustration of document indexing. Vt,
Vd and Vw are topic signature set, document set
and word set, respectively.
Denotes Dk the set of documents containing the
topic signature tk. The parameter a is the
coefficient controlling the influence of the
corpus model in the mixture model.
20Translation Probability Estimate
- Log likelihood of generating Dk
Where is the document frequency of term w in Dk,
i.e., the cooccurrence count of w and tk in the
whole collection.
21Contrasting Translation Example
22Topic Signature LM
- Basic Idea
- Linearly interpolate the topic signature based
translation model with a simple language model. - The document expansions based on
context-sensitive semantic smoothing will be very
specific. - The simple language model can capture the points
the topic signatures miss.
Where the translation coefficient (?) controls
the influence of the translation component in the
mixture model.
23Topic Signature LM
- The Simple Language Model
- The Topic Signature Translation Model
c(ti, d) is the frequency of topic signature ti
in document d.
24Text Retrieval Experiments
- Collections
- TREC Genomics Track 2004 and 2005
- Use sub-collection
- 2004 48,753 documents
- 2005 41,018 documents
- Measures
- Mean Average Precision (AP), Recall
- Settings
- Simple language model as the baseline
- Use concept pairs as topic signatures
- Background coefficient 0.05
- Pseudo-relevance feedback top 50 documents,
expand10 terms
25IR Experiment Results
Table 1. The comparison of the baseline language
model with the topic signature document model and
the topic signature query model . The parameters
? and ? are trained from TREC04 dataset.
26Effect of Document Smoothing
Figure 3. The variance of MAP with the
translation coefficient (?), which controls the
influence of the translation component in the
topic signature language model.
27Effect of Document Smoothing
Figure 4. The variance of MAP with the
translation coefficient (?), which controls the
influence of the translation component in the
topic signature language model.
28vs. Context-Insensitive Model
- Context-Insensitive Semantic Smoothing
c(t(w, wk)) is the frequency count of topic
signature t(w, wk) in the whole collection
29vs. Context-insensitive Model
Table 2. Comparison of the context-sensitive
semantic smoothing (Sensitive) to the
context-insensitive semantic smoothing
(Insensitive) on MAP. The rightmost column is the
change of Sensitive over Insensitive.
30vs. Other Approaches
- Other Approaches
- Simple language model with word as indexing unit
- Local Information Flow (Song and Bruza 2003)
- Context sensitive semantic smoothing
- Can not incorporate domain knowledge
- Model-based Feedback (Zhai and Lafferty 2001)
- Findings
- Our approach achieved the best result for both
2004 and 2005. - The incorporation of domain knowledge did not
help much when using simple language model
31vs. Other Approaches
Table 3. Comparison of the retrieval performance
of six approaches on TREC genomic track 2004 and
2005. The concept-based indexing is based on the
UMLS Metathesaurus. All approaches are
implemented by us.
32Text Clustering Experiments
- Using multiword phrases as topic signatures
- The meaning is unambiguous in most cases
- There are many statistical approaches to phrase
extraction - Applicable to any domain
- Testing Collections
- 20-newsgroups, LA Times, TDT2
- Evaluation Criterion
- Normalized mutual information (NMI, Banerjee and
Ghosh, 2002) - Entropy (Steinbach et al., 2000 )
- Purity (Zhao and Karypis, 2001 )
33Statistics of Three Datasets
Table 4. Statistics of three datasets
Notes In the testing phase, we create both small
and large testing collections. For small
collections, 100 documents are randomly selected
for each class. For large collections, 1000
documents are randomly selected for each class.
Agglomerative clustering is only evaluated on
small collections while partitional clustering
tests both.
34Agglomerative Clustering
Table 5. NMI results of the agglomerative
hierarchical clustering with complete linkage
criterion. Bkg and Semantic denote simple
background smoothing and semantic smoothing,
respectively. means stop words are not removed.
The translation coefficient ? is trained from
TDT2.
35Effect of Document Smoothing
Figure 5. The variance of the cluster quality
with the translation coefficient (?) which
controls the influence of semantic smoothing
36Partitional Clustering
Table 6. NMI results of partitional clustering on
large datasets and small datasets. Lap, Bkg,
and Semantic denote Laplacian smoothing,
background smoothing, and semantic smoothing,
respectively. means stop words are not
removed. The translation coefficient ? is trained
from TDT2.
37Effect of Cluster Smoothing
Figure 6. The variance of the cluster quality on
small datasets with the translation coefficient
(?) which controls the influence of semantic
smoothing. Stop words are removed.
38Effect of Cluster Smoothing
Figure 7. The variance of the cluster quality on
large datasets with the translation coefficient
(?) which controls the influence of semantic
smoothing. Stop words are removed.
39Clustering Result Summary
- Semantic smoothing is much more effective than
other schemes on agglomerative clustering where
data sparsity is the major problem. - For partitional clustering, when dataset is small
and data sparsity is the major problem, semantic
smoothing is very effective otherwise, it equals
to background smoothing. - Although both semantic smoothing and background
smoothing can weaken the effect of general words,
they are less effective than TFIDF which is more
aggressive on discounting general words. - Laplacian smoothing is the worst among all tested
schemes. - Removing stops or not have no effect on TFIDF,
background smoothing, and semantic smoothing, but
significant effect on other schemes.
40Conclusions and Future Work
- The topic signature language model is very
effective in discounting general words and
smoothing unseen topic-related words - Using different implementations of topic
signature, the model can incorporate domain
knowledge or not, can be applied to either
general domains or specific domains. - Future work
- the optimization of translation coefficient can
be improved. - Applied to other applications such as text
summarization and text classification.
41Software and Papers
- Download Dragon Toolkit http//www.ischool.drexel.
edu/dmbio/dragontool - Related Papers and Slideshttp//www.pages.drexel.
edu/xz37 orhttp//www.daviszhou.net
42Acknowledgements
- Thank my advisor Dr. Tony Hu as well as thesis
committee members Dr. Song and Dr. Lin for their
advice on my research work - Thank Dr. Han and Dr. Weber for their help when I
was working with them as RA - Thank IST for the generous support of my travels
to academic conferences - My research work is supported in part by NSF
Career grant (NSF IIS 0448023), NSF CCF
0514679, PA Dept of Health Grants (No. 240205,
No. 240196, No. 239667).
43Questions/Comments?