Semanticbased Language Models for Text Retrieval and Clustering

1 / 43
About This Presentation
Title:

Semanticbased Language Models for Text Retrieval and Clustering

Description:

... 0.040; mission 0.038; flight 0.037; earth 0.037; moon 0.035; orbit 0.032; satellite 0.031; Mar 0.030; explorer 0.028; station 0.028; rocket 0.027; technology 0.026; ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 44
Provided by: NanZ1

less

Transcript and Presenter's Notes

Title: Semanticbased Language Models for Text Retrieval and Clustering


1
Semantic-based Language Models for Text Retrieval
and Clustering
  • Xiaohua (Davis) Zhou
  • College of Information Science Technology
    Drexel University

2
Summary of Research Work
  • Publication in last three years
  • 2 Journal Papers, 13 Conference Papers, 2
    Workshop Papers, 10 of which are first-authored.
  • IJCAI07 (best conference in AI and Knowledge
    Base), SIGIR06 (best conference in Information
    Retrieval), and ICDM06 (one the three primary
    conferences in Data Mining)
  • Topic Distribution
  • Information Retrieval (4), Information Extraction
    (4), Text Mining (7), Other (2)
  • Main Research Direction
  • Statistical Language Models for Text Retrieval
    and Mining

3
Selected Publications
  • Zhou, X., Zhang, X., and Hu, X., Semantic
    Smoothing of Document Models for Agglomerative
    Clustering, to appear in IJCAI 2007 (15.7)
  • Zhou, X., Hu, X., Zhang, X., Lin, X., and Song,
    I.-Y. Context-Sensitive Semantic Smoothing for
    the Language Modeling Approach to Genomic IR, ACM
    SIGIR 2006 (18.5)
  • Zhou, X., Hu, X., Lin, X., and Zhang, X.
    Relation-based Document Retrieval for Biomedical
    IR, Transactions on Computational Systems
    Biology, 2006 Vol. 4
  • Zhou, X., Zhang, X., and Hu, X., Using
    Concept-based Indexing to Improve Language
    Modeling Approach to Genomic IR, ECIR 2006 (21)
  • Zhang, X., Zhou, X., and Hu, X., Semantic
    Smoothing for Model-based Document Clustering, to
    appear in ICDM 2006 (short paper, 20)

4
Statistical Language Models
  • Statistical Language Models (LM)
  • A language model is about a distribution over
    words
  • Text is randomly generated according to a given
    language model.
  • Basic Questions
  • Text Generation given the model, compute the
    probability of generating a text
  • Inference given a text, infer the behind model
    which generates the text.

5
Example of Language Modeling
Figure 1. Illustration of the generative process
and the problem of statistical inference
underlying topic models.
6
Applications of LM
  • Text Prediction
  • Computing the probability of generating a
    sequence of words according to the trained model.
  • Applications Information Extraction, Text
    Retrieval, Text Clustering, and Text
    Classification
  • Model Inference
  • Infer the underlying model according to the
    generated texts.
  • Applications Text Decoding, Topic Models

7
LM for Text Retrieval
  • Relevance
  • The probability of generating the query by the
    document (model), i.e. p(qd)
  • Example
  • Document 1(A,3), (B, 5), (C,2)
  • Document 2(A,4), (B, 1), (C,5)
  • QueryA, B
  • Which document is more relevant to the query?

Doc 1 0.30.50.15 Doc 2 0.40.10.04 Doc 1 is
more relevant to the query than Doc 2
8
LM for Text Clustering
  • Agglomerative Clustering
  • The pairwise document similarity is defined as
    the similarity (i.e., KL-divergence) of two
    document models
  • Partitional Clustering
  • The similarity of between a document and a
    cluster is defined as the generative probability
    of the document by the cluster model, i.e. p(dcj)

9
Where Is the Problem?
  • Sparsity of Topical Words
  • Document 1(A,3), (B, 5), (C,2)
  • Document 2(A,4), (B, 1), (C,5)
  • QueryA, D
  • Which document is more relevant to the query?

Doc 1 0.300 Doc 2 0.400 Obviously, this
result is not reasonable. The text clustering has
the same problem.
10
Where Is the Problem?
  • Density of Topic-free General Words
  • An extreme example is stop words.
  • Those words will be assigned with high
    probability, but no contribution to the retrieval
    or clustering task.
  • Any document pair could be considered similar for
    clustering because they share lots of common
    words
  • Need to discount the effect of those general
    words. The same idea as TF.IDF weighting schema.

11
Summary of the LM Problems
  • Need to assign reasonable probability (count) to
    those unseen words in the training data.
  • Technically avoid zero probability
  • Account for the semantic relationship between
    training words and testing words, e.g. the query
    car for document containing auto.
  • Need to discount topic-free general words.
  • Remove noise
  • Concentrate on topic-related words
  • These two issues are exactly the goals of
    statistical model smoothing.

12
Laplacian Smoothing
  • Basic Idea
  • Simply assume the prior count of each word is 2.
  • Just for preventing zero probability, not make
    lot of senses for real applications.

13
Background Smoothing
  • The document model is smoothed by the corpus
    model.

c(w d) is the count of word w in document d. C
denotes the corpus (Zhai and Lafferty 2001)
14
Two-Stage Smoothing
  • Basic Idea
  • The first stage smoothes the document model using
    the corpus model and the second stage involves
    the computation of the likelihood of the query
    according to a query model.

(Zhai and Lafferty 2002)
15
Cluster Language Model
  • Motivation
  • Have more similar documents to estimate a more
    accurate and smoothed model (Liu and Croft 2004)
  • Weakness
  • Time-consuming and not scalable to cluster a
    large collection
  • The assumption that one document is associated
    with only one cluster does not hold very well.

16
Statistical Translation Model
  • Motivating Example
  • The document containing auto should return for
    query car
  • Statistical Translation Model
  • Semantic relationships between document term and
    query term are considered (Berger and Lafferty
    1999)
  • Follow-up Jin, Hauptmann and Zhai 2002 Lafferty
    and Zhai 2001 Cao et al. 2005
  • Unable to incorporate contextual and sense
    information into the translation procedure.

17
Context-Sensitive Semantic Smoothing (Our
Approach)
  • Definition
  • Like the statistical translation model, term
    semantic relationships are used for model
    smoothing.
  • Unlike the statistical translation model,
    contextual and sense information are considered
  • Method
  • Decompose a document into a set of
    context-sensitive topic signatures and then
    statistically translate topic signatures into
    individual words.

18
Topic Signatures
  • Concept Pairs
  • A pair of two concepts which are semantically and
    syntactically related to each other
  • Example computer and mouse, hypertension and
    obesity
  • Extraction Ontology-based approach (Zhou et al.
    2006, SIGIR)
  • Multiword Phrases
  • Example Space Program, Star War, White House
  • Extraction Xtract (Smadja 1993)

19
Translation Probability Estimate
  • Method
  • Use cooccurrence counts (topic signature and
    individual words)
  • Use a mixture model to remove noise from
    topic-free general words

Figure 2. Illustration of document indexing. Vt,
Vd and Vw are topic signature set, document set
and word set, respectively.
Denotes Dk the set of documents containing the
topic signature tk. The parameter a is the
coefficient controlling the influence of the
corpus model in the mixture model.
20
Translation Probability Estimate
  • Log likelihood of generating Dk
  • EM for estimation

Where is the document frequency of term w in Dk,
i.e., the cooccurrence count of w and tk in the
whole collection.
21
Contrasting Translation Example
22
Topic Signature LM
  • Basic Idea
  • Linearly interpolate the topic signature based
    translation model with a simple language model.
  • The document expansions based on
    context-sensitive semantic smoothing will be very
    specific.
  • The simple language model can capture the points
    the topic signatures miss.

Where the translation coefficient (?) controls
the influence of the translation component in the
mixture model.
23
Topic Signature LM
  • The Simple Language Model
  • The Topic Signature Translation Model

c(ti, d) is the frequency of topic signature ti
in document d.
24
Text Retrieval Experiments
  • Collections
  • TREC Genomics Track 2004 and 2005
  • Use sub-collection
  • 2004 48,753 documents
  • 2005 41,018 documents
  • Measures
  • Mean Average Precision (AP), Recall
  • Settings
  • Simple language model as the baseline
  • Use concept pairs as topic signatures
  • Background coefficient 0.05
  • Pseudo-relevance feedback top 50 documents,
    expand10 terms

25
IR Experiment Results
Table 1. The comparison of the baseline language
model with the topic signature document model and
the topic signature query model . The parameters
? and ? are trained from TREC04 dataset.
26
Effect of Document Smoothing
Figure 3. The variance of MAP with the
translation coefficient (?), which controls the
influence of the translation component in the
topic signature language model.
27
Effect of Document Smoothing
Figure 4. The variance of MAP with the
translation coefficient (?), which controls the
influence of the translation component in the
topic signature language model.
28
vs. Context-Insensitive Model
  • Context-Insensitive Semantic Smoothing

c(t(w, wk)) is the frequency count of topic
signature t(w, wk) in the whole collection
29
vs. Context-insensitive Model
  • Experiment Results

Table 2. Comparison of the context-sensitive
semantic smoothing (Sensitive) to the
context-insensitive semantic smoothing
(Insensitive) on MAP. The rightmost column is the
change of Sensitive over Insensitive.
30
vs. Other Approaches
  • Other Approaches
  • Simple language model with word as indexing unit
  • Local Information Flow (Song and Bruza 2003)
  • Context sensitive semantic smoothing
  • Can not incorporate domain knowledge
  • Model-based Feedback (Zhai and Lafferty 2001)
  • Findings
  • Our approach achieved the best result for both
    2004 and 2005.
  • The incorporation of domain knowledge did not
    help much when using simple language model

31
vs. Other Approaches
  • Experiment Results

Table 3. Comparison of the retrieval performance
of six approaches on TREC genomic track 2004 and
2005. The concept-based indexing is based on the
UMLS Metathesaurus. All approaches are
implemented by us.
32
Text Clustering Experiments
  • Using multiword phrases as topic signatures
  • The meaning is unambiguous in most cases
  • There are many statistical approaches to phrase
    extraction
  • Applicable to any domain
  • Testing Collections
  • 20-newsgroups, LA Times, TDT2
  • Evaluation Criterion
  • Normalized mutual information (NMI, Banerjee and
    Ghosh, 2002)
  • Entropy (Steinbach et al., 2000 )
  • Purity (Zhao and Karypis, 2001 )

33
Statistics of Three Datasets
Table 4. Statistics of three datasets
Notes In the testing phase, we create both small
and large testing collections. For small
collections, 100 documents are randomly selected
for each class. For large collections, 1000
documents are randomly selected for each class.
Agglomerative clustering is only evaluated on
small collections while partitional clustering
tests both.
34
Agglomerative Clustering
Table 5. NMI results of the agglomerative
hierarchical clustering with complete linkage
criterion. Bkg and Semantic denote simple
background smoothing and semantic smoothing,
respectively. means stop words are not removed.
The translation coefficient ? is trained from
TDT2.
35
Effect of Document Smoothing
Figure 5. The variance of the cluster quality
with the translation coefficient (?) which
controls the influence of semantic smoothing
36
Partitional Clustering
Table 6. NMI results of partitional clustering on
large datasets and small datasets. Lap, Bkg,
and Semantic denote Laplacian smoothing,
background smoothing, and semantic smoothing,
respectively. means stop words are not
removed. The translation coefficient ? is trained
from TDT2.
37
Effect of Cluster Smoothing
Figure 6. The variance of the cluster quality on
small datasets with the translation coefficient
(?) which controls the influence of semantic
smoothing. Stop words are removed.
38
Effect of Cluster Smoothing
Figure 7. The variance of the cluster quality on
large datasets with the translation coefficient
(?) which controls the influence of semantic
smoothing. Stop words are removed.
39
Clustering Result Summary
  • Semantic smoothing is much more effective than
    other schemes on agglomerative clustering where
    data sparsity is the major problem.
  • For partitional clustering, when dataset is small
    and data sparsity is the major problem, semantic
    smoothing is very effective otherwise, it equals
    to background smoothing.
  • Although both semantic smoothing and background
    smoothing can weaken the effect of general words,
    they are less effective than TFIDF which is more
    aggressive on discounting general words.
  • Laplacian smoothing is the worst among all tested
    schemes.
  • Removing stops or not have no effect on TFIDF,
    background smoothing, and semantic smoothing, but
    significant effect on other schemes.

40
Conclusions and Future Work
  • The topic signature language model is very
    effective in discounting general words and
    smoothing unseen topic-related words
  • Using different implementations of topic
    signature, the model can incorporate domain
    knowledge or not, can be applied to either
    general domains or specific domains.
  • Future work
  • the optimization of translation coefficient can
    be improved.
  • Applied to other applications such as text
    summarization and text classification.

41
Software and Papers
  • Download Dragon Toolkit http//www.ischool.drexel.
    edu/dmbio/dragontool
  • Related Papers and Slideshttp//www.pages.drexel.
    edu/xz37 orhttp//www.daviszhou.net

42
Acknowledgements
  • Thank my advisor Dr. Tony Hu as well as thesis
    committee members Dr. Song and Dr. Lin for their
    advice on my research work
  • Thank Dr. Han and Dr. Weber for their help when I
    was working with them as RA
  • Thank IST for the generous support of my travels
    to academic conferences
  • My research work is supported in part by NSF
    Career grant (NSF IIS 0448023), NSF CCF
    0514679, PA Dept of Health Grants (No. 240205,
    No. 240196, No. 239667).

43
Questions/Comments?
Write a Comment
User Comments (0)