XRCE at CLEF 07 Domainspecific Track - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

XRCE at CLEF 07 Domainspecific Track

Description:

XRCE at CLEF 07. Domain-specific Track. Stephane Clinchant and Jean-Michel Renders ... Mono-lingual Domain-Specific Information Retrieval. Query Language Model ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 24
Provided by: clefca
Category:

less

Transcript and Presenter's Notes

Title: XRCE at CLEF 07 Domainspecific Track


1
XRCE at CLEF 07Domain-specific Track
  • Stephane Clinchant and Jean-Michel Renders
  • (presented by Gabriela Csurka)
  • Xerox Research Centre Europe
  • France

2
Outline
  • Introduction
  • Mono-lingual Domain-Specific Information
    Retrieval
  • Query Language Model Refinement with PRF
  • Lexical Entailment
  • Results
  • Cross-lingual Domain-Specific Information
    Retrieval
  • Machine translation (Matrax)
  • Dictionary Adaptation
  • Results
  • Conclusion

3
CLEF domain-specific track
  • Domain-Specific Information Retrieval
  • Leveraging the structure of data in collections
    (i.e. controlled vocabularies and other metadata)
    to improve search.
  • Multilingual database in social science domain
  • German Social Science Information Centres
    databases
  • Social Science Research Projects) databases
  • Tasks
  • Mono-lingual retrieval queries and documents are
    in the same language
  • Cross-lingual retrieval queries in one language
    are used with a collection in a different
    language.

4
Information retrieval
  • Simply keyword matching is not enough to retrieve
    the best documents for a query.

Information need
d1
matching
d2
query

dn
Courtesy Drawing borrowed from C Manning and
P. Raghavan lectures at Standford University
5
IR with Language Modeling
  • Treat each document as a multinomial distribution
    model ?d

Information need
d1
generation
d2
query


dn
  • Then the documents d in the corpus can be
  • either by
  • or computing a cross-entropy similarity between
    ?d and ?q

Courtesy Drawing borrowed from C Manning and
P. Raghavan lectures at Standford University
6
How we estimate ?d
  • A simple language model could be obtained
    (Maximul Likelihood) by considering the frequency
    of words in
  • The probabilities are smoothed by the corpus
    language model by
  • We used Jelinek-Mercer interpolation
  • The role of smoothing the language model is
  • LM more accurate (the query word can be absent in
    a document)
  • The has an IDF effect (renormalize the frequency
    of words with respect to its occurence in the
    corpus C )

7
Query LM Refinement with PRF
  • Aim Adapt (refine) the LM of a particular
    query
  • How Detecting implicit relevant concept
    present in the retrieved documents using
    pseudo-relevance feedback (PRF)

Pseudo-Relevance Feedback Top N ranked
text based on query similarity
Final rank Re-ranked documents based on refined
query similarity
Query

?F
??q (1-?)?F
?q

8
How to estimate ?F
  • Let F(q)d1,d2, ..dN be the N most relevant
    document for query q
  • Draw di following
  • With ?F assumed to be multinomial (peaked at
    relevant terms).
  • Then With ?F is estimated by EM algorithm from
    the global likelihood
  • where P(w ?C ) is word probability built upon
    the Corpus and ? (0.5) a fix parameter

Zhai and Lafferty, SIGIR 2001.
9
Lexical Entailment
  • Lexical Entailment A thesaurus built
    automatically on a given Corpus
  • Given by the probabilities that one term entails
    another term based on the Corpus
  • which is filtered using the information gain and
    an additional parameter enables us to increase
    the weights given to the self-entailment P(uu).
  • Applied for IR words from the document are
    translated into different query terms
  • If we add a background smoothing, we obtain
  • Pros Finds relation between terms that feedback
    can not.
  • Cons Heavier to compute and queries gets longer.

S. Clinchant C. Goutte E. Gaussier Lexical
Entailment for Information Retrieval ECIR 06
10
Other results for comparison
11
DS 07- Monolingual Official Runs
  • PRF Lexical Entailment is a Double Lexical
    Entailment model where a first LE model is used
    to provide the system with an initial set of TOPn
    documents, from which a mixture model for
    pseudo-feedback is built, and a second retrieval
    is performed based once again on the LE model
    applied to the enriched query.

12
Cross-Lingual -IR
Documents (in target language)
Information need
d1
Query (in source language
???
d2

Query translation
dn
  • Translation

Document translation
13
What to translate?
  • Document translation - translate documents into
    the query language
  • Pro translation may be (theoretically) more
    precise and documents become readable by the
    user
  • Cons huge volume to be translated
  • Query translation - translate the query into
    document language
  • Pros flexibility (translation on demand) and
    less text to translate
  • Cons less precise and the retrieved documents
    need to be translated to be readable.

14
How to translate ?
  • Statistical Machine Translation MatraX
  • Alignment Model learnt on parallel Corpus
    (JRC-AC with Giza word alignment)
  • Language Model (N-gram) Learnt on GIRT Corpus
  • Translates the source sentence into the K target
    sentences
  • Use them as mono-lingual query
  • Dictionary Based Approach with or without
    adaptation
  • Extract a probabilistic bilingual dictionary from
    different resources (standard, domain-specific
    thesaurus, JRC-AC)
  • Use the translated query with mono-lingual
    retrieval approaches
  • Adapt the dictionary to a particular (query
    (feedback), target corpus)
  • Use the adapted query with mono-lingual retrieval
    approaches

15
Dictionary based CLIR without Adaptation
  • Idea Rank the documents
  • according to the cross-entropy similarity
    between the language model of the query and the
    language model of the document
  • using a probabilistic bilingual dictionary given
    by P(wt ws), the probability that the word ws
    is translated in wt

16
Dictionary Adaptation
  • Aim Adapt the dictionary to a particular query
  • How Detecting implicit coherence present in
    the relevant documents using PR
  • The first IR (CL-LM) can be seen also as a
    dictionary disambiguation process.

PRF Top N ranked text
Final rank Re-ranked target documents
Source Query
Dictionary
Adapted Dictionary
P(wtws)

?st
CL-LM
?qs
CL-LM

17
How to estimate ?st
  • Let F(q) be the relevant documents retrieved by
    the translated query using (CL-LM).
  • The global model likelihood becomes
  • The estimation of ?st is done by EM initializing
    it by the general dictionary P(wt ws).
  • Finally, we apply (CL-LM) again, but with the
    adapted query language model
  • Note
  • This is an extension of the Query LM Refinement
    with PRF to multi-lingual case
  • This algorithm realizes both the query enrichment
    and dictionary adaptation

18
Other results for comparison
19
DS 07- Bilingual Official Runs
20
Conclusion
  • Mono-lingual Domain-Specific Information
    Retrieval
  • The Query LM refinement with PRF (LMPRF) give
    better performance than the Lexical Entailment
    (Table 7) but unlike the latter it makes 2
    retrieval steps.
  • Both over-perform the non-adapted LM case (Table
    9)
  • Combining them allowed for further improvements
    (Table 7)
  • Cross-lingual Domain-Specific Information
    Retrieval
  • Combining the QT with further adaptation was
    benefic (Table 9)
  • Matrax is bettter than dictionary based IR
    without adaptation (Table 9)
  • The dictionary adaptation method gave better
    results than the query translation with Matrax
    independently from the further mono-language
    adaptation (Table 8)
  • Combining the Lexical Entailment with LMPRF was
    benefic in the cross-lingual case too (Table 8)

21
Thank you for your attention!
  • ? Not satisfied with the answer?
  • ? You can always get answer directly from the
    authors
  • Stephane.Clinchant_at_xrce.xerox.com or
  • Jean-Michel.Render_at_xrce.xerox.com

22
Back-up
23
MatraX
Bi-phrase library
Pre-processing
JRC-AC
Bi-phrase library construction
GIRT
Training set
Model parameter optimization
Development set
Language Modeling (SRI LM-toolkit)
Decoder
Language Model
Model params
Write a Comment
User Comments (0)
About PowerShow.com