Title: XRCE at CLEF 07 Domain-specific Track
1XRCE at CLEF 07Domain-specific Track
- Stephane Clinchant and Jean-Michel Renders
- (presented by Gabriela Csurka)
- Xerox Research Centre Europe
- France
2Outline
- Introduction
- Mono-lingual Domain-Specific Information
Retrieval - Query Language Model Refinement with PRF
- Lexical Entailment
- Results
- Cross-lingual Domain-Specific Information
Retrieval - Machine translation (Matrax)
- Dictionary Adaptation
- Results
- Conclusion
3CLEF domain-specific track
- Domain-Specific Information Retrieval
- Leveraging the structure of data in collections
(i.e. controlled vocabularies and other metadata)
to improve search. - Multilingual database in social science domain
- German Social Science Information Centres
databases - Social Science Research Projects) databases
- Tasks
- Mono-lingual retrieval queries and documents are
in the same language - Cross-lingual retrieval queries in one language
are used with a collection in a different
language.
4Information retrieval
- Simply keyword matching is not enough to retrieve
the best documents for a query.
Information need
d1
matching
d2
query
dn
Courtesy Drawing borrowed from C Manning and
P. Raghavan lectures at Standford University
5IR with Language Modeling
- Treat each document as a multinomial distribution
model ?d
Information need
d1
generation
d2
query
dn
- Then the documents d in the corpus can be
- either by
- or computing a cross-entropy similarity between
?d and ?q
Courtesy Drawing borrowed from C Manning and
P. Raghavan lectures at Standford University
6How we estimate ?d
- A simple language model could be obtained
(Maximul Likelihood) by considering the frequency
of words in - The probabilities are smoothed by the corpus
language model by - We used Jelinek-Mercer interpolation
- The role of smoothing the language model is
- LM more accurate (the query word can be absent in
a document) - The has an IDF effect (renormalize the frequency
of words with respect to its occurence in the
corpus C )
7Query LM Refinement with PRF
- Aim Adapt (refine) the LM of a particular
query - How Detecting implicit relevant concept
present in the retrieved documents using
pseudo-relevance feedback (PRF)
Pseudo-Relevance Feedback Top N ranked
text based on query similarity
Final rank Re-ranked documents based on refined
query similarity
Query
?F
??q (1-?)?F
?q
8How to estimate ?F
- Let F(q)d1,d2, ..dN be the N most relevant
document for query q - Draw di following
- With ?F assumed to be multinomial (peaked at
relevant terms). - Then With ?F is estimated by EM algorithm from
the global likelihood - where P(w ?C ) is word probability built upon
the Corpus and ? (0.5) a fix parameter
Zhai and Lafferty, SIGIR 2001.
9Lexical Entailment
- Lexical Entailment A thesaurus built
automatically on a given Corpus - Given by the probabilities that one term entails
another term based on the Corpus - which is filtered using the information gain and
an additional parameter enables us to increase
the weights given to the self-entailment P(uu). - Applied for IR words from the document are
translated into different query terms - If we add a background smoothing, we obtain
-
- Pros Finds relation between terms that feedback
can not. - Cons Heavier to compute and queries gets longer.
S. Clinchant C. Goutte E. Gaussier Lexical
Entailment for Information Retrieval ECIR 06
10Other results for comparison
11DS 07- Monolingual Official Runs
- PRF Lexical Entailment is a Double Lexical
Entailment model where a first LE model is used
to provide the system with an initial set of TOPn
documents, from which a mixture model for
pseudo-feedback is built, and a second retrieval
is performed based once again on the LE model
applied to the enriched query.
12Cross-Lingual -IR
Documents (in target language)
Information need
d1
Query (in source language
???
d2
Query translation
dn
Document translation
13What to translate?
- Document translation - translate documents into
the query language - Pro translation may be (theoretically) more
precise and documents become readable by the
user - Cons huge volume to be translated
- Query translation - translate the query into
document language - Pros flexibility (translation on demand) and
less text to translate - Cons less precise and the retrieved documents
need to be translated to be readable.
14How to translate ?
- Statistical Machine Translation MatraX
- Alignment Model learnt on parallel Corpus
(JRC-AC with Giza word alignment) - Language Model (N-gram) Learnt on GIRT Corpus
- Translates the source sentence into the K target
sentences - Use them as mono-lingual query
- Dictionary Based Approach with or without
adaptation - Extract a probabilistic bilingual dictionary from
different resources (standard, domain-specific
thesaurus, JRC-AC) - Use the translated query with mono-lingual
retrieval approaches - Adapt the dictionary to a particular (query
(feedback), target corpus) - Use the adapted query with mono-lingual retrieval
approaches
15Dictionary based CLIR without Adaptation
- Idea Rank the documents
- according to the cross-entropy similarity
between the language model of the query and the
language model of the document - using a probabilistic bilingual dictionary given
by P(wt ws), the probability that the word ws
is translated in wt
16Dictionary Adaptation
- Aim Adapt the dictionary to a particular query
- How Detecting implicit coherence present in
the relevant documents using PR - The first IR (CL-LM) can be seen also as a
dictionary disambiguation process.
PRF Top N ranked text
Final rank Re-ranked target documents
Source Query
Dictionary
Adapted Dictionary
P(wtws)
?st
CL-LM
?qs
CL-LM
17How to estimate ?st
- Let F(q) be the relevant documents retrieved by
the translated query using (CL-LM). - The global model likelihood becomes
- The estimation of ?st is done by EM initializing
it by the general dictionary P(wt ws). - Finally, we apply (CL-LM) again, but with the
adapted query language model - Note
- This is an extension of the Query LM Refinement
with PRF to multi-lingual case - This algorithm realizes both the query enrichment
and dictionary adaptation
18Other results for comparison
19DS 07- Bilingual Official Runs
20Conclusion
- Mono-lingual Domain-Specific Information
Retrieval - The Query LM refinement with PRF (LMPRF) give
better performance than the Lexical Entailment
(Table 7) but unlike the latter it makes 2
retrieval steps. - Both over-perform the non-adapted LM case (Table
9) - Combining them allowed for further improvements
(Table 7) - Cross-lingual Domain-Specific Information
Retrieval - Combining the QT with further adaptation was
benefic (Table 9) - Matrax is bettter than dictionary based IR
without adaptation (Table 9) - The dictionary adaptation method gave better
results than the query translation with Matrax
independently from the further mono-language
adaptation (Table 8) - Combining the Lexical Entailment with LMPRF was
benefic in the cross-lingual case too (Table 8)
21Thank you for your attention!
- ? Not satisfied with the answer?
- ? You can always get answer directly from the
authors - Stephane.Clinchant_at_xrce.xerox.com or
- Jean-Michel.Render_at_xrce.xerox.com
22Back-up
23MatraX
Bi-phrase library
Pre-processing
JRC-AC
Bi-phrase library construction
GIRT
Training set
Model parameter optimization
Development set
Language Modeling (SRI LM-toolkit)
Decoder
Language Model
Model params