Title: Simultaneous Multilingual Search for Translingual Information Retrieval
1Simultaneous Multilingual Search for Translingual
Information Retrieval
- Kristen Parton1
- Kathleen McKeown1
- James Allan2
- Enrique Henestroza1
1
2
2Motivation Cross-Lingual IR
- User needs to search documents in other languages
Documents
Search Results in Document Language(s)
Query in User Language
?????? ????? ????? ???? ????? ?????? ??????? ??
?????
stereotypes of Arabs
3Task Redefinition Translingual IR
- User needs to search documents in other languages
and get back translated results
Documents
Search Results in User Language
Query in User Language
Queen Rania Al-Abdullah discusses stereotypes of
Arabs
stereotypes of Arabs
4Task Redefinition Translingual IR
- User needs to search documents in other languages
and get back translated results - For translingual applications, integrating CLIR
and result translation can improve both relevance
and translation quality
5Outline
- Approaches to CLIR
- SMLIR for Translingual IR
- Query-Directed MT Post-Editing
- System Evaluation
- Conclusions and Future Work
6Approaches to CLIR
- Map query and/or documents to common
representation
Schwarzenegger
??? ?? ?????????? ???? ????? ????????? ??
???????
???? ?? ???????? ?? ???? ???? ?????? ??????????
?????? ...
...??? ???? ????? ????? ????? ?????????? ??????
????????? .
Doc1
Doc2
Doc3
7Approaches to CLIR
- Map query and/or documents to common
representation - Document translation (DT) pre-translation query
expansion
Schwarzenegger Schwarznegger Schwartzenegger ...
The failure of all proposals made by
Schwarzenegger in a referendum
It should be mentioned that wArznjr is also a
nasseer of the Olympic Movement
besides the star and the governor of the state
of California Arnold Schwarznegger .
Doc1
Doc2
Doc3
8Approaches to CLIR
- Map query and/or documents to common
representation - Document translation (DT) pre-translation query
expansion - Query translation (QT) post-translation query
expansion
Schwarzenegger Schwarznegger Schwartzenegger ...
?????????? ???????? ????????? ??????????
??? ?? ?????????? ???? ????? ????????? ??
???????
???? ?? ???????? ?? ???? ???? ?????? ??????????
?????? ...
...??? ???? ????? ????? ????? ?????????? ??????
????????? .
Doc1
Doc2
Doc3
9Approaches to CLIR
- Map query and/or documents to common
representation - Document translation (DT) pre-translation query
expansion - Query translation (QT) post-translation query
expansion
Schwarzenegger Schwarznegger Schwartzenegger ...
?????????? ???????? ????????? ??????????
??? ?? ?????????? ???? ????? ????????? ??
???????
???? ?? ???????? ?? ???? ???? ?????? ??????????
?????? ...
...??? ???? ????? ????? ????? ?????????? ??????
????????? .
Doc1
Doc2
Doc3
10Query Translation vs. Document Translation
- Trade-offs
- Translation resources
- Approximate DT Oard 00, Chen 04
- Translation quality
- Handling synonymy
- Hybrid methods
- McCarley 99, Chen Gey 04 Run QT and DT
searches, merge results and rerank - Wang Oard 06 Use bidirectional word
alignments to capture information from QT and DT
11Hybrid Merged Method
- Merge and re-rank results of two searches
McCarley 99 - DT Query indexed document translations
- QT Translated query indexed source documents
- Problems
- Different document lengths, query lengths
- Raw IR scores not comparable across queries
- Many ways to re-rank, merge searches
Merged Results
Doc2 Doc3 Doc1
12Outline
- Approaches to CLIR
- SMLIR for Translingual IR
- Query-Directed MT Post-Editing
- System Evaluation
- Conclusions and Future Work
13Simultaneous Multilingual IR (SMLIR)
- Indexed document source document translation
- Query original query query translations
(expansions)
Query
?????????? ???????? ????????? ??????????
Schwarzenegger Schwarznegger
It should be mentioned that wArznjr is also a
nasseer of the Olympic Movement
besides the star and the governor of the state
of California Arnold Schwarznegger .
The failure of all proposals made by
Schwarzenegger in a referendum
???? ?? ???????? ?? ???? ???? ?????? ??????????
?????? ...
...??? ???? ????? ????? ????? ?????????? ??????
????????? .
??? ?? ?????????? ???? ????? ????????? ??
???????
Doc1
Doc2
Doc3
14Simultaneous Multilingual IR (SMLIR)
- Multilingual (probabilistic) structured queries
- Treat query term and its translations as synonyms
- SMLIR Hybrid vs. Merged Hybrid
- No need for re-ranking or raw score normalization
- Single index, one search
- Query time comparable to Merged in practice
15Outline
- Approaches to CLIR
- SMLIR for Translingual IR
- Query-Directed MT Post-Editing
- System Evaluation
- Conclusions and Future Work
16Relevance Lost in Translation
- Statistical MT makes mistakes
- Bad translations of relevant documents may be
perceived as irrelevant - Detection IR match in source language but not in
document translation ? Bad translation? - Correction Replace bad translation with query
term
It was the Iraqi sajidah AlryAwy had stopped
Sajida al-Rishawi
????? ????????
????? ???????? ????? ???????? ????? ...
17Query-Directed MT Post-Editing
- Use query translation word alignments to
rewrite incorrect machine translation (MT) - Considerations errors in query translation,
incorrect word alignments
It was the Iraqi sajidah AlryAwy had stopped
It was the Iraqi Sajida al-Rishawi had stopped
Sajida al-Rishawi
????? ????????
????? ???????? ????? ???????? ????? ...
Translated document with word alignments
Edited translation
18Outline
- Approaches to CLIR
- SMLIR for Translingual IR
- Query-Directed MT Post-Editing
- System Evaluation
- Conclusions and Future Work
19Experiment Setup
- Part of Darpa GALE question-answering task
- WHERE HAS UN Secretary General Kofi Annan BEEN
AND WHEN? - Multilingual English, Chinese, Arabic
- Multimodal speech, text Multigenre formal,
informal - Evaluation Corpus
- 102,859 Chinese documents
- Translated into English using RWTH statistical
machine translation system - Searches run using Indri (Lemur) IR system
- Relevance judgments
- 145 queries, 8,785 documents judged
- A document is Relevant or Not Relevant for a
query - Judgments based on Chinese text, by Chinese
native speakers
20Evaluation Points
- Query Translation Strategies
- English query ? Chinese query
- Run SMLIR searches, evaluate results
- Cross-lingual IR Approaches
- Using Chinese and/or English query, search over
Chinese and/or translated documents - Machine Translation Post-Editing
- Detect errors in result translations
- Rewrite translations
21Query Translation for SMLIR
- GALE queries are name-centric
- Statistical machine translation (SMT) failed to
translate many names in corpus - Wikipedia for name translation Ferrandez et al.
07 - Generated by humans, edited by humans
- Contains slang, name variations, common
misspellings - Noisy, some intentional spam
- Large variation in quantity/quality by language
22User-Generated Synonyms and Translations
23Query Translation Strategies for SMLIR
- MT dictionary probabilistic translation
dictionary derived from word alignments - Wikipedia for name translations not
probabilistic - Combination did not help?
24CLIR Evaluation
- SMLIR significantly outperforms all
- DT significantly better than QT
- Poor performance of QT degrades Merged
25Results Query-Directed SMT Post-Editing
- Post-Editing
- Detect possible incorrect name translations
- If translated name is not a synonym of query,
rewrite name - Very conservative algorithm does not handle
deletions - Experiment
- 127 queries, top 10 documents
- 28 queries triggered post-editing
- 15 of name matches were rewritten
- Evaluation
- 101 rewrites examined 93 Acceptable, 6 Not
Acceptable
26Conclusions
- SMLIR Novel and effective approach for
integrating document and query translation in
CLIR - Query-directed SMT post-editing shows promise
- More sophisticated editing possible, beyond just
names - Future work evaluating whole system for
end-to-end question answering - Combining CLIR and machine translation can
improve both search relevance and translation
accuracy
27Thank you!
- This work was supported in part by the Defense
Advanced Research Projects Agency (DARPA) under
contract number HR0011-06-C-0023, in part by an
NSF Graduate Research Fellowship, and in part by
the Center for Intelligent Information Retrieval
at the University of Massachusetts. - Thanks very much to Bob Armstrong for making the
annotation happen. Thanks also to Mark Smucker
and Giridhar Kumaran for help with INDRI
interface and corpus issues, and Ben Carterette
for help with estimated MAP. We would also like
to thank the members of the NIGHTINGALE machine
translation team for translation data, especially
Nizar Habash and Mahmoud Ghoneim.
Questions?kristen_at_cs.columbia.edu