Title: Recupera
1A Brief Survey on Cross-language Information
Retrieval (CLIR) - Text Retrieval Perspective
by Ying Alvarado (24401693)
CSE 8337 Lecturer Dr. Margaret Dunham April 26,
2007
2Outline
- Introduction
- Concept
- Why important
- Approach
- CLIR problems
- Resource
- Approaches
- Example Techniques
- A CLIR application system
- CLIR effectiveness
- CLIR future tasks
- CLIR communities
- References
3Cross Language IR
- Definition Users enter their query in one
language and the system retrieves relevant
documents in other languages. - For example, a user may pose their query in
English but retrieve relevant documents written
in French. - Example CLIR applications
- Cross-Language retrieval from texts
- Cross-Language retrieval from audio and images
In this presentation, we focus on text IR only!
1 Wikipedia, http//en.wikipedia.org/wiki/Cross
-language_information_retrieval 2 Paul Clough,
Bridging the language gap making digital
collections available to a multilingual society,
presentation, 2005
4Monolingual vs. Bilingual vs. Multilingual
- Monolingual IR Documents and user
requests in the same language
- Cross-language IR
- Documents and user requests are in different
languages (bilingual IR)
Cross-language IR (CLIR) system
Request (L1)
Results(L2)
Documents (L2 )
Source language
Target language
2 Paul Clough, Bridging the language gap
making digital collections available to a
multilingual society, presentation, 2005
5Monolingual vs. Bilingual vs. Multilingual (con.)
- Multilingual IR
- Documents in collection in different
languages, search requests in any language
Multilingual IR (MLIR) system
Request (L?)
Results (L2, L3 or L4)
Documents (L4 )
Documents (L3)
Documents (L2 )
e.g. the Web
6Why CLIR?
TOP TEN LANGUAGESIN THE INTERNET of allInternet Users Internet Usersby Language InternetPenetrationby Language Internet Growthfor Language( 2000 - 2007 ) 2007 EstimateWorld Populationfor the Language
English 29.5 328,666,386 28.7 139.6 1,143,218,916
Chinese 14.3 159,001,513 11.8 392.2 1,351,737,925
Spanish 8.0 88,920,232 20.2 260.3 439,284,783
Japanese 7.7 86,300,000 67.1 83.3 128,646,345
German 5.3 58,711,687 61.1 113.2 96,025,053
French 5.0 55,521,294 14.3 355.2 387,820,873
Portuguese 3.6 40,216,760 17.2 430.8 234,099,347
Korean 3.1 34,120,000 45.6 79.2 74,811,368
Italian 2.8 30,763,940 51.7 133.1 59,546,696
Arabic 2.6 28,540,700 8.4 931.8 340,548,157
TOP TEN LANGUAGES 81.7 910,762,512 21.4 181.4 4,255,739,462
Rest of World Languages 18.3 203,511,914 8.8 444.5 2,318,926,955
WORLD TOTAL 100.0 1,114,274,426 16.9 208.7 6,574,666,417
Top Ten Languages Used in the Web( Number of Internet Users by Language )
Mar. 10, 2007
3 Internet World Stats, http//www.internetworld
stats.com/stats7.htm
7Why CLIR? (con.)
- A collection may contains documents in many
different languages, e.g. the Web. It would be
impractical to form a query in each language. - The documents may be expressed in more than one
languages. For example, - Technical documents in which English jargon
appears intermixed with narrative text in another
language. - Academic works which cite the titles of documents
in different languages. - The user is not sufficiently fluent to express a
query in a language, but is able to make use of
the documents that are identified. - The user is monolingual and wants to query in
their native language. Because he - can judge relevance even if results not
translated - have access to document translation
2 Paul Clough, Bridging the language gap
making digital collections available to a
multilingual society, presentation, 2005
4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
8CLIR problems
- Handling non-ASCII character sets
- Untranslatable search keys (OOV) e.g. compound
words, proper names, special terms - Multi-word concepts, e.g. phrases and idioms
- Ambiguity, e.g. Homonymy and polysemy
- Word Inflections, e.g. plurals and gender
2 Paul Clough, Bridging the language gap
making digital collections available to a
multilingual society, presentation, 2005
5 Ari Pirkola, et al. Dictionary-Based
Cross-Language Information Retrieval_ Problems,
Methods, and Research Findings. Information
Retrieval, Vol. 4. 2001
9Resources for Translation
- Ontology
- Representation of concepts and relationships
- Thesaurus
- it more commonly means a listing of words with
similar, related, or opposite meanings - It does not include the definition of words
- Bilingual dictionary
- a list of words together with additional
word-specific information. - Bilingual controlled vocabulary
- carefully selected list of words and phrases,
which are used to tag units of information
(document or work) so that they may be more
easily retrieved by a search - Corpora
- The document collection itself
4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
6 Jimmy Lin, Cross-Language and Multimedia
Information Retrieval. Slides for LBSC 796/INFM
718R. 2006
1 Wikipedia. Related pages. 7 Metamodel.com.
What are the differences between a vocabulary, a
taxonomy, a thesaurus, an ontology, and a
meta-model? http//www.metamodel.com/article.php?s
tory20030115211223271. 2004
10An example of controlled vocabulary
The hierarchical relationships
The equivalence relationship
Womens Pants BT Pants NT Casual
Pants NT Dress Pants NT Sports Pants
14 Boxes and Arrows, http//www.boxesandarrows.c
om/view/what_is_a_controlled_vocabulary
11What to translate?
- Document translation
- Text translation
- E.g., translate entire document collection into
English ? search collection in English - Vector translation
- Query translation
- E.g., translate English query into Chinese query
? search Chinese document collection
6 Jimmy Lin, Cross-Language and Multimedia
Information Retrieval. Slides for LBSC 796/INFM
718R. 2006
12Tradeoffs
- Document Translation
- Documents can be translate and stored offline
- Dependent on high quality automatic machine
translation (MT) system - Does not easily deal with changing document sets
- Query Translation
- Often easier
- Disambiguation of query terms may be difficult
with short queries
4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
6 Jimmy Lin, Cross-Language and Multimedia
Information Retrieval. Slides for LBSC 796/INFM
718R. 2006
13Approaches to query translation
- Knowledge-based Several aspects of domain
knowledge is manually encoded in to a lexicon. - Ontology-based (concept driven)
- Thesaurus-based
- Dictionary-based
- Expensive to construct lexicons
- Lag behind the common use of terminology.
- Corpus-based directly exploit statistical
information about term usage in a corpora
automatically construct lexicon. - Parallel corpora document pairs, sentence pairs,
term pairs - Comparable corpora document pairs, similar
content - Unaligned corpora documents from the same
domain, not translations of one another, not
linked in any other way
4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
8 Miguel E. Ruiz, CLIR. Slides for school
seminars. 2001 9 Rada Mihalcea, Information
Retrieval and Web Search. Class slides. 2007
14Applying monolingual IR techniques
- Query expansion
- Relevance feedback
- Stemming
- Latent semantic analysis
- Parsing
- Part of speech tagging
4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
15Multilingual Thesauri
- Three construction techniques
- Build it from scratch
- Translate an existing thesaurus
- Merge monolingual thesauri
- For example EuroWordNet
- 7 languages
- Built from existing lexical resources
- Has the same structure as Princeton WordNet
8 Miguel E. Ruiz, CLIR. Slides for school
seminars. 2001 9 Rada Mihalcea, Information
Retrieval and Web Search. Class slides. 2007
16Pseudo-Relevance Feedback
- Also call Blind feedback
- Assume that the top n documents in the result set
actually are relevant. - Enter query terms in French
- Find top French documents in parallel corpus
- Construct a query from English translations
- Perform a monolingual free text search
Top ranked French Documents
French Query Terms
English Web Pages
English Translations
French Text Retrieval System
Parallel Corpus
AltaVista
9 Rada Mihalcea, Information Retrieval and Web
Search. Class slides. 2007
17Different level alignment in parallel corpora
- Document alignment
- Already exists
- Collected from existing corpora
- Examine document external features
- Examine document internal features
- Sentence alignment
- Easily constructed from aligned documents
- Match pattern of relative sentence lengths
- Good first step for term alignment
- Term alignment
- Using co-occurrence-based translation
9 Rada Mihalcea, Information Retrieval and Web
Search. Class slides. 2007
18Example of term alignment
19Co-occurrence-based translation
- Align terms using co-occurrence statistics
- assumed that the correct translations of query
terms tend to co-occur in target language
documents - How often do a term pair occur in sentence pairs?
- Weighted by relative position in the sentences
- Retain term pairs that occur unusually often
9 Rada Mihalcea, Information Retrieval and Web
Search. Class slides. 2007
20Exploiting Unaligned Corpora
- Example approach category-based translation
- Extract a large number of terms from unaligned
coprora of the first and second languages - Assign a category to each extracted term by
accessing monolingual thesauri of the first and
second languages - Estimate category-to-category translation
probabilities - Estimate term-to-term translation probabilities
using said category-to-category translation
probabilities
15 David Hull, Terminology translation for
unaligned comparable corpora using category based
translation probabilities. United States Patent
6885985. Filing date Dec 18, 2000. Issue date
Apr 26, 2005
21In Summary
8 Miguel E. Ruiz, CLIR. Slides for school
seminars. 2001
22An experimental system
Automatic construction of parallel
English-Chinese corpus for CLIR
- A parallel text mining system- PTMiner
- Finds parallel text from web
- Parallel Text Mining Algorithm
- Search for candidate sites - Using existing Web
search engines, search for the candidate sites
that may contain parallel pages (by using text
anchor) - File name fetching - For each candidate site,
fetch the URLs of Web pages that are indexed by
the search engines - Host crawling - Starting from the URLs collected
in the previous step, search through each
candidate site separately for more URLs - Pair scan - From the obtained URLs of each site,
scan for possible parallel pairs (by analyzing
document external features) - Download and verifying - Download the parallel
pages, determine file size, language and
character set, text length, HTML structure, and
filter out non-parallel pairs.
10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
23The workflow of the mining process
- Sample anchor texts english version in
english, - Sample document external features file-ch.html
vs. file-en.html -
/chinese//file.html vs. /english/file.html
- Sample document internal features Character set,
HTML structure -
10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
24An alignment example
10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
25Part of the lexicons
Other techniques and tools used
- Encoding scheme transformation (for Chinese)
- Sentence level segmentation
- Chinese word segmentation
- English expression extraction
- SILC language and encoding identification system
10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
26Results
- 14820 pairs of texts (lexicon)
- C-E has a precision of 77
- E-C has a precision of 81.5
- CLIR results
- Test corpus TREC5 and TREC6 Chinese track
10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
27Does CLIR work?
- Best systems at TREC-6 (1997)
- English-French 49 of highest French monolingual
- English-German 64 of highest German monolingual
- Best systems at CLEF (2002)
- English-French 83 of highest French monolingual
- English-German 86 of highest German monolingual
- Best systems at CLEF (2006)
- English-French 93.82 of best French monolingual
- English-Portuguese 90.91 of best Portuguese
monolingual
2Paul Clough, Bridging the language gap making
digital collections available to a multilingual
society, presentation, 2005
16 Giorgio M. Di Nunzio, CLEF 2006 Ad Hoc
Track Overview. 2006
28Future tasks
- Extend study scope
- Web pages, medical literature, USENET newsgroup
articles, records of legislative and legal
proceedings - Lower cost, improve efficiency
- Pay more attention on indexing-time optimizations
to improve query-time efficiency - Consider users perspective
- Improve the utility of ranked lists
- Define suitable criteria for the construction of
a valid multilingual Web corpus - Get resources for resource-poor languages
11 D.W. Oard, When You Come to a Fork in the
Road, Take It Multiple Futures for CLIR
Research. SIGIR 2002 CLIR 12 Fredric Gey, et
al, CROSS LANGUAGE INFORMATION RETRIEVAL A
RESEARCH ROADMAP. SIGIR 2002 CLIR
29CLIR Communities
- TREC Cross Language Track currently focuses on
the Arabic language, - Cross-Language Evaluation Forum (CLEF) a
spinoff from TREC - covering many European
languages, - NTCIR Asian Language Evaluation (covering
Chinese, Japanese and Korean).
12 Fredric Gey, et al, CROSS LANGUAGE
INFORMATION RETRIEVAL A RESEARCH ROADMAP. SIGIR
2002 CLIR
30CLEF
- In CLEF 2006, eight tracks were offered to
evaluate the - performance of systems
- multilingual document retrieval on news
collections (Ad-hoc) - cross-language structured scientific data
(Domain-specific) - interactive cross-language retrieval
- multiple language question answering
- cross-language retrieval on image collections
- cross-language speech retrieval
- multilingual web retrieval
- cross-language geographic retrieval.
13 Carol Peters, Cross-Language Evaluation
Forum - CLEF 2006. D-Lib Magazine October 2006
31References
1 Wikipedia, http//en.wikipedia.org/wiki/Cross
-language_information_retrieval 2 Paul Clough,
Bridging the language gap making digital
collections available to a multilingual society,
presentation, 2005
3 Internet World Stats, http//www.internetworld
stats.com/stats7.htm
4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
5 Ari Pirkola, et al. Dictionary_Based
Cross-Language Information Retrieval_ Problems,
Methods, and Research Findings. Information
Retrieval, Vol. 4. 2001
6 Jimmy Lin, Cross-Language and Multimedia
Information Retrieval. Slides for LBSC 796/INFM
718R. 2006
7 Metamodel.com. What are the differences
between a vocabulary, a taxonomy, a thesaurus, an
ontology, and a meta-model? http//www.metamodel.c
om/article.php?story20030115211223271. 2004
8 Miguel E. Ruiz, CLIR. Slides for school
seminars. 2001
9 Rada Mihalcea, Information Retrieval and Web
Search. Class slides. 2007
10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
11 D.W. Oard, When You Come to a Fork in the
Road, Take It Multiple Futures for CLIR
Research. SIGIR 2002 CLIR 12 Fredric Gey, et
al, CROSS LANGUAGE INFORMATION RETRIEVAL A
RESEARCH ROADMAP. SIGIR 2002 CLIR
13 Carol Peters, Cross-Language Evaluation
Forum - CLEF 2006. D-Lib Magazine October 2006
14 Boxes and Arrows, http//www.boxesandarrows.c
om/view/what_is_a_controlled_vocabulary
15 David Hull, Terminology translation for
unaligned comparable corpora using category based
translation probabilities. United States Patent
6885985. Filing date Dec 18, 2000. Issue date
Apr 26, 2005
16 Giorgio M. Di Nunzio, CLEF 2006 Ad Hoc
Track Overview. 2006
32Thank you!