Recupera

About This Presentation

Title:

Recupera

Description:

... for school seminars. 2001. 22. An ... When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. ... Slides for school seminars. 2001 ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 33

Provided by: bert193

Learn more at: https://s2.smu.edu

Category:

Tags: recupera

more less

Transcript and Presenter's Notes

Title: Recupera

1
A Brief Survey on Cross-language Information
Retrieval (CLIR) - Text Retrieval Perspective
by Ying Alvarado (24401693)
CSE 8337 Lecturer Dr. Margaret Dunham April 26,
2007
2
Outline

Introduction
Concept
Why important
Approach
CLIR problems
Resource
Approaches
Example Techniques
A CLIR application system
CLIR effectiveness
CLIR future tasks
CLIR communities
References

3
Cross Language IR

Definition Users enter their query in one
language and the system retrieves relevant
documents in other languages.
For example, a user may pose their query in
English but retrieve relevant documents written
in French.
Example CLIR applications
Cross-Language retrieval from texts
Cross-Language retrieval from audio and images

In this presentation, we focus on text IR only!
1 Wikipedia, http//en.wikipedia.org/wiki/Cross
-language_information_retrieval 2 Paul Clough,
Bridging the language gap making digital
collections available to a multilingual society,
presentation, 2005
4
Monolingual vs. Bilingual vs. Multilingual

Monolingual IR Documents and user
requests in the same language

Cross-language IR
Documents and user requests are in different
languages (bilingual IR)

Cross-language IR (CLIR) system
Request (L1)
Results(L2)
Documents (L2 )
Source language
Target language
2 Paul Clough, Bridging the language gap
making digital collections available to a
multilingual society, presentation, 2005
5
Monolingual vs. Bilingual vs. Multilingual (con.)

Multilingual IR
Documents in collection in different
languages, search requests in any language

Multilingual IR (MLIR) system
Request (L?)
Results (L2, L3 or L4)
Documents (L4 )
Documents (L3)
Documents (L2 )
e.g. the Web
6
Why CLIR?

TOP TEN LANGUAGESIN THE INTERNET of allInternet Users Internet Usersby Language InternetPenetrationby Language Internet Growthfor Language( 2000 - 2007 ) 2007 EstimateWorld Populationfor the Language
English 29.5 328,666,386 28.7 139.6 1,143,218,916
Chinese 14.3 159,001,513 11.8 392.2 1,351,737,925
Spanish 8.0 88,920,232 20.2 260.3 439,284,783
Japanese 7.7 86,300,000 67.1 83.3 128,646,345
German 5.3 58,711,687 61.1 113.2 96,025,053
French 5.0 55,521,294 14.3 355.2 387,820,873
Portuguese 3.6 40,216,760 17.2 430.8 234,099,347
Korean 3.1 34,120,000 45.6 79.2 74,811,368
Italian 2.8 30,763,940 51.7 133.1 59,546,696
Arabic 2.6 28,540,700 8.4 931.8 340,548,157
TOP TEN LANGUAGES 81.7 910,762,512 21.4 181.4 4,255,739,462
Rest of World Languages 18.3 203,511,914 8.8 444.5 2,318,926,955
WORLD TOTAL 100.0 1,114,274,426 16.9 208.7 6,574,666,417
Top Ten Languages Used in the Web( Number of Internet Users by Language )
Mar. 10, 2007
3 Internet World Stats, http//www.internetworld
stats.com/stats7.htm
7
Why CLIR? (con.)

A collection may contains documents in many
different languages, e.g. the Web. It would be
impractical to form a query in each language.
The documents may be expressed in more than one
languages. For example,
Technical documents in which English jargon
appears intermixed with narrative text in another
language.
Academic works which cite the titles of documents
in different languages.
The user is not sufficiently fluent to express a
query in a language, but is able to make use of
the documents that are identified.
The user is monolingual and wants to query in
their native language. Because he
can judge relevance even if results not
translated
have access to document translation

2 Paul Clough, Bridging the language gap
making digital collections available to a
multilingual society, presentation, 2005
4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
8
CLIR problems

Handling non-ASCII character sets
Untranslatable search keys (OOV) e.g. compound
words, proper names, special terms
Multi-word concepts, e.g. phrases and idioms
Ambiguity, e.g. Homonymy and polysemy
Word Inflections, e.g. plurals and gender

2 Paul Clough, Bridging the language gap
making digital collections available to a
multilingual society, presentation, 2005
5 Ari Pirkola, et al. Dictionary-Based
Cross-Language Information Retrieval_ Problems,
Methods, and Research Findings. Information
Retrieval, Vol. 4. 2001
9
Resources for Translation

Ontology
Representation of concepts and relationships
Thesaurus
it more commonly means a listing of words with
similar, related, or opposite meanings
It does not include the definition of words
Bilingual dictionary
a list of words together with additional
word-specific information.
Bilingual controlled vocabulary
carefully selected list of words and phrases,
which are used to tag units of information
(document or work) so that they may be more
easily retrieved by a search
Corpora
The document collection itself

4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
6 Jimmy Lin, Cross-Language and Multimedia
Information Retrieval. Slides for LBSC 796/INFM
718R. 2006
1 Wikipedia. Related pages. 7 Metamodel.com.
What are the differences between a vocabulary, a
taxonomy, a thesaurus, an ontology, and a
meta-model? http//www.metamodel.com/article.php?s
tory20030115211223271. 2004
10
An example of controlled vocabulary
The hierarchical relationships
The equivalence relationship
Womens Pants BT Pants NT Casual
Pants NT Dress Pants NT Sports Pants
14 Boxes and Arrows, http//www.boxesandarrows.c
om/view/what_is_a_controlled_vocabulary
11
What to translate?

Document translation
Text translation
E.g., translate entire document collection into
English ? search collection in English
Vector translation
Query translation
E.g., translate English query into Chinese query
? search Chinese document collection

6 Jimmy Lin, Cross-Language and Multimedia
Information Retrieval. Slides for LBSC 796/INFM
718R. 2006
12
Tradeoffs

Document Translation
Documents can be translate and stored offline
Dependent on high quality automatic machine
translation (MT) system
Does not easily deal with changing document sets
Query Translation
Often easier
Disambiguation of query terms may be difficult
with short queries

Knowledge-based Several aspects of domain
knowledge is manually encoded in to a lexicon.
Ontology-based (concept driven)
Thesaurus-based
Dictionary-based
Expensive to construct lexicons
Lag behind the common use of terminology.
Corpus-based directly exploit statistical
information about term usage in a corpora
automatically construct lexicon.
Parallel corpora document pairs, sentence pairs,
term pairs
Comparable corpora document pairs, similar
content
Unaligned corpora documents from the same
domain, not translations of one another, not
linked in any other way

4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
8 Miguel E. Ruiz, CLIR. Slides for school
seminars. 2001 9 Rada Mihalcea, Information
Retrieval and Web Search. Class slides. 2007
14
Applying monolingual IR techniques

Query expansion
Relevance feedback
Stemming
Latent semantic analysis
Parsing
Part of speech tagging

4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
15
Multilingual Thesauri

Three construction techniques
Build it from scratch
Translate an existing thesaurus
Merge monolingual thesauri
For example EuroWordNet
7 languages
Built from existing lexical resources
Has the same structure as Princeton WordNet

8 Miguel E. Ruiz, CLIR. Slides for school
seminars. 2001 9 Rada Mihalcea, Information
Retrieval and Web Search. Class slides. 2007
16
Pseudo-Relevance Feedback

Also call Blind feedback
Assume that the top n documents in the result set
actually are relevant.
Enter query terms in French
Find top French documents in parallel corpus
Construct a query from English translations
Perform a monolingual free text search

Top ranked French Documents
French Query Terms
English Web Pages
English Translations
French Text Retrieval System
Parallel Corpus
AltaVista
9 Rada Mihalcea, Information Retrieval and Web
Search. Class slides. 2007
17
Different level alignment in parallel corpora

Document alignment
Already exists
Collected from existing corpora
Examine document external features
Examine document internal features
Sentence alignment
Easily constructed from aligned documents
Match pattern of relative sentence lengths
Good first step for term alignment
Term alignment
Using co-occurrence-based translation

9 Rada Mihalcea, Information Retrieval and Web
Search. Class slides. 2007
18
Example of term alignment
19
Co-occurrence-based translation

Align terms using co-occurrence statistics
assumed that the correct translations of query
terms tend to co-occur in target language
documents
How often do a term pair occur in sentence pairs?
Weighted by relative position in the sentences
Retain term pairs that occur unusually often

9 Rada Mihalcea, Information Retrieval and Web
Search. Class slides. 2007
20
Exploiting Unaligned Corpora

Example approach category-based translation
Extract a large number of terms from unaligned
coprora of the first and second languages
Assign a category to each extracted term by
accessing monolingual thesauri of the first and
second languages
Estimate category-to-category translation
probabilities
Estimate term-to-term translation probabilities
using said category-to-category translation
probabilities

15 David Hull, Terminology translation for
unaligned comparable corpora using category based
translation probabilities. United States Patent
6885985. Filing date Dec 18, 2000. Issue date
Apr 26, 2005
21
In Summary
8 Miguel E. Ruiz, CLIR. Slides for school
seminars. 2001
22
An experimental system
Automatic construction of parallel
English-Chinese corpus for CLIR

A parallel text mining system- PTMiner
Finds parallel text from web
Parallel Text Mining Algorithm
Search for candidate sites - Using existing Web
search engines, search for the candidate sites
that may contain parallel pages (by using text
anchor)
File name fetching - For each candidate site,
fetch the URLs of Web pages that are indexed by
the search engines
Host crawling - Starting from the URLs collected
in the previous step, search through each
candidate site separately for more URLs
Pair scan - From the obtained URLs of each site,
scan for possible parallel pairs (by analyzing
document external features)
Download and verifying - Download the parallel
pages, determine file size, language and
character set, text length, HTML structure, and
filter out non-parallel pairs.

Sample anchor texts english version in
english,
Sample document external features file-ch.html
vs. file-en.html
/chinese//file.html vs. /english/file.html
Sample document internal features Character set,
HTML structure

10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
24
An alignment example
10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
25
Part of the lexicons

t ture
f false

Other techniques and tools used

Encoding scheme transformation (for Chinese)
Sentence level segmentation
Chinese word segmentation
English expression extraction
SILC language and encoding identification system

14820 pairs of texts (lexicon)
C-E has a precision of 77
E-C has a precision of 81.5
CLIR results
Test corpus TREC5 and TREC6 Chinese track

Best systems at TREC-6 (1997)
English-French 49 of highest French monolingual
English-German 64 of highest German monolingual
Best systems at CLEF (2002)
English-French 83 of highest French monolingual
English-German 86 of highest German monolingual
Best systems at CLEF (2006)
English-French 93.82 of best French monolingual
English-Portuguese 90.91 of best Portuguese
monolingual

2Paul Clough, Bridging the language gap making
digital collections available to a multilingual
society, presentation, 2005
16 Giorgio M. Di Nunzio, CLEF 2006 Ad Hoc
Track Overview. 2006
28
Future tasks

Extend study scope
Web pages, medical literature, USENET newsgroup
articles, records of legislative and legal
proceedings
Lower cost, improve efficiency
Pay more attention on indexing-time optimizations
to improve query-time efficiency
Consider users perspective
Improve the utility of ranked lists
Define suitable criteria for the construction of
a valid multilingual Web corpus
Get resources for resource-poor languages

11 D.W. Oard, When You Come to a Fork in the
Road, Take It Multiple Futures for CLIR
Research. SIGIR 2002 CLIR 12 Fredric Gey, et
al, CROSS LANGUAGE INFORMATION RETRIEVAL A
RESEARCH ROADMAP. SIGIR 2002 CLIR
29
CLIR Communities

TREC Cross Language Track currently focuses on
the Arabic language,
Cross-Language Evaluation Forum (CLEF) a
spinoff from TREC - covering many European
languages,
NTCIR Asian Language Evaluation (covering
Chinese, Japanese and Korean).

12 Fredric Gey, et al, CROSS LANGUAGE
INFORMATION RETRIEVAL A RESEARCH ROADMAP. SIGIR
2002 CLIR
30
CLEF

In CLEF 2006, eight tracks were offered to
evaluate the
performance of systems
multilingual document retrieval on news
collections (Ad-hoc)
cross-language structured scientific data
(Domain-specific)
interactive cross-language retrieval
multiple language question answering
cross-language retrieval on image collections
cross-language speech retrieval
multilingual web retrieval
cross-language geographic retrieval.

13 Carol Peters, Cross-Language Evaluation
Forum - CLEF 2006. D-Lib Magazine October 2006
31
References
1 Wikipedia, http//en.wikipedia.org/wiki/Cross
-language_information_retrieval 2 Paul Clough,
Bridging the language gap making digital
collections available to a multilingual society,
presentation, 2005
3 Internet World Stats, http//www.internetworld
stats.com/stats7.htm
4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
5 Ari Pirkola, et al. Dictionary_Based
Cross-Language Information Retrieval_ Problems,
Methods, and Research Findings. Information
Retrieval, Vol. 4. 2001
6 Jimmy Lin, Cross-Language and Multimedia
Information Retrieval. Slides for LBSC 796/INFM
718R. 2006
7 Metamodel.com. What are the differences
between a vocabulary, a taxonomy, a thesaurus, an
ontology, and a meta-model? http//www.metamodel.c
om/article.php?story20030115211223271. 2004
8 Miguel E. Ruiz, CLIR. Slides for school
seminars. 2001
9 Rada Mihalcea, Information Retrieval and Web
Search. Class slides. 2007
10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
11 D.W. Oard, When You Come to a Fork in the
Road, Take It Multiple Futures for CLIR
Research. SIGIR 2002 CLIR 12 Fredric Gey, et
al, CROSS LANGUAGE INFORMATION RETRIEVAL A
RESEARCH ROADMAP. SIGIR 2002 CLIR
13 Carol Peters, Cross-Language Evaluation
Forum - CLEF 2006. D-Lib Magazine October 2006
14 Boxes and Arrows, http//www.boxesandarrows.c
om/view/what_is_a_controlled_vocabulary
15 David Hull, Terminology translation for
unaligned comparable corpora using category based
translation probabilities. United States Patent
6885985. Filing date Dec 18, 2000. Issue date
Apr 26, 2005
16 Giorgio M. Di Nunzio, CLEF 2006 Ad Hoc
Track Overview. 2006
32
Thank you!

Write a Comment

User Comments (0)