Recupera - PowerPoint PPT Presentation

About This Presentation
Title:

Recupera

Description:

... for school seminars. 2001. 22. An ... When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. ... Slides for school seminars. 2001 ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 33
Provided by: bert193
Learn more at: https://s2.smu.edu
Category:
Tags: recupera

less

Transcript and Presenter's Notes

Title: Recupera


1
A Brief Survey on Cross-language Information
Retrieval (CLIR) - Text Retrieval Perspective
by Ying Alvarado (24401693)
CSE 8337 Lecturer Dr. Margaret Dunham April 26,
2007
2
Outline
  • Introduction
  • Concept
  • Why important
  • Approach
  • CLIR problems
  • Resource
  • Approaches
  • Example Techniques
  • A CLIR application system
  • CLIR effectiveness
  • CLIR future tasks
  • CLIR communities
  • References

3
Cross Language IR
  • Definition Users enter their query in one
    language and the system retrieves relevant
    documents in other languages.
  • For example, a user may pose their query in
    English but retrieve relevant documents written
    in French.
  • Example CLIR applications
  • Cross-Language retrieval from texts
  • Cross-Language retrieval from audio and images

In this presentation, we focus on text IR only!
1 Wikipedia, http//en.wikipedia.org/wiki/Cross
-language_information_retrieval 2 Paul Clough,
Bridging the language gap making digital
collections available to a multilingual society,
presentation, 2005
4
Monolingual vs. Bilingual vs. Multilingual
  • Monolingual IR Documents and user
    requests in the same language
  • Cross-language IR
  • Documents and user requests are in different
    languages (bilingual IR)

Cross-language IR (CLIR) system
Request (L1)
Results(L2)
Documents (L2 )
Source language
Target language
2 Paul Clough, Bridging the language gap
making digital collections available to a
multilingual society, presentation, 2005
5
Monolingual vs. Bilingual vs. Multilingual (con.)
  • Multilingual IR
  • Documents in collection in different
    languages, search requests in any language

Multilingual IR (MLIR) system
Request (L?)
Results (L2, L3 or L4)
Documents (L4 )
Documents (L3)
Documents (L2 )
e.g. the Web
6
Why CLIR?

TOP TEN LANGUAGESIN THE INTERNET of allInternet Users Internet Usersby Language InternetPenetrationby Language Internet Growthfor Language( 2000 - 2007 ) 2007 EstimateWorld Populationfor the Language
English 29.5 328,666,386 28.7 139.6 1,143,218,916
Chinese 14.3 159,001,513 11.8 392.2 1,351,737,925
Spanish 8.0 88,920,232 20.2 260.3 439,284,783
Japanese 7.7 86,300,000 67.1 83.3 128,646,345
German 5.3 58,711,687 61.1 113.2 96,025,053
French 5.0 55,521,294 14.3 355.2 387,820,873
Portuguese 3.6 40,216,760 17.2 430.8 234,099,347
Korean 3.1 34,120,000 45.6 79.2 74,811,368
Italian 2.8 30,763,940 51.7 133.1 59,546,696
Arabic 2.6 28,540,700 8.4 931.8 340,548,157
TOP TEN LANGUAGES 81.7 910,762,512 21.4 181.4 4,255,739,462
Rest of World Languages 18.3 203,511,914 8.8 444.5 2,318,926,955
WORLD TOTAL 100.0 1,114,274,426 16.9 208.7 6,574,666,417
Top Ten Languages Used in the Web( Number of Internet Users by Language )
Mar. 10, 2007
3 Internet World Stats, http//www.internetworld
stats.com/stats7.htm
7
Why CLIR? (con.)
  • A collection may contains documents in many
    different languages, e.g. the Web. It would be
    impractical to form a query in each language.
  • The documents may be expressed in more than one
    languages. For example,
  • Technical documents in which English jargon
    appears intermixed with narrative text in another
    language.
  • Academic works which cite the titles of documents
    in different languages.
  • The user is not sufficiently fluent to express a
    query in a language, but is able to make use of
    the documents that are identified.
  • The user is monolingual and wants to query in
    their native language. Because he
  • can judge relevance even if results not
    translated
  • have access to document translation

2 Paul Clough, Bridging the language gap
making digital collections available to a
multilingual society, presentation, 2005
4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
8
CLIR problems
  • Handling non-ASCII character sets
  • Untranslatable search keys (OOV) e.g. compound
    words, proper names, special terms
  • Multi-word concepts, e.g. phrases and idioms
  • Ambiguity, e.g. Homonymy and polysemy
  • Word Inflections, e.g. plurals and gender

2 Paul Clough, Bridging the language gap
making digital collections available to a
multilingual society, presentation, 2005
5 Ari Pirkola, et al. Dictionary-Based
Cross-Language Information Retrieval_ Problems,
Methods, and Research Findings. Information
Retrieval, Vol. 4. 2001
9
Resources for Translation
  • Ontology
  • Representation of concepts and relationships
  • Thesaurus
  • it more commonly means a listing of words with
    similar, related, or opposite meanings
  • It does not include the definition of words
  • Bilingual dictionary
  • a list of words together with additional
    word-specific information.
  • Bilingual controlled vocabulary
  • carefully selected list of words and phrases,
    which are used to tag units of information
    (document or work) so that they may be more
    easily retrieved by a search
  • Corpora
  • The document collection itself

4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
6 Jimmy Lin, Cross-Language and Multimedia
Information Retrieval. Slides for LBSC 796/INFM
718R. 2006
1 Wikipedia. Related pages. 7 Metamodel.com.
What are the differences between a vocabulary, a
taxonomy, a thesaurus, an ontology, and a
meta-model? http//www.metamodel.com/article.php?s
tory20030115211223271. 2004
10
An example of controlled vocabulary
The hierarchical relationships
The equivalence relationship
Womens Pants   BT Pants   NT Casual
Pants   NT Dress Pants   NT Sports Pants
14 Boxes and Arrows, http//www.boxesandarrows.c
om/view/what_is_a_controlled_vocabulary
11
What to translate?
  • Document translation
  • Text translation
  • E.g., translate entire document collection into
    English ? search collection in English
  • Vector translation
  • Query translation
  • E.g., translate English query into Chinese query
    ? search Chinese document collection

6 Jimmy Lin, Cross-Language and Multimedia
Information Retrieval. Slides for LBSC 796/INFM
718R. 2006
12
Tradeoffs
  • Document Translation
  • Documents can be translate and stored offline
  • Dependent on high quality automatic machine
    translation (MT) system
  • Does not easily deal with changing document sets
  • Query Translation
  • Often easier
  • Disambiguation of query terms may be difficult
    with short queries

4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
6 Jimmy Lin, Cross-Language and Multimedia
Information Retrieval. Slides for LBSC 796/INFM
718R. 2006
13
Approaches to query translation
  • Knowledge-based Several aspects of domain
    knowledge is manually encoded in to a lexicon.
  • Ontology-based (concept driven)
  • Thesaurus-based
  • Dictionary-based
  • Expensive to construct lexicons
  • Lag behind the common use of terminology.
  • Corpus-based directly exploit statistical
    information about term usage in a corpora
    automatically construct lexicon.
  • Parallel corpora document pairs, sentence pairs,
    term pairs
  • Comparable corpora document pairs, similar
    content
  • Unaligned corpora documents from the same
    domain, not translations of one another, not
    linked in any other way

4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
8 Miguel E. Ruiz, CLIR. Slides for school
seminars. 2001 9 Rada Mihalcea, Information
Retrieval and Web Search. Class slides. 2007
14
Applying monolingual IR techniques
  • Query expansion
  • Relevance feedback
  • Stemming
  • Latent semantic analysis
  • Parsing
  • Part of speech tagging

4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
15
Multilingual Thesauri
  • Three construction techniques
  • Build it from scratch
  • Translate an existing thesaurus
  • Merge monolingual thesauri
  • For example EuroWordNet
  • 7 languages
  • Built from existing lexical resources
  • Has the same structure as Princeton WordNet

8 Miguel E. Ruiz, CLIR. Slides for school
seminars. 2001 9 Rada Mihalcea, Information
Retrieval and Web Search. Class slides. 2007
16
Pseudo-Relevance Feedback
  • Also call Blind feedback
  • Assume that the top n documents in the result set
    actually are relevant.
  • Enter query terms in French
  • Find top French documents in parallel corpus
  • Construct a query from English translations
  • Perform a monolingual free text search

Top ranked French Documents
French Query Terms
English Web Pages
English Translations
French Text Retrieval System
Parallel Corpus
AltaVista
9 Rada Mihalcea, Information Retrieval and Web
Search. Class slides. 2007
17
Different level alignment in parallel corpora
  • Document alignment
  • Already exists
  • Collected from existing corpora
  • Examine document external features
  • Examine document internal features
  • Sentence alignment
  • Easily constructed from aligned documents
  • Match pattern of relative sentence lengths
  • Good first step for term alignment
  • Term alignment
  • Using co-occurrence-based translation

9 Rada Mihalcea, Information Retrieval and Web
Search. Class slides. 2007
18
Example of term alignment
19
Co-occurrence-based translation
  • Align terms using co-occurrence statistics
  • assumed that the correct translations of query
    terms tend to co-occur in target language
    documents
  • How often do a term pair occur in sentence pairs?
  • Weighted by relative position in the sentences
  • Retain term pairs that occur unusually often

9 Rada Mihalcea, Information Retrieval and Web
Search. Class slides. 2007
20
Exploiting Unaligned Corpora
  • Example approach category-based translation
  • Extract a large number of terms from unaligned
    coprora of the first and second languages
  • Assign a category to each extracted term by
    accessing monolingual thesauri of the first and
    second languages
  • Estimate category-to-category translation
    probabilities
  • Estimate term-to-term translation probabilities
    using said category-to-category translation
    probabilities

15 David Hull, Terminology translation for
unaligned comparable corpora using category based
translation probabilities. United States Patent
6885985. Filing date Dec 18, 2000. Issue date
Apr 26, 2005
21
In Summary
8 Miguel E. Ruiz, CLIR. Slides for school
seminars. 2001
22
An experimental system
Automatic construction of parallel
English-Chinese corpus for CLIR
  • A parallel text mining system- PTMiner
  • Finds parallel text from web
  • Parallel Text Mining Algorithm
  • Search for candidate sites - Using existing Web
    search engines, search for the candidate sites
    that may contain parallel pages (by using text
    anchor)
  • File name fetching - For each candidate site,
    fetch the URLs of Web pages that are indexed by
    the search engines
  • Host crawling - Starting from the URLs collected
    in the previous step, search through each
    candidate site separately for more URLs
  • Pair scan - From the obtained URLs of each site,
    scan for possible parallel pairs (by analyzing
    document external features)
  • Download and verifying - Download the parallel
    pages, determine file size, language and
    character set, text length, HTML structure, and
    filter out non-parallel pairs.

10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
23
The workflow of the mining process
  • Sample anchor texts english version in
    english,
  • Sample document external features file-ch.html
    vs. file-en.html

  • /chinese//file.html vs. /english/file.html
  • Sample document internal features Character set,
    HTML structure

10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
24
An alignment example
10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
25
Part of the lexicons
  • t ture
  • f false

Other techniques and tools used
  • Encoding scheme transformation (for Chinese)
  • Sentence level segmentation
  • Chinese word segmentation
  • English expression extraction
  • SILC language and encoding identification system

10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
26
Results
  • 14820 pairs of texts (lexicon)
  • C-E has a precision of 77
  • E-C has a precision of 81.5
  • CLIR results
  • Test corpus TREC5 and TREC6 Chinese track

10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
27
Does CLIR work?
  • Best systems at TREC-6 (1997)
  • English-French 49 of highest French monolingual
  • English-German 64 of highest German monolingual
  • Best systems at CLEF (2002)
  • English-French 83 of highest French monolingual
  • English-German 86 of highest German monolingual
  • Best systems at CLEF (2006)
  • English-French 93.82 of best French monolingual
  • English-Portuguese 90.91 of best Portuguese
    monolingual

2Paul Clough, Bridging the language gap making
digital collections available to a multilingual
society, presentation, 2005
16 Giorgio M. Di Nunzio, CLEF 2006 Ad Hoc
Track Overview. 2006
28
Future tasks
  • Extend study scope
  • Web pages, medical literature, USENET newsgroup
    articles, records of legislative and legal
    proceedings
  • Lower cost, improve efficiency
  • Pay more attention on indexing-time optimizations
    to improve query-time efficiency
  • Consider users perspective
  • Improve the utility of ranked lists
  • Define suitable criteria for the construction of
    a valid multilingual Web corpus
  • Get resources for resource-poor languages

11 D.W. Oard, When You Come to a Fork in the
Road, Take It Multiple Futures for CLIR
Research. SIGIR 2002 CLIR 12 Fredric Gey, et
al, CROSS LANGUAGE INFORMATION RETRIEVAL A
RESEARCH ROADMAP. SIGIR 2002 CLIR
29
CLIR Communities
  • TREC Cross Language Track currently focuses on
    the Arabic language,
  • Cross-Language Evaluation Forum (CLEF) a
    spinoff from TREC - covering many European
    languages,
  • NTCIR Asian Language Evaluation (covering
    Chinese, Japanese and Korean).

12 Fredric Gey, et al, CROSS LANGUAGE
INFORMATION RETRIEVAL A RESEARCH ROADMAP. SIGIR
2002 CLIR
30
CLEF
  • In CLEF 2006, eight tracks were offered to
    evaluate the
  • performance of systems
  • multilingual document retrieval on news
    collections (Ad-hoc)
  • cross-language structured scientific data
    (Domain-specific)
  • interactive cross-language retrieval
  • multiple language question answering
  • cross-language retrieval on image collections
  • cross-language speech retrieval
  • multilingual web retrieval
  • cross-language geographic retrieval.

13 Carol Peters, Cross-Language Evaluation
Forum - CLEF 2006. D-Lib Magazine October 2006
31
References
1 Wikipedia, http//en.wikipedia.org/wiki/Cross
-language_information_retrieval 2 Paul Clough,
Bridging the language gap making digital
collections available to a multilingual society,
presentation, 2005
3 Internet World Stats, http//www.internetworld
stats.com/stats7.htm
4 D.W. Oard, A Survey of Multilingual Text
Retrieval. Computer Science Technical Report
Series Vol. CS-TR-3615. 1996
5 Ari Pirkola, et al. Dictionary_Based
Cross-Language Information Retrieval_ Problems,
Methods, and Research Findings. Information
Retrieval, Vol. 4. 2001
6 Jimmy Lin, Cross-Language and Multimedia
Information Retrieval. Slides for LBSC 796/INFM
718R. 2006
7 Metamodel.com. What are the differences
between a vocabulary, a taxonomy, a thesaurus, an
ontology, and a meta-model? http//www.metamodel.c
om/article.php?story20030115211223271. 2004
8 Miguel E. Ruiz, CLIR. Slides for school
seminars. 2001
9 Rada Mihalcea, Information Retrieval and Web
Search. Class slides. 2007
10 Jiang Chen, et al. Automatic construction of
parallel English-Chinese corpus for
cross-language information retrieval. Proceedings
of the sixth conference on Applied natural
language processin. 2000
11 D.W. Oard, When You Come to a Fork in the
Road, Take It Multiple Futures for CLIR
Research. SIGIR 2002 CLIR 12 Fredric Gey, et
al, CROSS LANGUAGE INFORMATION RETRIEVAL A
RESEARCH ROADMAP. SIGIR 2002 CLIR
13 Carol Peters, Cross-Language Evaluation
Forum - CLEF 2006. D-Lib Magazine October 2006
14 Boxes and Arrows, http//www.boxesandarrows.c
om/view/what_is_a_controlled_vocabulary
15 David Hull, Terminology translation for
unaligned comparable corpora using category based
translation probabilities. United States Patent
6885985. Filing date Dec 18, 2000. Issue date
Apr 26, 2005
16 Giorgio M. Di Nunzio, CLEF 2006 Ad Hoc
Track Overview. 2006
32
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com