Cross Language Information Retrieval - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Cross Language Information Retrieval

Description:

... Lexicon with light weight WSD module. Current lexicon based ... Indexing, Multiway Lexicon and Ancilliaries. Retrieved UNL Documents. Complete UNL Match ... – PowerPoint PPT presentation

Number of Views:223

Avg rating:3.0/5.0

Slides: 43

Provided by: tdilM

Category:

more less

Transcript and Presenter's Notes

Title: Cross Language Information Retrieval

1
Cross Language Information Retrieval

Proposal made by A Consortium of CLIR Research
groups
Vasudeva Varma, IIIT, Hyderabad
Pushpak Bhattacharya, IIT, Bombay
Sobha L, AU-KBC
Anupam Basu, Sudeshna Sarkar, Pabitra Sarkar,
IIT, Kharagpur
Mandar Mitra, ISI, Calcultta

2
Definition - CLIR

Query from Language L1
Retrieve relevant documents from L1 and/or L2 (L2
is English)
In this task, we are including English and Indian
language IR (Multiple Language IR)

3
CLIR and IR, MT

CLIR an application of IR
IR techniques of Query Processing, Indexing,
Techniques (TFIDF, LSI etc),Ranking may be useful
in CLIR
Unfortunately CLIR not equal to MT IR
MT expects syntactically proper input, while IR
queries may not fall into this category
MT is computationally intensive compared to IR.
Would be ideal to do MT only on the output of
CLIR.
CLIR is tolerant to ambiguities in translation.
CLIR then deals with ranking in case of multiple
translations.

4
CLIR challenges

How can a query term in L1 be expressed in L2?
What mechanisms determine which of the possible
translations of text from L1 to L2 should be
retained?
In cases where more than one translation are
retained, how can different translation
alternatives be weighed?

5
CLIR Technique Classification

Broadly Corpus based, Knowledge based or a
combination of multiple techniques

6
CLIR Techniques - Issues

Lot of effort to build and evaluate corpora for
corpus based techniques.
Dictionary based approaches have a major problem
of OOVs (Out of Vocabulary words).
OOVs could be Named Entities, Transliterated
words from other languages etc.

7
Other Issues

W.r.t. Indian languages, even basic resources yet
to mature.
Resources such as Stemmers, Lexica with good
quality and quantity (coverage) are needed.

8
CLIR Evaluation

Based on TREC, CLEF models
Inputs should be a set of topics (queries) in IL,
a set of Documents of English marked with
relevance judgements for those queries.
Can pick existing English judgements to begin
with and translate English queries into IL to
reduce effort.
This also helps in using existing evaluation
tools, guidelines of TREC, CLEF.

9
International Research Programs

Major ones are
TREC(US DARPA under TIDES program),
CLEF(EU) and
NTCIR (Japan)
Programs were initially designed for ad-hoc
cross-language text retrieval, then extended to
multi-lingual, multimedia, domain specific and
other dimensions.

10
Ideal CLIR Goals

Stable performance across all topics than high
average and/or
Minimum performance across all topics.

11
Possible Approaches

Corpus based Vs. Dictionary Vs. Combination
approach
Root dictionary Vs. Stem Dictionary in Query
translation component
Role of WSD in CLIR (precise WSD is not needed
take elimination approach)
Recall is more important than precision
Language Modeling CL-LSI, Core IR algorithms.
Third party components Lucene, Nutch, Google?

12
Evaluation

English corpus, relevance judgments and tools
are already available
Need to build these resources for Indian
languages

13
Architecture

1. Components specific to a language
2. Components independent of language (For
English)

14
Components Specific to a Language

IL-E Dictionary (stem/root)
POS
WSD
Chunking
NER, multi word expressions
Language modelling
Optional to a group based on the approach they
take

15
Components Independent (?) of Language

Indexing
Relevance computation (ranking)
Evaluation
Summarization (?)

16
Architecture
17
Alternative Architecture
18
Members of Proposed Consortium
Proposed (initial) task division by the members
of the Consortium
19
References

Multi-lingual information management - Current
Levels and Future Abilities, US NSF Report.
Editors Eduard Hovy, Nany Ide, Robert Frederking,
Joseph Mariani, Antonnio Zampolli.
Alternative approaches for Cross Language Text
Retrieval, Douglas Oard.
Using linguistic tools and resources in
Cross-Language Retrieval, Carol Peters, Eugenio
Picchi

20
IIIT Hyderabad

Title Cross Lingual Information Retrieval
Proposer Dr. Vasudeva Varma
Institution IIIT Hyderabad
Language Telugu
Name of Components that will be implemented
Telugu English Bilingual Lexicon with light
weight WSD module
Telugu monolingual news corpus
Telugu Stemmer / Stop word list
Telugu Named Entity Recognizer and MWE
Telugu English Transliteration module
Proposed Domain
General domain Telugu News corpus. Other
specific domains like Entertainment, disaster
management. Proposed Domain
Named Entity Recognition and Multiword
Expressions

21
IIIT - Hyderabad

Search and Information Extraction Group at LTRC,
has expertise in building
Indian language search engines
Personalized search engines
Domain specific search engines
Information Extraction from specific domains
Summarization systems
Collaborative Search
Cross Language Information Retrieval Systems

22
Some Funded Projects

Indian Language Search Engine DST
Personalized Search Engine for mobile phones
Nokia Research Centre, Helsinki
Information Extraction from Disaster Management
Domain ADRIN, Department of Space
Information Extraction from Financial Domain TCS

23
Telugu Specific Components

Telugu English Bilingual Lexicon with light
weight WSD module
Telugu monolingual news corpus
Telugu Stemmer / Stop word list
Telugu Named Entity Recognizer and MWE
Telugu English Transliteration module

24
Telugu Specific Components

Telugu English Bilingual Lexicon with light
weight WSD module
Current lexicon based on dictionary more than 150
years old.
Lot of new words in usage in Telugu newspapers.

25
Telugu Specific Components

Telugu monolingual news corpus
Large number of documents will enable language
modeling algorithms useful in CLIR.
It was shown that such corpora though without
parallel or comparable corpora could help CLIR
(Lisa Ballesteros, 1997).

26
Telugu Specific Components

Telugu Stemmer / Stop word list
Propose to build statistical stemmer for Telugu.
Propose to come up with a manually cleaned stop
word list for Telugu.
Telugu being highly agglutinative, role of a
stemmer in CLIR is very important.
Unlike traditional stemmer, this stemmer will
return multiple stems as output, since IL words
participate in 'sandhi' formation or compound
word agglutination.

27
Telugu Specific Components

Telugu English Transliteration module
Transliteration helps in handling OOVs.
Both rule based and probabilistic techniques have
been tried with English being the target
language.
We propose to study both rule based and
statistical techniques for this purpose.

28
Named Entity Recognition

Built a linguistics based NER framework and is
available
Building CRF based and Bayesian NER
Some good work being done in MWE

29
Format for Component Builders

Title Cross Lingual Information Retrieval
Proposer Prof. Anupam Basu
Prof. Sudeshna Sarkar
Dr. Pabitra Mitra
Institution IIT Kharagpur
Language Bengali
Name of Components that will be implemented
Bengali-English dictionary (stemmer/stop word
list)
Bengali POS tagger (common with MT)
Bengali Local Word Grouper (chunker) (common with
MT)
Named entity Recognizer for Bengali
Bengali WSD
Bengali language modeling
Proposed Domain
General domain Bengali News corpus. Other
specific domains like medical and tourism.

30
IIT Bombay

Indexing, Multiway Lexicon and Ancilliaries

31
IITB architecture for CLIR as given in the EoI
Query
aAQUA Threads HTML Corpus
WSD
Query Expansion
Enconverter
Enconverter
UNL
Stemmers
UNL
AgroExplorer
U N L Index
Index
Yes
Complete UNL Match
No
Partial UNL Match
Yes
No
UW Match
Yes
Retrieved UNL Documents
Lucene
No
Stemmers
Deconverter
Search Results
Search Results
Failsafe Search Strategy
32
Indexing Experience

The scheme shown needs a 4 level indexing
Complete meaning Expression
Partial Meaning Expression
Concepts (i.e., disambiguated words)
Ordinary key words
The systems performance crucially depends on the
richness and exhaustiveness of indexing

33
In the consortium indexing on words

Stemmed or root words (the former preferred in IR
engines) of multiple languages
Keeping English at the center, multiple
languages words would link to one another
Disambiguation necessary but a light one,
sacrificing precision
Challenge is not demanding that the multilingual
dictionary remains in memory

34
Indexing on Multi-words

Challenge Multiword detection
Multi-words stored in the English-pivoted
multi-way lexicons
Efficient storage a concern
Multiword-parts too are indexed
High precision retrieval would demand high ranks
for multi-words and not components

35
CLIR with ILs too as Documents

Elaborate and sophisticated indexing needed for
catering to multiple languages
Inverted indices of different languages should
link to each other.

36
A multilingual indexing scenerio
DOCm
shikshan
Inverted Index
Common link
shikkhaa
education
shikshaa
DOCb
DOCe
DOCh
37
Bengali-English Dictionary and Stemmer

Language Bengali
Name of Component Bengali English dictionary and
stemmer
Techniques Used
Probabilistic dictionaries from parallel corpora
Structured query translation
Transliteration for handling OOV
Statistical stemmers
Performance of Techniques in other Languages
Probabilistic dictionary for English-Hindi
provides 43 average precision on a 41,000
document collection (Doug Orad, TALIP 2003)
Estimate of expected Performance (PERT Chart)
Evaluation metrics
Ratio of Precision and Recall in monolingual and
cross-lingual engine