Title: Cross Language Information Retrieval
1Cross Language Information Retrieval
- Proposal made by A Consortium of CLIR Research
groups - Vasudeva Varma, IIIT, Hyderabad
- Pushpak Bhattacharya, IIT, Bombay
- Sobha L, AU-KBC
- Anupam Basu, Sudeshna Sarkar, Pabitra Sarkar,
IIT, Kharagpur - Mandar Mitra, ISI, Calcultta
2Definition - CLIR
- Query from Language L1
- Retrieve relevant documents from L1 and/or L2 (L2
is English) - In this task, we are including English and Indian
language IR (Multiple Language IR)
3CLIR and IR, MT
- CLIR an application of IR
- IR techniques of Query Processing, Indexing,
Techniques (TFIDF, LSI etc),Ranking may be useful
in CLIR - Unfortunately CLIR not equal to MT IR
- MT expects syntactically proper input, while IR
queries may not fall into this category - MT is computationally intensive compared to IR.
- Would be ideal to do MT only on the output of
CLIR. - CLIR is tolerant to ambiguities in translation.
- CLIR then deals with ranking in case of multiple
translations.
4CLIR challenges
- How can a query term in L1 be expressed in L2?
- What mechanisms determine which of the possible
translations of text from L1 to L2 should be
retained? - In cases where more than one translation are
retained, how can different translation
alternatives be weighed?
5CLIR Technique Classification
- Broadly Corpus based, Knowledge based or a
combination of multiple techniques
6CLIR Techniques - Issues
- Lot of effort to build and evaluate corpora for
corpus based techniques. - Dictionary based approaches have a major problem
of OOVs (Out of Vocabulary words). - OOVs could be Named Entities, Transliterated
words from other languages etc.
7Other Issues
- W.r.t. Indian languages, even basic resources yet
to mature. - Resources such as Stemmers, Lexica with good
quality and quantity (coverage) are needed.
8CLIR Evaluation
- Based on TREC, CLEF models
- Inputs should be a set of topics (queries) in IL,
a set of Documents of English marked with
relevance judgements for those queries. - Can pick existing English judgements to begin
with and translate English queries into IL to
reduce effort. - This also helps in using existing evaluation
tools, guidelines of TREC, CLEF.
9International Research Programs
- Major ones are
- TREC(US DARPA under TIDES program),
- CLEF(EU) and
- NTCIR (Japan)
- Programs were initially designed for ad-hoc
cross-language text retrieval, then extended to
multi-lingual, multimedia, domain specific and
other dimensions.
10Ideal CLIR Goals
- Stable performance across all topics than high
average and/or - Minimum performance across all topics.
11Possible Approaches
- Corpus based Vs. Dictionary Vs. Combination
approach - Root dictionary Vs. Stem Dictionary in Query
translation component - Role of WSD in CLIR (precise WSD is not needed
take elimination approach) - Recall is more important than precision
- Language Modeling CL-LSI, Core IR algorithms.
- Third party components Lucene, Nutch, Google?
12Evaluation
- English corpus, relevance judgments and tools
are already available - Need to build these resources for Indian
languages
13Architecture
- 1. Components specific to a language
- 2. Components independent of language (For
English)
14Components Specific to a Language
- IL-E Dictionary (stem/root)
- POS
- WSD
- Chunking
- NER, multi word expressions
- Language modelling
- Optional to a group based on the approach they
take
15Components Independent (?) of Language
- Indexing
- Relevance computation (ranking)
- Evaluation
- Summarization (?)
16Architecture
17Alternative Architecture
18Members of Proposed Consortium
Proposed (initial) task division by the members
of the Consortium
19References
- Multi-lingual information management - Current
Levels and Future Abilities, US NSF Report.
Editors Eduard Hovy, Nany Ide, Robert Frederking,
Joseph Mariani, Antonnio Zampolli. - Alternative approaches for Cross Language Text
Retrieval, Douglas Oard. - Using linguistic tools and resources in
Cross-Language Retrieval, Carol Peters, Eugenio
Picchi
20IIIT Hyderabad
- Title Cross Lingual Information Retrieval
- Proposer Dr. Vasudeva Varma
- Institution IIIT Hyderabad
- Language Telugu
- Name of Components that will be implemented
- Telugu English Bilingual Lexicon with light
weight WSD module - Telugu monolingual news corpus
- Telugu Stemmer / Stop word list
- Telugu Named Entity Recognizer and MWE
- Telugu English Transliteration module
- Proposed Domain
- General domain Telugu News corpus. Other
specific domains like Entertainment, disaster
management. Proposed Domain - Named Entity Recognition and Multiword
Expressions
21IIIT - Hyderabad
- Search and Information Extraction Group at LTRC,
has expertise in building - Indian language search engines
- Personalized search engines
- Domain specific search engines
- Information Extraction from specific domains
- Summarization systems
- Collaborative Search
- Cross Language Information Retrieval Systems
22Some Funded Projects
- Indian Language Search Engine DST
- Personalized Search Engine for mobile phones
Nokia Research Centre, Helsinki - Information Extraction from Disaster Management
Domain ADRIN, Department of Space - Information Extraction from Financial Domain TCS
23Telugu Specific Components
- Telugu English Bilingual Lexicon with light
weight WSD module - Telugu monolingual news corpus
- Telugu Stemmer / Stop word list
- Telugu Named Entity Recognizer and MWE
- Telugu English Transliteration module
24Telugu Specific Components
- Telugu English Bilingual Lexicon with light
weight WSD module - Current lexicon based on dictionary more than 150
years old. - Lot of new words in usage in Telugu newspapers.
25Telugu Specific Components
- Telugu monolingual news corpus
- Large number of documents will enable language
modeling algorithms useful in CLIR. - It was shown that such corpora though without
parallel or comparable corpora could help CLIR
(Lisa Ballesteros, 1997).
26Telugu Specific Components
- Telugu Stemmer / Stop word list
- Propose to build statistical stemmer for Telugu.
- Propose to come up with a manually cleaned stop
word list for Telugu. - Telugu being highly agglutinative, role of a
stemmer in CLIR is very important. - Unlike traditional stemmer, this stemmer will
return multiple stems as output, since IL words
participate in 'sandhi' formation or compound
word agglutination.
27Telugu Specific Components
- Telugu English Transliteration module
- Transliteration helps in handling OOVs.
- Both rule based and probabilistic techniques have
been tried with English being the target
language. - We propose to study both rule based and
statistical techniques for this purpose.
28Named Entity Recognition
- Built a linguistics based NER framework and is
available - Building CRF based and Bayesian NER
- Some good work being done in MWE
29Format for Component Builders
- Title Cross Lingual Information Retrieval
- Proposer Prof. Anupam Basu
- Prof. Sudeshna Sarkar
- Dr. Pabitra Mitra
- Institution IIT Kharagpur
- Language Bengali
- Name of Components that will be implemented
- Bengali-English dictionary (stemmer/stop word
list) - Bengali POS tagger (common with MT)
- Bengali Local Word Grouper (chunker) (common with
MT) - Named entity Recognizer for Bengali
- Bengali WSD
- Bengali language modeling
- Proposed Domain
- General domain Bengali News corpus. Other
specific domains like medical and tourism.
30IIT Bombay
- Indexing, Multiway Lexicon and Ancilliaries
31IITB architecture for CLIR as given in the EoI
Query
aAQUA Threads HTML Corpus
WSD
Query Expansion
Enconverter
Enconverter
UNL
Stemmers
UNL
AgroExplorer
U N L Index
 Index
Yes
Complete UNL Match
No
Partial UNL Match
Yes
No
UW Match
Yes
Retrieved UNL Documents
Lucene
No
Stemmers
Deconverter
Search Results
Search Results
Failsafe Search Strategy
32Indexing Experience
- The scheme shown needs a 4 level indexing
- Complete meaning Expression
- Partial Meaning Expression
- Concepts (i.e., disambiguated words)
- Ordinary key words
- The systems performance crucially depends on the
richness and exhaustiveness of indexing
33In the consortium indexing on words
- Stemmed or root words (the former preferred in IR
engines) of multiple languages - Keeping English at the center, multiple
languages words would link to one another - Disambiguation necessary but a light one,
sacrificing precision - Challenge is not demanding that the multilingual
dictionary remains in memory
34Indexing on Multi-words
- Challenge Multiword detection
- Multi-words stored in the English-pivoted
multi-way lexicons - Efficient storage a concern
- Multiword-parts too are indexed
- High precision retrieval would demand high ranks
for multi-words and not components
35CLIR with ILs too as Documents
- Elaborate and sophisticated indexing needed for
catering to multiple languages - Inverted indices of different languages should
link to each other.
36A multilingual indexing scenerio
DOCm
shikshan
Inverted Index
Common link
shikkhaa
education
shikshaa
DOCb
DOCe
DOCh
37Bengali-English Dictionary and Stemmer
- Language Bengali
- Name of Component Bengali English dictionary and
stemmer - Techniques Used
- Probabilistic dictionaries from parallel corpora
- Structured query translation
- Transliteration for handling OOV
- Statistical stemmers
- Performance of Techniques in other Languages
- Probabilistic dictionary for English-Hindi
provides 43 average precision on a 41,000
document collection (Doug Orad, TALIP 2003) - Estimate of expected Performance (PERT Chart)
- Evaluation metrics
- Ratio of Precision and Recall in monolingual and
cross-lingual engine
38POS Tagger for Bengali (common with MT)
- Language Bengali
- Name of Component Bengali Part-of-Speech Tagger
- Techniques Used
- Bi-gram Hidden Markov Model
- Semi-Supervised Learning.
- Morphology driven transformation based learning
for unknown word handling. - Performance of Techniques in other Languages
- Bi-gram Hidden Markov Model 97-98 for English
- Estimate of expected Performance (PERT Chart)
- Evaluation metrics
- Sentence/word level Accuracy.
- Known/Unknown word Accuracy
39Named Entity Recognizer for Bengali
- Language Bengali
- Name of Component Bengali Named-entity
recognizer - Techniques Used
- Maximum Entropy model.
- Performance of Techniques in other Languages
- Precision 90-95 for English
- Estimate of expected Performance (PERT Chart)
- Evaluation metrics
- Precision and Recall
- F-measure
40Bengali Local Word Grouper (common with MT)
- Language Bengali
- Name of Component Bengali Local Word Grouper
- Techniques Used
- Feature Structure Unification using greedy
Algorithm - Statistical Chunking
- MWE handling
- Performance of Techniques in other Languages
- 20 improvement for English on a medical document
collection - Estimate of expected Performance (PERT Chart)
- Evaluation metrics
- F-Score Harmonic mean of Precision and Recall
41Bengali Language Modeling
- Language Bengali
- Name of Component Bengali Language Modeling
- Techniques Used
- n-gram
- Lemur will be ported to Bengali
- MWE handling
- Performance of Techniques in other Languages
- At par vector space model
- Estimate of expected Performance (PERT Chart)
- Evaluation metrics
- F-Score Harmonic mean of Precision and Recall
42Bengali Word Sense Disambiguator
- Language Bengali
- Name of Component Bengali Local Word Grouper
- Techniques Used
- Feature Structure Unification using greedy
Algorithm - Statistical Chunking
- MWE handling
- Performance of Techniques in other Languages
- LWG accuracy 90-95 for English
- Estimate of expected Performance (PERT Chart)
- Evaluation metrics
- F-Score Harmonic mean of Precision and Recall