Cross Language Information Retrieval - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Cross Language Information Retrieval

Description:

... Lexicon with light weight WSD module. Current lexicon based ... Indexing, Multiway Lexicon and Ancilliaries. Retrieved UNL Documents. Complete UNL Match ... – PowerPoint PPT presentation

Number of Views:222
Avg rating:3.0/5.0
Slides: 43
Provided by: tdilM
Category:

less

Transcript and Presenter's Notes

Title: Cross Language Information Retrieval


1
Cross Language Information Retrieval
  • Proposal made by A Consortium of CLIR Research
    groups
  • Vasudeva Varma, IIIT, Hyderabad
  • Pushpak Bhattacharya, IIT, Bombay
  • Sobha L, AU-KBC
  • Anupam Basu, Sudeshna Sarkar, Pabitra Sarkar,
    IIT, Kharagpur
  • Mandar Mitra, ISI, Calcultta

2
Definition - CLIR
  • Query from Language L1
  • Retrieve relevant documents from L1 and/or L2 (L2
    is English)
  • In this task, we are including English and Indian
    language IR (Multiple Language IR)

3
CLIR and IR, MT
  • CLIR an application of IR
  • IR techniques of Query Processing, Indexing,
    Techniques (TFIDF, LSI etc),Ranking may be useful
    in CLIR
  • Unfortunately CLIR not equal to MT IR
  • MT expects syntactically proper input, while IR
    queries may not fall into this category
  • MT is computationally intensive compared to IR.
  • Would be ideal to do MT only on the output of
    CLIR.
  • CLIR is tolerant to ambiguities in translation.
  • CLIR then deals with ranking in case of multiple
    translations.

4
CLIR challenges
  • How can a query term in L1 be expressed in L2?
  • What mechanisms determine which of the possible
    translations of text from L1 to L2 should be
    retained?
  • In cases where more than one translation are
    retained, how can different translation
    alternatives be weighed?

5
CLIR Technique Classification
  • Broadly Corpus based, Knowledge based or a
    combination of multiple techniques

6
CLIR Techniques - Issues
  • Lot of effort to build and evaluate corpora for
    corpus based techniques.
  • Dictionary based approaches have a major problem
    of OOVs (Out of Vocabulary words).
  • OOVs could be Named Entities, Transliterated
    words from other languages etc.

7
Other Issues
  • W.r.t. Indian languages, even basic resources yet
    to mature.
  • Resources such as Stemmers, Lexica with good
    quality and quantity (coverage) are needed.

8
CLIR Evaluation
  • Based on TREC, CLEF models
  • Inputs should be a set of topics (queries) in IL,
    a set of Documents of English marked with
    relevance judgements for those queries.
  • Can pick existing English judgements to begin
    with and translate English queries into IL to
    reduce effort.
  • This also helps in using existing evaluation
    tools, guidelines of TREC, CLEF.

9
International Research Programs
  • Major ones are
  • TREC(US DARPA under TIDES program),
  • CLEF(EU) and
  • NTCIR (Japan)
  • Programs were initially designed for ad-hoc
    cross-language text retrieval, then extended to
    multi-lingual, multimedia, domain specific and
    other dimensions.

10
Ideal CLIR Goals
  • Stable performance across all topics than high
    average and/or
  • Minimum performance across all topics.

11
Possible Approaches
  • Corpus based Vs. Dictionary Vs. Combination
    approach
  • Root dictionary Vs. Stem Dictionary in Query
    translation component
  • Role of WSD in CLIR (precise WSD is not needed
    take elimination approach)
  • Recall is more important than precision
  • Language Modeling CL-LSI, Core IR algorithms.
  • Third party components Lucene, Nutch, Google?

12
Evaluation
  • English corpus, relevance judgments and tools
    are already available
  • Need to build these resources for Indian
    languages

13
Architecture
  • 1. Components specific to a language
  • 2. Components independent of language (For
    English)

14
Components Specific to a Language
  • IL-E Dictionary (stem/root)
  • POS
  • WSD
  • Chunking
  • NER, multi word expressions
  • Language modelling
  • Optional to a group based on the approach they
    take

15
Components Independent (?) of Language
  • Indexing
  • Relevance computation (ranking)
  • Evaluation
  • Summarization (?)

16
Architecture
17
Alternative Architecture
18
Members of Proposed Consortium
Proposed (initial) task division by the members
of the Consortium
19
References
  • Multi-lingual information management - Current
    Levels and Future Abilities, US NSF Report.
    Editors Eduard Hovy, Nany Ide, Robert Frederking,
    Joseph Mariani, Antonnio Zampolli.
  • Alternative approaches for Cross Language Text
    Retrieval, Douglas Oard.
  • Using linguistic tools and resources in
    Cross-Language Retrieval, Carol Peters, Eugenio
    Picchi

20
IIIT Hyderabad
  • Title Cross Lingual Information Retrieval
  • Proposer Dr. Vasudeva Varma
  • Institution IIIT Hyderabad
  • Language Telugu
  • Name of Components that will be implemented
  • Telugu English Bilingual Lexicon with light
    weight WSD module
  • Telugu monolingual news corpus
  • Telugu Stemmer / Stop word list
  • Telugu Named Entity Recognizer and MWE
  • Telugu English Transliteration module
  • Proposed Domain
  • General domain Telugu News corpus. Other
    specific domains like Entertainment, disaster
    management. Proposed Domain
  • Named Entity Recognition and Multiword
    Expressions

21
IIIT - Hyderabad
  • Search and Information Extraction Group at LTRC,
    has expertise in building
  • Indian language search engines
  • Personalized search engines
  • Domain specific search engines
  • Information Extraction from specific domains
  • Summarization systems
  • Collaborative Search
  • Cross Language Information Retrieval Systems

22
Some Funded Projects
  • Indian Language Search Engine DST
  • Personalized Search Engine for mobile phones
    Nokia Research Centre, Helsinki
  • Information Extraction from Disaster Management
    Domain ADRIN, Department of Space
  • Information Extraction from Financial Domain TCS

23
Telugu Specific Components
  • Telugu English Bilingual Lexicon with light
    weight WSD module
  • Telugu monolingual news corpus
  • Telugu Stemmer / Stop word list
  • Telugu Named Entity Recognizer and MWE
  • Telugu English Transliteration module

24
Telugu Specific Components
  • Telugu English Bilingual Lexicon with light
    weight WSD module
  • Current lexicon based on dictionary more than 150
    years old.
  • Lot of new words in usage in Telugu newspapers.

25
Telugu Specific Components
  • Telugu monolingual news corpus
  • Large number of documents will enable language
    modeling algorithms useful in CLIR.
  • It was shown that such corpora though without
    parallel or comparable corpora could help CLIR
    (Lisa Ballesteros, 1997).

26
Telugu Specific Components
  • Telugu Stemmer / Stop word list
  • Propose to build statistical stemmer for Telugu.
  • Propose to come up with a manually cleaned stop
    word list for Telugu.
  • Telugu being highly agglutinative, role of a
    stemmer in CLIR is very important.
  • Unlike traditional stemmer, this stemmer will
    return multiple stems as output, since IL words
    participate in 'sandhi' formation or compound
    word agglutination.

27
Telugu Specific Components
  • Telugu English Transliteration module
  • Transliteration helps in handling OOVs.
  • Both rule based and probabilistic techniques have
    been tried with English being the target
    language.
  • We propose to study both rule based and
    statistical techniques for this purpose.

28
Named Entity Recognition
  • Built a linguistics based NER framework and is
    available
  • Building CRF based and Bayesian NER
  • Some good work being done in MWE

29
Format for Component Builders
  • Title Cross Lingual Information Retrieval
  • Proposer Prof. Anupam Basu
  • Prof. Sudeshna Sarkar
  • Dr. Pabitra Mitra
  • Institution IIT Kharagpur
  • Language Bengali
  • Name of Components that will be implemented
  • Bengali-English dictionary (stemmer/stop word
    list)
  • Bengali POS tagger (common with MT)
  • Bengali Local Word Grouper (chunker) (common with
    MT)
  • Named entity Recognizer for Bengali
  • Bengali WSD
  • Bengali language modeling
  • Proposed Domain
  • General domain Bengali News corpus. Other
    specific domains like medical and tourism.

30
IIT Bombay
  • Indexing, Multiway Lexicon and Ancilliaries

31
IITB architecture for CLIR as given in the EoI
Query
aAQUA Threads HTML Corpus
WSD
Query Expansion
Enconverter
Enconverter
UNL
Stemmers
UNL
AgroExplorer
U N L Index
  Index
Yes
Complete UNL Match
No
Partial UNL Match
Yes
No
UW Match
Yes
Retrieved UNL Documents
Lucene
No
Stemmers
Deconverter
Search Results
Search Results
Failsafe Search Strategy
32
Indexing Experience
  • The scheme shown needs a 4 level indexing
  • Complete meaning Expression
  • Partial Meaning Expression
  • Concepts (i.e., disambiguated words)
  • Ordinary key words
  • The systems performance crucially depends on the
    richness and exhaustiveness of indexing

33
In the consortium indexing on words
  • Stemmed or root words (the former preferred in IR
    engines) of multiple languages
  • Keeping English at the center, multiple
    languages words would link to one another
  • Disambiguation necessary but a light one,
    sacrificing precision
  • Challenge is not demanding that the multilingual
    dictionary remains in memory

34
Indexing on Multi-words
  • Challenge Multiword detection
  • Multi-words stored in the English-pivoted
    multi-way lexicons
  • Efficient storage a concern
  • Multiword-parts too are indexed
  • High precision retrieval would demand high ranks
    for multi-words and not components

35
CLIR with ILs too as Documents
  • Elaborate and sophisticated indexing needed for
    catering to multiple languages
  • Inverted indices of different languages should
    link to each other.

36
A multilingual indexing scenerio
DOCm
shikshan
Inverted Index
Common link
shikkhaa
education
shikshaa
DOCb
DOCe
DOCh
37
Bengali-English Dictionary and Stemmer
  • Language Bengali
  • Name of Component Bengali English dictionary and
    stemmer
  • Techniques Used
  • Probabilistic dictionaries from parallel corpora
  • Structured query translation
  • Transliteration for handling OOV
  • Statistical stemmers
  • Performance of Techniques in other Languages
  • Probabilistic dictionary for English-Hindi
    provides 43 average precision on a 41,000
    document collection (Doug Orad, TALIP 2003)
  • Estimate of expected Performance (PERT Chart)
  • Evaluation metrics
  • Ratio of Precision and Recall in monolingual and
    cross-lingual engine

38
POS Tagger for Bengali (common with MT)
  • Language Bengali
  • Name of Component Bengali Part-of-Speech Tagger
  • Techniques Used
  • Bi-gram Hidden Markov Model
  • Semi-Supervised Learning.
  • Morphology driven transformation based learning
    for unknown word handling.
  • Performance of Techniques in other Languages
  • Bi-gram Hidden Markov Model 97-98 for English
  • Estimate of expected Performance (PERT Chart)
  • Evaluation metrics
  • Sentence/word level Accuracy.
  • Known/Unknown word Accuracy

39
Named Entity Recognizer for Bengali
  • Language Bengali
  • Name of Component Bengali Named-entity
    recognizer
  • Techniques Used
  • Maximum Entropy model.
  • Performance of Techniques in other Languages
  • Precision 90-95 for English
  • Estimate of expected Performance (PERT Chart)
  • Evaluation metrics
  • Precision and Recall
  • F-measure

40
Bengali Local Word Grouper (common with MT)
  • Language Bengali
  • Name of Component Bengali Local Word Grouper
  • Techniques Used
  • Feature Structure Unification using greedy
    Algorithm
  • Statistical Chunking
  • MWE handling
  • Performance of Techniques in other Languages
  • 20 improvement for English on a medical document
    collection
  • Estimate of expected Performance (PERT Chart)
  • Evaluation metrics
  • F-Score Harmonic mean of Precision and Recall

41
Bengali Language Modeling
  • Language Bengali
  • Name of Component Bengali Language Modeling
  • Techniques Used
  • n-gram
  • Lemur will be ported to Bengali
  • MWE handling
  • Performance of Techniques in other Languages
  • At par vector space model
  • Estimate of expected Performance (PERT Chart)
  • Evaluation metrics
  • F-Score Harmonic mean of Precision and Recall

42
Bengali Word Sense Disambiguator
  • Language Bengali
  • Name of Component Bengali Local Word Grouper
  • Techniques Used
  • Feature Structure Unification using greedy
    Algorithm
  • Statistical Chunking
  • MWE handling
  • Performance of Techniques in other Languages
  • LWG accuracy 90-95 for English
  • Estimate of expected Performance (PERT Chart)
  • Evaluation metrics
  • F-Score Harmonic mean of Precision and Recall
Write a Comment
User Comments (0)
About PowerShow.com