IIT Bombay - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

IIT Bombay

Description:

IIT Bombay. Indexing, Multiway Lexicon and Ancilliaries: Horizontal Component ... 3-way lexicon for Marathi, Hindi and English in the agricultural domain ... – PowerPoint PPT presentation

Number of Views:957
Avg rating:3.0/5.0
Slides: 12
Provided by: tdilM
Category:
Tags: iit | bombay | lexicon

less

Transcript and Presenter's Notes

Title: IIT Bombay


1
IIT Bombay
  • Indexing, Multiway Lexicon and Ancilliaries
    Horizontal Component
  • Marathi and Hindi Vertical Component

2
IITB architecture for CLIR as given in the EoI
Query
aAQUA Threads HTML Corpus
WSD
Query Expansion
Enconverter
Enconverter
UNL
Stemmers
UNL
AgroExplorer
U N L Index
  Index
Yes
Complete UNL Match
No
Partial UNL Match
Yes
No
UW Match
Yes
Retrieved UNL Documents
Lucene
No
Stemmers
Deconverter
Search Results
Search Results
Failsafe Search Strategy
3
Indexing Experience
  • The scheme shown needs a 4 level indexing
  • Complete meaning Expression
  • Partial Meaning Expression
  • Concepts (i.e., disambiguated words)
  • Ordinary key words
  • The systems performance crucially depends on the
    richness and exhaustiveness of indexing

4
In the consortium indexing on words
  • Stemmed or root words (the former preferred in IR
    engines) of multiple languages
  • Keeping English at the center, multiple
    languages words would link to one another
  • Disambiguation necessary but a light one,
    sacrificing precision
  • Challenge is not demanding that the multilingual
    dictionary remains in memory

5
Indexing on Multi-words
  • Challenge Multiword detection
  • Multi-words stored in the English-pivoted
    multi-way lexicons
  • Efficient storage a concern
  • Multiword-parts too are indexed
  • High precision retrieval would demand high ranks
    for multi-words and not components

6
CLIR with ILs too as Documents
  • Elaborate and sophisticated indexing needed for
    catering to multiple languages
  • Inverted indices of different languages should
    link to each other.

7
A multilingual indexing scenerio
DOCm
shikshan
Inverted Index
Common link
shikkhaa
education
shikshaa
DOCb
DOCe
DOCh
8
Existing Capability
  • MCIT funded Media Lab Asia Project Meaning
    Based, Multilingual Search Engine in the
    Agricultural Domain- AGroExplorer
  • MCIT funded project on Technology Development in
    Indian Languages (first phase completed)
  • TCS funded project on Laboratory for Intelligent
    Internet Research
  • World Bank funded Project on Development Gateway
    Foundation Language Technology Part

9
Relevant Language Resources and Tools
developed/under_development
  • Morph Analysers for Marathi and Hindi
  • 3-way lexicon for Marathi, Hindi and English in
    the agricultural domain
  • Wordnets for Hindi and Marathi

10
Existing Team
  • 2 PhD scholars
  • 4 M.Tech students
  • 6 B.Tech/B.E. students
  • 2 linguists
  • 4 lexicographers/lexiconists

11
Websites for Publications and Resources
  • www.mlasia.iitb.ac.in
  • www.cfilt.iitb.ac.in
  • www.cse.iitb.ac.in/laiir
  • www.cse.iitb.ac.in/pb

(please also see the publications list in the
Expression of Interest Document of 15th
February, 2006)
Write a Comment
User Comments (0)
About PowerShow.com