Multilingual Information Access in a Digital Library - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Multilingual Information Access in a Digital Library

Description:

Hyderabad, India Context Digital Library of India 155,000 English books 145,000 Other language books Population of literates 20% of India understand English 80% can ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 16
Provided by: vam74
Category:

less

Transcript and Presenter's Notes

Title: Multilingual Information Access in a Digital Library


1
Multilingual Information Access in a Digital
Library
  • Vamshi Ambati, Rohini U, Pramod, N Balakrishnan
    and Raj Reddy
  • International Institute of Information Technology
  • Hyderabad, India

2
Context
  • Digital Library of India
  • 155,000 English books
  • 145,000 Other language books
  • Population of literates
  • 20 of India understand English
  • 80 can not

3
Multilingual Access to Information
  • Retrieve a book
  • By metadata
  • By keyword / content
  • Cross Lingual Information Retrieval
  • Read a book
  • Help understand sentences in a language
  • Help understand sentences across languages
  • Machine Translation

4
Approaches to Multilingual Access
  • Cross Lingual Retrieval
  • Translate Query to Document Language
  • Translate Document to Query Language
  • Machine Translation
  • Knowledge Based Approaches
  • Corpus Based Approaches
  • Hybrid Approaches

5
Challenges in Multilingual Access
  • Corpus Based Approaches
  • Unavailability of Parallel Corpus for pairs of
    languages
  • Unavailability of Computational Linguistics
    Resources
  • Dictionary Based Approaches
  • Unavailability of multiple bilingual dictionaries

6
Resources
  • Universal Dictionary
  • Conceived and implemented by Michael Shamos at
    CMU, USA
  • ITRANS
  • A transcription scheme and associated tool built
    by IISc, IIIT and CMU
  • Corpus
  • Data Entry by TTD and DLI project
  • TIDES project

7
Universal Dictionary
8
How are we doing it
  • Cross Lingual Search (Identify Information)
  • Dictionary lookup
  • User feedback based
  • Lucene Search Engine
  • Machine Translation (Understand Information)
  • Corpus based technique (EBMT)
  • Dictionary based word-word lookup
  • Good-enough translation vs Perfect translation

9
Cross Lingual Retrieval
10
Cross Lingual Retrieval
11
Reading Assistant System
12
Reading Assistant
13
Status Today
  • CLIR for 6 languages
  • MT for 3 languages
  • Shakti (a knowledge based MT system)
  • Parallel Corpus for Hindi-Eng
  • UDICT
  • About 40 Foreign Languages
  • 6 Indian Languages

14
What more is needed?
  • UDICT
  • Improving coverage of existing languages
  • Adding new languages
  • Machine Translation
  • Corpus acquisition
  • State of art techniques applied to Indian
    Languages
  • Multi-way parallel corpus development
  • Textual format for the books
  • Books currently are in Image formats
  • OCR should be developed for textual content

15
Thank You
  • Questions ?
Write a Comment
User Comments (0)
About PowerShow.com