Efforts in Language

1 / 23
About This Presentation
Title:

Efforts in Language

Description:

Centre for Development of Advanced Computing (Ministry of Communications ... Transliteration, Terminology Development, Document analysis, Font converters ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 24
Provided by: auk
Learn more at: http://www.au-kbc.org

less

Transcript and Presenter's Notes

Title: Efforts in Language


1
  • Efforts in Language Speech Technology
  • Natural Language Processing Lab
  • Centre for Development of Advanced Computing
  • (Ministry of Communications Information
    Technology)
  • Anusandhan Bhawan,
  • C 56/1 Sector 62, Noida 201 307, India
  • karunesharora_at_cdacnoida.com

2
Translation Support System
  • Technology Angla Bharati (Rule base) developed
    by IIT Kanpur. System developed jointly by
    IIT,Kanpur and CDAC Noida
  • Operating system support LINUX/ WINDOWS
  • Performance 85 correct parsing, 60 correct
    translation
  • Embedded Text Editor ,Pre Processor and Post
    editor
  • Lexicon 25,000 root words

3
Translation Support System (English to Hindi)
Pattern Directed Parsing
Morphological Analyzer
English Sentence
Lexical Dictionary
CORPUS
Rule Base
Pseudo Language Output
Hindi Text Generator
Post Editor
4
(No Transcript)
5
(No Transcript)
6
Test suite for Translation Support Systems
7
  • Knowledge Management
  • Parallel Corpus Tools

8
Gyan Nidhi Parallel Corpus
  • GyanNidhi which stands for Knowledge
    Resource is parallel in 12 Indian languages , a
    project sponsored by TDIL, DIT, MC IT, Govt of
    India

9
Gyan Nidhi Multi-Lingual Aligned Parallel Corpus
  • What it is?The multilingual parallel text corpus
    contains the same text translated in more than
    one language.
  • What Gyan Nidhi contains?GyanNidhi corpus
    consists of text in English and 11 Indian
    languages (Hindi, Punjabi, Marathi, Bengali,
    Oriya, Gujarati, Telugu, Tamil, Kannada,
    Malayalam, Assamese). It aims to digitize 1
    million pages altogether containing at least
    50,000 pages in each Indian language and English.

Source for Parallel Corpus
  • National Book Trust India
  • Sahitya Akademi
  • Navjivan Publishing House
  • Publications Division
  • SABDA, Pondicherry

10
GyanNidhi Block Diagram
11
Gyan Nidhi Multi-Lingual Aligned Parallel Corpus
Platform Windows Data Encoding XML,
UNICODE Portability of Data Data in XML format
supports various platforms
Applications of GyanNidhi Automatic Dictionary
extraction Creation of Translation memory Example
Based Machine Translation (EBMT) Language
research study and analysis Language Modeling
12
Tools Prabandhika Corpus
Manager
  • Categorisation of corpus data in various
    user-defined domains
  • Addition/Deletion/Modification of any Indian
    Language data files in HTML / RTF / TXT / XML
    format.
  • Selection of languages for viewing parallel
    corpus with data aligned up to paragraph level
  • Automatic selection and viewing of parallel
    paragraphs in multiple languages
  • Abstract and Metadata
  • Printing and saving parallel data in Unicode
    format

13
Sample Screen Shot Prabandhika
14
Tools Vishleshika Statistical Text
Analyzer
  • Vishleshika is a tool for Statistical Text
    Analysis for Hindi extendible to other Indian
    Languages text
  • It examines input text and generates various
    statistics, e.g.
  • Sentence statistics
  • Word statistics
  • Character statistics
  • Text Analyzer presents analysis in Textual as
    well as Graphical form.

15
Sample output Character statistics
Above Graph shows that the distribution is almost
equal in Hindi and Nepali in the sample text.
Most frequent consonants in the Hindi Most
frequent consonants in the Nepali
Results also show that these six consonants
constitute more than 50 of the consonants usage.
16
Vishleshika Word and sentence Statistics
17
  • Speech Technology and tools

18
Annotated Speech Corpora for Hindi, Punjabi and
Marathi languages
Vishleshika Statistical AnalysisTool
Gyan Nidhi Corpus
Phonetically Rich sentence set
Manual Verification and Editing
Studio Recording by Professionals
Segmentation and labeling using Praat / Emulabel
XML Meta Data Creation
19
(No Transcript)
20
Modules under TTS
21
Other Areas of expertise
  • OCR for Devanagri Script
  • Digital Library for Indian languages
  • Word Processing tools like Spell Checker,
    Transliteration, Terminology Development,
    Document analysis, Font converters
  • Indian Language eContent Creation

22
Areas for future work
  • Machine Translation
  • Standardization Lexware Database design
  • Working on the global approach BhashaSetu
    which is a amalgamation of different approaches
    to squeeze the best of each approach
  • Development of Translation system Test Bed
  • Knowledge Management
  • Automatic Text Summarization tool for Hindi and
    other Indian languages
  • Standardization of Parts of Speech TagSet for
    Hindi extendible to other Indian
  • languages
  • Parts of Speech Tagger development for Indian
    languages
  • Automated Terminology Development tools
  • Sentence alignment tool for Indian languages
  • Development of manually tagged parallel corpus up
    to word level
  • Speech Technology
  • Speech to Speech Translation System
  • Development of Semi-automated speech annotation
    tools

23
Thank You
Write a Comment
User Comments (0)