Localization and Language Technology Standards - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Localization and Language Technology Standards

Description:

Translation, Linguistic Resources. Speech and OCR Technologies. Enforcement ... Translation. Create Material Afresh. Translate by Hand. Automatic/Machine Translation ... – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 41
Provided by: KNM4
Category:

less

Transcript and Presenter's Notes

Title: Localization and Language Technology Standards


1
Localization and Language Technology Standards
  • Kavi Narayana Murthy
  • University of Hyderabad
  • ELITEX - 2007
  • New Delhi, 10-11 January 2007

2
Outline
  • Character Encoding Standards
  • Fonts, Glyphs, Mapping Standards
  • OS/Browser Support, Drivers
  • Transliteration, Romanization
  • Translation, Linguistic Resources
  • Speech and OCR Technologies
  • Enforcement

3
Goals
  • Functionality
  • Whatever we can do with English, we must be able
    to do with our own languages and scripts with
    equal ease
  • Inter-operability, Platform Independence
  • All Applications must work seemlessly on all
    hardware and software platforms
  • Language and Script Independence
  • Multi-lingual, Multi-Script Support

4
Standards
  • Even a poor standard is better than no standard
  • Standards save us a lot in the long run
  • Commercial forces promoting non-standard,
    proprietary, secret systems must not be allowed
    to succeed
  • Let us not say Let the Market Decide!!!

5
Character Encoding Standards
  • ISCII and Unicode
  • ISCII is a BIS Standard, Unicode is not
  • Unicode is based on ISCII
  • In some sense, Unicode is a step in the backward
    direction
  • Let us understand ISCII first

6
Language and Script
  • Do not confuse one for the other
  • Many-to-Many
  • Script is neither language nor font
  • Script and SuperScript
  • Phonetic Basis
  • Common SuperScript for all ILs
  • Script Grammar

7
Language and Script
  • Sanskrit is written in Devanagari, Telugu,
    Kannada, Bangla etc. scripts
  • Devanagari is used for writing Sanskrit, Hindi,
    Marathi, etc.
  • English words are often written (transliterated)
    in local language scripts

8
Phonetic Basis
  • Words Meanings, Sounds, Written Symbols
  • Meanings are supreme but difficult to quantify
    and encode
  • Sounds are the next best
  • A ka sound is a ka sound, whatever be the
    language Hence Universal
  • No need for Spellings
  • What is write is what we speak - directly

9
Orthography
  • Written symbols correspond with phonemes basic
    sound units
  • Minor variations in sounds (allophones,
    co-articulation effects etc.) are not depicted in
    orthography
  • t Mountain, tea, truck, spilt, little
  • Special Symbols not to confused with basic
    Characters

10
What is a Character?
  • Indian Languages
  • No alphabet, not letters, no spellings
  • Phoneme-based
  • Units are syllable-like called akshara-s
  • akshara-s very large in number
  • Corpus studies not sufficient
  • Made up of vowels, consonants etc.
  • Not all sequences valid

11
Script Grammar
  • A Grammar for Scripts
  • Allows all valid sequences, only valid sequences
  • No need to code all possible akshara-s
  • Script grammar must be part of standards ISCII
    includes. UNICODE?
  • Script Grammar to be enforced by s/w

12
SuperScript
  • ILs 10 Scripts with a nearly common sound system
    all derived from the ancient braahmi script
  • gt SuperScript
  • Super Set of all Phonemes
  • Common encoding ISCII
  • Extendable to all languages of the world

13
ISCII (BIS 1991 IS 13194)
  • 128 codes more than sufficient
  • Uses second half of ASCII, first half untouched
    allows mixing with English
  • SuperScript Transliteration built-in
  • Long Standing ISCII 1988, 1991
  • Well thought and well designed

14
Why did ISCII fail to catch on?
  • Silent on Character-to-Font mapping
  • A complex many-to-many mapping
  • Fonts not standardized, fonts not available
  • Not registered, no OS/Browser Support
  • (BIS 1991 IS 13194)
  • Rationale not explained
  • Not publicized, not enforced

15
History
  • Proprietary, non-standard, secret font based
    encoding schemes
  • Promoted by commercial companies
  • Near Zero Inter-operability
  • Ad-hoc ISCII-to-font mapping schemes
  • Mapping schemes not made public
  • To be made Illegal and Punishable
  • Put India back by at least a decade!

16
Improving ISCII
  • Register - To get OS/Browser Support
  • Remove encoding of allophones, allographs
  • Script Grammar FSM enough, CFG - not needed
  • Include Rationale, explanatory notes
  • Remove Attribute/Extension codes
  • Standardize ISCII-to-Font Mapping Scheme
  • Promote, Enforce

17
Character-to-Font Mapping
  • Complex scripts not linear
  • Glyphs shape units convenient for rendering
  • Poor correspondence with sound units
  • Many-to-Many mappings
  • Glyph selection, scaling, positioning
  • No Glyph Encoding Standard

18
From Character to Font
  • Must be provably complete and 100 consistent
  • Current systems are all ad-hoc neither complete
    nor consistent
  • Finite State Transducers
  • Necessary and Sufficient
  • Without restricting Creativity and Flexibility
  • Simple, Efficient, Re-Usable

19
Encoding Standards Unicode
  • For Language/Script/SuperScript?
  • CJK. Why not for ILs?
  • Script Grammar?
  • Character-to-Font
  • relegated to font level
  • font effects
  • ISCII-88 Based, Has Errors
  • Once added, cannot be deleted!

20
ISCII or Unicode?
  • Unicode
  • To be with the World, to know and be known
  • Correcting Mistakes, Improving Standards
  • Support (OS, Fonts, etc.), Education, Training
  • Converting Legacy Data A Huge Task
  • ISCII-to-Unicode is not trivial
  • Ignore BIS Standard and embrace what is not yet
    standardized?
  • Why not co-exist? Internal and External Views

21
Keyboard Layouts, Drivers
  • Several de-facto standards and many variations in
    use
  • To select a few and standardize
  • So called Roman Phonetic Typing
  • ILs through English!
  • OK for oldies, not for future!
  • INSCRIPT ISCII Standard, Good for new comers
  • To strictly enforce Script Grammar

22
Document Encoding Standards
  • Plain Text pure ISCII/UNICODE
  • Mono-lingual Plain Text?
  • Annotated Text (Ex. Word Processors)
  • XML Style, Open, Readable formats to be
    encouraged
  • Proprietary, secret, non-standard encodings must
    be discouraged

23
Transliteration
  • Widely used, part of our Tradition
  • Sanskrit texts in local scripts
  • English, Hindi, Urdu words in local scripts
  • Music Compositions
  • Automatic in ISCII. Unicode?
  • Quality of transliteration
  • To and From English?

24
Romanization
  • Need
  • Where there is no support for local languages
  • English dailies, posters, advertisements etc.
  • Lack of support OS/Browser/Fonts etc.
  • Where users prefer Roman
  • A variety of ad-hoc schemes in use
  • iTRANS, RTS, W-X, etc.
  • Standards badly wanted

25
Romanization
  • Multi-dimensional optimization problem
  • Case Mix-up
  • 26 Letters not sufficient
  • 52 nearly sufficient
  • Not always supported
  • Storage space, Ease of Typing, Aesthetics
  • Scientific/Logical Design/Naturalness
  • English-like for the oldies a, ee, oo, a, oa
    ???
  • Futuristic aa/ii/uu/ee/oo

26
Romanization
  • Clashes au/au, kh/kh, s
  • Two way conversion, cyclic check
  • Ex. Long Vowels
  • a -clashes with colon
  • diacritic not supported
  • ipa not understood not supported
  • A single char. saves space ugly difficult to
    type case-mix-up
  • aa logical (like ee) easy to type

27
Romanization An Example
  • a aa i ii u uu R RR e ee ai o oo au M H
  • k kh g gh n
  • c ch j jh n
  • T TH D DH N
  • t th d dh n
  • p ph b bh m
  • y r l v s S s h L

28
Translation
  • Create Material Afresh
  • Translate by Hand
  • Automatic/Machine Translation
  • Machine Aided Translation
  • English Local Language Translation
  • Local Local Language Translation

29
Translation
  • Resource Intensive
  • Manpower, Time, Cost
  • Quality/Uniformity
  • Standards, Bench-Mark Data, Testing and
    Evaluation Procedures
  • Dictionaries, Terminology Databases
  • Pan-Indian Terms/Sanskritize/Localize

30
Linguistic Resources
  • Dictionaries General, Domain Specific
  • Terminological Databases
  • Thesauri, WordNets, Ontologies
  • Morphological Analyzers, Generators
  • Spell/Grammar/Style Checkers
  • Annotated Text and Speech Corpora

31
India Future is in Speech
  • One Billion People, A Sixth of the World
  • More than 150 Languages, 22 Recognized
  • 95 not comfortable with English
  • Computers, Current, Connectivity
  • Info Revolution benefits Majority Deprived
  • 10 M Computers, 100 M Phones
  • Future is in Speech

32
Speech
  • Natural
  • Easy, Fast
  • Hands-Free
  • No need to Learn
  • Technology
  • Language
  • Available to all

33
Text and Speech
  • Speech is Natural
  • Reading/Writing is learnt, Artificial
  • Some never learn Illiterates
  • Oral Tradition
  • Speech is more permanent than Text!
  • I did not steal that ring of gold
  • Trust Yourself!

34
Speech Technologies
  • Speech Recognition Speech to Text
  • Speech Synthesis Text to Speech
  • Speaker Recognition,Verification,ID
  • Speech Coding/Decoding, Compression
  • Slow down, Speed up
  • Speech as Evidence

35
Applications
  • Telephone Dialing
  • Form Filling
  • Dictation Machine
  • Command and Control
  • Voice enabled Web
  • OCRWPTTS
  • MT Cross-Lingual IR, S2S

36
OCR
  • OCR in Local Scripts Needed
  • To digitize and save legacy data
  • To compile/process/edit/refine data
  • For Printed Texts/Manuscripts
  • Old Data
  • deterioration of paper
  • old type fonts, problems of type-setting

37
Multi-Modal Interfaces
  • To Reach out to 1 Billion People, we must get the
    best of many worlds
  • Speech Recognition and Synthesis
  • Graphics and iconic Interfaces
  • OCR Technologies
  • Translation, CLIR
  • Camera, Gestures, Touch Screen

38
Balance
  • Between Backward Compatibility and Future-Proof
    Designs
  • Quick Fix Solutions and Long Haul
  • One Standard or Several?
  • Economics and Business Sense versus Social
    Responsibilities
  • Acceptance versus Enforcement

39
The 3 Most Important Things
  • 1. Develop/Refine/Update Standards
  • Detailed Documentation
  • Including rationale, issues, evaluation, etc.
  • 2. Education and Training
  • 3. Enforcement
  • Make use of non-standard methods illegal and
    punishable under law
  • Technical Workshops for detailing

40
Thank You!
  • Visit
  • www.LanguageTechnologies.ac.in
Write a Comment
User Comments (0)
About PowerShow.com