Lao in Papillon Current status - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Lao in Papillon Current status

Description:

Papillon 2003 - Sapporo. 14 /19. LaoDict main dictionary: ... XML Lao add-on (schema redefinition) to http://www-clips.imag.fr/geta/services/dml/papillon.xsd ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 20
Provided by: Gil47
Category:
Tags: current | lao | papillon | status

less

Transcript and Presenter's Notes

Title: Lao in Papillon Current status


1
Lao in Papillon - Current status
  • Papillon 2003 Sapporo
  • July 3rd 5th 2003

Vincent BERMENT GETA-CLIPS, Joseph Fourier
University, Grenoble, France Vincent.Berment_at_imag.
fr
2
Laos and Lao Language
  • Lao PDR, South-East Asia
  • 5 Millions inhabitants
  • 2,500 Internet subscribers (via ISPs)
  • 10,000 computers
  • 30,000 telephone lines
  • Lao a Tai Kadai language
  • Lao has its own alphabet
  • Unicode range 0E80-0EFF
  • 8 bit fonts no standard
  • Unsegmented writing ???????????????
  • Limited resources for Lao language
  • Lao not included in the main OSs
  • Computerization and standardization not done by
    Lao citizens
  • Technology transfer and standardization projects
    (UN, )

3
Text sample in Lao language
  • ?????????????? ?????????????????????????????????
    ???????????. ???????????????????
    ????????????????? ????????????? ?????????,
    ????????? ?????????????? ?????????,
    ???????????????????????????????????????.
    ??????????????????????????????? ?????????
    ???????????? ????????????????? ???????????????
    ????? ???. ????????????????? ?????????????????????
    ???.

4
Lao Dictionary Development
  • Lao dictionary building principle
  • French-speaking students learning Lao
  • Lao-French translation support tool
  • On-line service called LaoLex
  • Current translations bi-texts (on the server)
  • Word for word translations
  • Missing word ? add it in a personal dictionary
    (on the server)
  • Personal to main dictionary transfer done by
    linguists

5
Principles of LaoLex
Main Dictionary
Personal Dictionaries
Text to translate
Lao Text
1
Syllabic Segmentation
Word for Word Translation (with word
segmentation using longest matching)
List of French Words
French Translation
Human
Transfer from the personal dictionaries to the
main one done by linguists
1
6
Some Algorithms in LaoLex (1)Segmentation
Word for Word Translation
  • Syllabic segmentation of the Lao text to
    translate (automaton built from a set of
    derivation rules)
  • ??????????????? ? ??-???-??-???-?-????
  • Compare groups of syllables with the words in the
    dictionary (longest matching algorithm, no
    backtracking)
  • ??-???-??-???-?-???? ? ???????-???-?-????
  • If no word matches, the first syllable is taken
    as a word
  • Output the list of words
  • ??????? (bonjour) ??? (tout, tous) ? (?) ????
    (personne)

7
Some Algorithms in LaoLex (2)Transformation
into a Canonic Form
  • Lao needs another process a transfer to a
    canonic form
  • Zero-width character repetition
  • Zero-width character inversion
  • ? ? ? ?
  • Prior to store the entries of the dictionaries
  • Prior to compare the groups of syllables with
    dictionaries words

8
Some Algorithms in LaoLex (3)Two Input Methods
  • JavaScript
  • Pros
  • Platform independent (web client)
  • No download required
  • Cons
  • Inserts characters at the end of the input field
  • JavaScript must be activated
  • LaoMonoKey, a tool dedicated to Windows platforms
  • Hook intercepting the Windows messages coming
    from the keyboard
  • Pros
  • Insert characters at the cursors location
  • Cons
  • Download required
  • Limited to Windows

9
Some Algorithms in LaoLex (4)Sorting the
Dictionary
  • Split two words into syllables
  • Transform the chains of characters (the vocalic
    graphemes are not in the phonologic order)
  • Compare the words using their phonologic elements

Ë
Á¾
¹
Ë
Á ¾
¹
C
V
C
V
V
C
C
C
C
V
V
V
10
Programming Languages used in LaoLex
  • Server side
  • PHP/HTML/SHTML
  • C CGIs (segmentation and sorting algorithms)
  • MySql (lexical and miscellaneous information
    database)
  • Browser side
  • HTML
  • JavaScript (input of Lao text)
  • Text input in TextArea and Input fields (HTML)

11
LaoDict A Dictionary built using LaoLex
() With translations in French
12
57 Parts of Speech
  • Reinhorns classification based on the 7
    traditional categories (for Thai NECTEC 47, KU
    61)
  • nouns (?????) 8 categories
  • pronouns (??????????) 11 categories
  • verbs (????????) 10 categories
  • predicatives (???????) 22 categories
  • prepositions (??????????) 1 category
  • conjunctions (????????) 2 categories
  • interjections (???????) 3 categories

13
11 Levels of Language
  • Deriving from discussions inside INALCOs Lao
    department (foreign language university in
    Paris)
  • General use
  • Respectful
  • Colloquial
  • Slangy
  • Specialty
  • Refined
  • Monk
  • Royal
  • Literary
  • Spoken
  • Archaic

14
From LaoDict to Papillon
  • LaoDict main dictionary 28 lexies
  • LaoDict personal dictionaries (total) 978 lexies
  • XML Lao add-on (schema redefinition) to
    http//www-clips.imag.fr/geta/services/dml/papillo
    n.xsd
  • Parts of speech
  • Language level
  • Export from LaoDict to Papillon (not done yet)
  • XMLization
  • Unicodization

15
  • Complement

16
GMSLex
  • The method used to develop LaoDict can be
    enlarged to a number of languages, in particular
    those using unsegmented writings
  • About 20 weakly computerized indic-derived
    scripts (except Thai) from languages of the
    Greater Mekong Subregion (from Michel Ferlus)
  • Khmer script family
  • sub-families Cham , Khmer, Lao-Thai (Lao, Thai),
    North-East Tai (Tai Dam, Tai Khao, Tai Deng, Tai
    Yo, Lai Pao),
  • Mon script family
  • sub-families Mon-Burmese (Burmese, Mon),
    North-West Tai (Dehong Dai, Khamti, Shan, Tai
    Mau), Tham (Tham Lao and Isan, Lü, Lanna, Khün)
  • And share the same difficulties due to their
    common scripts history
  • No space between words
  • Order of the letters different from the
    phonologic order
  • Consequently, text segmentation and sorting
    difficulties

17
Some GMS Indic-derived Scripts
18
Sylla A Tool for Designing Syllabic Models
  • Developed to generate segmentation automata for
    Indic-Derived Writings
  • First versions of syllabic models for Khmer and
    Burmese writings

19
  • Thank You
  • www.LaoSoftware.com
  • Sabaidi.imag.fr
Write a Comment
User Comments (0)
About PowerShow.com