Title: Lao in Papillon Current status
1Lao in Papillon - Current status
- Papillon 2003 Sapporo
- July 3rd 5th 2003
Vincent BERMENT GETA-CLIPS, Joseph Fourier
University, Grenoble, France Vincent.Berment_at_imag.
fr
2Laos and Lao Language
- Lao PDR, South-East Asia
- 5 Millions inhabitants
- 2,500 Internet subscribers (via ISPs)
- 10,000 computers
- 30,000 telephone lines
- Lao a Tai Kadai language
- Lao has its own alphabet
- Unicode range 0E80-0EFF
- 8 bit fonts no standard
- Unsegmented writing ???????????????
- Limited resources for Lao language
- Lao not included in the main OSs
- Computerization and standardization not done by
Lao citizens - Technology transfer and standardization projects
(UN, )
3Text sample in Lao language
- ?????????????? ?????????????????????????????????
???????????. ???????????????????
????????????????? ????????????? ?????????,
????????? ?????????????? ?????????,
???????????????????????????????????????.
??????????????????????????????? ?????????
???????????? ????????????????? ???????????????
????? ???. ????????????????? ?????????????????????
???.
4Lao Dictionary Development
- Lao dictionary building principle
- French-speaking students learning Lao
- Lao-French translation support tool
- On-line service called LaoLex
- Current translations bi-texts (on the server)
- Word for word translations
- Missing word ? add it in a personal dictionary
(on the server) - Personal to main dictionary transfer done by
linguists
5Principles of LaoLex
Main Dictionary
Personal Dictionaries
Text to translate
Lao Text
1
Syllabic Segmentation
Word for Word Translation (with word
segmentation using longest matching)
List of French Words
French Translation
Human
Transfer from the personal dictionaries to the
main one done by linguists
1
6Some Algorithms in LaoLex (1)Segmentation
Word for Word Translation
- Syllabic segmentation of the Lao text to
translate (automaton built from a set of
derivation rules) - ??????????????? ? ??-???-??-???-?-????
- Compare groups of syllables with the words in the
dictionary (longest matching algorithm, no
backtracking) - ??-???-??-???-?-???? ? ???????-???-?-????
- If no word matches, the first syllable is taken
as a word - Output the list of words
- ??????? (bonjour) ??? (tout, tous) ? (?) ????
(personne)
7Some Algorithms in LaoLex (2)Transformation
into a Canonic Form
- Lao needs another process a transfer to a
canonic form - Zero-width character repetition
- Zero-width character inversion
- ? ? ? ?
-
- Prior to store the entries of the dictionaries
- Prior to compare the groups of syllables with
dictionaries words
8Some Algorithms in LaoLex (3)Two Input Methods
- JavaScript
- Pros
- Platform independent (web client)
- No download required
- Cons
- Inserts characters at the end of the input field
- JavaScript must be activated
- LaoMonoKey, a tool dedicated to Windows platforms
- Hook intercepting the Windows messages coming
from the keyboard - Pros
- Insert characters at the cursors location
- Cons
- Download required
- Limited to Windows
9Some Algorithms in LaoLex (4)Sorting the
Dictionary
- Split two words into syllables
- Transform the chains of characters (the vocalic
graphemes are not in the phonologic order) - Compare the words using their phonologic elements
Ë
Á¾
¹
Ë
Á ¾
¹
C
V
C
V
V
C
C
C
C
V
V
V
10Programming Languages used in LaoLex
- Server side
- PHP/HTML/SHTML
- C CGIs (segmentation and sorting algorithms)
- MySql (lexical and miscellaneous information
database) - Browser side
- HTML
- JavaScript (input of Lao text)
- Text input in TextArea and Input fields (HTML)
11LaoDict A Dictionary built using LaoLex
() With translations in French
1257 Parts of Speech
- Reinhorns classification based on the 7
traditional categories (for Thai NECTEC 47, KU
61) - nouns (?????) 8 categories
- pronouns (??????????) 11 categories
- verbs (????????) 10 categories
- predicatives (???????) 22 categories
- prepositions (??????????) 1 category
- conjunctions (????????) 2 categories
- interjections (???????) 3 categories
1311 Levels of Language
- Deriving from discussions inside INALCOs Lao
department (foreign language university in
Paris) - General use
- Respectful
- Colloquial
- Slangy
- Specialty
- Refined
- Monk
- Royal
- Literary
- Spoken
- Archaic
14From LaoDict to Papillon
- LaoDict main dictionary 28 lexies
- LaoDict personal dictionaries (total) 978 lexies
- XML Lao add-on (schema redefinition) to
http//www-clips.imag.fr/geta/services/dml/papillo
n.xsd - Parts of speech
- Language level
- Export from LaoDict to Papillon (not done yet)
- XMLization
- Unicodization
15 16GMSLex
- The method used to develop LaoDict can be
enlarged to a number of languages, in particular
those using unsegmented writings - About 20 weakly computerized indic-derived
scripts (except Thai) from languages of the
Greater Mekong Subregion (from Michel Ferlus) - Khmer script family
- sub-families Cham , Khmer, Lao-Thai (Lao, Thai),
North-East Tai (Tai Dam, Tai Khao, Tai Deng, Tai
Yo, Lai Pao), - Mon script family
- sub-families Mon-Burmese (Burmese, Mon),
North-West Tai (Dehong Dai, Khamti, Shan, Tai
Mau), Tham (Tham Lao and Isan, Lü, Lanna, Khün) - And share the same difficulties due to their
common scripts history - No space between words
- Order of the letters different from the
phonologic order - Consequently, text segmentation and sorting
difficulties
17Some GMS Indic-derived Scripts
18Sylla A Tool for Designing Syllabic Models
- Developed to generate segmentation automata for
Indic-Derived Writings - First versions of syllabic models for Khmer and
Burmese writings
19- Thank You
- www.LaoSoftware.com
- Sabaidi.imag.fr