Title: Human Language Technology in Thailand
1Human Language Technology in Thailand
- Virach Sornlertlamvanich
- Information RD Division
- National Electronics and Computer Technology
Center - virach_at_nectec.or.th
- 22 March 2002
IEEE Colloquium on Signal Processing
2Knowledge Information - Data
- KnowledgeAbility in understanding, reasoning and
problem solving - InformationEntity to inform the others
- DataEntity observed by intelligent agents
3Artificial Intelligent Agent
4Research Theme
- Human Language Technology
- Text
- Speech
- Multimedia and Multimodality
- Intelligent Content
- Knowledge Discovery
- Datamining
- Visualization
- Natural Interaction
5Text
ADLTSUGINFORMATIONBWGZKTILA ??????????????????????
??????
ADLTSUGINFORMATIONBWGZKTILA ??????????????????????
??????
6Term Candidate Extraction
- Virach Sornlertlamvanich et. al. (COLING 2000)
- Automatic Corpus-Based Thai Word Extraction with
the C4.5 Learning Algorithm - C4.5-trained decision tree for determining
potential word boundary from MI, Entropy and
Linguistic information - Capable of discovering new words in document
without assistance from static dictionary
7Mutual Information
y z
x y
x
z
where x is the leftmost character of string
xyz y is the middle substring of xyz z is the
rightmost character of string xyz p( ) is the
probability function.
High mutual information implies that xyz
co-occurs more than expected by chance. If xyz is
a word then its Lm and Rm must be
high.Efunction vs ...Function...
8Entropy
y
x
y
z
where A is the set of characters x is the
leftmost character of string xyz y is the middle
substring of xyz z is the rightmost character
of string xyz p( ) is the probability function.
Entropy shows the variety of characters before
and after a word. If y is a word then its left
and right entropy must be high....?function...
vs ...?unction...
9Other Features
- Frequency Words tend to be used more often than
non-word string sequences. - Length Short strings are likely to happen by
chance. The long and short strings should be
treated differently. - Functional Words Functional words are used
mostly in phrases. They are useful to
disambiguate words and phrases.
Result of subjective test Word
precision 85 Word recall 56
10Evaluation Result of Word Extraction
RID Royal Institute Dictionary (30,000
words of Thai-Thai dictionary)
11Dictionary-less Search Engine
???? (common noun in common noun)
...?????????????????????????????????????????????..
. family
...??????????????????...????????...??????????...
kitchen
????? (proper noun in common noun)
...???????????????????????????????????????... PR
...????????????? ????????????????????... proper
noun
???? (common noun in proper noun)
...????????????????????????... proper noun
...??????????????????????????? ????????????... el
ement
12Sansarn
13Speech
Input String
Sentence Segmentation
Word Segmentation
Grapheme to Phoneme
Text processing
Prosody / Tone Generation
Signal processing
Demi-syllable Concatenation
14Sentence Segmentation
Input paragraph
Training POS tagged corpus
Word segmentation andPOS tagging
Winnow(Feature-based ML)
Word sequence with tagged POS
Winnow
Trained network
Paragraph with sentence break
15Accuracy in Word/Sentence Segmentation
- Word Segmentation
- Longest matching (92)
- Maximal matching (93)
- POS tri-gram (96)
- Machine learning (97)
- Sentence Segmentation
- POS tri-gram (85)
- Machine learning (89)
Supervised approaches
16Thai Grapheme-to Phoneme
PGLR Table
/som/chaj/
/som4/chaj0/
?????
PGLR parser
Most probable parse tree
G-P Mapping
Tone Generation
CFG Rule
G-P Table
17PGLR Approach
- Probabilistic Generalize LR parsing
- Advantage in context-sensitivity
- Two levels of context
- Global context - over structures from the CFG
rules (probability in reduce action) - Local n-gram context (probability in shift
action)
18Prosody Generation
- F0 contour
- Intonation
- Downdrift
- Pitch Range
- Upper limit
- Lower limit
19Prosody Generation
- Tone concatenation
- Tone Location
- Coarticulation
20Demi-syllable Concatenation
- Sound Units
- Demi-syllable
- Use original tone 1,505 units
- Tone modification technique 605 units
21Demi-syllable Concatenation
- Speech Signal Processing
- Tonal modification technique
- Pitch-Synchronous Overlap-Add (PSOLA)
- Spectral smoothing technique
- Line Spectrum Pair (LSP) parameter smoothing
22Demi-syllable Concatenation
- Cross-syllable coarticulation smoothing
- Tone coarticulation PSOLA
- Waveform interaction LSP parameter smoothing
- Concatenation smoothing
- Intra-syllable smoothing
- Inter-syllable smoothing
23Text to Speech
?????????????????????????????????
??????????????????????????????
Prosody Generation Syllable Duration F0 Contour
- Text Segmentation
- Sentence Extraction
- Phrasing
- Word Segmentation
?? / ?? / ?????? / ??????? / PhraseBreak / ??? /
?? / ???????? / ???
- Speech Signal SynthesisBoundary Smoothing
- Prosodic waveform modification
Graphem-to-Phoneme conversion
/phom4/kh_at_4/kh_at_p1/khun0/thuk3/than2/PAUSE/thi2
/ma0/jiam2/chom0/ngan0/
Speech Signal
24Ontologies
- EDR
- Approach Word description as employed in
dictionaries - Problem Ambiguities and incomputability
- Wordnet
- Approach Synonym set and simple
semantic relations to other words - Problem Ambiguities
- UW
- Approach Headwords and semantic restrictions
- Advantage Computability and no ambiguity
25Ontologies
Representation of concept tired in different
schemes
EDR Wordnet 1.5 UW
- having or displaying a need for rest-
having lost of interest- lack of imagination
- A1 tired (vs. rested)- A2 bromidic,
commonplace, hackneyed, - V1 tire,
pall, grow weary, fatigue- V2 tire, wear upon,
fag out- V3 run down, exhaust, sap, - V4
bore, tire, ...
- tired- tired(iclgtphysical)-
tired(iclgtmental)
26Universal Word (UW)
- UW format ltheadwordgt ( ltlist of
restrictionsgt ) e.g. book (icl gt do, obj gt
room) - Headword An English word roughly describes
the UW sense. - Restrictions
- Inclusion (icl ) indicates the class of the
sensee.g. car ( icl gt movable thing)
27Universal Word (UW)
- Restrictions (continued)
- UNL semantic relationse.g. eat ( agt gt
volitional thing, obj gt food )The agent of this
UW is restricted to be volitional thing.The
object of this UW is restricted to be food.
UW Class Hierarchy
28Milestone
29Thailand Knowledge-based Economy
- eContent
- Open Source distributed architecture
- Language resources
- Digital Divide
- Speech-based Internet access
- Language Divide
- Mutilingual access
30Technology Solution
- Intelligent Terminal
- Knowledge Exchange
- Intelligent System
- Natural Interaction
- eContent