Title: PRONUNCIATION DICTIONARIES
1PRONUNCIATION DICTIONARIES
- Dr. Bali RANAIVO-MALANÇON
- Unit Terjemahan Melalui Komputer
- Universiti Sains Malaysia
2DefinitionWhat is a pronunciation dictionary?
- A pronunciation dictionary (or Phonetic
dictionary) is a list of words following by their
phonetic transcriptions. - Phonetic transcriptions
- Canonical
- pronunciation
Variant pronunciations
Phonological rules to generate variant
pronunciations
3A few linguistic basic knowledge
Notation ltgt orthographic representation ltbuahgt
character representation b, u, a,
h // phonemic representation /buah/ phonetic
representation buwah
PHONOLOGY (or phonemics) study distinctive sound
units, the patterns they form, and the rules
which regulate their use Phonemes / Phones /r/
PHONETICS study the inventory and structure of
the sounds of language Allophones r R ?
4Examples of pronunciation dictionaries
Verbmobil "Ubernachtungen Qyb6n'axtUN_at_n "Uberna
chtungskosten Qyb6n'axtUNsk"Ost_at_n "Ubernachtun
gsm"oglichk Qyb6n'axtUNsm"2klICk
PHONOLEX "Ubernachtungsgeldes CLnom ORsb
TPptra Qyb6naxtUNsgEld_at_s "Ubernachtungskost
en ORvm TPmanu Qyb6n'axtUNsk"Ost_at_n
yb6naxtUNskOst_at_n 1 VM MAUS yb6naxtUNskOsn 1 VM
MAUS
CMUdict (Carnegie Mellon Pronouncing dictionary)
5ApplicationsWhy do we need pronunciation
dictionaries?
- Speech technologies to help phonetic labeling
- Automatic Speech Recognition (ASR)
- - Tan Tien Pieng -
- Text-To-Speech (TTS)
- - Nur Hana Samsudin -
- Pronunciation can be added to Malay dictionary
6Simplified Speech Recognition ArchitectureJurafsk
y D., Martin J. H. (2000) Speech and Language
Processing, Prentice-Hall, Inc.
Speech Waveform
7MBROLA Malay Diphone Database
- Diphones
- Speech units that begin in the middle of a phone
and end in the middle of the following one. - Concatenative synthesis
- Minimize concatenation problems
- Require an affordable amount of memory
- MBROLA (Multi Band Resynthesis OverLap Add)
- Speech synthesizer based on the concatenation of
diphones - Faculté Polytechnique de Mons, Belgium, 1996,
- Synthesizers for many languages, e.g. Indonesian,
British, American English, Arab - Synthesizers Diphone database free,
non-commercial applications, available online - As MBROLA provides all facilities (programs,
guidelines, assistance, etc.) to build a
synthesizer, we can focus our research only on
preparing the diphone data to built the Malay
synthesizer
8Building diphone database
Pronunciation Dictionary saya, saja
List of phones a, j, s,
Combine two phones
List of diphones aj, ja, sa,
List of diphones aa, aj, as, ja, jj,
js, sa, sj, ss,
9ResourcesWhat do we have today to build the
Malay pronunciation dictionary?
- Linguistic resources
- List of Malay words ? 60,000 words or tokens
- List of Malay abbreviations and their expansions
- List of Malay proper names
- Malay corpus novels, academic
- Phonological rules (Dr Tajul)
- Programs, Techniques, Algorithms
- Grapheme-to-phoneme converter
- Statistical techniques
UTMKs future researches on speech Applications
of the Malay pronunciation dictionary
- From readings (books, reports, etc.)
- Knowledge about pronunciation dictionary
- applications,
- needs,
- techniques, algorithms, implementation
10Building the pronunciation dictionary
- Define phoneme inventories and use
machine-readable phonetic alphabets (ASCII-IPA
alphabets), e.g. SAMPA, TIMIT, etc.) - IPA SAMPA TIMIT
- ? S sh she
- ? jh joke
- ? N ng sing
- Define phonological rules in a form adapted to
computation - Etymology information
- Arab ltmaafgt ma?af
- Malay ltgunaangt guna?an
- Morphological analysis
- ltpakaigt paka?
- ltdiketuaigt dik?tuwaji
- Rewriting rules order rules
- Two-level morphology without rule-ordering
- Implementation using finite-state transducers
11Building
- Differentiate homographs,
- semak_Noun s?ma?
- semak_Verb sema?
- Pronunciation of
- proper names
- abbreviations, e.g. Proton
- numbers, e.g. Boeing 747
- some characters, e.g. _at_ and . in
ranaivo_at_cs.usm.my - Grapheme to phoneme converter
- Experts checking
12Conclusion
- Structure of Malay pronunciation dictionary
- word, lexcat, etym, pht, nbph
- lexcat lexical category
- etym etymology
- MAL(ay), IND(onesian), ENG(lish), AR(a)B,
OTH(er) - pht phonetic transcription
- using one ASCII-API alphabets (not defined
yet) - nbph number of phones
- Set of phonological rules to derive variant
pronunciations - TERIMA KASIH