Title: Indonesian English Parallel Texts for StatisticalMachineTranslation
1Indonesian English Parallel Texts for
Statistical Machine Translation
( Hammam Riza, Adiansya Prasetya, Henky Mulyadi )
2Background
- The Republic of Indonesia is an
- Archipelago of 13,000 islands that spread over an
area of 1,900,000 square kilometers - Population of 245,000,000 (July. 2006 estimated)
- 7 growth of the GDP was recorded on per year
- Indonesian economy and political conditions are
gradually stabilizing - Indonesia is back on the track to become an
industrialized nation - Bahasa Indonesia became the formal language of
the country, uniting its citizens who speak
different languages - Bahasa Indonesia has become the language that
bridges the language barrier among Indonesians
who have different mother-tongues - The vocabulary of bahasa Indonesia has been
extensively influenced by outside languages,
especially Sanskrit, Arabic, Chinese, Dutch, and
English, as well as local languages such as
Javanese and Batavian
3Research Topics
Tourism in Asia
Social Service In Asia
Safety and Security in Asia
Archiving of Asian Language
Education In Asia
Business in Asia
Multi-lingual Speech and Language Transcription
and formats
Multi-lingual Speech translation
Multi-lingual Speech Translation
Multi-lingual Speech Transcription
Multi-lingual Speech and Text Archive
Parallel Corpus ( Synonymous Speech Text)
English Language
Indonesian Language
SpeechText
SpeechText
Parallel Corpus Format
Dictionary
4Corpus Collection and Processing
Antara News Agency (oracle DB)
Corpus Indonesian- English
Selection Transformation
Selected DB (SQL 2000)
Alignment Article
Alignment Sentences
Corpus Indonesian- English
Web News
Collection Alignment Sentences
Toggle Cleaning
Conversion to Text
Indonesian Text
Clean Corpus
English Text
5Translation of SMT System (1)
- A. Translation model
- gt SRI Language Modeling Toolkit which extracts a
3-gram language model from the data. Besides the
SRILM distribution, you will also need the
following freely available tools ANSI-C/C
compiler, gcc version 3.4.3 or higher, GNU make,
GNU gawk, GNU gzip, Tcl, CYGWIN porting layer, to
build SRILM on a Microsoft Windows system. - gt Functionalities of SRILM
- Generate the n-gram count file from the corpus
- Train the language model from the n-gram count
file - Calculate the test data perplexity using the
trained language model
Training corpus
ngram corpus
Count file
Lexicon
ngram count
LM
Test data
ngram
ppl
6Translation of SMT System (2)
- B. Language Model
- bin contains GIZA which is an implementation
based on the IBM models, and mkcls which divides
words into probabilistically based classes. - In order to compile GIZA you may need
- a recent version of the GNU compiler (2.95 or
higher) - a recent version of assembler and linker which do
not have restrictions with respect to the length
of symbol names - corpus is where the data should be placed when
training the translation model.
source
- Translation Model
- Program (SRILM)
- Compiler
Data preparations
Pharaoh SMT Generation System
Train Phrase Model
target
7Testing System Performance (1)
- Sentence translation process use decoder Pharaoh
- Files used for translasi are
- pharaoh (executable)
- pharaoh.ini
- xkalimat.lm
- phrase-table
- Example
- Type the command like this
- echo Can I check in now ./pharaoh f
./pharaoh.ini gt OUT - The process will yield file OUT, to see result
type - cat OUT
- Presented results Dapatkah saya check in
sekarang
8Testing System Performance (2)
Testing Performance
Bleu Score Sample 275,000 sentence bleu score
is 0.878
9Thank you...
sunset in Kuta, Bali