Indonesian English Parallel Texts for StatisticalMachineTranslation - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

Indonesian English Parallel Texts for StatisticalMachineTranslation

Description:

Archipelago of 13,000 islands that spread over an area of 1,900,000 ... Calculate the test data perplexity using the trained language model. Training corpus ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 10
Provided by: 20314
Category:

less

Transcript and Presenter's Notes

Title: Indonesian English Parallel Texts for StatisticalMachineTranslation


1
Indonesian English Parallel Texts for
Statistical Machine Translation
( Hammam Riza, Adiansya Prasetya, Henky Mulyadi )
2
Background
  • The Republic of Indonesia is an
  • Archipelago of 13,000 islands that spread over an
    area of 1,900,000 square kilometers
  • Population of 245,000,000 (July. 2006 estimated)
  • 7 growth of the GDP was recorded on per year
  • Indonesian economy and political conditions are
    gradually stabilizing
  • Indonesia is back on the track to become an
    industrialized nation
  • Bahasa Indonesia became the formal language of
    the country, uniting its citizens who speak
    different languages
  • Bahasa Indonesia has become the language that
    bridges the language barrier among Indonesians
    who have different mother-tongues
  • The vocabulary of bahasa Indonesia has been
    extensively influenced by outside languages,
    especially Sanskrit, Arabic, Chinese, Dutch, and
    English, as well as local languages such as
    Javanese and Batavian

3
Research Topics
Tourism in Asia
Social Service In Asia
Safety and Security in Asia
Archiving of Asian Language
Education In Asia
Business in Asia
Multi-lingual Speech and Language Transcription
and formats
Multi-lingual Speech translation
Multi-lingual Speech Translation
Multi-lingual Speech Transcription
Multi-lingual Speech and Text Archive
Parallel Corpus ( Synonymous Speech Text)
English Language
Indonesian Language
SpeechText
SpeechText
Parallel Corpus Format
Dictionary
4
Corpus Collection and Processing
  • Data collection schema

Antara News Agency (oracle DB)
Corpus Indonesian- English
Selection Transformation
Selected DB (SQL 2000)
Alignment Article
Alignment Sentences
Corpus Indonesian- English
Web News
Collection Alignment Sentences
Toggle Cleaning
Conversion to Text
Indonesian Text
Clean Corpus
English Text
5
Translation of SMT System (1)
  • A. Translation model
  • gt SRI Language Modeling Toolkit which extracts a
    3-gram language model from the data. Besides the
    SRILM distribution, you will also need the
    following freely available tools ANSI-C/C
    compiler, gcc version 3.4.3 or higher, GNU make,
    GNU gawk, GNU gzip, Tcl, CYGWIN porting layer, to
    build SRILM on a Microsoft Windows system.
  • gt Functionalities of SRILM
  • Generate the n-gram count file from the corpus
  • Train the language model from the n-gram count
    file
  • Calculate the test data perplexity using the
    trained language model

Training corpus
ngram corpus
Count file
Lexicon
ngram count
LM
Test data
ngram
ppl
6
Translation of SMT System (2)
  • B. Language Model
  • bin contains GIZA which is an implementation
    based on the IBM models, and mkcls which divides
    words into probabilistically based classes.
  • In order to compile GIZA you may need
  • a recent version of the GNU compiler (2.95 or
    higher)
  • a recent version of assembler and linker which do
    not have restrictions with respect to the length
    of symbol names
  • corpus is where the data should be placed when
    training the translation model.

source
  • Translation Model
  • Program (SRILM)
  • Compiler

Data preparations
Pharaoh SMT Generation System
Train Phrase Model
target
7
Testing System Performance (1)
  • Sentence translation process use decoder Pharaoh
  • Files used for translasi are
  • pharaoh (executable)
  • pharaoh.ini
  • xkalimat.lm
  • phrase-table
  • Example
  • Type the command like this
  • echo Can I check in now ./pharaoh f
    ./pharaoh.ini gt OUT
  • The process will yield file OUT, to see result
    type
  • cat OUT
  • Presented results Dapatkah saya check in
    sekarang

8
Testing System Performance (2)
Testing Performance
Bleu Score Sample 275,000 sentence bleu score
is 0.878
9
Thank you...
sunset in Kuta, Bali
Write a Comment
User Comments (0)
About PowerShow.com