Title: Statistically Motivated Example-based Machine Translation using Translation Memory
1Statistically Motivated Example-based Machine
Translation using Translation Memory
- Sandipan Dandapat , Sara Morrissey, Sudip Kumar
Naskar, Harold Somers - CNGL, School of Computing, DCU
2Introduction
- Machine Translation is the process of automatic
encoding of information (syntactic and semantic)
from one language to another language. - RBMT is characterized by linguistic rules
- SMT mathematical model based on probability
distribution from parallel corpus - EBMT integrates both rule-based and
data-driven techniques - EBMT is often linked to another related
technique, Translation Memory (TM) stores
past translation in database - Both EBMT and TM have the idea of using existing
translations - But, EBMT is an automated technique for
translation whereas TM is an interactive tool for
human translators
3Introduction
SMT Vs. EBMT
Works well when a significant amount of training data is available Can be developed with a limited example-base
Good for open domain translation Good for restricted domain works well when test and training set are close
Has shown difficulties with free word order language Reuse the segment of a test sentence that can be found in the source side of the example base
4Our Attempt
- We try to use EBMT and TM to tackle the
English-Bangla language pair - Proved troublesome with low BLEU score for
various SMT approaches (Islam et. al., 2010) - We attempt to translate medical-receptionist
dialogues, primarily for appointment scheduling - Our Goal
- Integrate EBMT and TM for better translation for
restricted domain - EBMT helps to find the closest match and TM is
good for translating segments of a sentence
5Creating English -Bangla Parallel Corpus
- Task to create manually translated
English-Bangla parallel corpus for training - Points to consider native speakers, translation
challenges ? literal vs. explicit - Methodology
- Manual translation by native speaker
- Discussions on translation conventions
- Corpus example
- English
- Hello, can I get an appointment sometime later
this week? - Bangla
???????, ?? ???????? ????? ????, ??? ??? ???? ????????????? ????? ???? ???
6Notes on Translation Challenges
- Non-alteration of source text
- Literal translation of source
- Which doctor would you prefer?
- I dont mind
- Bangla
7Size and Type of the Corpora
- Due to involvement of aforementioned stages, it
is time-consuming to collect large amount of
medical-receptionist dialogue - Thus, our corpus comprises 380 dialogue turns
- In transcription, this works out at just under
3000 words (8 words per dialogue) - A very small corpus by any standard
8Note on Size of the Corpora
- How many examples are needed to adopt any
data-driven MT system?
System Language Pair Size
TTL English ? Turkish 488
TDMT English ? Japanese 350
EDGAR German ? English 303
ReVerb English ? German 214
ReVerb Irish ? English 120
METLA-1 English ? French 29
METLA English ? Urdu 7
- No SMT system developed with only 380 parallel
sentences - But, many EBMT systems have been developed with
such a small corpus
9Structure of the Corpus
- Medical receptionist dialogue is comprised of
very similarly structured sentences
- Example
- (1) a. I need a medical for my insurance company.
- b. I need a medical for my new job.
- (2) a. The doctor told me to come back for a
follow up appointment. - b. The doctor told me to call back in a week.
- Thus, it might be helpful to reuse the
translation of common parts while translating new
sentences - This leads us to use EBMT
10Main Idea
- Input
- Ok, I have booked you in for eleven fifteen on
Friday with Dr. Thomas. - Fuzzy match in the example-base
- Ok, I have booked you in for three thirty on
Thursday with Dr. Kelly. - gtPart of the translation from example-base fuzzy
match - Part of the translation Translation Memory or
SMT - Ok, I have booked you in for eleven fifteen on
Friday with Dr. Thomas.
11Building Translation Memory (TM)
- We build TM automatically from our small
patient-dialogue corpus - We use Moses to build two TMs
- Aligned phrase pairs from the Moses phrase table
(phrase table - PT) - Aligned word pairs based on GIZA (lexical
table - LT)
LT LT
hello ?????? ???????
eleven ?????? ??????
PT PT
come in on friday instead ? ???????? ???????? ???? ?????? ?
, but dr finn , ?????? ??? ???
- We keep all the target equivalents of a source
phrase in the TM which are stored in a sorted
order based on the phrase translation probability
12Building Translation Memory (TM)
- We build TM automatically from our small
patient-dialogue corpus - We use Moses to build two TMs
- Aligned phrase pairs from the Moses phrase table
(phrase table - PT) - Aligned word pairs based on GIZA (lexical
table - LT)
- We keep all the target equivalents of a source
phrase in the TM which are stored in a sorted
order based on the phrase translation probability
13Our Approach
- Our EBMT system, like most, has three stages
- Matching find closest match with the input
sentence - Adoptability find translation of the desired
segments - Recombination combined the translation of
desired segments
14Matching
- We find the closest sentence (Sc) from the
example base for the input sentence (S) to be
translated - We have used a word based edit distance metric to
find out this closest match sentence from the
example base ( ).
S Ok, I have booked you in for eleven fifteen
on Friday with Dr. Thomas. Sc Ok, I have booked
you in for three thirty on Thursday with Dr.
Kelly.
15Matching
- We consider the associated translation ( Sct) of
Sc as the skeleton translation of the input
sentence S
S Ok, I have booked you in for eleven fifteen
on Friday with Dr. Thomas. Sc Ok, I have booked
you in for three thirty on Thursday with Dr.
Kelly. Sct ????? , ??? ????? ???? ???????????
????? ?????? ??? ????? ???? ???? ????? ? AchchhA
, Ami ApanAra janya bRRihaspatibAra tinaTe
tirishe DAH kelira sAthe buk karechhi.
We will use some segment of the Sct to produce a
new translation
16Adaptability
- We extract the translation of the inappropriate
fragments of the input sentence (S) - To do this, we align three sentences the input
(S), the closest source-side match (Sc) and its
target equivalent (Sct)
- Mark the mismatched portion between input
sentence (S) and the closest source-side match
(Sc) using edit-distance - S ok , ive booked you in for lteleven
fifteengt on ltfridaygt with dr ltthomasgt . - Sc ok , ive booked you in for ltthree thirtygt
on ltthursdaygt with dr ltkellygt .
17Adaptability
- We extract the translation of the inappropriate
fragments of the input sentence (S) - To do this, we align three sentences the input
(S), the closest source-side match (Sc) and its
target equivalent (Sct)
- Further we align the mismatch portion of Sc with
its associated translation Sct using our TMs (PT
and LT) - S ok , ive booked you in for lteleven
fifteengt on ltfridaygt with dr ltthomasgt . - Sc ok , ive booked you in for ltthree thirtygt
on ltthursdaygt with dr ltkellygt . - Sct ?????, ??? ????? ???? lt1???????????gt
lt0????? ?????gt? ??? lt2????gt? ???? ???? ??????
- The number in the angular bracket keeps track of
the order of the appropriate fragments
18Recombination
Substitute, add or delete the segments from the
input sentence (S) with the skeleton translation
equivalent (Sct) S ok , ive booked you in
for lteleven fifteengt on ltfridaygt with dr
ltthomasgt . Sc ok , ive booked you in for
ltthree thirtygt on ltthursdaygt with dr ltkellygt .
Sct?????, ??? ????? ???? lt1???????????gt
lt0????? ?????gt? ??? lt2????gt? ???? ???? ??????
gtgt ?????, ??? ????? ???? lt1Fridaygt lt0eleven
fifteengt? ??? lt2Thomasgt? ???? ???? ????? ?
- Possible ways of obtaining Tx
- Tx SMT(x)
- Tx PT(x)
19Recombination Algorithm
20Experiments
- We conduct 5 different experiments
- Baseline
- SMT use OpenMaTrEx (http//www.openmatrex.org)
- EBMT based on the matching step. We consider
the skeleton translation as the desired output - Our Approach
- EBMT TM (PT) uses only phrase table during
recombination - EBMT TM(PT,LT) using both phase- and lexical-
table during recombination - EBMT SMT untranslated segments are translated
using SMT
21Results
- Data used for the experiment
- Training Data 381 parallel sentences
- Test Data 41 sentences disjoint from the
training set - We use BLEU and NIST score for automatic
evaluation
System BLEU NIST
SMT 39.32 4.84
EBMT 50.38 5.32
EBMT TM(PT) 57.47 5.92
EBMT TM(PT,LT) 57.56 6.00
EBMTSMT 52.01 5.51
22Results
- Manual evaluation 4 different native speakers
were asked to rate the translations using the two
scales
Fluency Adequacy
5 Flawless Bangla 5 All
4 Good Bangla 4 Most
3 Non-native Bangla 3 Much
2 Disfluent Bangla 2 Little
1 Incomprehensible 1 None
System Fluency Adequacy
SMT 3.00 3.16
EBMTTM(PT) 3.50 3.55
EBMTTM(PT,LT) 3.50 3.70
EBMTSMT 3.44 3.52
23Example Translations
24Assessment of Error Types
- Wrong source-target alignment in the phrase table
and lexical table resulting in an incorrect
alignment
25Assessment of Error Types
- Generates erroneous translation during
recombination
in a few minutes in a few minutes in - a.
???? (niYe) b.???? ???? (niYe Asate) c.????
(Asuna). a few minutes - ???? ????? ??????
(kaYeka miniTa derite) in a few minutes ????
???? ????? ?????? (niYe kaYeka miniTa derite)
26Observations
- Baseline EBMT has higher accuracy in all metrics
compared to the baseline SMT system - Combination of EBMT and TM has better accuracy
than both the baseline SMT and EBMT system - The combination of SMT with EBMT has some
improvement over baseline EBMT but has lower
accuracy than combination of TM with EBMT
27Conclusion and Future Work
- We have shown initial investigations for
combining TM in an EBMT framework - The integration of TM with EBMT has improved the
translation quality
- The error shows that a syntax based matching and
adaptation might help to reduce false positive
adaptations - Use of morpho-syntactic information during
recombination might improve the translation
quality
28Thank you!
Questions?