Statistically Motivated Example-based Machine Translation using Translation Memory - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Statistically Motivated Example-based Machine Translation using Translation Memory

Description:

Statistically Motivated Example-based Machine Translation using Translation Memory Sandipan Dandapat , Sara Morrissey, Sudip Kumar Naskar, Harold Somers – PowerPoint PPT presentation

Number of Views:241
Avg rating:3.0/5.0
Slides: 27
Provided by: SaraMor2
Category:

less

Transcript and Presenter's Notes

Title: Statistically Motivated Example-based Machine Translation using Translation Memory


1
Statistically Motivated Example-based Machine
Translation using Translation Memory
  • Sandipan Dandapat , Sara Morrissey, Sudip Kumar
    Naskar, Harold Somers
  • CNGL, School of Computing, DCU

2
Introduction
  • Machine Translation is the process of automatic
    encoding of information (syntactic and semantic)
    from one language to another language.
  • RBMT is characterized by linguistic rules
  • SMT mathematical model based on probability
    distribution from parallel corpus
  • EBMT integrates both rule-based and
    data-driven techniques
  • EBMT is often linked to another related
    technique, Translation Memory (TM) stores
    past translation in database
  • Both EBMT and TM have the idea of using existing
    translations
  • But, EBMT is an automated technique for
    translation whereas TM is an interactive tool for
    human translators

3
Introduction
SMT Vs. EBMT
Works well when a significant amount of training data is available Can be developed with a limited example-base
Good for open domain translation Good for restricted domain works well when test and training set are close
Has shown difficulties with free word order language Reuse the segment of a test sentence that can be found in the source side of the example base
4
Our Attempt
  • We try to use EBMT and TM to tackle the
    English-Bangla language pair
  • Proved troublesome with low BLEU score for
    various SMT approaches (Islam et. al., 2010)
  • We attempt to translate medical-receptionist
    dialogues, primarily for appointment scheduling
  • Our Goal
  • Integrate EBMT and TM for better translation for
    restricted domain
  • EBMT helps to find the closest match and TM is
    good for translating segments of a sentence

5
Creating English -Bangla Parallel Corpus
  • Task to create manually translated
    English-Bangla parallel corpus for training
  • Points to consider native speakers, translation
    challenges ? literal vs. explicit
  • Methodology
  • Manual translation by native speaker
  • Discussions on translation conventions
  • Corpus example
  • English
  • Hello, can I get an appointment sometime later
    this week?
  • Bangla

???????, ?? ???????? ????? ????, ??? ??? ???? ????????????? ????? ???? ???
6
Notes on Translation Challenges
  • Non-alteration of source text
  • Literal translation of source
  • Which doctor would you prefer?
  • I dont mind
  • Bangla

7
Size and Type of the Corpora
  • Due to involvement of aforementioned stages, it
    is time-consuming to collect large amount of
    medical-receptionist dialogue
  • Thus, our corpus comprises 380 dialogue turns
  • In transcription, this works out at just under
    3000 words (8 words per dialogue)
  • A very small corpus by any standard

8
Note on Size of the Corpora
  • How many examples are needed to adopt any
    data-driven MT system?

System Language Pair Size
TTL English ? Turkish 488
TDMT English ? Japanese 350
EDGAR German ? English 303
ReVerb English ? German 214
ReVerb Irish ? English 120
METLA-1 English ? French 29
METLA English ? Urdu 7
  • No SMT system developed with only 380 parallel
    sentences
  • But, many EBMT systems have been developed with
    such a small corpus

9
Structure of the Corpus
  • Medical receptionist dialogue is comprised of
    very similarly structured sentences
  • Example
  • (1) a. I need a medical for my insurance company.
  • b. I need a medical for my new job.
  • (2) a. The doctor told me to come back for a
    follow up appointment.
  • b. The doctor told me to call back in a week.
  • Thus, it might be helpful to reuse the
    translation of common parts while translating new
    sentences
  • This leads us to use EBMT

10
Main Idea
  • Input
  • Ok, I have booked you in for eleven fifteen on
    Friday with Dr. Thomas.
  • Fuzzy match in the example-base
  • Ok, I have booked you in for three thirty on
    Thursday with Dr. Kelly.
  • gtPart of the translation from example-base fuzzy
    match
  • Part of the translation Translation Memory or
    SMT
  • Ok, I have booked you in for eleven fifteen on
    Friday with Dr. Thomas.

11
Building Translation Memory (TM)
  • We build TM automatically from our small
    patient-dialogue corpus
  • We use Moses to build two TMs
  • Aligned phrase pairs from the Moses phrase table
    (phrase table - PT)
  • Aligned word pairs based on GIZA (lexical
    table - LT)

LT LT
hello ?????? ???????
eleven ?????? ??????
PT PT
come in on friday instead ? ???????? ???????? ???? ?????? ?
, but dr finn , ?????? ??? ???
  • We keep all the target equivalents of a source
    phrase in the TM which are stored in a sorted
    order based on the phrase translation probability

12
Building Translation Memory (TM)
  • We build TM automatically from our small
    patient-dialogue corpus
  • We use Moses to build two TMs
  • Aligned phrase pairs from the Moses phrase table
    (phrase table - PT)
  • Aligned word pairs based on GIZA (lexical
    table - LT)
  • We keep all the target equivalents of a source
    phrase in the TM which are stored in a sorted
    order based on the phrase translation probability

13
Our Approach
  • Our EBMT system, like most, has three stages
  • Matching find closest match with the input
    sentence
  • Adoptability find translation of the desired
    segments
  • Recombination combined the translation of
    desired segments

14
Matching
  • We find the closest sentence (Sc) from the
    example base for the input sentence (S) to be
    translated
  • We have used a word based edit distance metric to
    find out this closest match sentence from the
    example base ( ).

S Ok, I have booked you in for eleven fifteen
on Friday with Dr. Thomas. Sc Ok, I have booked
you in for three thirty on Thursday with Dr.
Kelly.
15
Matching
  • We consider the associated translation ( Sct) of
    Sc as the skeleton translation of the input
    sentence S

S Ok, I have booked you in for eleven fifteen
on Friday with Dr. Thomas. Sc Ok, I have booked
you in for three thirty on Thursday with Dr.
Kelly. Sct ????? , ??? ????? ???? ???????????
????? ?????? ??? ????? ???? ???? ????? ? AchchhA
, Ami ApanAra janya bRRihaspatibAra tinaTe
tirishe DAH kelira sAthe buk karechhi.
We will use some segment of the Sct to produce a
new translation
16
Adaptability
  • We extract the translation of the inappropriate
    fragments of the input sentence (S)
  • To do this, we align three sentences the input
    (S), the closest source-side match (Sc) and its
    target equivalent (Sct)
  • Mark the mismatched portion between input
    sentence (S) and the closest source-side match
    (Sc) using edit-distance
  • S ok , ive booked you in for lteleven
    fifteengt on ltfridaygt with dr ltthomasgt .
  • Sc ok , ive booked you in for ltthree thirtygt
    on ltthursdaygt with dr ltkellygt .

17
Adaptability
  • We extract the translation of the inappropriate
    fragments of the input sentence (S)
  • To do this, we align three sentences the input
    (S), the closest source-side match (Sc) and its
    target equivalent (Sct)
  • Further we align the mismatch portion of Sc with
    its associated translation Sct using our TMs (PT
    and LT)
  • S ok , ive booked you in for lteleven
    fifteengt on ltfridaygt with dr ltthomasgt .
  • Sc ok , ive booked you in for ltthree thirtygt
    on ltthursdaygt with dr ltkellygt .
  • Sct ?????, ??? ????? ???? lt1???????????gt
    lt0????? ?????gt? ??? lt2????gt? ???? ???? ??????
  • The number in the angular bracket keeps track of
    the order of the appropriate fragments

18
Recombination
Substitute, add or delete the segments from the
input sentence (S) with the skeleton translation
equivalent (Sct) S ok , ive booked you in
for lteleven fifteengt on ltfridaygt with dr
ltthomasgt . Sc ok , ive booked you in for
ltthree thirtygt on ltthursdaygt with dr ltkellygt .
Sct?????, ??? ????? ???? lt1???????????gt
lt0????? ?????gt? ??? lt2????gt? ???? ???? ??????
gtgt ?????, ??? ????? ???? lt1Fridaygt lt0eleven
fifteengt? ??? lt2Thomasgt? ???? ???? ????? ?
  • Possible ways of obtaining Tx
  • Tx SMT(x)
  • Tx PT(x)

19
Recombination Algorithm
20
Experiments
  • We conduct 5 different experiments
  • Baseline
  • SMT use OpenMaTrEx (http//www.openmatrex.org)
  • EBMT based on the matching step. We consider
    the skeleton translation as the desired output
  • Our Approach
  • EBMT TM (PT) uses only phrase table during
    recombination
  • EBMT TM(PT,LT) using both phase- and lexical-
    table during recombination
  • EBMT SMT untranslated segments are translated
    using SMT

21
Results
  • Data used for the experiment
  • Training Data 381 parallel sentences
  • Test Data 41 sentences disjoint from the
    training set
  • We use BLEU and NIST score for automatic
    evaluation

System BLEU NIST
SMT 39.32 4.84
EBMT 50.38 5.32
EBMT TM(PT) 57.47 5.92
EBMT TM(PT,LT) 57.56 6.00
EBMTSMT 52.01 5.51
22
Results
  • Manual evaluation 4 different native speakers
    were asked to rate the translations using the two
    scales

Fluency Adequacy
5 Flawless Bangla 5 All
4 Good Bangla 4 Most
3 Non-native Bangla 3 Much
2 Disfluent Bangla 2 Little
1 Incomprehensible 1 None
System Fluency Adequacy
SMT 3.00 3.16
EBMTTM(PT) 3.50 3.55
EBMTTM(PT,LT) 3.50 3.70
EBMTSMT 3.44 3.52
23
Example Translations
24
Assessment of Error Types
  • Wrong source-target alignment in the phrase table
    and lexical table resulting in an incorrect
    alignment

25
Assessment of Error Types
  • Generates erroneous translation during
    recombination

in a few minutes in a few minutes in - a.
???? (niYe) b.???? ???? (niYe Asate) c.????
(Asuna). a few minutes - ???? ????? ??????
(kaYeka miniTa derite) in a few minutes ????
???? ????? ?????? (niYe kaYeka miniTa derite)
26
Observations
  • Baseline EBMT has higher accuracy in all metrics
    compared to the baseline SMT system
  • Combination of EBMT and TM has better accuracy
    than both the baseline SMT and EBMT system
  • The combination of SMT with EBMT has some
    improvement over baseline EBMT but has lower
    accuracy than combination of TM with EBMT

27
Conclusion and Future Work
  • We have shown initial investigations for
    combining TM in an EBMT framework
  • The integration of TM with EBMT has improved the
    translation quality
  • The error shows that a syntax based matching and
    adaptation might help to reduce false positive
    adaptations
  • Use of morpho-syntactic information during
    recombination might improve the translation
    quality

28
Thank you!
Questions?
Write a Comment
User Comments (0)
About PowerShow.com