Title: Comparing ExampleBased
1Comparing Example-Based Statistical Machine
Translation
- Andy Way
- Nano Gough, Declan Groves
- National Centre for Language Technology
- School of Computing, Dublin City University
- away,ngough,dgroves_at_computing.dcu.ie
- To appear in the Journal of Natural Language
Engineering, June 2005 - To appear in the Workshop on Building and
Using Parallel Texts - Data-Driven MT and Beyond, ACL-05, June 2005
2Plan of the Talk
- Basic Situation in MT today
- Statistical MT (SMT)
- Example-Based MT (EBMT)
- Differences between Phrase-based SMT EBMT.
- Our Marker-based EBMT system.
- Testing EBMT vs. word- phrase-based SMT.
- Results Observations.
- Concluding Remarks.
- Future Research Avenues.
3What is the Situation today in MT?
- Most MT research undertaken today iscorpus-based
(compared with rule-based methods). - Two main data-driven approaches
- Example-Based MT (EBMT)
- Statistical MT (SMT)
- SMT by far the more dominant paradigm.
4How does EBMT work?
EX (input)
search
FX (output)
F2 F4
5A (much simplified) Example
- Given in corpusJohn went to school ? Jean est
allé à lécole.The butchers is next to the
bakers ? La boucherie est à côté de
la boulangerie. - Isolate useful fragmentsJohn went to ? Jean
est allé à the bakers ? la boulangerie - We can now translateJohn went to the bakers
Jean est allé à la boulangerie.
6How does SMT work?
- SMT deduces language translation models from
huge quantities of monolingual and bilingual data
using a range of theoretical approaches to
probability distribution and estimation. - Translation model establishes the set of target
language words (and more recently, phrases) which
are most likely to be useful in translating the
source string. - takes into account source and target word (and
phrase) co-occurrence frequencies, sentence
lengths and the relative sentence positions of
source and target words. - Language model tries to assemble these words (and
phrases) in the best order possible. - trained by determining all bigram and/or trigram
frequency distributions occurring in the training
data
7The Paradigms are Converging
- Harder than it has ever been to describe the
differences between the two methods. - This used to be easy
- from the beginning, EBMT has sought to translate
new texts by means of a range of sub-sentential
databoth lexical and phrasalstored in the
system's memory. - until quite recently, SMT models of
translationwere based on the simple IBM word
alignment models of Brown et al., 1990.
8From word- to phrase-based SMT
- SMT systems now learn phrasal as well as lexical
alignments e.g. Koehn, Och, Marcu 2003 Och,
2003. - Unsurprisingly, the quality of today's
phrase-based SMT systems is considerably better
than that of the poorer word-based models. - Despite the fact that EBMT models have been
modelling lexical and phrasal correspondences for
20 years, no papers on SMT acknowledge this debt
to EBMT, nor describe their approach as
example-based
9Differences between EBMT and Phrase-Based SMT?
- EBMT alignments remain available for reuse in the
system, whereas (similar) SMT alignments
disappear in the probability models. - SMT systems never learn from previously
encountered data, i.e. when SMT sees a string
its seen before, it processes it in the same way
as unseen dataEBMT will just look up such
strings in its databases and output the
translation quite straightforwardly. - Depending on the model, EBMT builds in (some)
syntax at its coremost SMT systems only use
models of syntax in a post hoc reranking process,
and even here, Koehn et al., JHU Workshop 2003
demonstrated that bolting on syntax in this
manner did not help improve translation quality - Given (3), phrase-based SMT systems are likely to
learn (some) chunks that EBMT systems would
not.
10SMT chunks are different from EBMT chunks
- En Mary did not slap the green witch ?
- Sp Maria no dió una botefada a la bruja verde.
- (Lit Mary not gave a slap to the witch green)
- From this aligned example, an SMT system would
potentially learn the following phrases (along
with many others) - slap the ? dió una botefada a
- slap the ? dió una botefada a la
- the green witch ?a la bruja verde
- NB, SMT essentially learns n-gram sequences,
rather than phrases per se. - Koehn Knight, AMTA-04 SMT Tutorial
Notes
11Our Marker-Based EBMT System
The Marker Hypothesis states that all natural
languages have a closed set of specific words or
morphemes which appear in a limited set of
grammatical contexts and which signal that
context. Green, 1979 Markers for English
(and French)
12An Example
- En you click apply to view the effect of the
selection ? - Fr vous cliquez sur appliquer pour visualiser
l'effet de la sélection - Sourcetarget aligned sentences are traversed
word by word and automatically tagged with their
marker categories - ltPRONgtyou click apply ltPREPgtto view ltDETgtthe
effect ltPREPgtof ltDETgtthe selection ? - ltPRONgtvous cliquez ltPREPgtsur appliquer ltPREPgtpour
visualiser ltDETgtl'effet ltPREPgtde ltDETgtla
sélection
13Deriving Sub-Sentential SourceTarget Chunks
- From these tagged strings, we generate the
following aligned marker chunks - ltPRONgt you click apply vous cliquez sur
appliquer - ltPREPgt to view pour visualiser
- ltDETgt the effect l'effet
- ltPREPgt of the selection de la sélection
- New source and target (not necessarily
sourcetarget! fragments begin where marker
words are met and end at the next marker word
cognates, MI etc ? sourcetarget sub-sentential
alignments. - One further constraint each chunk must contain
at least one non-marker word (cf. 4th marker
chunk).
14Deriving Lexical Mappings
- Where chunks contain just one non-marker word in
both source and target, we assume they are
translations. - Thus we can extract the following word-level
translations - ltPREPgt to pour
- ltLEXgt view visualiser
- ltLEXgt effect effet
- ltPRONgt you vous
- ltDETgt the l
- ltPREPgt of de
-
15Deriving Generalised Templates
- In a final pre-processing stage, we produce a set
of generalised marker templates by replacing
marker words with their tags - ltPRONgt click apply ltPRONgt cliquez sur appliquer
- ltPREPgt view ltPREPgt visualiser
- ltDETgt effect ltDETgt effet
- ltPREPgt the selection ltPREPgt la sélection
- Any marker tag pair can now be inserted at the
appropriate tag location. - More general examples add flexibility to the
matching process and improve coverage (and
quality).
16Summary of Knowledge Sources
- the original sententially-aligned sourcetarget
pairs - the marker-aligned chunks
- the generalised marker chunks
- the word-level lexicon.
- New strings are segmented into all possible
n-grams that might be retrieved from the system's
memories. - Resources searched in the order provided here,
from maximal (specific sourcetarget
sentence-pairs) to minimal context (word-for-word
translation).
17Application Areas for our EBMT System
- Seeding System Memories with Penn-II Treebank
phrases and translations AMTA-02. - Controlled Language EBMT MT Summit-03,
EAMT-04, MT Journal-05. - Integration with web-based MT Systems CL
Journal-03. - Using the Web for Translation Validation (and
Correction, if required). - Scalable EBMT TMI-04, NLE Journal-05, ACL-05.
- Largest English?French EBMT System.
- Robust, Wide-Coverage, Good Quality.
- Outperforms good on-line MT Systems.
18What are we interested in finding out?
- Whether our marker-based EBMT system could
outperform (1) word-based and (2) phrase-based
SMT systems compiled from generally available
tools - Whether such SMT systems outperform our EBMT
system when given enough training text. - Whether seeding SMT (and EBMT) systems with SMT
and/or EBMT data improves translation quality. - NB, (astonishingly), no previous published
research on comparing EBMT and SMT
19What have we done vs. what are we doing?
- WBSMT vs. EBMT
- PBSMT seeded with
- SMT chunks
- EBMT chunks
- Both knowledge sources (Hybrid Example-Based
SMT). - PBSMT vs. EBMT
- Ongoing work
- EBMT seeded with
- SMT chunks
- EBMT chunks
- Merged knowledge sources (Hybrid Statistical
EBMT).
20Word-Based SMT vs. EBMT
- Marker-Based EBMT system Gough Way, TMI-04
- To develop language and translation models for
the WBSMT system, we used - Giza (for word-alignment)
- the CMU-Cambridge statistical toolkit (for
computing the language and translation models) - the ISI ReWrite Decoder (for deriving
translations)
21Experiment 1 Set-Up
- 207K EnglishFrench Sun TM.
- Randomly extracted 4K sentence test set.
- Split remaining sentences into three training
sets roughly 50K (1.1M words), 100K and 203K
(4.8M words) sentence-pairs to test impact of
training set size. - Translation performed at each stage from
EnglishFrench and FrenchEnglish. - Resulting translations evaluated using a range of
automatic metrics.
22WBSMT vs. EBMT EnglishFrench
- All metrics bar one suggest that EBMT can
outperform WBSMT from FrenchEnglish - Only exception is for TS1, where WBSMT
outperforms EBMT in terms of precision (.674
compared to .653)
23WBSMT vs. EBMT EnglishFrench
- In general, scores incrementally improve as
training data increases. - But apart from SER, metrics suggest that training
on just over 100K sentences pairs yields better
results than training on just over 200K. - Why? Maybe due to overfitting or odd data
- Surprising generally assumed that increasing
training data in Machine Learning approaches will
improve the quality of the output translations
(variance analysisbootstrap-resampling on test
set Koehn, EMNLP-04 different test sets). - Note especially the similarity of the WER scores,
and the difference in SER values. Much more
significant improvement for EBMT (20.6) than for
WBSMT (0.1).
24WBSMT vs. EBMT FrenchEnglish
- All WBSMT scores higher than for FrenchEnglish
- For EBMT, better translations from FrenchEnglish
for BLEU, Recall and SER worse for WER (FR-EN
.508, EN-FR .448) and precision (FR-EN .678,
EN-FR .736)
25WBSMT vs. EBMT FrenchEnglish
- For TS1, EBMT does not outperform WBSMT from
FrenchEnglish for any of the five metrics. - For TS2, EBMT beats WBSMT in terms of BLEU,
Recall and SER (66.5 compared to 81.3 for
WBSMT), while WBSMT gets higher scores for
Precision and WER (46.2 compared to 55.2). - For TS3, WBSMT again beats EBMT in terms of
Precision (2.5) and WER (4 - both less
significant differences than for TS1 and TS2),
but EBMT wins out according to the other three
metricsnotably, by a huge 29.6 for SER. - BLEU WBSMT obtains significantly higher scores
for FrenchEnglish compared to EnglishFrench 8
higher for TS1, 6higher for TS2, and 12 higher
for TS3. Apart from TS1, the EBMT scores for the
two different language directions are much more
in line, indicating perhaps that EBMT may be more
consistent even for the same language pair in
different directions.
26Summary of Results
- Both EBMT WBSMT achieve better translation
qualityfrom FrenchEnglish compared to
EnglishFrench. Of the five automatic evaluation
metrics for each of the three training sets, in
9/15 cases WBSMT wins out over our EBMT system. - For EnglishFrench, in 14/15 cases EBMT beats
WBSMT. - Summing these results together, EBMT outperforms
WBSMT in 20 tests, while WBSMT does better in 10
experiments. - Assuming all of these tests to be of equal
importance,EBMT appears to outperform WBSMT by a
factor of two to one. - While the results are a little mixed, it is clear
that EBMT tends to outperform WBSMT on this
sublanguage and on these training sets.
27Experiment 2 Phrase-Based SMT vs. EBMT
- Same EBMT system as for WBSMT experiment
- To develop language and translation models for
the SMT system, we used - Giza to extract word-alignments
- Refine these to extract Giza phrase-alignments
- Construct Probability Tables
- Pass these to CMU-SRI statistical toolkit
Pharaoh Decoder to derive translations. - Same Translation Pairs, Training Sets, Test Sets
- Resulting translations evaluated using a range of
automatic metrics
28PBSMT vs. EBMT EnglishFrench
- PBSMT with Giza sub-sentential alignments wins
out over PBSMT with EBMT data, but cf. size of
data sets - EBMT 403,317
- PBSMT 1.73M
- PBSMT beats WBSMT, notably for BLEU but 5 worse
for WER. SER still (disappointingly) high - EBMT beats PBSMT, esp. for BLEU, Recall, WER
SER
29PBSMT vs. EBMT FrenchEnglish
- PBSMT with Giza sub-sentential alignments wins
out over PBSMT with EBMT data (with same caveat) - PBSMT with both knowledge sources better for FE
than for EF - PBSMT doesnt beat WBSMT - ??
- EBMT beats PBSMT
30Experiment 3a Seeding Pharaoh with Giza Words
and EBMT Phrases EnglishFrench
- Hybrid PBSMT system beats baseline PBSMT for
- BLEU, PR, and SER slightly worse WER
- Data Size 430K (cf. PBSMT 1.73M, EBMT 403K)
- Still worse than EBMT scores
31Experiment 3b Seeding Pharaoh with Giza Words
and EBMT Phrases FrenchEnglish
- Hybrid PBSMT system beats baseline PBSMT for
- BLEU slightly worse for PR, and SER quite a
bit worse for WER - Still shy of the results for EBMT.
32Experiment 4a Seeding Pharaoh with All Data,
EnglishFrench
- Hybrid System beats semi-hybrid system on all
metrics - Loses out to EBMT system, except for Precision.
- Data Set now gt2M items.
33Experiment 4b Seeding Pharaoh with All Data,
FrenchEnglish
- Hybrid System beats semi-hybrid system on all
metrics - Hybrid System beats EBMT on BLEU Precision
EBMT ahead for Recall WER still well ahead for
SER.
34Summary of Results WBSMT vs. EBMT
- None of these are bad systems for TS3, worst
BLEU score is for WBSMT, E?F, .322 - WBSMT loses out to EBMT 21 (but better overall
for F?E) - For TS3, WBSMT BLEU score of .446 and EBMT score
of .461 are high scores - For WBSMT vs. EBMT experiments, odd finding
higher scores for 100K training set investigate
in future work.
35Summary of Results PBSMT vs. EBMT
- PBSMT scores better than for WBSMT, but odd
result for F?E ?! - Best PBSMT BLEU scores (with Giza data only)
.375 (E?F), .420 (F?E) - Seeding PBSMT with EBMT data gets good scores
for BLEU, .364 (E?F), .395 (F?E) note
differences in data size (1.73M vs. 403K) - PBSMT loses out to EBMT
- PBSMT SER still very high (8387).
36Summary of Results Semi-Hybrid Systems
- Seeding Pharaoh with SMT words and EBMT phrases
improves over baseline Giza seeded system - Data size diminishes considerably (430K vs.
1.73M) - Still worse result for semi-hybrid system for
F?E than for WBSMT ?! - Still worse results than for EBMT.
37Summary of Results Fully Hybrid Systems
- Better results than for semi-hybrid systems
E?F .426 (.396), F?E .489 (.427) - Data size increases
- For F?E, Hybrid system beats EBMT on BLEU (.461)
Precision EBMT ahead for Recall WER still
well ahead (27) for SER.
38Concluding Remarks
- Despite the convergence between EBMT and SMT,
further gains to be made - Merging Giza and EBMT-induced data leads to an
improved Hybrid Example-Based SMT system - ?Lesson for SMT community dont disregard the
large body of work on EBMT! - We expect in further work that adding SMT
sub-sentential data to our EBMT system will also
lead to improvements - ?Lesson for EBMT-ers SMT data can help you too!
39Future Work
- Carry out significance tests on these results.
- Investigate whats going on in 2nd 100K training
set. - Develop Statistical EBMT System as described
- Other issues in hybridity
- Use target LM in EBMT
- Replace EBMT recombination process with SMT
decoder - Try different decoders, LMs and TMs
- Factor in Marker Tags into SMT Probability
Tables. - Experiment with other training data in other
sublanguage domains, especially those where
larger corpora are available (e.g. Canadian
Hansards, European Parliament ) - Try other language pairs.