Title: Confidence Estimation for Machine Translation
1Confidence Estimation for Machine Translation
- Lucia Specia
- Xerox Research Centre Europe Grenoble
- (collaboration with Marco Turchi Nello
Cristianini U. Bristol)
2Outline
- The task of Confidence Estimation for Machine
Translation - Our approach
- Method
- Features
- Algorithms
- Experiments
- On-going and future work
3The task of CE for MT
- Goal given the output of a Machine Translation
(MT) system for a given input, provide an
estimate of its quality. - Motivation assessing the quality of
translations is - Time consuming reading the translation takes
time - Los piratas se han incautado un gigante
petrolero saudita llevando su carga completa de 2
m de barriles - más de una cuarta parte de la
Arabia Saudita de la producción diaria - el
sábado frente a la costa de Kenya (unas 450
millas náuticas al sudeste de Mombasa) y la
dirección hacia la puerto somalí de AEL, la
Marina de los EE.UU. dice. - Not possible if user does not know the source
language - ????????????????1??????????????4??1
-????????????(??450????????)????????-?????????????
??????????Eyl??????????????....
4The task of CE for MT
- Uses
- Filter out bad translations to avoid
professional translators wasting time reading /
post-editing them. - Make end-users aware of the translation
quality.
Is it worth providing this translation as
suggestion to the professional translator?
Should this translation be highlighted as
suspect to the reader?
5General approach
- Different from MT evaluation (BLEU/NIST)
reference translations are NOT available - Unit word, phrase or sentence
- Embedded to SMT system (word or phrase
probabilities) or dedicated layer (machine
learning problem) - Binary problem distinguish between good and bad
translations
6General approach
- Different from MT evaluation (BLEU/NIST)
reference translations are NOT available - Unit word, phrase or sentence
- Embedded to SMT system (word or phrase
probabilities) or dedicated layer (machine
learning problem) - Binary problem distinguish between good and bad
translations
7Related work (sentence-level)
- Workshop at JHU (Blatz et al, Coling-04) MSR
(Quirk, LREC-04) - Automatic MT metrics or few manually assessed
cases - Poor analysis on the contribution of different
features - Predictions did not have a positive effect on
practical tasks - MSR (Gamon et al, EAMT-05)
- Human likeness classification
- Resource dependent features
- Poor performance if compared to BLEU little
correlation with human - One MT system, one domain, one language pair
- Only good / bad estimates binary task
8Our approach
- Sentence-level natural scenario for MT
- Many resource language independent features
- Take contribution of features into account
- MT system dependent x independent features
- Machine learning problem regression
- Any continuous score
- Human annotation as training
- Several MT systems, text domains, language pairs
and quality scores - Results useful in practical applications
9Method
- Identify and extract information sources.
- Refine the set of information sources to keep
only the relevant ones - Increase performance.
- Decrease extraction cost (time).
- Learn a model to produce quality scores for new
translations. - Use the CE score in some application.
10Features
- Most of those identified in previous work new
ones - Black-box (77) from the input and translation
sentences, monolingual or parallel corpora, e.g. - Source and target sentence lengths and their
ratios - Language model and other statistics in the corpus
- Shallow syntax checking (target and target
against source) - Average number of possible translations per
source word (SMT) - Practical scenario
- Useful when it is not possible to have access to
internal features of the MT systems (commercial
systems, e.g.). - Provides a way to perform the task of CE across
different MT systems, which may use different
frameworks.
11Features
- Glass-box (54) depend on some aspect of the
translation process, e.g. - Language model (target) using n-best list
word/phrase-based - Proximity with other hypothesis in the n-best
list - MT base model features
- Distortion count, gap count, (compositional)
bi-phrase probability - Search nodes in the graph (aborted, pruned)
- Proportion of unknown words in the source
- Richer scenario
- When it is possible to have access to internal
features of the MT systems.
12Learning methods
- Feature selection Partial Least Squares (PLS)
- Regression PLS, SVM
13Partial Least Square Regression
- Given two matrices X (input variables) and Y
(response variables) predict Y from X and
describe their common structure. - Projects the original data onto a different space
of latent variables (or components) - Provides by-product an ordering of the original
features according to their importance. - Particularly indicated when the features in X are
strongly correlated (multicollinearity) ? case of
CE datasets. - Widely applied in other fields not yet for NLP.
14Partial Least Square Regression
- Ordinary multiple regression problem
- Where
- Bw is the regression matrix computed directly
using an optimal number of components. - F is the residual matrix.
- When X is standardized, an element of Bw with
large absolute value indicates an important
X-variable.
15Feature Section with PLS
- Method
- Compute the Bw matrix on some training data for
different numbers of components (all possible) - Sort the absolute value of the bw-coefficients.
This produces a list of features from the most
important to the less important (Lb) - Select the n features training and testing on a
validation set according to some objective
criteria - Train and test these n features on a test set
- Evaluate predictions using appropriate metrics
16Feature Section with PLS
- Method
- Compute the Bw matrix on some training data for
different numbers of components (all possible) - Sort the absolute value of the bw-coefficients.
This produces a list of features from the most
important to the less important (Lb) - Done for each i-th training subsample obtain
several Lb(i) - The final list L is obtained picking the most
voted features for each column (mode) L 66,
56, , 35, 10
17Feature Section with PLS
- Method
- Compute the Bw matrix on some training data for
different numbers of components (all possible) - Sort the absolute value of the bw-coefficients.
This produces a list of features from the most
important to the less important (Lb) - Select the n features training and testing on a
validation set according to some objective
criterion - Objective criterion RMSPE
- Analyze learning curves to select top n features
18Feature Section with PLS
19Feature Section with PLS
- Method
- Compute the Bw matrix on some training data for
different numbers of components (all possible) - Sort the absolute value of the bw-coefficients.
This produces a list of features from the most
important to the less important (Lb) - Select the n features training and testing on a
validation set according to some objective
criteria - Train and test these n features on a test set
- SVM or PLS with optimal number of components
20Feature Section with PLS
- Method
- Compute the Bw matrix on some training data for
different numbers of components (all possible) - Sort the absolute value of the bw-coefficients.
This produces a list of features from the most
important to the less important (Lb) - Select the n features training and testing on a
validation set according to some objective
criteria - Train and test these n features on a test set
- Evaluate predictions
- Root Mean Square Error (RMSPE)
21Feature Section with PLS
- Root Mean Squared Error (RMSPE)
- N number of test cases
- TP True positives
- FP False positive
- y expected value
- prediction obtained
22Experiments
- Datasets
- Europarl data with quality scores from automatic
metrics. - News data with manually assigned quality scores
(1-5). - Europarl with manually assigned quality scores
(1-4). - Technical documents with manually assigned
quality scores (1-4). - Technical documents with post-edition time
annotation.
23GKLS 1-4 en-es dataset
- WMT-2008 Europarl English-Spanish dev test
data - 4K Translations SMT systems Matrax, Portage,
Sinuhe and MMR (P-ES-1, P-ES-2, P-ES-3 and
P-ES-4) - Quality score 1-4
- Features P-ES-1gb 131, others 77 black-box
- Little gain with glass-box features
- Good from a practical point of view
24GKLS 1-4 en-dk dataset
- Automotive documents English-Danish
- En-Es is a reasonably close language-pair, try
En-Dk - 2K Translations SMT system trained on 170K
parallel sentences Matrax - Quality score 1-4
- Features black-box glass-box
- Results (RMSPE)
- Considerable gain in using glass-box features
- Expected with more distant language pairs
25GKLS post-edition time dataset
- Automotive documents English-Russian
- 3K Translations SMT systems trained on 250K
sentences Matrax, Portage and MMR (P-ER-1,
P-ER-2 and P-ER-3) - Quality score post-edition time in seconds
(normalized by sentence length) - Given a source sentence in English and its
translation into Russian, a professional
translator post-edited such translation to make
it into a good quality sentence, while the time
was recorded. - Features 77 black-box
26Discussion
- Results for a subset of features outperform all
features. - GKLS 1-4 CE models deviate 0,6-0,7. E.g.
sentence that should be classified as fit for
purpose would never be classified as requires
complete retranslation. - Post-edition time vary considerably from system
to system. E.g. P-ES-1 (2), CE models deviate
up to 2 seconds/word.
27Manually annotated datasets
- Best features (BB)
- source target sentence 3-gram language
- model probability
- source target sentence lengths
- percentage and mismatch in the numbers and
punctuation symbols in the source and target
28Manually annotated datasets
- Best features (GB)
- size of the n-best list
- sentence n-gram log-probabilities using the
n-best for training a LM (using words or phrases) - bi-phrase count
- distortion count
- bi-phrase probability
- translation model
- average size of hypotheses in the n-best list
- number of search graph nodes in final decoders
graph
29More results GKLS 1-4 en-es
- System combination
- Produce CE predictions for a test set decoded by
4 systems - Sort the four CE predictions for each test
instance - Select the test instance with higher score
- Evaluate whether this was the best sentence
according to human annotation - matches for Top 659 / 802 82.1
30More results GKLS 1-4 en-es
- Predictive power of features Pearsons
correlation
31More results GKLS 1-4 en-es
- CE score x MT metrics - Pearsons correlation
32More results GKLS 1-4 en-es
- Filter out bad translations for Lang. Service
Providers
33More results GKLS 1-4 en-es
- Filter out bad translations for Lang. Service
Providers
34More results GKLS 1-4 en-dk
- Predictive power of features Pearsons
correlation
Source LM log probability
Target sentence 1-gram perplexity using n-best
for training a LM
35More results GKLS 1-4 en-dk
- Filter out bad translations for Lang. Service
Providers - Aim is to do better than sentence length
- Rank (300) test sentences according to true
score, CE score or sent length and take average
true score for top-n - CE gt 2.3 ( 3 true score) x Sentence Length lt
12
36More results GKLS 1-4 en-dk
- Using thresholds on CE score or Sent Length we
select - 86 of the good sentences (human scores 3 or 4)
- bad sentences (human scores 1 or 2)
- When sentence length and CE score disagree
37Discussion
- Results considered to be satisfactory (except for
post-edition time). - Prediction errors are similar across different
language pairs - Quality of MT system have some influence.
- Predictions correlate better with human scores
than metrics using reference translations. - Prediction error would yield uncertainty in the
boundaries between two adjacent categories only. - Estimating continuous scores is more appropriate
than binary classification, even for a binary
application - Use of Inductive Confidence Machines to threshold
predicted score
38Discussion
- Most relevant features include many that have not
been used before - average size of the phrases in the target,
- several mismatchings in the source and target
- proportion of aborted search nodes, etc.
- Future work further investigate uses for these
most relevant features - In SMT models to improve the translations quality
- To complement existing features in SMT models.
- To rerank n-best lists produced by SMT systems.
39Discussion
- In MT evaluation
- To provide additional features to reference-based
metrics based on ML algorithms. - To provide a score to be combined with other MT
evaluation metrics. - To provide a new evaluation metric on itself,
with some function to optimize the correlation
with human annotations, without the need of
reference translations.
40Discussion
- Uses of CE score for other applications
- Cross-language information retrieval
- Finding parallel data from comparable corpus
-
- Whenever is it important to identify whether a
target sentence is a GOOD translation of a source
sentence.
41Thanks!
- Lucia Specia
- lucia.specia_at_xrce.xerox.com
42The source
-
- Pirates have seized a giant Saudi oil tanker
carrying its full load of 2m barrels - more than
one-quarter of Saudi Arabia's daily output - on
Saturday off the Kenyan coast (some 450 nautical
miles south-east of Mombasa) and are steering it
towards the Somali port of Eyl, the US Navy says.
BBC News, 17/11/08