Confidence Estimation for Machine Translation - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Confidence Estimation for Machine Translation

Description:

(collaboration with Marco Turchi & Nello ... On-going and future work. The task of CE for MT ... Workshop at JHU (Blatz et al, Coling-04) & MSR (Quirk, LREC-04) ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 43

Provided by: lucias

Category:

more less

Transcript and Presenter's Notes

Title: Confidence Estimation for Machine Translation

1
Confidence Estimation for Machine Translation

Lucia Specia
Xerox Research Centre Europe Grenoble
(collaboration with Marco Turchi Nello
Cristianini U. Bristol)

2
Outline

The task of Confidence Estimation for Machine
Translation
Our approach
Method
Features
Algorithms
Experiments
On-going and future work

3
The task of CE for MT

Goal given the output of a Machine Translation
(MT) system for a given input, provide an
estimate of its quality.
Motivation assessing the quality of
translations is
Time consuming reading the translation takes
time
Los piratas se han incautado un gigante
petrolero saudita llevando su carga completa de 2
m de barriles - más de una cuarta parte de la
Arabia Saudita de la producción diaria - el
sábado frente a la costa de Kenya (unas 450
millas náuticas al sudeste de Mombasa) y la
dirección hacia la puerto somalí de AEL, la
Marina de los EE.UU. dice.
Not possible if user does not know the source
language
????????????????1??????????????4??1
-????????????(??450????????)????????-?????????????
??????????Eyl??????????????....

4
The task of CE for MT

Uses
Filter out bad translations to avoid
professional translators wasting time reading /
post-editing them.
Make end-users aware of the translation
quality.

Is it worth providing this translation as
suggestion to the professional translator?
Should this translation be highlighted as
suspect to the reader?
5
General approach

Different from MT evaluation (BLEU/NIST)
reference translations are NOT available
Unit word, phrase or sentence
Embedded to SMT system (word or phrase
probabilities) or dedicated layer (machine
learning problem)
Binary problem distinguish between good and bad
translations

6
General approach

Different from MT evaluation (BLEU/NIST)
reference translations are NOT available
Unit word, phrase or sentence
Embedded to SMT system (word or phrase
probabilities) or dedicated layer (machine
learning problem)
Binary problem distinguish between good and bad
translations

7
Related work (sentence-level)

Workshop at JHU (Blatz et al, Coling-04) MSR
(Quirk, LREC-04)
Automatic MT metrics or few manually assessed
cases
Poor analysis on the contribution of different
features
Predictions did not have a positive effect on
practical tasks
MSR (Gamon et al, EAMT-05)
Human likeness classification
Resource dependent features
Poor performance if compared to BLEU little
correlation with human
One MT system, one domain, one language pair
Only good / bad estimates binary task

8
Our approach

Sentence-level natural scenario for MT
Many resource language independent features
Take contribution of features into account
MT system dependent x independent features
Machine learning problem regression
Any continuous score
Human annotation as training
Several MT systems, text domains, language pairs
and quality scores
Results useful in practical applications

9
Method

Identify and extract information sources.
Refine the set of information sources to keep
only the relevant ones
Increase performance.
Decrease extraction cost (time).
Learn a model to produce quality scores for new
translations.
Use the CE score in some application.

10
Features

Most of those identified in previous work new
ones
Black-box (77) from the input and translation
sentences, monolingual or parallel corpora, e.g.
Source and target sentence lengths and their
ratios
Language model and other statistics in the corpus
Shallow syntax checking (target and target
against source)
Average number of possible translations per
source word (SMT)
Practical scenario
Useful when it is not possible to have access to
internal features of the MT systems (commercial
systems, e.g.).
Provides a way to perform the task of CE across
different MT systems, which may use different
frameworks.

11
Features

Glass-box (54) depend on some aspect of the
translation process, e.g.
Language model (target) using n-best list
word/phrase-based
Proximity with other hypothesis in the n-best
list
MT base model features
Distortion count, gap count, (compositional)
bi-phrase probability
Search nodes in the graph (aborted, pruned)
Proportion of unknown words in the source
Richer scenario
When it is possible to have access to internal
features of the MT systems.

12
Learning methods

Feature selection Partial Least Squares (PLS)
Regression PLS, SVM

13
Partial Least Square Regression

Given two matrices X (input variables) and Y
(response variables) predict Y from X and
describe their common structure.
Projects the original data onto a different space
of latent variables (or components)
Provides by-product an ordering of the original
features according to their importance.
Particularly indicated when the features in X are
strongly correlated (multicollinearity) ? case of
CE datasets.
Widely applied in other fields not yet for NLP.

14
Partial Least Square Regression

Ordinary multiple regression problem
Where
Bw is the regression matrix computed directly
using an optimal number of components.
F is the residual matrix.
When X is standardized, an element of Bw with
large absolute value indicates an important
X-variable.

15
Feature Section with PLS

Method
Compute the Bw matrix on some training data for
different numbers of components (all possible)
Sort the absolute value of the bw-coefficients.
This produces a list of features from the most
important to the less important (Lb)
Select the n features training and testing on a
validation set according to some objective
criteria
Train and test these n features on a test set
Evaluate predictions using appropriate metrics

16
Feature Section with PLS

Method
Compute the Bw matrix on some training data for
different numbers of components (all possible)
Sort the absolute value of the bw-coefficients.
This produces a list of features from the most
important to the less important (Lb)
Done for each i-th training subsample obtain
several Lb(i)
The final list L is obtained picking the most
voted features for each column (mode) L 66,
56, , 35, 10

17
Feature Section with PLS

Method
Compute the Bw matrix on some training data for
different numbers of components (all possible)
Sort the absolute value of the bw-coefficients.
This produces a list of features from the most
important to the less important (Lb)
Select the n features training and testing on a
validation set according to some objective
criterion
Objective criterion RMSPE
Analyze learning curves to select top n features

18
Feature Section with PLS
19
Feature Section with PLS

Method
Compute the Bw matrix on some training data for
different numbers of components (all possible)
Sort the absolute value of the bw-coefficients.
This produces a list of features from the most
important to the less important (Lb)
Select the n features training and testing on a
validation set according to some objective
criteria
Train and test these n features on a test set
SVM or PLS with optimal number of components

20
Feature Section with PLS

Method
Compute the Bw matrix on some training data for
different numbers of components (all possible)
Sort the absolute value of the bw-coefficients.
This produces a list of features from the most
important to the less important (Lb)
Select the n features training and testing on a
validation set according to some objective
criteria
Train and test these n features on a test set
Evaluate predictions
Root Mean Square Error (RMSPE)

21
Feature Section with PLS

Root Mean Squared Error (RMSPE)
N number of test cases
TP True positives
FP False positive
y expected value
prediction obtained

22
Experiments

Datasets
Europarl data with quality scores from automatic
metrics.
News data with manually assigned quality scores
(1-5).
Europarl with manually assigned quality scores
(1-4).
Technical documents with manually assigned
quality scores (1-4).
Technical documents with post-edition time
annotation.

23
GKLS 1-4 en-es dataset

WMT-2008 Europarl English-Spanish dev test
data
4K Translations SMT systems Matrax, Portage,
Sinuhe and MMR (P-ES-1, P-ES-2, P-ES-3 and
P-ES-4)
Quality score 1-4
Features P-ES-1gb 131, others 77 black-box

Little gain with glass-box features
Good from a practical point of view

24
GKLS 1-4 en-dk dataset

Automotive documents English-Danish
En-Es is a reasonably close language-pair, try
En-Dk
2K Translations SMT system trained on 170K
parallel sentences Matrax
Quality score 1-4
Features black-box glass-box
Results (RMSPE)
Considerable gain in using glass-box features
Expected with more distant language pairs

25
GKLS post-edition time dataset

Automotive documents English-Russian
3K Translations SMT systems trained on 250K
sentences Matrax, Portage and MMR (P-ER-1,
P-ER-2 and P-ER-3)
Quality score post-edition time in seconds
(normalized by sentence length)
Given a source sentence in English and its
translation into Russian, a professional
translator post-edited such translation to make
it into a good quality sentence, while the time
was recorded.
Features 77 black-box

26
Discussion

Results for a subset of features outperform all
features.
GKLS 1-4 CE models deviate 0,6-0,7. E.g.
sentence that should be classified as fit for
purpose would never be classified as requires
complete retranslation.
Post-edition time vary considerably from system
to system. E.g. P-ES-1 (2), CE models deviate
up to 2 seconds/word.

27
Manually annotated datasets

Best features (BB)
source target sentence 3-gram language
model probability
source target sentence lengths
percentage and mismatch in the numbers and
punctuation symbols in the source and target

28
Manually annotated datasets

Best features (GB)
size of the n-best list
sentence n-gram log-probabilities using the
n-best for training a LM (using words or phrases)
bi-phrase count
distortion count
bi-phrase probability
translation model
average size of hypotheses in the n-best list
number of search graph nodes in final decoders
graph

29
More results GKLS 1-4 en-es

System combination
Produce CE predictions for a test set decoded by
4 systems
Sort the four CE predictions for each test
instance
Select the test instance with higher score
Evaluate whether this was the best sentence
according to human annotation
matches for Top 659 / 802 82.1

30
More results GKLS 1-4 en-es

Predictive power of features Pearsons
correlation

31
More results GKLS 1-4 en-es

CE score x MT metrics - Pearsons correlation

32
More results GKLS 1-4 en-es

Filter out bad translations for Lang. Service
Providers

33
More results GKLS 1-4 en-es

Filter out bad translations for Lang. Service
Providers

34
More results GKLS 1-4 en-dk

Predictive power of features Pearsons
correlation

Source LM log probability
Target sentence 1-gram perplexity using n-best
for training a LM
35
More results GKLS 1-4 en-dk

Filter out bad translations for Lang. Service
Providers
Aim is to do better than sentence length
Rank (300) test sentences according to true
score, CE score or sent length and take average
true score for top-n
CE gt 2.3 ( 3 true score) x Sentence Length lt
12

36
More results GKLS 1-4 en-dk

Using thresholds on CE score or Sent Length we
select
86 of the good sentences (human scores 3 or 4)
bad sentences (human scores 1 or 2)
When sentence length and CE score disagree

37
Discussion

Results considered to be satisfactory (except for
post-edition time).
Prediction errors are similar across different
language pairs
Quality of MT system have some influence.
Predictions correlate better with human scores
than metrics using reference translations.
Prediction error would yield uncertainty in the
boundaries between two adjacent categories only.
Estimating continuous scores is more appropriate
than binary classification, even for a binary
application
Use of Inductive Confidence Machines to threshold
predicted score

38
Discussion

Most relevant features include many that have not
been used before
average size of the phrases in the target,
several mismatchings in the source and target
proportion of aborted search nodes, etc.
Future work further investigate uses for these
most relevant features
In SMT models to improve the translations quality
To complement existing features in SMT models.
To rerank n-best lists produced by SMT systems.

39
Discussion

In MT evaluation
To provide additional features to reference-based
metrics based on ML algorithms.
To provide a score to be combined with other MT
evaluation metrics.
To provide a new evaluation metric on itself,
with some function to optimize the correlation
with human annotations, without the need of
reference translations.

40
Discussion

Uses of CE score for other applications
Cross-language information retrieval
Finding parallel data from comparable corpus
Whenever is it important to identify whether a
target sentence is a GOOD translation of a source
sentence.

41
Thanks!

Lucia Specia
lucia.specia_at_xrce.xerox.com

42
The source

Pirates have seized a giant Saudi oil tanker
carrying its full load of 2m barrels - more than
one-quarter of Saudi Arabia's daily output - on
Saturday off the Kenyan coast (some 450 nautical
miles south-east of Mombasa) and are steering it
towards the Somali port of Eyl, the US Navy says.
BBC News, 17/11/08

Write a Comment

User Comments (0)