MT Evaluation - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

MT Evaluation

Description:

Fast and cheap, minimal human labor, no need for bilingual speakers ... each lang pair, 2 from winter vacations, 2 from summer resorts; 2 collected ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 38

Provided by: AlonL

Category:

more less

Transcript and Presenter's Notes

Title: MT Evaluation

1
MT Evaluation

11-682/15-492
Introduction to IR, NLP, MT and Speech
November 4, 2003

2
Need for MT Evaluation

MT Evaluation is important
MT systems are becoming wide-spread, embedded in
more complex systems
How well do they work in practice?
Are they reliable enough?
MT is a technology still in research stages
How can we tell if we are making progress?
Metrics that can drive experimental development
MT Evaluation is difficult
Human evaluation is subjective
How good is good enough? Depends on
application
Is system A better than system B? Depends on
specific criteria
MT Evaluation is a research topic in itself! How
do we assess whether an evaluation method is good?

3
Dimensions of MT Evaluation

Human evaluation vs. automatic metrics
Quality assessment at sentence (segment) level
vs. task-based evaluation
Black-box evaluation vs. Glass-box evaluation
Adequacy (is the meaning translated correctly?)
vs. Fluency (is the output grammatical and
fluent?)

4
Example Approaches

We will survey three evaluation methodologies as
representative examples
BLEU BUBBLE automatic metrics for MT evaluation
Evaluation of NESPOLE! Speech-to-Speech
Translation system (human, sentence-level,
quality-based, end-to-end some components)
Task-based evaluation of speech-to-speech
translation

5
Automatic Metrics for MT Evaluation

Idea compare output of an MT system to a
reference good (usually human) translation
how close is the MT output to the reference
translation?
Advantages
Fast and cheap, minimal human labor, no need for
bilingual speakers
Can be used on an on-going basis during system
development to test changes
Disadvantages
Current metrics are very crude, cannot
distinguish well between subtle differences in
systems
Individual sentence scores are not very
meaningful, aggregate score for a system on a
large test set are meaningful
Automatic metrics for MT evaluation very active
area of current research

6
Automatic Metrics for MT Evaluation

Example
Reference the Iraqi weapons are to be handed
over to the army within two weeks
MT output in two weeks Iraqs weapons will give
army
Possible metric components
Precision correct words / total words in MT
Recall correct words / total words in ref
Combination of P/R (i.e. F1)
Levenshtein edit distance number of insertions,
deletions, substitutions required to transform MT
output to the reference
Important Issues
Perfect word matches are too harsh synonyms,
inflections Iraqs vs. Iraqi, give vs.
handed over

7
The BLEU Metric

Proposed by IBM Papineni et al, 2002
Main ideas
Exact matches of words
Match against a set of reference translations for
greater variety of expressions
Account for Adequacy by looking at word precision
Account for Fluency by calculating n-gram
precisions for n1,2,3,4
No recall (because difficult with multiple refs)
To compensate for recall introduce Brevity
Penalty
Final score is weighted geometric average of the
n-gram scores
Calculate aggregate score over a large test set

8
The BLEU Metric

Example
Reference the Iraqi weapons are to be handed
over to the army within two weeks
MT output in two weeks Iraqs weapons will give
army
BLUE metric
1-gram precision 4/8
2-gram precision 1/7
3-gram precision 0/6
4-gram precision 0/5
BLEU score 0 (weighted geometric average)

9
The BLEU Metric

Clipping precision counts
Reference1 the Iraqi weapons are to be handed
over to the army within two weeks
Reference2 the Iraqi weapons will be
surrendered to the army in two weeks
MT output the the the the
Precision count for the should be clipped at
two max count of the word in any reference
Modified unigram score will be 2/4 (not 4/4)

10
The BLEU Metric

Brevity Penalty
Reference1 the Iraqi weapons are to be handed
over to the army within two weeks
Reference2 the Iraqi weapons will be
surrendered to the army in two weeks
MT output the the
Precision score unigram 2/2, bigram 1/1, BLEU
1.0
MT output is much too short, thus boosting
precision, and BLEU doesnt have recall
An exponential Brevity Penalty reduces score,
calculated based on the aggregate length (not
individual sentences)

11
Some Problems with BLEU

No Recall recall is very important for MT
quality, BP does not adequately compensate for
lack of recall
No explicit alignment of words between the MT
output and the reference
Dependence on n-grams to account for fluency (how
well ordered is the output)
Geometric averaging is harsh

12
The BUBBLE Metric

New metric under development at CMU
Main new ideas
Reintroduce Recall and F1 as a balanced
Precision/Recall combination
Look only at unigram Precision/Recall/F1
Align MT output with each reference and take
score of best pairing
Assess fluency via a direct word-order penalty
how out-of-order is the MT output?
Calculated as a Bubble Sort metric how many
flips required to correctly order the matching
words, as a fraction of the worst possible word
ordering.

13
The BUBBLE Metric

Example
Reference the Iraqi weapons are to be handed
over to the army within two weeks
MT output in two weeks Iraqs weapons will give
army
Matching Ref weapons army two weeks
MT two weeks weapons
army
P 4/8 R 4/14 F1 2PR/(PR) 0.36
Flips required 4 max flips 43/26
Flip Penalty 4/6 0.67
Raw BUBBLE score F1(1-FP) 0.12
With grouping and exponential penalty
Flips 1 max flips 6 FP 1/6 0.167
Mod BUBBLE score F1 (1/(2 exp FP)) 0.32

14
BLEU vs BUBBLE

How do we know if a metric is better?
Better correlation with human judgments of MT
output
Reduced score variability on MT outputs that are
ranked equivalent by humans
Higher and less variable scores on scoring human
translations against the reference translations

15
BLEU vs BUBBLE
16
BLEU vs BUBBLE
17
BLEU vs BUBBLE
18
Further Issues

Words are not created equal some are more
important for effective translation
More effective matching with synonyms and
inflected forms
Stemming
Use a synonym knowledge-base (WordNet)
How to incorporate such information within the
metric?
Train weights for word matches
Target goal is to optimize correlation with human
judgements

19
NESPOLE! System Overview

Human-to-human spoken language translation for
e-commerce application (e.g. travel tourism)
(Lavie et al., 2002)
English, German, Italian, and French
Translation via interlingua
Translation servers for each language exchange
interlingua to perform translation
Speech recognition (Speech ? Text)
Analysis (Text ? Interlingua)
Generation (Interlingua ? Text)
Synthesis (Text ? Speech)

20
NESPOLE! User Interfaces
21
NESPOLE! Translation Monitor
22
NESPOLE! Architecture
23
Interchange Format

Interchange Format (IF) is a shallow semantic
interlingua for task-oriented domains
Utterances represented as sequences of semantic
dialog units (SDUs)
IF representation consists of four parts
Speaker
Speech Act
Concepts
Arguments
speaker speech act concept arguments

Domain Action
24
Evaluation Types and Methods

Individual evaluation of components
Speech Recognizers
Analysis and generation engines
Sythesizers
IF (intercoder agreement, effectiveness)
End-to-End translation quality
From speech input to speech/text output
From transcribed input to text output
Architecture effectiveness network effects
Task-based evaluation
User studies what works and how well?
Evaluating multi-modal interfaces

25
Single Component Evaluations

Speech Recognizers
Measure Word Error Rates (WERs) compared to a
transcription of the input
Analysis Modules (from speech or text input)
Compare output from analyzer with manually
annotated IF representations for the input
Generation Modules (from IFs)
Compare the generated output from IFs with the
input utterance and assess quality of output
Synthesizers Does the output sound good?

26
Accuracy-based End-to-End Evaluation

Example
Si un attimo che vedo le voglio mandare .
attimo
Yes Im indicating moment
Yes Im
indicating moment
G1 P B K
K
G2 P B P
K
G3 P B K
P
Three point grading scheme perfect/ok/bad
OK all meaning is translated, output somewhat
disfluent
Perfect OK output is fluent
Bad meaning is lost in translation
Acceptable Perfect OK

27
End-to-End Evaluation Methodology

ENG/ITA, GER/ITA, FR/ITA
4 unseen dialogues for each lang pair, 2 from
winter vacations, 2 from summer resorts 2
collected monolingually, 2 bilingually
Monolingual and crosslingual evaluations
to Italian on client, from Italian on agent
Evaluate translation from speech and from
transcriptions, to text
ASR output also graded as a paraphrase
3-4 human graders per language pair
Accuracy based evaluation at the Semantic
Dialogue Unit (SDU) level one grader segmented,
all used segmentation
Calculate percent P/K/B and Acceptable for each
grader, average results across graders

28
Evaluation Results
29
Task-based Evaluation Motivation

Accuracy-based evaluation can be very harsh (to
get K ALL meaning must be preserved)
Users studies indicate that with 65 accuracy we
achieve almost 100 task completion
But this is highly dependent on the definition
and complexity of the task!

30
Task-based Evaluation

Evaluates the ability of users to perform
(complete) an overall task using an MT system,
rather than scoring the MT system directly
MT system used to mediate human conversation the
task to be evaluated is the communication of
goals, not the human actions
Better MT system ? better task completion rate
Involves analysis and breakdown of the task
goals, sub-goals, prioritization of goals
Were goals completed? Were multiple
attempts/repairs required? Were goals abandoned?

31
Designing a Task-based Evaluation

Idea evaluate effectiveness of communication
not human execution of a task
Decide on goals and sub-goals for the task
Example Conveying to agent that you need a hotel
room (when, for how long, type of room)
Annotation scheme was each communication goal
accomplished immediately? eventually?
Scoring scheme assigning an overall score for
individual goals and their accomplishment level
References LREC-2000, ACL-99 Student Session

32
Task-based Evaluation of JANUS Speech-to-Speech
Translation

Interlingua representation used to identify
communication goals
Dialogue consists of a sequence of SDUs, each
consisting of one main goal and zero or more
sub-goals
Example I would like to reserve a single
room
give-informationaccomodation(room-typesingle
)
Main goal give-informationaccomodation
Sub-goal room-typesingle

33
Task-based Evaluation of Speech-to-Speech
Translation

Scoring scheme for goals
Important goals will be attempted multiple times
before being abandoned
For successful goals score should decay with the
number of retries required
For abandoned goals penalty should increase with
the number of retries attempted
t of communication attempts (initialretries)
Score(goal) 1/t if goal successful
-(1-1/t) if goal was
abandoned
Score(dialogue) average over all goal scores

34
Task-based Evaluation of Speech-to-Speech
Translation

How to account for sub-goals?
Same as goals, smaller weight
Same information can be conveyed differently by
different speakers one main goal with many
sub-goals vs. multiple main goals
Identify the complexity of a goal (domain-action)
based on the number/complexity of its sub-goals
(arguments)
Scale the score as a function of the goal
complexity

35
Task-based Evaluation of Speech-to-Speech
Translation

Main issues
Focus on goal accomplishment rather than
translation quality on a sentence/phrase level
Performing the evaluation requires human coding,
based on a transcript of the conversation
inter-coder agreement
Score of a dialogue very sensitive to the flow of
the dialogue, actions of the participants
Level of granularity of goals and sub-goals
In this case, determined by the interlingua

36
Summary

MT Evaluation is important for driving system
development and the technology as a whole
Different aspects need to be evaluated not just
translation quality of individual sentences
Human evaluations are costly, but are most
meaningful
New automatic metrics are becoming popular, but
are still rather crude, can drive system progress
and rank systems
New metrics that achieve better correlation with
human judgments are being developed

37
References

2002, Papineni, K, S. Roukos, T. Ward and W-J.
Zhu, BLEU a Method for Automatic Evaluation of
Machine Translation, in Proceedings of the 40th
Annual Meeting of the Association for
Computational Linguistics (ACL-2002),
Philadelphia, PA, July 2002
1997, Gates, D., A. Lavie, L. Levin, A. Waibel,
M. Gavalda, M. Woszczyna and P. Zhan. "End-to-end
Evaluation in JANUS a Speech-to-speech
Translation System". In Dialogue Processing in
Spoken Language Systems Revised Papers from
ECAI-96 Workshop, E. Maier, M. Mast and S.
LuperFoy (eds.), LNCS series, Springer Verlag,
June 1997.
2000, Levin, L., B. Bartlog, A. Font-Llitjos, D.
Gates, A. Lavie, D. Wallace, T. Watanabe and M.
Woszczyna, "Lessons Learned from a Task-Based
Evaluation of Speech-to-Speech Machine
Translation". In Proceedings of 2nd International
Conference on Language Resources and Evaluation
(LREC-2000), Athens, Greece, June 2000.