Title: MT Evaluation
1MT Evaluation
- 11-682/15-492
- Introduction to IR, NLP, MT and Speech
- November 4, 2003
2Need for MT Evaluation
- MT Evaluation is important
- MT systems are becoming wide-spread, embedded in
more complex systems - How well do they work in practice?
- Are they reliable enough?
- MT is a technology still in research stages
- How can we tell if we are making progress?
- Metrics that can drive experimental development
- MT Evaluation is difficult
- Human evaluation is subjective
- How good is good enough? Depends on
application - Is system A better than system B? Depends on
specific criteria - MT Evaluation is a research topic in itself! How
do we assess whether an evaluation method is good?
3Dimensions of MT Evaluation
- Human evaluation vs. automatic metrics
- Quality assessment at sentence (segment) level
vs. task-based evaluation - Black-box evaluation vs. Glass-box evaluation
- Adequacy (is the meaning translated correctly?)
vs. Fluency (is the output grammatical and
fluent?)
4Example Approaches
- We will survey three evaluation methodologies as
representative examples - BLEU BUBBLE automatic metrics for MT evaluation
- Evaluation of NESPOLE! Speech-to-Speech
Translation system (human, sentence-level,
quality-based, end-to-end some components) - Task-based evaluation of speech-to-speech
translation
5Automatic Metrics for MT Evaluation
- Idea compare output of an MT system to a
reference good (usually human) translation
how close is the MT output to the reference
translation? - Advantages
- Fast and cheap, minimal human labor, no need for
bilingual speakers - Can be used on an on-going basis during system
development to test changes - Disadvantages
- Current metrics are very crude, cannot
distinguish well between subtle differences in
systems - Individual sentence scores are not very
meaningful, aggregate score for a system on a
large test set are meaningful - Automatic metrics for MT evaluation very active
area of current research
6Automatic Metrics for MT Evaluation
- Example
- Reference the Iraqi weapons are to be handed
over to the army within two weeks - MT output in two weeks Iraqs weapons will give
army - Possible metric components
- Precision correct words / total words in MT
- Recall correct words / total words in ref
- Combination of P/R (i.e. F1)
- Levenshtein edit distance number of insertions,
deletions, substitutions required to transform MT
output to the reference - Important Issues
- Perfect word matches are too harsh synonyms,
inflections Iraqs vs. Iraqi, give vs.
handed over
7The BLEU Metric
- Proposed by IBM Papineni et al, 2002
- Main ideas
- Exact matches of words
- Match against a set of reference translations for
greater variety of expressions - Account for Adequacy by looking at word precision
- Account for Fluency by calculating n-gram
precisions for n1,2,3,4 - No recall (because difficult with multiple refs)
- To compensate for recall introduce Brevity
Penalty - Final score is weighted geometric average of the
n-gram scores - Calculate aggregate score over a large test set
8The BLEU Metric
- Example
- Reference the Iraqi weapons are to be handed
over to the army within two weeks - MT output in two weeks Iraqs weapons will give
army - BLUE metric
- 1-gram precision 4/8
- 2-gram precision 1/7
- 3-gram precision 0/6
- 4-gram precision 0/5
- BLEU score 0 (weighted geometric average)
9The BLEU Metric
- Clipping precision counts
- Reference1 the Iraqi weapons are to be handed
over to the army within two weeks - Reference2 the Iraqi weapons will be
surrendered to the army in two weeks - MT output the the the the
- Precision count for the should be clipped at
two max count of the word in any reference - Modified unigram score will be 2/4 (not 4/4)
10The BLEU Metric
- Brevity Penalty
- Reference1 the Iraqi weapons are to be handed
over to the army within two weeks - Reference2 the Iraqi weapons will be
surrendered to the army in two weeks - MT output the the
- Precision score unigram 2/2, bigram 1/1, BLEU
1.0 - MT output is much too short, thus boosting
precision, and BLEU doesnt have recall - An exponential Brevity Penalty reduces score,
calculated based on the aggregate length (not
individual sentences)
11Some Problems with BLEU
- No Recall recall is very important for MT
quality, BP does not adequately compensate for
lack of recall - No explicit alignment of words between the MT
output and the reference - Dependence on n-grams to account for fluency (how
well ordered is the output) - Geometric averaging is harsh
12The BUBBLE Metric
- New metric under development at CMU
- Main new ideas
- Reintroduce Recall and F1 as a balanced
Precision/Recall combination - Look only at unigram Precision/Recall/F1
- Align MT output with each reference and take
score of best pairing - Assess fluency via a direct word-order penalty
how out-of-order is the MT output? - Calculated as a Bubble Sort metric how many
flips required to correctly order the matching
words, as a fraction of the worst possible word
ordering.
13The BUBBLE Metric
- Example
- Reference the Iraqi weapons are to be handed
over to the army within two weeks - MT output in two weeks Iraqs weapons will give
army - Matching Ref weapons army two weeks
- MT two weeks weapons
army - P 4/8 R 4/14 F1 2PR/(PR) 0.36
- Flips required 4 max flips 43/26
- Flip Penalty 4/6 0.67
- Raw BUBBLE score F1(1-FP) 0.12
- With grouping and exponential penalty
- Flips 1 max flips 6 FP 1/6 0.167
- Mod BUBBLE score F1 (1/(2 exp FP)) 0.32
14BLEU vs BUBBLE
- How do we know if a metric is better?
- Better correlation with human judgments of MT
output - Reduced score variability on MT outputs that are
ranked equivalent by humans - Higher and less variable scores on scoring human
translations against the reference translations
15BLEU vs BUBBLE
16BLEU vs BUBBLE
17BLEU vs BUBBLE
18Further Issues
- Words are not created equal some are more
important for effective translation - More effective matching with synonyms and
inflected forms - Stemming
- Use a synonym knowledge-base (WordNet)
- How to incorporate such information within the
metric? - Train weights for word matches
- Target goal is to optimize correlation with human
judgements
19NESPOLE! System Overview
- Human-to-human spoken language translation for
e-commerce application (e.g. travel tourism)
(Lavie et al., 2002) - English, German, Italian, and French
- Translation via interlingua
- Translation servers for each language exchange
interlingua to perform translation - Speech recognition (Speech ? Text)
- Analysis (Text ? Interlingua)
- Generation (Interlingua ? Text)
- Synthesis (Text ? Speech)
20NESPOLE! User Interfaces
21NESPOLE! Translation Monitor
22NESPOLE! Architecture
23Interchange Format
- Interchange Format (IF) is a shallow semantic
interlingua for task-oriented domains - Utterances represented as sequences of semantic
dialog units (SDUs) - IF representation consists of four parts
- Speaker
- Speech Act
- Concepts
- Arguments
- speaker speech act concept arguments
Domain Action
24Evaluation Types and Methods
- Individual evaluation of components
- Speech Recognizers
- Analysis and generation engines
- Sythesizers
- IF (intercoder agreement, effectiveness)
- End-to-End translation quality
- From speech input to speech/text output
- From transcribed input to text output
- Architecture effectiveness network effects
- Task-based evaluation
- User studies what works and how well?
- Evaluating multi-modal interfaces
25Single Component Evaluations
- Speech Recognizers
- Measure Word Error Rates (WERs) compared to a
transcription of the input - Analysis Modules (from speech or text input)
- Compare output from analyzer with manually
annotated IF representations for the input - Generation Modules (from IFs)
- Compare the generated output from IFs with the
input utterance and assess quality of output - Synthesizers Does the output sound good?
26Accuracy-based End-to-End Evaluation
- Example
- Si un attimo che vedo le voglio mandare .
attimo - Yes Im indicating moment
- Yes Im
indicating moment - G1 P B K
K - G2 P B P
K - G3 P B K
P - Three point grading scheme perfect/ok/bad
- OK all meaning is translated, output somewhat
disfluent - Perfect OK output is fluent
- Bad meaning is lost in translation
- Acceptable Perfect OK
27End-to-End Evaluation Methodology
- ENG/ITA, GER/ITA, FR/ITA
- 4 unseen dialogues for each lang pair, 2 from
winter vacations, 2 from summer resorts 2
collected monolingually, 2 bilingually - Monolingual and crosslingual evaluations
- to Italian on client, from Italian on agent
- Evaluate translation from speech and from
transcriptions, to text - ASR output also graded as a paraphrase
- 3-4 human graders per language pair
- Accuracy based evaluation at the Semantic
Dialogue Unit (SDU) level one grader segmented,
all used segmentation - Calculate percent P/K/B and Acceptable for each
grader, average results across graders
28Evaluation Results
29Task-based Evaluation Motivation
- Accuracy-based evaluation can be very harsh (to
get K ALL meaning must be preserved) - Users studies indicate that with 65 accuracy we
achieve almost 100 task completion - But this is highly dependent on the definition
and complexity of the task!
30Task-based Evaluation
- Evaluates the ability of users to perform
(complete) an overall task using an MT system,
rather than scoring the MT system directly - MT system used to mediate human conversation the
task to be evaluated is the communication of
goals, not the human actions - Better MT system ? better task completion rate
- Involves analysis and breakdown of the task
goals, sub-goals, prioritization of goals - Were goals completed? Were multiple
attempts/repairs required? Were goals abandoned?
31Designing a Task-based Evaluation
- Idea evaluate effectiveness of communication
not human execution of a task - Decide on goals and sub-goals for the task
- Example Conveying to agent that you need a hotel
room (when, for how long, type of room) - Annotation scheme was each communication goal
accomplished immediately? eventually? - Scoring scheme assigning an overall score for
individual goals and their accomplishment level - References LREC-2000, ACL-99 Student Session
32Task-based Evaluation of JANUS Speech-to-Speech
Translation
- Interlingua representation used to identify
communication goals - Dialogue consists of a sequence of SDUs, each
consisting of one main goal and zero or more
sub-goals - Example I would like to reserve a single
room - give-informationaccomodation(room-typesingle
) - Main goal give-informationaccomodation
- Sub-goal room-typesingle
33Task-based Evaluation of Speech-to-Speech
Translation
- Scoring scheme for goals
- Important goals will be attempted multiple times
before being abandoned - For successful goals score should decay with the
number of retries required - For abandoned goals penalty should increase with
the number of retries attempted - t of communication attempts (initialretries)
- Score(goal) 1/t if goal successful
- -(1-1/t) if goal was
abandoned - Score(dialogue) average over all goal scores
34Task-based Evaluation of Speech-to-Speech
Translation
- How to account for sub-goals?
- Same as goals, smaller weight
- Same information can be conveyed differently by
different speakers one main goal with many
sub-goals vs. multiple main goals - Identify the complexity of a goal (domain-action)
based on the number/complexity of its sub-goals
(arguments) - Scale the score as a function of the goal
complexity
35Task-based Evaluation of Speech-to-Speech
Translation
- Main issues
- Focus on goal accomplishment rather than
translation quality on a sentence/phrase level - Performing the evaluation requires human coding,
based on a transcript of the conversation
inter-coder agreement - Score of a dialogue very sensitive to the flow of
the dialogue, actions of the participants - Level of granularity of goals and sub-goals
- In this case, determined by the interlingua
36Summary
- MT Evaluation is important for driving system
development and the technology as a whole - Different aspects need to be evaluated not just
translation quality of individual sentences - Human evaluations are costly, but are most
meaningful - New automatic metrics are becoming popular, but
are still rather crude, can drive system progress
and rank systems - New metrics that achieve better correlation with
human judgments are being developed
37References
- 2002, Papineni, K, S. Roukos, T. Ward and W-J.
Zhu, BLEU a Method for Automatic Evaluation of
Machine Translation, in Proceedings of the 40th
Annual Meeting of the Association for
Computational Linguistics (ACL-2002),
Philadelphia, PA, July 2002 - 1997, Gates, D., A. Lavie, L. Levin, A. Waibel,
M. Gavalda, M. Woszczyna and P. Zhan. "End-to-end
Evaluation in JANUS a Speech-to-speech
Translation System". In Dialogue Processing in
Spoken Language Systems Revised Papers from
ECAI-96 Workshop, E. Maier, M. Mast and S.
LuperFoy (eds.), LNCS series, Springer Verlag,
June 1997. - 2000, Levin, L., B. Bartlog, A. Font-Llitjos, D.
Gates, A. Lavie, D. Wallace, T. Watanabe and M.
Woszczyna, "Lessons Learned from a Task-Based
Evaluation of Speech-to-Speech Machine
Translation". In Proceedings of 2nd International
Conference on Language Resources and Evaluation
(LREC-2000), Athens, Greece, June 2000.