MT Evaluation - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

MT Evaluation

Description:

Fast and cheap, minimal human labor, no need for bilingual speakers ... each lang pair, 2 from winter vacations, 2 from summer resorts; 2 collected ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 38
Provided by: AlonL
Category:

less

Transcript and Presenter's Notes

Title: MT Evaluation


1
MT Evaluation
  • 11-682/15-492
  • Introduction to IR, NLP, MT and Speech
  • November 4, 2003

2
Need for MT Evaluation
  • MT Evaluation is important
  • MT systems are becoming wide-spread, embedded in
    more complex systems
  • How well do they work in practice?
  • Are they reliable enough?
  • MT is a technology still in research stages
  • How can we tell if we are making progress?
  • Metrics that can drive experimental development
  • MT Evaluation is difficult
  • Human evaluation is subjective
  • How good is good enough? Depends on
    application
  • Is system A better than system B? Depends on
    specific criteria
  • MT Evaluation is a research topic in itself! How
    do we assess whether an evaluation method is good?

3
Dimensions of MT Evaluation
  • Human evaluation vs. automatic metrics
  • Quality assessment at sentence (segment) level
    vs. task-based evaluation
  • Black-box evaluation vs. Glass-box evaluation
  • Adequacy (is the meaning translated correctly?)
    vs. Fluency (is the output grammatical and
    fluent?)

4
Example Approaches
  • We will survey three evaluation methodologies as
    representative examples
  • BLEU BUBBLE automatic metrics for MT evaluation
  • Evaluation of NESPOLE! Speech-to-Speech
    Translation system (human, sentence-level,
    quality-based, end-to-end some components)
  • Task-based evaluation of speech-to-speech
    translation

5
Automatic Metrics for MT Evaluation
  • Idea compare output of an MT system to a
    reference good (usually human) translation
    how close is the MT output to the reference
    translation?
  • Advantages
  • Fast and cheap, minimal human labor, no need for
    bilingual speakers
  • Can be used on an on-going basis during system
    development to test changes
  • Disadvantages
  • Current metrics are very crude, cannot
    distinguish well between subtle differences in
    systems
  • Individual sentence scores are not very
    meaningful, aggregate score for a system on a
    large test set are meaningful
  • Automatic metrics for MT evaluation very active
    area of current research

6
Automatic Metrics for MT Evaluation
  • Example
  • Reference the Iraqi weapons are to be handed
    over to the army within two weeks
  • MT output in two weeks Iraqs weapons will give
    army
  • Possible metric components
  • Precision correct words / total words in MT
  • Recall correct words / total words in ref
  • Combination of P/R (i.e. F1)
  • Levenshtein edit distance number of insertions,
    deletions, substitutions required to transform MT
    output to the reference
  • Important Issues
  • Perfect word matches are too harsh synonyms,
    inflections Iraqs vs. Iraqi, give vs.
    handed over

7
The BLEU Metric
  • Proposed by IBM Papineni et al, 2002
  • Main ideas
  • Exact matches of words
  • Match against a set of reference translations for
    greater variety of expressions
  • Account for Adequacy by looking at word precision
  • Account for Fluency by calculating n-gram
    precisions for n1,2,3,4
  • No recall (because difficult with multiple refs)
  • To compensate for recall introduce Brevity
    Penalty
  • Final score is weighted geometric average of the
    n-gram scores
  • Calculate aggregate score over a large test set

8
The BLEU Metric
  • Example
  • Reference the Iraqi weapons are to be handed
    over to the army within two weeks
  • MT output in two weeks Iraqs weapons will give
    army
  • BLUE metric
  • 1-gram precision 4/8
  • 2-gram precision 1/7
  • 3-gram precision 0/6
  • 4-gram precision 0/5
  • BLEU score 0 (weighted geometric average)

9
The BLEU Metric
  • Clipping precision counts
  • Reference1 the Iraqi weapons are to be handed
    over to the army within two weeks
  • Reference2 the Iraqi weapons will be
    surrendered to the army in two weeks
  • MT output the the the the
  • Precision count for the should be clipped at
    two max count of the word in any reference
  • Modified unigram score will be 2/4 (not 4/4)

10
The BLEU Metric
  • Brevity Penalty
  • Reference1 the Iraqi weapons are to be handed
    over to the army within two weeks
  • Reference2 the Iraqi weapons will be
    surrendered to the army in two weeks
  • MT output the the
  • Precision score unigram 2/2, bigram 1/1, BLEU
    1.0
  • MT output is much too short, thus boosting
    precision, and BLEU doesnt have recall
  • An exponential Brevity Penalty reduces score,
    calculated based on the aggregate length (not
    individual sentences)

11
Some Problems with BLEU
  • No Recall recall is very important for MT
    quality, BP does not adequately compensate for
    lack of recall
  • No explicit alignment of words between the MT
    output and the reference
  • Dependence on n-grams to account for fluency (how
    well ordered is the output)
  • Geometric averaging is harsh

12
The BUBBLE Metric
  • New metric under development at CMU
  • Main new ideas
  • Reintroduce Recall and F1 as a balanced
    Precision/Recall combination
  • Look only at unigram Precision/Recall/F1
  • Align MT output with each reference and take
    score of best pairing
  • Assess fluency via a direct word-order penalty
    how out-of-order is the MT output?
  • Calculated as a Bubble Sort metric how many
    flips required to correctly order the matching
    words, as a fraction of the worst possible word
    ordering.

13
The BUBBLE Metric
  • Example
  • Reference the Iraqi weapons are to be handed
    over to the army within two weeks
  • MT output in two weeks Iraqs weapons will give
    army
  • Matching Ref weapons army two weeks
  • MT two weeks weapons
    army
  • P 4/8 R 4/14 F1 2PR/(PR) 0.36
  • Flips required 4 max flips 43/26
  • Flip Penalty 4/6 0.67
  • Raw BUBBLE score F1(1-FP) 0.12
  • With grouping and exponential penalty
  • Flips 1 max flips 6 FP 1/6 0.167
  • Mod BUBBLE score F1 (1/(2 exp FP)) 0.32

14
BLEU vs BUBBLE
  • How do we know if a metric is better?
  • Better correlation with human judgments of MT
    output
  • Reduced score variability on MT outputs that are
    ranked equivalent by humans
  • Higher and less variable scores on scoring human
    translations against the reference translations

15
BLEU vs BUBBLE
16
BLEU vs BUBBLE
17
BLEU vs BUBBLE
18
Further Issues
  • Words are not created equal some are more
    important for effective translation
  • More effective matching with synonyms and
    inflected forms
  • Stemming
  • Use a synonym knowledge-base (WordNet)
  • How to incorporate such information within the
    metric?
  • Train weights for word matches
  • Target goal is to optimize correlation with human
    judgements

19
NESPOLE! System Overview
  • Human-to-human spoken language translation for
    e-commerce application (e.g. travel tourism)
    (Lavie et al., 2002)
  • English, German, Italian, and French
  • Translation via interlingua
  • Translation servers for each language exchange
    interlingua to perform translation
  • Speech recognition (Speech ? Text)
  • Analysis (Text ? Interlingua)
  • Generation (Interlingua ? Text)
  • Synthesis (Text ? Speech)

20
NESPOLE! User Interfaces
21
NESPOLE! Translation Monitor
22
NESPOLE! Architecture
23
Interchange Format
  • Interchange Format (IF) is a shallow semantic
    interlingua for task-oriented domains
  • Utterances represented as sequences of semantic
    dialog units (SDUs)
  • IF representation consists of four parts
  • Speaker
  • Speech Act
  • Concepts
  • Arguments
  • speaker speech act concept arguments


Domain Action
24
Evaluation Types and Methods
  • Individual evaluation of components
  • Speech Recognizers
  • Analysis and generation engines
  • Sythesizers
  • IF (intercoder agreement, effectiveness)
  • End-to-End translation quality
  • From speech input to speech/text output
  • From transcribed input to text output
  • Architecture effectiveness network effects
  • Task-based evaluation
  • User studies what works and how well?
  • Evaluating multi-modal interfaces

25
Single Component Evaluations
  • Speech Recognizers
  • Measure Word Error Rates (WERs) compared to a
    transcription of the input
  • Analysis Modules (from speech or text input)
  • Compare output from analyzer with manually
    annotated IF representations for the input
  • Generation Modules (from IFs)
  • Compare the generated output from IFs with the
    input utterance and assess quality of output
  • Synthesizers Does the output sound good?

26
Accuracy-based End-to-End Evaluation
  • Example
  • Si un attimo che vedo le voglio mandare .
    attimo
  • Yes Im indicating moment
  • Yes Im
    indicating moment
  • G1 P B K
    K
  • G2 P B P
    K
  • G3 P B K
    P
  • Three point grading scheme perfect/ok/bad
  • OK all meaning is translated, output somewhat
    disfluent
  • Perfect OK output is fluent
  • Bad meaning is lost in translation
  • Acceptable Perfect OK

27
End-to-End Evaluation Methodology
  • ENG/ITA, GER/ITA, FR/ITA
  • 4 unseen dialogues for each lang pair, 2 from
    winter vacations, 2 from summer resorts 2
    collected monolingually, 2 bilingually
  • Monolingual and crosslingual evaluations
  • to Italian on client, from Italian on agent
  • Evaluate translation from speech and from
    transcriptions, to text
  • ASR output also graded as a paraphrase
  • 3-4 human graders per language pair
  • Accuracy based evaluation at the Semantic
    Dialogue Unit (SDU) level one grader segmented,
    all used segmentation
  • Calculate percent P/K/B and Acceptable for each
    grader, average results across graders

28
Evaluation Results
29
Task-based Evaluation Motivation
  • Accuracy-based evaluation can be very harsh (to
    get K ALL meaning must be preserved)
  • Users studies indicate that with 65 accuracy we
    achieve almost 100 task completion
  • But this is highly dependent on the definition
    and complexity of the task!

30
Task-based Evaluation
  • Evaluates the ability of users to perform
    (complete) an overall task using an MT system,
    rather than scoring the MT system directly
  • MT system used to mediate human conversation the
    task to be evaluated is the communication of
    goals, not the human actions
  • Better MT system ? better task completion rate
  • Involves analysis and breakdown of the task
    goals, sub-goals, prioritization of goals
  • Were goals completed? Were multiple
    attempts/repairs required? Were goals abandoned?

31
Designing a Task-based Evaluation
  • Idea evaluate effectiveness of communication
    not human execution of a task
  • Decide on goals and sub-goals for the task
  • Example Conveying to agent that you need a hotel
    room (when, for how long, type of room)
  • Annotation scheme was each communication goal
    accomplished immediately? eventually?
  • Scoring scheme assigning an overall score for
    individual goals and their accomplishment level
  • References LREC-2000, ACL-99 Student Session

32
Task-based Evaluation of JANUS Speech-to-Speech
Translation
  • Interlingua representation used to identify
    communication goals
  • Dialogue consists of a sequence of SDUs, each
    consisting of one main goal and zero or more
    sub-goals
  • Example I would like to reserve a single
    room
  • give-informationaccomodation(room-typesingle
    )
  • Main goal give-informationaccomodation
  • Sub-goal room-typesingle

33
Task-based Evaluation of Speech-to-Speech
Translation
  • Scoring scheme for goals
  • Important goals will be attempted multiple times
    before being abandoned
  • For successful goals score should decay with the
    number of retries required
  • For abandoned goals penalty should increase with
    the number of retries attempted
  • t of communication attempts (initialretries)
  • Score(goal) 1/t if goal successful
  • -(1-1/t) if goal was
    abandoned
  • Score(dialogue) average over all goal scores

34
Task-based Evaluation of Speech-to-Speech
Translation
  • How to account for sub-goals?
  • Same as goals, smaller weight
  • Same information can be conveyed differently by
    different speakers one main goal with many
    sub-goals vs. multiple main goals
  • Identify the complexity of a goal (domain-action)
    based on the number/complexity of its sub-goals
    (arguments)
  • Scale the score as a function of the goal
    complexity

35
Task-based Evaluation of Speech-to-Speech
Translation
  • Main issues
  • Focus on goal accomplishment rather than
    translation quality on a sentence/phrase level
  • Performing the evaluation requires human coding,
    based on a transcript of the conversation
    inter-coder agreement
  • Score of a dialogue very sensitive to the flow of
    the dialogue, actions of the participants
  • Level of granularity of goals and sub-goals
  • In this case, determined by the interlingua

36
Summary
  • MT Evaluation is important for driving system
    development and the technology as a whole
  • Different aspects need to be evaluated not just
    translation quality of individual sentences
  • Human evaluations are costly, but are most
    meaningful
  • New automatic metrics are becoming popular, but
    are still rather crude, can drive system progress
    and rank systems
  • New metrics that achieve better correlation with
    human judgments are being developed

37
References
  • 2002, Papineni, K, S. Roukos, T. Ward and W-J.
    Zhu, BLEU a Method for Automatic Evaluation of
    Machine Translation, in Proceedings of the 40th
    Annual Meeting of the Association for
    Computational Linguistics (ACL-2002),
    Philadelphia, PA, July 2002
  • 1997, Gates, D., A. Lavie, L. Levin, A. Waibel,
    M. Gavalda, M. Woszczyna and P. Zhan. "End-to-end
    Evaluation in JANUS a Speech-to-speech
    Translation System". In Dialogue Processing in
    Spoken Language Systems Revised Papers from
    ECAI-96 Workshop, E. Maier, M. Mast and S.
    LuperFoy (eds.), LNCS series, Springer Verlag,
    June 1997.
  • 2000, Levin, L., B. Bartlog, A. Font-Llitjos, D.
    Gates, A. Lavie, D. Wallace, T. Watanabe and M.
    Woszczyna, "Lessons Learned from a Task-Based
    Evaluation of Speech-to-Speech Machine
    Translation". In Proceedings of 2nd International
    Conference on Language Resources and Evaluation
    (LREC-2000), Athens, Greece, June 2000.
Write a Comment
User Comments (0)
About PowerShow.com