DependencyBased Automatic Evaluation for Machine Translation - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

DependencyBased Automatic Evaluation for Machine Translation

Description:

100 English Europarl sentences containing adjuncts or coordinated structures ... Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 17
Provided by: karolinao
Category:

less

Transcript and Presenter's Notes

Title: DependencyBased Automatic Evaluation for Machine Translation


1
Dependency-Based Automatic Evaluation for Machine
Translation
  • Karolina Owczarzak, Josef van Genabith, Andy Way
  • National Centre for Language Technology
  • School of Computing
  • Dublin City University

2
Overview
  • Automatic evaluation for Machine Translation
    (MT) BLEU, NIST, GTM, METEOR, TER
  • Lexical-Functional Grammar (LFG) in language
    processing parsing to simple logical forms
  • LFG in MT evaluation
  • assessing level of parser noise the adjunct
    attachment experiment
  • checking for bias the Europarl experiment
  • correlation with human judgement the MultiTrans
    experiment
  • Future work

3
Automatic MT evaluation
  • Automatic MT metrics fast and cheap way to
    evaluate your MT system
  • Basic and most popular BLEU, NIST
  • John resigned yesterday vs. Yesterday, John
    quit
  • 1-grams 2/3 (john, yesterday)
  • 2-grams 0/2
  • 3-grams 0/1 Total 2/6 n-grams 0.33
  • String comparison - not sensitive to legitimate
    syntactic and lexical variation
  • Need large test sets and/or multiple references

4
Automatic MT evaluation
  • Other attempts to include more variation into
    evaluation
  • General Text Matcher (GTM) precision and recall
    on translation-reference pairs, weights
    contiguous matches more
  • Translation Error Rate (TER) edit distance for
    translation-reference pair, number of insertions,
    deletions, substitutions and shifts
  • METEOR sum of n-gram matches for exact string
    forms, stemmed words, and WordNet synonyms
  • Kauchak and Barzilay (2006) using WordNet
    synonyms with BLEU
  • Owczarzak et al. (2006) using paraphrases
    derived from the test set through word/phrase
    alignment with BLEU and NIST

5
Lexical-Functional Grammar
  • Sentence structure representation
  • c-structure (constituent) CFG trees, reflects
    surface word order and structural hierarchy
  • f-structure (functional) abstract grammatical
    (syntactic) relations

John resigned yesterday vs. Yesterday, John
resigned
triples SUBJ(resign, john) PERS(john, 3)
NUM(john, sg) TENSE(resign, past) ADJ(resign,
yesterday) PERS(yesterday, 3) NUM(yesterday,
sg) triples preds only SUBJ(resign,
john) ADJ(resign, yesterday)
6
LFG Parser
  • Cahill et al. (2004) LFG parser based on Penn II
    Treebank (demo at http//lfg-demo.computing.dcu.ie
    /lfgparser.html)
  • Automatic annotation of Charniaks/Bikels output
    parse with attribute-value equations, resolving
    to f-structures
  • Evaluation of parser quality comparison of
    dependencies produced by the parser with the set
    of dependencies in human annotation of same text,
    precision and recall
  • Our LFG parser reaches high precision and recall
    scores

7
LFG in MT evaluation
  • Parse translation and reference into LFG
    f-structures rendered as dependency triples
  • Comparison of translation and reference text on
    structural (dependency) level
  • Calculate precision and recall on translation and
    reference dependency sets
  • Comparison of two automatically produced outputs
  • how much noise does the parser introduce?

John resigned yesterday SUBJ(resign,
john) PERS(john, 3) NUM(john, sg) TENSE(resign,
past) ADJ(resign, yesterday) PERS(yesterday,
3) NUM(yesterday, sg)
Yesterday, John resigned SUBJ(resign,
john) PERS(john, 3) NUM(john, sg) TENSE(resign,
past) ADJ(resign, yesterday) PERS(yesterday,
3) NUM(yesterday, sg)
vs.
100
8
The adjunct attachment experiment
  • 100 English Europarl sentences containing
    adjuncts or coordinated structures
  • Hand-modified to change the placement of the
    adjunct or the order of coordinated elements, no
    change in meaning or grammaticality
  • Schengen, on the other hand, is not organic. lt-
    original reference
  • On the other hand, Schengen is not organic. lt-
    modified translation
  • Change limited to c-structure, no change in
    f-structure
  • A perfect parser should give both identical set
    of dependencies

9
The adjunct attachment experiment - results
Parser
10
The Europarl experiment
  • N-gram-based metrics (BLEU, NIST) favour
    n-gram-based translation (statistical MT)
  • Owczarzak et al. (2006)
  • BLEU Pharaoh gt Logomedia (0.0349)
  • NIST Pharaoh gt Logomedia (0.6219)
  • Human Pharaoh lt Logomedia (0.19)
  • 4000 sentences from Spanish-English Europarl
  • Two translations
  • Logomedia
  • Pharaoh
  • Evaluated with BLEU, NIST, GTM, TER, METEOR
    (-WordNet), dependency-based method (basic,
    predicate-only, -WordNet, -bitext-generated
    paraphrases)
  • WordNet paraphrases used to create new
    best-matching reference for the translation, then
    evaluated with dependency-based method

11
The Europarl experiment - results
Europarl 4000 Logomedia vs Pharaoh
12
The MultiTrans experiment
  • Correlation of dependency-based method with human
    evaluation
  • Comparison with correlation of BLEU, NIST, GTM,
    METEOR, TER
  • Linguistic Data Consortium Multiple Translation
    Chinese Parts 2 and 4
  • multiple translations of Chinese newswire text
  • four human-produced references
  • segment-level human scores for a subset of the
    translations
  • total 16,800 translation-reference-human score
    segments
  • Pearsons correlation coefficient
  • -1 negative correlation
  • 0 no correlation
  • 1 positive correlation

13
The MultiTrans experiment - results
Correlation with human judgement of translation
quality
  • Dependency-based method sensitive to grammatical
    structure of the sentence more grammatical
    translation more fluent translation
  • Different position of a word different local
    (and global) structure the word appears in
    dependency triples that do not match the
    reference

14
Future work
  • Use n-best parses to reduce parser noise and
    increase number of matches
  • Generate a paraphrase set through word alignment
    from a large bitext (Europarl), use instead of
    WordNet
  • Create weights for individual dependency scores
    that contribute to segment-level score, train to
    maximize correlation with human judgement

15
Conclusions
  • New automatic method for evaluation of MT output
  • LFG dependency triples simple logical form
  • Evaluation on structural level, not surface
    string form
  • Allows legitimate syntactic variation
  • Allow legitimate lexical variation when used with
    WordNet or paraphrases
  • Correlates higher than other metrics with human
    evaluation of fluency

16
References
  • Satanjeev Banerjee and Alon Lavie. 2005. METEOR
    An Automatic Metric for MT Evaluation with
    Improved Correlation with Human Judgments.
    Proceedings of the ACL 2005 Workshop on Intrinsic
    and Extrinsic Evaluation Measures for MT and/or
    Summarization 65-73.
  • Aoife Cahill, Michael Burke, Ruth ODonovan,
    Josef van Genabith, and Andy Way. 2004.
    Long-Distance Dependency Resolution in
    Automatically Acquired Wide-Coverage PCFG-Based
    LFG Approximations. Proceedings of ACL 2004
    320-327.
  • George Doddington. 2002. Automatic Evaluation of
    MT Quality using N-gram Co-occurrence Statistics.
    Proceedings of HLT 2002 138-145.
  • David Kauchak and Regina Barzilay. 2006.
    Paraphrasing for Automatic Evaluation.
    Proceedings of HLT-NAACL 2006 45-462.
  • Philipp Koehn, Franz Och and Daniel Marcu. 2003.
    Statistical Phrase-Based Translation. Proceedings
    of HLT-NAACL 2003 48-54.
  • Philipp Koehn. 2005. Europarl A Parallel Corpus
    for Statistical Machine Translation. Proceedings
    of MT Summit 2005 79-86.
  • Philipp Koehn. 2004. Pharaoh a beam search
    decoder for phrase-based statistical machine
    translation models. Proceedings of the AMTA 2004
    Workshop on Machine Translation From real users
    to research 115-124.
  • Karolina Owczarzak, Declan Groves, Josef van
    Genabith, and Andy Way. 2006. Contextual
    Bitext-Derived Paraphrases in Automatic MT
    Evaluation. Proceedings of the HLT-NAACL 2006
    Workshop on Statistical Machine Translation
    86-93.
  • Kishore Papineni, Salim Roukos, Todd Ward, and
    WeiJing Zhu. 2002. BLEU a method for automatic
    evaluation of machine translation. In Proceedings
    of ACL 2002 311-318.
  • Mathew Snover, Bonnie Dorr, Richard Schwartz,
    John Makhoul, Linnea Micciula. 2006. A Study of
    Translation Error Rate with Targeted Human
    Annotation. Proceedings of AMTA 2006 223-231.
  • Joseph P. Turian, Luke Shen, and I. Dan Melamed.
    2003. Evaluation of Machine Translation and Its
    Evaluation. Proceedings of MT Summit 2003
    386-393.
Write a Comment
User Comments (0)
About PowerShow.com