Measuring Confidence Intervals for MT Evaluation Metrics - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Measuring Confidence Intervals for MT Evaluation Metrics

Description:

Very expensive in time and money. Objective automatic MT evaluations ... Info gain for 2-gram and up is not meaningful. 80% of the score comes from unigram matches ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 30
Provided by: Joy293
Category:

less

Transcript and Presenter's Notes

Title: Measuring Confidence Intervals for MT Evaluation Metrics


1
Measuring Confidence Intervals for MT Evaluation
Metrics
  • Ying Zhang (Joy)
  • Stephan Vogel
  • Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University

2
Outline
  • Automatic Machine Translation Evaluation
  • BLEU
  • Modified BLEU
  • NIST MTEval
  • Confidence Intervals based on Bootstrap
    Percentile
  • Algorithm
  • Comparing two MT systems
  • Implementation
  • Discussions
  • How much testing data is needed?
  • How many reference translations are needed?
  • How many bootstrap samples are needed?

3
Automatic Machine Translation Evaluation
  • Subjective MT evaluations
  • Fluency and Adequacy scored by human judges
  • Very expensive in time and money
  • Objective automatic MT evaluations
  • Inspired by the Word Error Rate metric used by
    ASR research
  • Measuring the closeness between the MT
    hypothesis and human reference translations
  • Precision n-gram precision
  • Recall
  • Against the best matched reference
  • Approximated by brevity penalty
  • Cheap, fast
  • Highly correlated with subjective evaluations
  • MT research has greatly benefited from automatic
    evaluations
  • Typical metrics IBM BLEU, CMU M-BLEU, CMU
    METEOR, NIST MTeval, NYU GTM

4
BLEU Metrics
  • Proposed by IBMs SMT group (Papineni et al,
    2002)
  • Widely used in MT evaluations
  • DARPA TIDES MT evaluation
  • IWSLT evaluation
  • TC-Star
  • BLEU Metric
  • Pn Modified n-gram precision
  • Geometric mean of p1, p2,..pn
  • BP Brevity penalty
  • Usually, N4 and wn1/N.

c length of the MT hypothesis r effective
reference length
5
BLEU Metric
  • Example
  • MT Hypothesis the gunman was shot dead by
    police .
  • Reference 1 The gunman was shot to death by the
    police .
  • Reference 2 The gunman was shot to death by the
    police .
  • Reference 3 Police killed the gunman .
  • Reference 4 The gunman was shot dead by the
    police .
  • Precision p11.0(8/8) p20.86(6/7) p30.67(4/6)
    p40.6 (3/5)
  • Brevity Penalty c8, r9, BP0.8825
  • Final Score
  • Usually n-gram precision and BP are calculated on
    the test set level

6
Modified BLEU Metric
  • BLEU focuses heavily on long n-grams because of
    the geometric mean
  • Example
  • Modified BLEU Metric (Zhang, 2004)
  • Arithmetic mean of the n-gram precision
  • More balanced contribution from different n-grams

p1 p2 p3 p4 BLEU
MT1 1.0 0.21 0.11 0.06 0.19
MT2 0.35 0.32 0.28 0.26 0.30
7
NIST MTEval Metric
  • Motivation
  • Weight more heavily those n-grams that are more
    informative (NIST 2002)
  • Use a geometric mean of the n-gram score
  • Pros more sensitive than BLEU
  • Cons
  • Info gain for 2-gram and up is not meaningful
  • 80 of the score comes from unigram matches
  • Most matched 5-grams have info gain 0 !
  • Score increases when the testing set size
    increases

8
Questions Regarding MT Evaluation Metrics
  • Do they rank the MT systems in the same way as
    human judges?
  • IBM showed a strong correlation between BLEU and
    human judgments
  • How reliable are the automatic evaluation scores?
  • How sensitive is a metric?
  • Sensitivity the metric should be able to
    distinguish between systems of similar
    performance
  • Is the metric consistent?
  • Consistency the difference between systems is
    not affected by the selection of
    testing/reference data
  • How many reference translations are needed?
  • How much testing data is sufficient for
    evaluation?
  • If we can measure the confidence interval of the
    evaluation scores, we can answer the above
    questions

9
Outline
  • Overview of Automatic Machine Translation
    Evaluation
  • BLEU
  • Modified BLEU
  • NIST MTEval
  • Confidence Intervals based on Bootstrap
    Percentile
  • Algorithm
  • Comparing two MT systems
  • Implementation
  • Discussions
  • How much testing data is needed?
  • How many reference translations are needed?
  • How many bootstrap samples are needed?

10
Measuring the Confidence Intervals
  • One BLEU/M-BLEU/NIST score per test set
  • How accurate is this score?
  • To measure the confidence interval a population
    is required
  • Building a test set with multiple human reference
    translations is expensive
  • Solution bootstrapping (Efron 1986)
  • Introduced in 1979 as a computer-based method for
    estimating the standard errors of a statistical
    estimation
  • Resampling creating an artificial population by
    sampling with replacement
  • Proposed by Franz Och (2003) to measure the
    confidence intervals for automatic MT evaluation
    metrics

11
A Schematic of the Bootstrapping Process
Score0
12
An Efficient Implementation
  • Translate and evaluate 2,000 test sets?
  • No Way!
  • Resample the n-gram precision information for the
    sentences
  • Most MT systems are context independent at the
    sentence level
  • MT evaluation metrics are based on information
    collected for each testing sentences
  • E.g. for BLEU/M-BLEU and NIST
  • RefLen 17 20 19 24
  • ClosestRefLen 17
  • 1-gram 15 10 89.34
  • 2-gram 14 4 9.04
  • 3-gram 13 3 3.65
  • 4-gram 12 2 2.43
  • Similar for human judgment and other MT metrics
  • Approximation for NIST information gain
  • Scripts available at http//projectile.is.cs.cmu.
    edu/research/public/tools/bootStrap/tutorial.htm

13
Algorithm
  • Original test suite T0 with N segments and R
    reference translations
  • Represent the i-th segment of T0 as an n-tuple
  • T0iltsi, ri1,ri2,..,riRgt
  • for(b1bltBb)
  • for(i1iltNi)
  • s random(1,N)
  • Tbi T0s
  • Calculating BLEU/M-BLEU/NIST for Tb
  • Sort B BLEU/M-BLEU/NIST scores
  • Output scores ranked 2.5th and 97.5

14
Confidence Intervals
15
Are Two MT Systems Different?
  • Comparing two MT systems performance
  • Using the similar method as for single system
  • E.g. Diff(Sys1-Sys2)Median-1.7355
    -1.5453,-1.9056
  • If the confidence intervals overlap with 0, two
    systems are not significantly different
  • M-Bleu and NIST have more discriminative power
    than Bleu
  • Automatic metrics have pretty high correlations
    with the human ranking
  • Human judges like system E (Syntactic system)
    more than B (Statistical system), but automatic
    metrics do not

16
Outline
  • Overview of Automatic Machine Translation
    Evaluation
  • BLEU
  • Modified BLEU
  • NIST MTEval
  • Confidence Intervals based on Bootstrap
    Percentile
  • Algorithm
  • Comparing two MT systems
  • Implementation
  • Discussions
  • How much testing data is needed?
  • How many reference translations are needed?
  • How many bootstrap samples are needed?
  • Non-parametric interval or normal/t-intervals?

17
How much testing data is needed
18
How much testing data is needed
  • NIST scores increase steadily with the growing
    test set size
  • The distance between the scores of the different
    systems remains stable when using 40 or more of
    the test set
  • The confidence intervals become narrower for
    larger test set
  • Rule of thumb doubling the testing data size
    narrows the confidence interval by 30
    (theoretically justified)

System A, (Bootstrap Size B2000)
19
Effects of Using Multiple References
  • Single reference from one translator may favor
    some systems
  • Increasing the number of references narrows down
    the relative confidence interval

20
How Many Reference Translations are Sufficient?
System A, (Bootstrap Size B2000)
  • Confidence intervals become narrower with more
    reference translations
  • 100(1-ref) 8090(2-ref) 7080(3-ref)
    6070(4-ref)
  • One additional reference translation compensates
    for 1015 of testing data

21
Do We Really Need Multiple References?
  • Parallel multiple reference
  • Single reference from multiple translators
  • Reduced bias from different translators
  • Yields the same confidence interval/reliability
    as the parallel multiple reference
  • Costs only half of the effort compared to
    building a parallel multiple reference set

Originally proposed in IBMs BLEU report
22
Single Reference from Multiple Translators
  • Reduced bias by mixing from different translators
  • Yields the same confidence intervals

23
Bootstrap-t Interval vs. Normal/t Interval
  • Normal distribution / t-distribution
  • Students t-interval (when n is small)
  • Bootstrap-t interval
  • For each bootstrap sample, calculate
  • The alpha-th percentile is estimated by the value
    , such that
  • Bootstrap-t interval is
  • e.g. if B1000, the 50th largest value and the
    950th largest value gives the bootstrap-t
    interval

Assuming that
Assuming that
24
Bootstrap-t interval vs. Normal/t interval (Cont.)
  • Bootstrap-t intervals assumes no distribution,
    but
  • It can give erratic results
  • It can be heavily influenced by a few outlying
    data points
  • When B is large, the bootstrap sample scores are
    pretty close to normal distribution
  • Assume normal distribution gives more reliable
    intervals, e.g. for BLEU relative confidence
    interval (B500)
  • STDEV0.27 for bootstrap-t interval
  • STDEV0.14 for normal/student-t interval

25
The Number of Bootstrap Replications B
  • Ideal bootstrap estimate of the confidence
    interval takes B
  • Computational time increases linearly with B
  • The greater B, the smaller the standard deviation
    of the estimated confidence intervals. E.g. for
    BLEUs relative confidence interval
  • STDEV 0.60 when B100 STDEV 0.27 when B500
  • Two rules of thumb
  • Even a small B, say B100 is usually informative
  • Bgt1000 gives quite satisfactory results

26
Conclusions
  • Using bootstrapping method to measure the
    confidence intervals for MT evaluation metrics
  • Using confidence intervals to study the
    characteristics of an MT evaluation metric
  • Correlation with human judgments
  • Sensitivity
  • Consistency
  • Modified BLEU is a better metric than BLEU
  • Single reference from multiple translators is as
    good as parallel multiple references and costs
    only half the effort

27
References
  • Efron, B. and R. Tibshirani 1986, Bootstrap
    Methods for Standard Errors, Confidence
    Intervals, and Other Measures of Statistical
    Accuracy, Statistical Science 1, p. 54-77.
  • F. Och. 2003. Minimum Error Rate Training in
    Statistical Machine Translation. In Proc. Of ACL,
    Sapporo, Japan.
  • M. Bisani and H. Ney 2004, 'Bootstrap Estimates
    for Confidence Intervals in ASR Performance
    Evaluation', In Proc. of ICASP, Montreal, Canada,
    Vol. 1, pp. 409-412.
  • G. Leusch, N. Ueffing, H. Ney 2003, 'A Novel
    String-to-String Distance Measure with
    Applications to Machine Translation Evaluation',
    In Proc. 9th MT Summit, New Orleans, LO.
  • I Dan Melamed, Ryan Green and Joseph P. Turian
    2003, 'Precision and Recall of Machine
    Translation', In Proc. of NAACL/HLT 2003,
    Edmonton, Canada.
  • King M., Popescu-Belis A. Hovy E. 2003,
    'FEMTI creating and using a framework for MT
    evaluation', In Proc. of 9th Machine Translation
    Summit, New Orleans, LO, USA.
  • S. Nießen, F.J. Och, G. Leusch, H. Ney 2000,
    'An Evaluation Tool for Machine Translation Fast
    Evaluation for MT Research', In Proc. LREC 2000,
    Athens, Greece.
  • NIST Report 2002, Automatic Evaluation of
    Machine Translation Quality Using N-gram
    Co-Occurrence Statistics, http//www.nist.gov/spee
    ch/tests/mt/doc/ngram-study.pdf
  • Papineni, Kishore Roukos, Salim et al. 2002,
    'BLEU A Method for Automatic Evaluation of
    Machine Translation', In Proc. of the 20th ACL.
  • Ying Zhang, Stephan Vogel, Alex Waibel 2004,
    'Interpreting BLEU/NIST scores How much
    improvement do we need to have a better system?,'
    In Proc. of LREC 2004, Lisbon, Portugal.

28
Questions and Comments?
29
N-gram Contributions to NIST Score
Write a Comment
User Comments (0)
About PowerShow.com