Title: Measuring Confidence Intervals for MT Evaluation Metrics
1Measuring Confidence Intervals for MT Evaluation
Metrics
- Ying Zhang (Joy)
- Stephan Vogel
- Language Technologies Institute
- School of Computer Science
- Carnegie Mellon University
2Outline
- Automatic Machine Translation Evaluation
- BLEU
- Modified BLEU
- NIST MTEval
- Confidence Intervals based on Bootstrap
Percentile - Algorithm
- Comparing two MT systems
- Implementation
- Discussions
- How much testing data is needed?
- How many reference translations are needed?
- How many bootstrap samples are needed?
3Automatic Machine Translation Evaluation
- Subjective MT evaluations
- Fluency and Adequacy scored by human judges
- Very expensive in time and money
- Objective automatic MT evaluations
- Inspired by the Word Error Rate metric used by
ASR research - Measuring the closeness between the MT
hypothesis and human reference translations - Precision n-gram precision
- Recall
- Against the best matched reference
- Approximated by brevity penalty
- Cheap, fast
- Highly correlated with subjective evaluations
- MT research has greatly benefited from automatic
evaluations - Typical metrics IBM BLEU, CMU M-BLEU, CMU
METEOR, NIST MTeval, NYU GTM
4BLEU Metrics
- Proposed by IBMs SMT group (Papineni et al,
2002) - Widely used in MT evaluations
- DARPA TIDES MT evaluation
- IWSLT evaluation
- TC-Star
- BLEU Metric
- Pn Modified n-gram precision
- Geometric mean of p1, p2,..pn
- BP Brevity penalty
- Usually, N4 and wn1/N.
c length of the MT hypothesis r effective
reference length
5BLEU Metric
- Example
- MT Hypothesis the gunman was shot dead by
police . - Reference 1 The gunman was shot to death by the
police . - Reference 2 The gunman was shot to death by the
police . - Reference 3 Police killed the gunman .
- Reference 4 The gunman was shot dead by the
police . - Precision p11.0(8/8) p20.86(6/7) p30.67(4/6)
p40.6 (3/5) - Brevity Penalty c8, r9, BP0.8825
- Final Score
- Usually n-gram precision and BP are calculated on
the test set level
6Modified BLEU Metric
- BLEU focuses heavily on long n-grams because of
the geometric mean - Example
- Modified BLEU Metric (Zhang, 2004)
- Arithmetic mean of the n-gram precision
- More balanced contribution from different n-grams
p1 p2 p3 p4 BLEU
MT1 1.0 0.21 0.11 0.06 0.19
MT2 0.35 0.32 0.28 0.26 0.30
7NIST MTEval Metric
- Motivation
- Weight more heavily those n-grams that are more
informative (NIST 2002) - Use a geometric mean of the n-gram score
- Pros more sensitive than BLEU
- Cons
- Info gain for 2-gram and up is not meaningful
- 80 of the score comes from unigram matches
- Most matched 5-grams have info gain 0 !
- Score increases when the testing set size
increases
8Questions Regarding MT Evaluation Metrics
- Do they rank the MT systems in the same way as
human judges? - IBM showed a strong correlation between BLEU and
human judgments - How reliable are the automatic evaluation scores?
- How sensitive is a metric?
- Sensitivity the metric should be able to
distinguish between systems of similar
performance - Is the metric consistent?
- Consistency the difference between systems is
not affected by the selection of
testing/reference data - How many reference translations are needed?
- How much testing data is sufficient for
evaluation? - If we can measure the confidence interval of the
evaluation scores, we can answer the above
questions
9Outline
- Overview of Automatic Machine Translation
Evaluation - BLEU
- Modified BLEU
- NIST MTEval
- Confidence Intervals based on Bootstrap
Percentile - Algorithm
- Comparing two MT systems
- Implementation
- Discussions
- How much testing data is needed?
- How many reference translations are needed?
- How many bootstrap samples are needed?
10Measuring the Confidence Intervals
- One BLEU/M-BLEU/NIST score per test set
- How accurate is this score?
- To measure the confidence interval a population
is required - Building a test set with multiple human reference
translations is expensive - Solution bootstrapping (Efron 1986)
- Introduced in 1979 as a computer-based method for
estimating the standard errors of a statistical
estimation - Resampling creating an artificial population by
sampling with replacement - Proposed by Franz Och (2003) to measure the
confidence intervals for automatic MT evaluation
metrics
11A Schematic of the Bootstrapping Process
Score0
12An Efficient Implementation
- Translate and evaluate 2,000 test sets?
- No Way!
- Resample the n-gram precision information for the
sentences - Most MT systems are context independent at the
sentence level - MT evaluation metrics are based on information
collected for each testing sentences - E.g. for BLEU/M-BLEU and NIST
- RefLen 17 20 19 24
- ClosestRefLen 17
- 1-gram 15 10 89.34
- 2-gram 14 4 9.04
- 3-gram 13 3 3.65
- 4-gram 12 2 2.43
- Similar for human judgment and other MT metrics
- Approximation for NIST information gain
- Scripts available at http//projectile.is.cs.cmu.
edu/research/public/tools/bootStrap/tutorial.htm
13Algorithm
- Original test suite T0 with N segments and R
reference translations - Represent the i-th segment of T0 as an n-tuple
- T0iltsi, ri1,ri2,..,riRgt
- for(b1bltBb)
- for(i1iltNi)
- s random(1,N)
- Tbi T0s
-
- Calculating BLEU/M-BLEU/NIST for Tb
-
- Sort B BLEU/M-BLEU/NIST scores
- Output scores ranked 2.5th and 97.5
14Confidence Intervals
15Are Two MT Systems Different?
- Comparing two MT systems performance
- Using the similar method as for single system
- E.g. Diff(Sys1-Sys2)Median-1.7355
-1.5453,-1.9056 - If the confidence intervals overlap with 0, two
systems are not significantly different
- M-Bleu and NIST have more discriminative power
than Bleu - Automatic metrics have pretty high correlations
with the human ranking - Human judges like system E (Syntactic system)
more than B (Statistical system), but automatic
metrics do not
16Outline
- Overview of Automatic Machine Translation
Evaluation - BLEU
- Modified BLEU
- NIST MTEval
- Confidence Intervals based on Bootstrap
Percentile - Algorithm
- Comparing two MT systems
- Implementation
- Discussions
- How much testing data is needed?
- How many reference translations are needed?
- How many bootstrap samples are needed?
- Non-parametric interval or normal/t-intervals?
17How much testing data is needed
18How much testing data is needed
- NIST scores increase steadily with the growing
test set size - The distance between the scores of the different
systems remains stable when using 40 or more of
the test set - The confidence intervals become narrower for
larger test set - Rule of thumb doubling the testing data size
narrows the confidence interval by 30
(theoretically justified)
System A, (Bootstrap Size B2000)
19Effects of Using Multiple References
- Single reference from one translator may favor
some systems - Increasing the number of references narrows down
the relative confidence interval
20How Many Reference Translations are Sufficient?
System A, (Bootstrap Size B2000)
- Confidence intervals become narrower with more
reference translations - 100(1-ref) 8090(2-ref) 7080(3-ref)
6070(4-ref) - One additional reference translation compensates
for 1015 of testing data
21Do We Really Need Multiple References?
- Parallel multiple reference
- Single reference from multiple translators
- Reduced bias from different translators
- Yields the same confidence interval/reliability
as the parallel multiple reference - Costs only half of the effort compared to
building a parallel multiple reference set
Originally proposed in IBMs BLEU report
22Single Reference from Multiple Translators
- Reduced bias by mixing from different translators
- Yields the same confidence intervals
23Bootstrap-t Interval vs. Normal/t Interval
- Normal distribution / t-distribution
- Students t-interval (when n is small)
- Bootstrap-t interval
- For each bootstrap sample, calculate
- The alpha-th percentile is estimated by the value
, such that - Bootstrap-t interval is
- e.g. if B1000, the 50th largest value and the
950th largest value gives the bootstrap-t
interval
Assuming that
Assuming that
24Bootstrap-t interval vs. Normal/t interval (Cont.)
- Bootstrap-t intervals assumes no distribution,
but - It can give erratic results
- It can be heavily influenced by a few outlying
data points - When B is large, the bootstrap sample scores are
pretty close to normal distribution - Assume normal distribution gives more reliable
intervals, e.g. for BLEU relative confidence
interval (B500) - STDEV0.27 for bootstrap-t interval
- STDEV0.14 for normal/student-t interval
25The Number of Bootstrap Replications B
- Ideal bootstrap estimate of the confidence
interval takes B - Computational time increases linearly with B
- The greater B, the smaller the standard deviation
of the estimated confidence intervals. E.g. for
BLEUs relative confidence interval - STDEV 0.60 when B100 STDEV 0.27 when B500
- Two rules of thumb
- Even a small B, say B100 is usually informative
- Bgt1000 gives quite satisfactory results
26Conclusions
- Using bootstrapping method to measure the
confidence intervals for MT evaluation metrics - Using confidence intervals to study the
characteristics of an MT evaluation metric - Correlation with human judgments
- Sensitivity
- Consistency
- Modified BLEU is a better metric than BLEU
- Single reference from multiple translators is as
good as parallel multiple references and costs
only half the effort
27References
- Efron, B. and R. Tibshirani 1986, Bootstrap
Methods for Standard Errors, Confidence
Intervals, and Other Measures of Statistical
Accuracy, Statistical Science 1, p. 54-77. - F. Och. 2003. Minimum Error Rate Training in
Statistical Machine Translation. In Proc. Of ACL,
Sapporo, Japan. - M. Bisani and H. Ney 2004, 'Bootstrap Estimates
for Confidence Intervals in ASR Performance
Evaluation', In Proc. of ICASP, Montreal, Canada,
Vol. 1, pp. 409-412. - G. Leusch, N. Ueffing, H. Ney 2003, 'A Novel
String-to-String Distance Measure with
Applications to Machine Translation Evaluation',
In Proc. 9th MT Summit, New Orleans, LO. - I Dan Melamed, Ryan Green and Joseph P. Turian
2003, 'Precision and Recall of Machine
Translation', In Proc. of NAACL/HLT 2003,
Edmonton, Canada. - King M., Popescu-Belis A. Hovy E. 2003,
'FEMTI creating and using a framework for MT
evaluation', In Proc. of 9th Machine Translation
Summit, New Orleans, LO, USA. - S. Nießen, F.J. Och, G. Leusch, H. Ney 2000,
'An Evaluation Tool for Machine Translation Fast
Evaluation for MT Research', In Proc. LREC 2000,
Athens, Greece. - NIST Report 2002, Automatic Evaluation of
Machine Translation Quality Using N-gram
Co-Occurrence Statistics, http//www.nist.gov/spee
ch/tests/mt/doc/ngram-study.pdf - Papineni, Kishore Roukos, Salim et al. 2002,
'BLEU A Method for Automatic Evaluation of
Machine Translation', In Proc. of the 20th ACL. - Ying Zhang, Stephan Vogel, Alex Waibel 2004,
'Interpreting BLEU/NIST scores How much
improvement do we need to have a better system?,'
In Proc. of LREC 2004, Lisbon, Portugal.
28Questions and Comments?
29N-gram Contributions to NIST Score