Measuring Confidence Intervals for MT Evaluation Metrics - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Measuring Confidence Intervals for MT Evaluation Metrics

Description:

Very expensive in time and money. Objective automatic MT evaluations ... Info gain for 2-gram and up is not meaningful. 80% of the score comes from unigram matches ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 30

Provided by: Joy293

Category:

more less

Transcript and Presenter's Notes

Title: Measuring Confidence Intervals for MT Evaluation Metrics

1
Measuring Confidence Intervals for MT Evaluation
Metrics

Ying Zhang (Joy)
Stephan Vogel
Language Technologies Institute
School of Computer Science
Carnegie Mellon University

2
Outline

Automatic Machine Translation Evaluation
BLEU
Modified BLEU
NIST MTEval
Confidence Intervals based on Bootstrap
Percentile
Algorithm
Comparing two MT systems
Implementation
Discussions
How much testing data is needed?
How many reference translations are needed?
How many bootstrap samples are needed?

3
Automatic Machine Translation Evaluation

Subjective MT evaluations
Fluency and Adequacy scored by human judges
Very expensive in time and money
Objective automatic MT evaluations
Inspired by the Word Error Rate metric used by
ASR research
Measuring the closeness between the MT
hypothesis and human reference translations
Precision n-gram precision
Recall
Against the best matched reference
Approximated by brevity penalty
Cheap, fast
Highly correlated with subjective evaluations
MT research has greatly benefited from automatic
evaluations
Typical metrics IBM BLEU, CMU M-BLEU, CMU
METEOR, NIST MTeval, NYU GTM

4
BLEU Metrics

Proposed by IBMs SMT group (Papineni et al,
2002)
Widely used in MT evaluations
DARPA TIDES MT evaluation
IWSLT evaluation
TC-Star
BLEU Metric
Pn Modified n-gram precision
Geometric mean of p1, p2,..pn
BP Brevity penalty
Usually, N4 and wn1/N.

c length of the MT hypothesis r effective
reference length
5
BLEU Metric

Example
MT Hypothesis the gunman was shot dead by
police .
Reference 1 The gunman was shot to death by the
police .
Reference 2 The gunman was shot to death by the
police .
Reference 3 Police killed the gunman .
Reference 4 The gunman was shot dead by the
police .
Precision p11.0(8/8) p20.86(6/7) p30.67(4/6)
p40.6 (3/5)
Brevity Penalty c8, r9, BP0.8825
Final Score
Usually n-gram precision and BP are calculated on
the test set level

6
Modified BLEU Metric

BLEU focuses heavily on long n-grams because of
the geometric mean
Example
Modified BLEU Metric (Zhang, 2004)
Arithmetic mean of the n-gram precision
More balanced contribution from different n-grams

p1 p2 p3 p4 BLEU
MT1 1.0 0.21 0.11 0.06 0.19
MT2 0.35 0.32 0.28 0.26 0.30
7
NIST MTEval Metric

Motivation
Weight more heavily those n-grams that are more
informative (NIST 2002)
Use a geometric mean of the n-gram score
Pros more sensitive than BLEU
Cons
Info gain for 2-gram and up is not meaningful
80 of the score comes from unigram matches
Most matched 5-grams have info gain 0 !
Score increases when the testing set size
increases

8
Questions Regarding MT Evaluation Metrics

Do they rank the MT systems in the same way as
human judges?
IBM showed a strong correlation between BLEU and
human judgments
How reliable are the automatic evaluation scores?
How sensitive is a metric?
Sensitivity the metric should be able to
distinguish between systems of similar
performance
Is the metric consistent?
Consistency the difference between systems is
not affected by the selection of
testing/reference data
How many reference translations are needed?
How much testing data is sufficient for
evaluation?
If we can measure the confidence interval of the
evaluation scores, we can answer the above
questions

9
Outline

Overview of Automatic Machine Translation
Evaluation
BLEU
Modified BLEU
NIST MTEval
Confidence Intervals based on Bootstrap
Percentile
Algorithm
Comparing two MT systems
Implementation
Discussions
How much testing data is needed?
How many reference translations are needed?
How many bootstrap samples are needed?

10
Measuring the Confidence Intervals

One BLEU/M-BLEU/NIST score per test set
How accurate is this score?
To measure the confidence interval a population
is required
Building a test set with multiple human reference
translations is expensive
Solution bootstrapping (Efron 1986)
Introduced in 1979 as a computer-based method for
estimating the standard errors of a statistical
estimation
Resampling creating an artificial population by
sampling with replacement
Proposed by Franz Och (2003) to measure the
confidence intervals for automatic MT evaluation
metrics

11
A Schematic of the Bootstrapping Process
Score0
12
An Efficient Implementation

Translate and evaluate 2,000 test sets?
No Way!
Resample the n-gram precision information for the
sentences
Most MT systems are context independent at the
sentence level
MT evaluation metrics are based on information
collected for each testing sentences
E.g. for BLEU/M-BLEU and NIST
RefLen 17 20 19 24
ClosestRefLen 17
1-gram 15 10 89.34
2-gram 14 4 9.04
3-gram 13 3 3.65
4-gram 12 2 2.43
Similar for human judgment and other MT metrics
Approximation for NIST information gain
Scripts available at http//projectile.is.cs.cmu.
edu/research/public/tools/bootStrap/tutorial.htm

13
Algorithm

Original test suite T0 with N segments and R
reference translations
Represent the i-th segment of T0 as an n-tuple
T0iltsi, ri1,ri2,..,riRgt
for(b1bltBb)
for(i1iltNi)
s random(1,N)
Tbi T0s
Calculating BLEU/M-BLEU/NIST for Tb
Sort B BLEU/M-BLEU/NIST scores
Output scores ranked 2.5th and 97.5

14
Confidence Intervals
15
Are Two MT Systems Different?

Comparing two MT systems performance
Using the similar method as for single system
E.g. Diff(Sys1-Sys2)Median-1.7355
-1.5453,-1.9056
If the confidence intervals overlap with 0, two
systems are not significantly different

M-Bleu and NIST have more discriminative power
than Bleu
Automatic metrics have pretty high correlations
with the human ranking
Human judges like system E (Syntactic system)
more than B (Statistical system), but automatic
metrics do not

16
Outline

Overview of Automatic Machine Translation
Evaluation
BLEU
Modified BLEU
NIST MTEval
Confidence Intervals based on Bootstrap
Percentile
Algorithm
Comparing two MT systems
Implementation
Discussions
How much testing data is needed?
How many reference translations are needed?
How many bootstrap samples are needed?
Non-parametric interval or normal/t-intervals?

17
How much testing data is needed
18
How much testing data is needed

NIST scores increase steadily with the growing
test set size
The distance between the scores of the different
systems remains stable when using 40 or more of
the test set
The confidence intervals become narrower for
larger test set
Rule of thumb doubling the testing data size
narrows the confidence interval by 30
(theoretically justified)

System A, (Bootstrap Size B2000)
19
Effects of Using Multiple References

Single reference from one translator may favor
some systems
Increasing the number of references narrows down
the relative confidence interval

20
How Many Reference Translations are Sufficient?
System A, (Bootstrap Size B2000)

Confidence intervals become narrower with more
reference translations
100(1-ref) 8090(2-ref) 7080(3-ref)
6070(4-ref)
One additional reference translation compensates
for 1015 of testing data

21
Do We Really Need Multiple References?

Parallel multiple reference
Single reference from multiple translators
Reduced bias from different translators
Yields the same confidence interval/reliability
as the parallel multiple reference
Costs only half of the effort compared to
building a parallel multiple reference set

Originally proposed in IBMs BLEU report
22
Single Reference from Multiple Translators

Reduced bias by mixing from different translators
Yields the same confidence intervals

23
Bootstrap-t Interval vs. Normal/t Interval

Normal distribution / t-distribution
Students t-interval (when n is small)
Bootstrap-t interval
For each bootstrap sample, calculate
The alpha-th percentile is estimated by the value
, such that
Bootstrap-t interval is
e.g. if B1000, the 50th largest value and the
950th largest value gives the bootstrap-t
interval

Assuming that
Assuming that
24
Bootstrap-t interval vs. Normal/t interval (Cont.)

Bootstrap-t intervals assumes no distribution,
but
It can give erratic results
It can be heavily influenced by a few outlying
data points
When B is large, the bootstrap sample scores are
pretty close to normal distribution
Assume normal distribution gives more reliable
intervals, e.g. for BLEU relative confidence
interval (B500)
STDEV0.27 for bootstrap-t interval
STDEV0.14 for normal/student-t interval

25
The Number of Bootstrap Replications B

Ideal bootstrap estimate of the confidence
interval takes B
Computational time increases linearly with B
The greater B, the smaller the standard deviation
of the estimated confidence intervals. E.g. for
BLEUs relative confidence interval
STDEV 0.60 when B100 STDEV 0.27 when B500
Two rules of thumb
Even a small B, say B100 is usually informative
Bgt1000 gives quite satisfactory results

26
Conclusions

Using bootstrapping method to measure the
confidence intervals for MT evaluation metrics
Using confidence intervals to study the
characteristics of an MT evaluation metric
Correlation with human judgments
Sensitivity
Consistency
Modified BLEU is a better metric than BLEU
Single reference from multiple translators is as
good as parallel multiple references and costs
only half the effort

27
References

Efron, B. and R. Tibshirani 1986, Bootstrap
Methods for Standard Errors, Confidence
Intervals, and Other Measures of Statistical
Accuracy, Statistical Science 1, p. 54-77.
F. Och. 2003. Minimum Error Rate Training in
Statistical Machine Translation. In Proc. Of ACL,
Sapporo, Japan.
M. Bisani and H. Ney 2004, 'Bootstrap Estimates
for Confidence Intervals in ASR Performance
Evaluation', In Proc. of ICASP, Montreal, Canada,
Vol. 1, pp. 409-412.
G. Leusch, N. Ueffing, H. Ney 2003, 'A Novel
String-to-String Distance Measure with
Applications to Machine Translation Evaluation',
In Proc. 9th MT Summit, New Orleans, LO.
I Dan Melamed, Ryan Green and Joseph P. Turian
2003, 'Precision and Recall of Machine
Translation', In Proc. of NAACL/HLT 2003,
Edmonton, Canada.
King M., Popescu-Belis A. Hovy E. 2003,
'FEMTI creating and using a framework for MT
evaluation', In Proc. of 9th Machine Translation
Summit, New Orleans, LO, USA.
S. Nießen, F.J. Och, G. Leusch, H. Ney 2000,
'An Evaluation Tool for Machine Translation Fast
Evaluation for MT Research', In Proc. LREC 2000,
Athens, Greece.
NIST Report 2002, Automatic Evaluation of
Machine Translation Quality Using N-gram
Co-Occurrence Statistics, http//www.nist.gov/spee
ch/tests/mt/doc/ngram-study.pdf
Papineni, Kishore Roukos, Salim et al. 2002,
'BLEU A Method for Automatic Evaluation of
Machine Translation', In Proc. of the 20th ACL.
Ying Zhang, Stephan Vogel, Alex Waibel 2004,
'Interpreting BLEU/NIST scores How much
improvement do we need to have a better system?,'
In Proc. of LREC 2004, Lisbon, Portugal.