Automatic Evaluation

About This Presentation

Title:

Automatic Evaluation

Description:

Evaluation on large test sets reveals minor improvements ... an email from Mr. Bin Laden and other rich businessman from Saudi Arabia. ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 13

Provided by: peopleC6

Learn more at: https://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Evaluation

1
Automatic Evaluation
Philipp Koehn
Computer Science and Artificial Intelligence
Lab Massachusetts Institute of Technology
2
A utomatic Evaluation

Why automatic evaluation metrics?
Manual evaluation is too slow
Evaluation on large test sets reveals minor
improvements
Automatic tuning to improve machine translation
performance
History
Word Error Rate
BLEU since 2002
BLEU in short Overlap with reference translations

3
BLEU in Action
the gunman was shot to death by the police .
(Reference Translation) the gunman was police
kill . 1wounded police jaya of 2the
gunman was shot dead by the police . 3the
gunman arrested by police kill . 4the gunmen
were killed . 5the gunman was shot to death
by the police . 6 gunmen were killed by police
?SUBgt0 ?SUBgt0 7 al by the police . 8the
ringer is killed by the police . 9police
killed the gunman . 10 What is the best
translation?
4
BLEU in Action
the gunman was shot to death by the police .
(Reference Translation) the gunman was police
kill . 1wounded police jaya of 2the
gunman was shot dead by the police . 3the
gunman arrested by police kill . 4the gunmen
were killed . 5the gunman was shot to death
by the police . 6 gunmen were killed by police
?SUBgt0 ?SUBgt0 7 al by the police . 8the
ringer is killed by the police . 9police
killed the gunman . 10
green 4-gram match (good!) cyan 3-gram
match blue 2-gram match purple 1-gram
match red word not matched (bad!)
5
(No Transcript)
6
DARPA MT Evaluation Corpus11 Human Translations
of 100 Chinese News Article
At least 12 people were killed in the battle
last week. Last week 's fight took at least 12
lives. The fighting last week killed at least
12. The battle of last week killed at least 12
persons. At least 12 people lost their lives in
last week 's fighting. At least 12 persons died
in the fighting last week. At least 12 died in
the battle last week. At least 12 people were
killed in the fighting last week. During last
week 's fighting , at least 12 people died. Last
week at least twelve people died in the fighting.
Last week 's fighting took the lives of twelve
people.
7
BLEU in Theory

How many n-grams in the output
match n-grams in the reference ?
Usually 1-gram to 4-grams
Length penalty to assure that output is of
similar length
BLEU BP exp(w1 log p1 ... w4 log p4)
pn correct n-grams / count n-grams in output
BP min(1, exp(length_output/length_reference) )

8
BLEU Tends to Predict Human Judgments
(variant of BLEU)
slide from G. Doddington (NIST)
9
Developing with BLEU

Track improvements quit dead ends early

10
Optimize Systems for BLEU
Learning algorithm for directly reducing
translation error ? big improvements in quality.
11
Criticisms of BLEU

Not sensitive to global syntactic structure
Some words are more important than others (not
vs. the)
Score by itself is not very meaningful (is
0.34 good?)
... but does this matter?
... can it be fixed?

12
Is BLEU perfect?

A very useful tool at this point
Some caveats
Only makes sense for large test sets (1000s
sentences)
BLEU does not work for single sentences
Problems with BLEU have to be demonstrated by
lack of correlation with human jugdements Nobod
y cares about anecdotal criticism
Can BLEU be improved? There is a lot of
work in MT Evaluation...