Title: Automatic Evaluation
1Automatic Evaluation
Philipp Koehn
Computer Science and Artificial Intelligence
Lab Massachusetts Institute of Technology
2A utomatic Evaluation
- Why automatic evaluation metrics?
- Manual evaluation is too slow
- Evaluation on large test sets reveals minor
improvements - Automatic tuning to improve machine translation
performance - History
- Word Error Rate
- BLEU since 2002
- BLEU in short Overlap with reference translations
3BLEU in Action
the gunman was shot to death by the police .
(Reference Translation) the gunman was police
kill . 1wounded police jaya of 2the
gunman was shot dead by the police . 3the
gunman arrested by police kill . 4the gunmen
were killed . 5the gunman was shot to death
by the police . 6 gunmen were killed by police
?SUBgt0 ?SUBgt0 7 al by the police . 8the
ringer is killed by the police . 9police
killed the gunman . 10 What is the best
translation?
4BLEU in Action
the gunman was shot to death by the police .
(Reference Translation) the gunman was police
kill . 1wounded police jaya of 2the
gunman was shot dead by the police . 3the
gunman arrested by police kill . 4the gunmen
were killed . 5the gunman was shot to death
by the police . 6 gunmen were killed by police
?SUBgt0 ?SUBgt0 7 al by the police . 8the
ringer is killed by the police . 9police
killed the gunman . 10
green 4-gram match (good!) cyan 3-gram
match blue 2-gram match purple 1-gram
match red word not matched (bad!)
5(No Transcript)
6DARPA MT Evaluation Corpus11 Human Translations
of 100 Chinese News Article
At least 12 people were killed in the battle
last week. Last week 's fight took at least 12
lives. The fighting last week killed at least
12. The battle of last week killed at least 12
persons. At least 12 people lost their lives in
last week 's fighting. At least 12 persons died
in the fighting last week. At least 12 died in
the battle last week. At least 12 people were
killed in the fighting last week. During last
week 's fighting , at least 12 people died. Last
week at least twelve people died in the fighting.
Last week 's fighting took the lives of twelve
people.
7BLEU in Theory
- How many n-grams in the output
match n-grams in the reference ? - Usually 1-gram to 4-grams
- Length penalty to assure that output is of
similar length - BLEU BP exp(w1 log p1 ... w4 log p4)
- pn correct n-grams / count n-grams in output
- BP min(1, exp(length_output/length_reference) )
8BLEU Tends to Predict Human Judgments
(variant of BLEU)
slide from G. Doddington (NIST)
9Developing with BLEU
- Track improvements quit dead ends early
10Optimize Systems for BLEU
Learning algorithm for directly reducing
translation error ? big improvements in quality.
11Criticisms of BLEU
- Not sensitive to global syntactic structure
- Some words are more important than others (not
vs. the) - Score by itself is not very meaningful (is
0.34 good?) - ... but does this matter?
- ... can it be fixed?
12Is BLEU perfect?
- A very useful tool at this point
- Some caveats
- Only makes sense for large test sets (1000s
sentences) - BLEU does not work for single sentences
- Problems with BLEU have to be demonstrated by
lack of correlation with human jugdements Nobod
y cares about anecdotal criticism - Can BLEU be improved? There is a lot of
work in MT Evaluation...