Title: MT Parameter Estimation Minimum Error Rate Training
1MT Parameter EstimationMinimum Error Rate
Training
2Overview
- Parameter Estimation / Tuning / Minimum Error
Rate Training - Tuning Set
- Difficulties
- Computationally expensive to calculate objective
function - Error surface makes search non-trivial
- N-Best Lists
- Powell Search
- Simplex Algorithm
- N-Best List Re-scoring
- Minimum Bayes Risk
3System overview
4System overview
Translation Model Phrase Table e -gt f
Source Language Text
Translation Model Phrase Table f -gt e
Preprocessing
Parameter Estimation
Translation Model Lexicon e -gt f
Translation Model Lexicon f -gt e
Decoder
Language Model
POS LM
Distortion Model
Word Count
Phrase Count
Target Language Text
Cohesion Constraint
5Parameter Estimation / Tuning
- Need training data to optimize weights (?1, ,
?n) - Set of sentences with reference translation
- usually around 1000 sentences
- held out from training data for translation and
language models - is called tuning set or development set
- Tuning towards better translation
- needs automatic translation evaluation metric
(e.g. BLEU, TER, METEOR) - minimize translation error rate (maximize
translation score) - Minimum Error Rate Training (MERT)
6Parameter Estimation / Tuning
- Find (?1, , ?n) so the translation error rate is
minimal - To evaluate the objective function we need to
- set weights
- run decoder with these weights
- evaluate resulting translation
- computationally expensive!
- Error surface is not nice
- Not convex -gt many local minima
- Not differentiable -gt gradient descent methods
not readily applicable
?i
7N-Best Lists
- Optimize on n-best list
- output e.g. 500 best translations with their
feature scores - pre-calculate error rate for each n-best list
entry - optimize weights so the best translation
(according to error metric) is getting the best
total score - Powell search / Simplex Algorithm
- re-run decoder with updated weights
- add new n-best list to the previous one (more
stability) - run optimizer over larger n-best lists
- repeat until no new translations, or improvement
lt epsilon, or just k times (typically 5-10
iterations)
8N-Best Lists Example
17 since october 9th , the dprk announced to
conduct nuclear tests , japan's ruling liberal
democratic party policy chief shoichi nakagawa
repeatedly on different occasions claimed that
discussions on whether japan should possess
nuclear weapons . cost 12.271806 17
since october 9th , the dprk announced to conduct
nuclear tests , japan's ruling liberal democratic
party policy chief shoichi nakagawa repeatedly on
different occasions that japan should discuss
whether it possesses nuclear weapons . cost
12.488882 17 since october 9th north korea
announced a nuclear test , japan's ruling liberal
democratic party policy chief shoichi nakagawa
repeatedly on different occasions claimed that
discussions on whether japan should possess
nuclear weapons . cost 12.599372 17
since october 9th , the dprk announced to conduct
nuclear tests , discussion of japan's ruling
liberal democratic party policy chief shoichi
nakagawa repeatedly on different occasions that
japan should have nuclear weapons . cost
12.612238 17 beginning october 9th , north
korea announced a nuclear test , japan's ruling
liberal democratic party policy chief shoichi
nakagawa repeatedly on different occasions
claimed that discussions on whether japan should
possess nuclear weapons . cost 12.970050 17
since october 9th north korea announced a
nuclear test , japan's ruling liberal democratic
party policy chief shoichi nakagawa repeatedly on
different occasions that japan should discuss
whether it possesses nuclear weapons . cost
13.050649 17 from october 9th , the dprk
announced to conduct nuclear tests , japan's
ruling liberal democratic party policy chief
shoichi nakagawa repeatedly on different
occasions claimed that discussions on whether
japan should possess nuclear weapons . cost
13.192306
9N-Best Lists Example
16 there will be no ( removed ) . "
cost 2.8804655652387 2.93105 15.7766 0.155555
1.18532 1.84075 2.38613 5.12687 2.43787 5 -9
4.44849 16 there will not be ( removed ) .
" cost 3.0528793365197 3.39828 18.0113
0.155555 0.941883 1.635 1.84705 4.29711 2.43787 5
-9 3.95439 16 there will be no ( the
dismissal ) . " cost 3.094871928852 3.51552
18.9155 0.04 2.26044 2.48211 1.08776 5.11773
1.90654 5 -10 4.56892 16 there will be no
dismiss ( ) . " cost 3.2658643839427 3.26587
17.0452 0.155555 2.36578 2.33918 1.48557 5.72912
2.23892 5 -9 3.95801 16 there will be a (
removed ) . " cost 3.4441157853547 3.44412
15.6787 0.155555 2.06722 3.24385 5.58451 3.79602
2.43787 5 -9 6.31365 16 there will be a (
recall ) . " cost 3.4704808758737 3.47048
15.65 0.155555 2.98438 3.68957 5.63635 3.5788
2.23326 5 -9 6.104 16 ( the recall ) . "
cost 3.49422747670162 3.49422 10.8817
0.733333 3.29764 4.00283 8.77178 1.37352 2.23326
5 -6 9.02326 16 there will be no ( the
dismissal ) " . cost 3.515502727333 3.51552
18.9155 0.04 2.26044 2.48211 1.08776 5.11773
1.90654 5 -10 4.56892
10N-Best Lists
- Optimize on n-best list
- output e.g. 500 best translations with their
feature scores - pre-calculate error rate for each n-best list
entry - optimize weights so the best translation
(according to error metric) is getting the best
total score - Powell search / Simplex Algorithm
- re-run decoder with updated weights
- add new n-best list to the previous one (more
stability) - run optimizer over larger n-best lists
- repeat until no new translations, or improvement
lt epsilon, or just k times (typically 5-10
iterations)
11Powell search
- Powell search is a line search
- consider one parameter at a time
12Powell search
- Change one parameter at a time to find optimum
for each dimension - Cannot go diagonally
- Not guaranteed to find global optimum
- End point depends on where the search starts
- What step size is good?
- High dimensionality
- Computationally expensive many points to
evaluate - Can we reduce the number of points that need to
be evaluated?
13Powell search
- Linear combination of models
- Model cost
- Only look at one feature weight at a time
- Total cost for one hypothesis considering only
one weight changing is a linear function
14Powell search
- Model score for one hypothesis
- Changing one feature weight
e12 TER 5
Model Score
hk
1
?k
15Powell search
- Depending on scaling factor ?k, different hyps
are in 1-best position - Set ?k to have metric-best hypothesis also being
model-best
e11 TER 8
e12 TER 5
Model Score
e13 TER 4
best hyp
?k
e12
e13
e11
8
5
4
16Powell search
- Select minimum number of evaluation points
- Calculate intersection point
- Keep only if hypotheses are minimum at that
point - Choose evaluation points between intersection
points
e11 TER 8
e12 TER 5
Model Score
e13 TER 4
best hyp
?k
e12
e13
e11
8
5
4
17Powell search
- Different source sentence
- No matter which ?k, h22 would never be 1-best
e21 TER 2
e22 TER 0
e23 TER 5
Model Score
best hyp
?k
e23
e21
2
5
18Powell search
e21 TER 2
e22 TER 0
e23 TER 5
e11 TER 8
e12 TER 5
Model Score
e13 TER 4
e22
e21
best hyp
?k
e12
e13
e11
10
7
10
9
19Simplex Algorithm
- Downhill Simplex is essentially like a gradient
descend - Error function is not differentiable
- Looking at two dimensions
- Evaluate three points to find the direction in
which the error decreases - consider additional points to ensure
convergence
20Simplex Algorithm
- Replace worst with point (in this order)
- R reflection
- E expansion
- C contraction
- S shrinkingreplace worst with Sand good
with M
21Simplex Algorithm
- For n dimensions
- Start with n1 random weight vectors
- Evaluate translation for each configuration -gt
objective function - Sort points xk according to objective
function f(x1) lt f(x2) lt lt f(xn1) - Calculate x0 as center of gravity for x1 xn
- Replace worst point with a point reflected
through the centroid xr x0 r (x0 xn1) - Or consider additional points
22Random Restarts
- Comparison Simplex/Powell
- (Alok Parlikar, unpublished)
(Bing Zhao, unpublished)
- Alok Simplex jumpier then Powell
- Bing Simplex better than MER
- Both you need many restarts
23Notes on Tuning
- Optimization can get stuck in local minimum
- Restart multiple times, random seeds
- Models are not improved, only their combination
- Some parameters change performance of decoder,
but are not in the linear combination - Beam size
- Word reordering restrictions
- Optimization towards different automatic metrics
- Optimization using different development data
24N-Best List Re-Ranking
- N-best list re-ranking is a standard technique
- use for computationally expensive things
- try new ideas with minimum implementation effort
- apply additional models which are too large to
be loaded - apply additional models which can not be applied
to a lattice - E.g. consensus over all translations in the
n-best list
25N-Best List Re-Ranking
- On n-best list
- add additional feature score
- optimize and re-rank
- output new first best translation from list
- Oracle score on n-best
- pick the metric best hypothesis for each source
sentence - usually e.g. 8 to 12 points (in BLEU) better
than decoder first best - means our models are not strong enough to
select the best hypothesis
26Minimum Bayes Risk
- Maximum a posteriori solution
- Minimum Bayes risk solution
27Minimum Bayes Risk
- Loss function L(e,e) hypotheses that are
similar, close together in the hypothesis space? - Use the automatic evaluation metric we want to
improve - e is hypothesis, e treated as reference
- calculate translation error rate as distance
measure - Requires pairwise comparison of all hypotheses,
O(n2) - Only consider e.g. 1000 best hypotheses (n-best
list)
28Summary
- Parameter Estimation / Tuning / Minimum Error
Rate Training - Tuning Set
- Difficulties
- Computationally expensive to calculate objective
function - Error surface makes search non-trivial
- N-Best-lists
- Powell Search
- Simplex Algorithm
- N-Best List Re-scoring
- Minimum Bayes Risk