Title: Decoding Algorithms for Statistical Machine Translation
1Decoding Algorithms for Statistical Machine
Translation
- Dr. Joy Ying Zhang
- Carnegie Mellon University
2Carnegie Mellon University Silicon Valley
3CMU SV Campus
4In a few years
5Outline
- Overview
- Monotone decoder
- Decoding with reordering
- Jumping window
- Decoding with ITG
- Hierarchical decoder
- Decoder for mobile devices
6Phrase-based SMT
7Decoding is NP Complete
- Even the simplest decoding algorithm is
NP-complete complexity is exponential to the
sentence length. Just as a Travelling Salesman
Problem (TSP) Knight et. al. 99
8Decoding as TSP
- In a word-to-word translation model
- The order to choose the next source word to
translate is like to choose the next city to
visit - To choose the target translation is like to
choose which hotel to stay in a city - The optimal translation corresponds to the
optimal city/hotel choice. - We can only afford a suboptimal solution ?
- Lets start with the simplest one
9Monotone Decoding
- No reordering is allowed, decoding from left to
right - Apply translation model over the testing sentence
to build up a lattice - Search the lattice for a best path given all
knowledge sources (translation model, language
model, sentence length model )
10Monotone Decoding
- Traverse the lattice from left to right
- Building partial translation hypotheses for each
node (what are good translations up to this
source positions) - Output the best that covers the complete
sentence as the final translation.
11Probability/Score of a Partial Hyp
- Depends on the model used in decoder
- Translation model scores under the independent
assumption. - E.g. P(e1en f1fm) P(e1..e3
f1f4)P(e4..e5 f5f6).. - Language modelP(e1en )
- Sentence length model score(n src length)
- Distortion model
- And be creative ..
12Sentence Length Model
- Different language have different level of
wordiness - Histogram over source sentence length target
sentence length shows that distribution is rather
flat -gt p( J I ) is not very helpful - Very simple sentence length model the more the
better - i.e. give bonus for each word (not a
probabilistic model) - Balances shortening effect of LM
- Can be applied immediately, as absolute length is
not important - However this is insensitive to whats in the
sentence - Optimize length of translations for entire test
set, not each sentence - Long sentences are made longer to cover for
sentences which are too short
13Partial Hypotheses Recombination
- For each source word and phrase, there are t
translation alternatives. - If simply combine them, final node will have
tJ hyps to be explored. - However, many partial hyps are not
distinguishable to decoder models - If using TM and a 3-gram LM only
- I will come to office
- I came to office
14Recombination of Hypotheses
- Recombination Of two hypotheses keep only the
better one if no future information can switch
their current ranking - Notice this depends on the models
- Model score depends on current partial
translation and the extension, e.g. LM - Model score depends on global features known only
at the sentence end, e.g. sentence length model - The models define equivalence classes for the
hypotheses - Expand only best hypothesis in each equivalence
class
15Recombination of Hypotheses Example
- TM and n-gram LM only
- Hypotheses
- H1 I would like to go
- H2 I would not like to go
- Assume as possible expansions Eto the movies
to the cinema and watch a film - LMscore is identical for H1Expansion as for
H2Expansion for bi, tri, four-gram LMs - E.g 3-gram LMscore Expansion 1 is-logpr( to
to go ) - logpr( the go to ) logpr( movies
to the) - Therefore Cost(H1) lt Cost(H2) gt Cost(H1E) lt
Cost(H2E) for all possible expansions E
16Beam Pruning
- Still a lot of partial hyps to explore even after
the recombination for each node in lattice (src
sentence up to this position) - To a not so good partial hyp Sorry, we dont
give you chances any more since you failed this
mid-term - Prune H if it is not the top B-best hyp --- Beam
Size Pruning - Prune H if its score is lower than factorbest
score --- Beam Factor Pruning - Pruning reduces number of partial hyps to
explore faster decoding - But it eliminates those might become good
translations later on
17Beam Pruning
18Rest-Cost Estimation
- In Pruning we compare hyps, which are not
strictly equivalent under the models - Risk prefer hypotheses which have covered the
easy parts - Remedy estimate remaining cost for each
hypothesis - Want to know minimum expected cost (similar to A
search) - Gives a bound for pruning
- However, not possible with acceptable effort
- Want to include as many models as possible
- Translation model costs, word count, phrase count
- Language model costs
- Distortion model costs
- Calculate expected cost for each span (l, r)
R(l, r)
19Rest Cost for Translation Models
- Translation model, word count and phrase count
features are local costs - Depend only on current phrase pair
- Strictly additive R(l, m) R(m, r) R(l, r)
- Minimize over alternative translations
- For each source phrase span (l, r) initialize
with cost for best translation - Combine adjacent spans, take best combination
20Rest Cost for Language Models
- We do not have history -gt only approximation
- For each span (l, r) calculate LM score without
history - Combine LM scores for adjacent spans
- Notice p(e1 em) p(em1 en) ! p(e1 en)
beyond 1-gram LM
- Alternative fast monotone decoding with TM-best
translations - History available
- Then R(l,r) R(1,r) R(1,l)
21Rest Cost for Distance-Based DM
- Distance-based DM rest cost depends on coverage
pattern - To many different coverage patterns, can not
pre-calculate - Estimate by jumping to first gap, then filling
gaps in sequence - Moore Quirk 2007 DM cost plus rest cost
S
S
S
Previous phrase
Gap-free initial segment
Current phrase
L(.) length of phrase, D(.,.) distance
between phrases
S adjacent S d0
S left of S d2L(S)
S subsequence of S d2(D(S,S)L(S))
Otherwise d2(D(S,S)L(S))
22Rest Cost for Lexicalized DM
- Lexicalized DM per phrase
- DM(f,e) scores in-mon, in-swap, in-dist,
out-mon, out-swap, out-dist - Treat as local cost for each span (l, r)
- Minimize over alternative translations and
different orientations in- and out-
23Effect of Rest-Cost Estimation
- From Richard Zens 2008
- LM is important, DM is important
24Output Best Translation
- Optimal hypothesis in the last node of the
lattice - We need to keep the back pointers
25Monotone Decoding Algorithm
- Apply TM on sentence f1fJ
- For j1 to J
- Foreach incoming edge e that enters node j
- Edge e i-gtj
- Foreach partial hyp h in node I
- Extend h with edge e
- Estimate hyp prob/score for he
- Store lthe, prob/score, back pointer to hgt in
node j - Prune partial hyps in node j
- In node J, find out the best hyp
- Follow the backpointers and output the final
translation
26Output N-best List
- When traverse back from the last node, decoder
can output the top N-best hyps for the whole
sentence N-best list. - Model scores do not correlate well with external
scores such as BLEU - In a 1000-best list, hyps with the highest BLEU
ranks about 489.38 according to their model
scores.
27N-Best List
28N-Best Rescoring
- Generate n-best list
- Use different TM and/or LM to rescore each
translation -gt reordering of translations, i.e.
different best translation - Different TMs
- Use IBM1 lexicon for entire translation
- Use HMM-FB and IBM4 lexicons
- Forced alignment with HMM alignment model
- Different LMs
- Very large LM (Distributed Language Model)
- Link grammar too slow)
- Other syntax-based LMs, e.g. Charniaks parser?
29Problem with N-Best Generation
- Duplicates from different transducers
- _at_Lex A B 0.5
- _at_ISA A B 0.7
- -gt Two identical translations with different
scores or even same score (when rescoring all
translations with same lexicon) - Spurious ambiguities
- us companies and other institutions
- us companies and other institutions
- us companies and other institutions
- us companies and other institutions
- . . .
- Example run 1000 n-best -gt 400 different
strings on average Extreme case only 10 unique
strings - Possible solution Checking uniqueness during
backtracking
30Oracle Score of N-best List
31Using Distributed LM for Reranking Systems
- Large training data available
- Distributed computing clusters
- Distributed language modeling (Zhang and Vogel,
2006 Emami, 2007 Brants et al, 2007)?
32Rerank the N-Best List using LM Features
33Rerank N-best List
34Rerank N-best List
35Rerank N-best List
- Considering long-distance dependencies
36Reranking N-best List
37Tuning the SMT System
- We use different models in SMT system
- Models have simplifications
- Trained on different amounts of data
- gt Different levels of reliability
- gt Give different weight to different ModelsQ
c1 Q1 c2 Q2 cn Qn - Find optimal scaling factors c1 cn
- Optimal means Highest score for chosen
evaluation metric
38Automatic Tuning
- Many algorithms to find (near) optimal solutions
available - Simplex
- Maximum entropy
- Minimum error training
- Minimum Bayes risk training
- Genetic algorithm
- Note models are not improved, only their
combination - Large number of fully translations required gt
still problematic when decoding is slow
39Automatic Tuning on N-best List
- Generate n-best lists, e.g. for each of 500
source sentences 1000 translations - Loop
- Changing scaling factors results in re-ranking
the n-best lists - Evaluate new 1-best translations
- Apply any of the standard optimization techniques
- Advantage much faster
- Can pre-calculate the counts (e.g. n-gram
matches) for each translation to speed up
evaluation - For Bleu or NIST metric with global length
penalty do local hill climbing for each
individual n-best list
40Minimum Error Training
- For each scaling factor we have Q ck Qk
QRest - For different values different hyps have lowest
score - Different hyps lead to different MT eval scores
41Decoding with Reordering
- Languages are of different word orders
- 1??/Austrilia 2?/is ?/with 3??/North Korea
4?/has 5??/diplomatic relationship 6?/of
7??/a few 8??/countries 9??/one of - Austrilia is one of the few countries that have
diplomatic relationship with North Korea - To generate the right English translation, we
need to translate the source in order of 12967845 - Reordering either change the order to translate
the source, or equivalently re-arrange the
partial translations - Knowledge sources
- Reordering models
- Language models
- Syntax
42Reordering Strategies
- All permutations
- Any re-ordering possible
- Complexity of traveling salesman -gt only possible
for very short sentences - Small jumps ahead filling in the gaps pretty
soon - Only local word reordering
- Implemented in current decoder
- Leaving small number of gaps fill in at any
time - Allows for global but limited reordering
- Similar decoding complexity exponential in
number of gaps - IBM-style reordering (described in IBM patent)
- Merging neighboring regions with swap no gaps
at all - Allows for global reordering
- Complexity lower than 1, but higher than 2 and 3
43Decoding with Reordering Window
- Word and phrase reordering within a given window
- From first un-translated source word next k
positions - Window length 1 monotone decoding
- Restrict total number of reordering (typically 3
per 10 words) - Simple Jump model
- One reordering typically includes two jumps
- Jump distance D depends on gap and also on phrase
lengthdistance measured from center of phrase to
center of phrase - Simple Gaussian distribution p(D) exp( D - 1)
- Lexicalized jump model
44Jumping ahead in the Lattice
- Hypothesis describes a partial translation
- Coverage information, Back-trace information,
Score - Expand hypothesis over uncovered position
I will come
to your office
I come
tomorrow
to
you
come
I
morgen
zu
dir
ich
komme
h c11000, tI will come
h c11011, tI will come to your office
h c11111, tI will come to your office tomorrow
45Word Order Coverage Info
- Need to know which source words have already been
translated - Dont want to miss some words
- Dont want to translate words twice
- Can compare hypotheses which cover the same words
- Use Coverage vector to store this information
- Essentially a bit vector
- For small jumps ahead position of first gap
plus short bit vector - For small number of gaps array of positions of
uncovered words - For merging neighboring regions left and right
position
46Decoding with Inverted Transduction Grammar
- Translation model phrase to phrase translation
- May include lexicalized reordering probabilities
- Grammar X-gtltF1F2, E1E2gt X-gtltF1F2, E2E1gt
X-gtltf, egt
47Combine Adjacent Edges
- Take adjacent edges el and er and create a new
edge e - e.FromNode el.FromNode
- e.ToNode er.ToNode
- e.Translation el.Translation er.Translation
tomorrow I will come
I will come
to your office
I come
to
you
come
tomorrow
I
morgen
zu
dir
ich
komme
hl c(0,2), tI will come
hr c(2,3) ttomorrow
h c(0,3), tI will come tomorrow
48 And Allow For Reordering
- Create additional edge
- e.FromNode el.FromNode
- e.ToNode er.ToNode
- e.Translation er.Translation el.Translation
tomorrow I will come
I will come
to your office
I come
to
you
come
tomorrow
I
morgen
zu
dir
ich
komme
hl c(0,2), tI will come
hr c(2,3) ttomorrow
h c(0,3), ttomorrow I will come
49Chart-Decoder for Simple ITG
- Recall Simple ITG binary tree
- Word reordering straight and inverted subtrees
- Allows long distance reordering first-gtlast
word, last -gt first word - Generation of partial hypotheses
- Initialize with phrase translations
- Combine adjacent areas into longer translations
- Allow for swaps
- Requires different organization of decoder
50Chart Decoder
51LM in Chart-Based Translation
- Language model states on both sides
- History has not been seen
- Combine h(0,2) abc with h(2,5) de to give
h(0,5) abcde - Calculated was p(a) p(ba) p(cab)and
p(d) p(ed) - But now needed p(dbc) p(fde)
- Partly undo calculation
- subtract wrong log probs p(d) p(ed)
- add correct log probs p(dbc) p(fde)
- For short extensions just extend from left
hypothesis - For long extensions, faster to correct LM score
52Effect of Reordering
- Arabic devtest set (203 sentences)
- Chinese test set 2002 (878 sentences)
- Reordering mainly improves fluency, i.e. stronger
effect for Bleu - Improvement for Arabic 4.8 NIST and 12.7 Bleu
- Less improvement for Chinese 5 in Bleu
53Effect of Reordering
Arabic/English translation
54Effect of Reordering
55Hierarchical Decoding
- Translation model phrase pairs with holes
(phrase of phrases) - Consider hierarchical phrase pairs as translation
rules - Decoding is a CYK parsing find the optimal
synchronous parsing tree
56Hierarchical Decoding (no LM)
57Decoding as Parsing (Hiero)
58SMT Decoder for Mobile Devices
- Mobile speech translators
- Fast (close to real time) speech translation
- Domain limited but should not limit to
pre-recorded sentences - Two-way translation
- Challenges
- Weaker CPU (e.g. iPhone 3G S 600MHz)
- Tiny RAM a few MB, up to 256 MB
- No numerical co-processors
- Pandora decoder
- Minimum on-device computing
- Intergized computation
- Compact data structure
59Summary
- Decoder
- Generating translation lattice
- Finding best path
- Limited word reordering
- Generation of N-best list
- Esp used for tuning system
- May also be used for downstream NLP modules
- Tuning of System
- Find optimal set of scaling factors
- Done on n-best list for speed
- Direct minimization of any MT eval metric