Title: Decoding and Reordering
1Decoding and Reordering
2Outline
- A Probabilistic Approach to Syntax-based
Reordering for Statistical Machine Translation - Binarizing Syntax Trees to Improve Syntax-Based
Machine Translation Accuracy - Forest Rescoring Faster Decoding with
Integrated Language Models - An Efficient Two-Pass Approach to Synchronous-CFG
Driven Statistical MT
3Syntax-based Reordering for Phrase-based Decoding
- There are global reordering and local reordering
in phrase-based decoding according to the
distortion limit (resort to the distance-based
reordering model (Koehn et al., 2003) for
details)
4Syntax-based Reordering for Phrase-based Decoding
- Syntax , a potential solution to global
reordering, gives decoder a reordered input - If it works, how about n-best list of reordered
inputs?
5Translation Models
- one reordered input
- nbest reordered input
S
6Translation Models
- one reordered input
- nbest reordered input
S1
S2
Generate nbest reordered inputs
S
Sn
7Translation Models
- one reordered input
- nbest reordered input
S1
T1
Phrase-based decoding
S2
T2
S
Sn
Tn
8Translation Models
- one reordered input
- nbest reordered input
S1
T1
S2
T2
T
Select the best T
S
Sn
Tn
9Select the best translation
P(CE) P(EC) P(LM) Just like those in the
phrase-based SMT
The probability of reordering S as S
10Select the best translation
11Acquisition of Reordering Knowledge
- Given a node N on the parse tree of an source
language sentence, reordering knowledge can be
extract from the relative order of its children
phrase pi and corresponding target language
phrase T(pi) - Just consider the case of binary node for
simplicity
12Acquisition of Reordering Knowledge
13Two kinds of representations
- Reordering rules
- Z is the phrase label of a binary node
- X and Y are the phrase labels of Zs children
- Pr(IN-ORDER) and Pr(INVERTED) are the
probabilities that X and Y are inverted or not in
the target language. - Estimate the probability by Maximum Likelihood
Estimation -
14Two kinds of representations
- Maximum Entropy Model
- (binary classification problem)
- Features mfay be used
- Leftmost word
- Rightmost word
- Head word
- Context word
- POSs
- All features above can be extracted from source
phrases as well as target phrases
15The Application of Reordering Knowledge
- Let Pr(p?p) be the probability for a particular
reordering, it denote the probability of
reordering a phrase p into p
unitary node
binary node
16The Application of Reordering Knowledge
- The number of S increases exponentially. Let
R(N) be the number of reorderings of the phrase
yielded by N - Traverse the source language tree bottom to up,
at each node we just keep n ps that have the
highest reordering probability
17Remedy data sparseness
- If T(p1) and T(p2) overlap, the node N with
children N1 and N2 is not taken as a training
instance - The amount of training input is greatly reduced
- Remove some less probable alignment points to
minimize overlapping phrases
18Decoding
- As the greedy reordering algorithm above has a
tendency to focus on a particular clause of a
long sentence, for a lot of long sentences
containing several clauses, only one of the
clauses is reordered.
19Decoding
S
20Decoding
C1
S
Ci
Cn
Clauses spliting
21Decoding
S1
Cj1
Cj2
C1
Sj
S
Ci
Cjn
Cn
Sm
Clauses spliting
Clauses reordering
22Decoding
S1
Cj21
Cj1
Cj22
Cj2
C1
Sj
S
Ci
Cj2n
Cjn
Cn
Sm
Clauses spliting
Clauses reordering
In-Clause reordering
23Decoding
S1
Cj21
T(Cj21)
Cj1
Cj22
T(Cj22)
Cj2
C1
Sj
S
Ci
Cj2n
T(Cj2n)
Cjn
Cn
Sm
translation
Clauses spliting
Clauses reordering
In-Clause reordering
compose
24Decoding
S1
Cj21
T(Cj21)
Cj1
Cj22
T(Cj22)
T(Cj2)
Cj2
C1
Sj
S
Ci
Cj2n
T(Cj2n)
Cjn
Cn
Select the best
Sm
translation
Clauses spliting
Clauses reordering
In-Clause reordering
25Decoding
S1
Cj21
T(Cj21)
Cj1
Cj22
T(Cj22)
T(Cj2)
Cj2
C1
Sj
T(Sj)
S
Ci
Cj2n
T(Cj2n)
Cjn
Cn
Sm
translation
merge
Select the best
Clauses spliting
Clauses reordering
In-Clause reordering
26Decoding
S1
Cj21
T(Cj21)
Cj1
Cj22
T(Cj22)
T(Cj2)
Cj2
C1
Sj
T(Sj)
T(S)
S
Ci
Cj2n
T(Cj2n)
Cjn
Cn
Sm
translation
Clauses spliting
Clauses reordering
In-Clause reordering
merge
compose
Select the best
27Binarizing Syntax Trees for Syntax-Based MT
- Substructures of the tree cannot be reused
- A solution is to binarizingsyntax trees
- Simple methods such asleft-, right-,
head-binarization and theircombinations
28Left/Right binarizaton
29Left/Right binarizaton
30Definition Left Binarization
- The left binarization of node n factorizes the
leftmost r-1 children by forming a new node n to
dominate them, leaving the last child nr
untouched, and then makes the new node n the left
child of n
31Definition Left Binarization
- The left binarization of node n factorizes the
leftmost r-1 children by forming a new node n to
dominate them, leaving the last child nr
untouched, and then makes the new node n the left
child of n
32Definition Right Binarization
- The right binarization of node n factorizes the
rightmost r-1 children by forming a new node n to
dominate them, leaving the first child n1
untouched, and then makes the new node n the
right child of n
33Definition Right Binarization
- The right binarization of node n factorizes the
rightmost r-1 children by forming a new node n to
dominate them, leaving the first child n1
untouched, and then makes the new node n the
right child of n
34Definition Head Binarization
- Left binarizes n if the head is the first child,
otherwise right binarizes it. We prefer
right-binarization if both applicable. - Keep the head be in the push-down part.
35Parallel binatization
- Transform a parse tree into a packed binarization
forest - Packed forest is composed of additive forest
nodesand multiplicative forest nodes
36Procedure
- Given a tree node that has children n1,nr
- Recursively parallel-binarize childrennode
n1,nr, producing binarization - Right-binarize n if any contiguous subset of
children n2,nr is factorizable, insert a label
n, recursively parrel-binarize n to generate a
binarization forest node , then form a
multiplicative forest nodeas the parent of
and - Left-binarize is similar to Right-binarize above
except that the subset is n1,nr-1. Finaly it
forms a multiplicative forest node - Form an additive node as the parent
of and
37Example
Similar to OR
Similar to AND
38Extract translation rule Condition1
39Extract translation rule
Call Procedure-1
40Extract translation rule
Call Procedure-1
Call Procedure-2 recursively
41Extract translation rule
Call Procedure-1
Call Procedure-2 recursively
Call Procedure-2 recursively
42Extract translation rule Condition2
43Extract translation rule
Call Procedure-2
44Extract translation rule
Call Procedure-1 recursively
Call Procedure-2
45Extract translation rule
Call Procedure-1 recursively
Call Procedure-1 recursively
Call Procedure-2
46Extract translation rule
- So we can build a derivation forest, through
traversing the forest top-down recursively we can
extract rules at admissible forest nodes
47Learning how to binarize via EM algorithm
- Perform a set of binarization operations ß on a
parse tree t - Each binarition ßis the sequence of binarizations
on the necessary nodes in t in pre-order - Each binarization ßresults in a restructured tree
t ß - Extract rules from (t ß , f , a ), generating a
translation model consisting parameters ?
(i.e.,rule probability) - Obtain the ßthat satisfy
-
48Using the EM algorithm to choose restructuring
49Forest Rescoring Faster Decoding with Integrated
Language Models
- Efficient decoding for phrase-based MT models and
syntax-based MT model is a difficult problem - If the language model is fully integrated into
the decoder, there will be an expensive overhead
for maintaining target-language boundary words in
decoding
50Some alternative methods
- Rescoring Produce a k-best list of candidate
translations without LM, then rerank the the
k-best list using the LM - Forest rescoring
- Cube pruning
- Cube growing
51Cube pruning some details
- Avoid duplicate deductions
52Cube pruning some details
- Avoid duplicate deductions
The first to be extracted
53Cube pruning some details
- Avoid duplicate deductions
The second to be extracted
54Cube pruning some details
- Avoid duplicate deductions
The third to be extracted
55Cube pruning some details
Here?
Cube1
Here?
Stack
Cube2
Heap
Cuben
56Cube pruning some details
- Suppose that we are decoding with hierachical
phrase based model, the dimension of the cube is
at most 3, because ench rule has at most 2 vars,
The rule itself forms a dimension, while the two
vars forms the other two
Dimension 1X1 ? X2
Dimension 3????
Dimension 2??
57Cube pruning some details
- So when we extract the best derivation from the
top of the heap, we put at most 3 neighbors of it
into the heap as candidates
Nb1 i1, j ,k Nb1 i, j1 ,k Nb1 i, j
,k1
Dimension 1X1 ? X2
Dimension 3????
Dimension 2??
k
i
j
58Cube growing
LazyJthBest(n)
59Cube growing
LazyJthBest(n)
Fire(1,1, cand)
Fire(1,1, cand)
60Cube growing
LazyJthBest(n)
Fire(1,1, cand)
Fire(1,1, cand)
LazyJthBest(1)
61Cube growing
LazyJthBest(n)
Fire(1,1, cand)
Fire(1,1, cand)
LazyJthBest(1)
62Cube growing
LazyJthBest(n)
Fire(1,1, cand)
Fire(1,1, cand)
LazyJthBest(1)
63Cube growing
LazyJthBest(n)
While D(v)ltn AND cand not empty
Pop_Min Fire(Nb1,cand) Fire(Nb2,cand) Fi
re(Nbm,cand) End
64Two-Pass Approach to SCFG SMT
- The first pass, corresponding to a severe
parameterization of Cube Pruning, consider only
the first best (LM integrated) chart item in each
cell, while maintaining unexplored alternatives
for second-pass consideration - The second stage, we drive the search process
with the integration of long distance and
flexible history n-gram LMs, rather than simply
using such models for hypothesis rescoring
65Thanks!