Title: Shuffling Non-Constituents
1Shuffling Non-Constituents
with David A. Smith and Roy
Tromble
syntactically-flavored reordering search methods
ACL SSST Workshop, June 2008
2Starting point Synchronous alignment
- Synchronous grammars are very pretty.
- But does parallel text actually have parallel
structure? - Depends on what kind of parallel text
- Free translations? Noisy translations?
- Were the parsers trained on parallel annotation
schemes? - Depends on what kind of parallel structure
- What kinds of divergences can your synchronous
grammar formalism capture? - E.g., wh-movement versus wh in situ
3Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English.
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
4Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
5Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. A much worse alignment ...
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
NP
beaucoup(lots)
quite
d (of)
NP
Adv
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
6Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
7Grammar Set of Elementary Trees
8But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
9But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Displaced modifier (negation)
10But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Displaced modifier (negation)
11But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Displaced argument (here, because projective
parser)
12But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Head-swapping (here, different annotation
conventions)
13Free Translation
Tschernobyl Chernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
NULL
Then
we
could
deal
with
Chernobyl
some
time
later
14Free Translation
Tschernobyl Chernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
NULL
Then
we
could
deal
with
Chernobyl
some
time
later
Probably not systematic (but words are correctly
aligned)
15Free Translation
Tschernobyl Chernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
NULL
Then
we
could
deal
with
Chernobyl
some
time
later
Erroneous parse
16What to do?
- Current practice
- Dont try to model all systematic phenomena!
- Just use non-syntactic alignments (Giza).
- Only care about the fragments that recur often
- Phrases or gappy phrases
- Sometimes even syntactic constituents (can favor
these, e.g., Marton Resnik 2008) - Use these (gappy) phrases in a decoder
- Phrase based or hierarchical
17What to do?
- Current practice
- Use non-syntactic alignments (Giza)
- Keep frequent phrases for a decoder
- But could syntax give us better alignments?
- Would have to be loose syntax
- Why do we want better alignments?
- Throw away less of the parallel training data
- Help learn a smarter, syntactic, reordering model
- Could help decoding less reliance on LM
- Some applications care about full alignments
18Quasi-synchronous grammar
- How do we handle loose syntax?
- Translation story
- Generate target English by a monolingual grammar
- Any grammar formalism is okay
- Pick a dependency grammar formalism for now
P(I did, PRP)
I
did
not
unfortunately
receive
an
answer
to
this
question
P(PRP no previous left children of did)
parsing O(n3)
19Quasi-synchronous grammar
- How do we handle loose syntax?
- Translation story
- Generate target English by a monolingual grammar
- But probabilities are influenced by source
sentence - Each English node is aligned to some source node
- Prefers to generate children aligned to nearby
source nodes
I
did
not
unfortunately
receive
an
answer
to
this
question
parsing O(n3)
20QCFG Generative Story
observed
?
Auf
Frage
diese
bekommen
ich
leider
Antwort
keine
habe
NULL
P(parent-child)
P(breakage)
P(I did, PRP, ich)
I
did
not
unfortunately
receive
an
answer
to
this
question
P(PRP no previous left children
of did, habe)
aligned parsing O(m2n3)
21Whats a nearby node?
- Given parents alignment, where might child be
aligned?
synchronous grammar case
none of the above
22Quasi-synchronous grammar
- How do we handle loose syntax?
- Translation story
- Generate target English by a monolingual grammar
- But probabilities are influenced by source
sentence
- Useful analogies
- Generative grammar with latent word senses
- MEMM
- Generate n-gramtag sequence,
- but probabilities are influenced by word
sequence
23Quasi-synchronous grammar
- How do we handle loose syntax?
- Translation story
- Generate target English by a monolingual grammar
- But probabilities are influenced by source
sentence
- Useful analogies
- Generative grammar with latent word senses
- MEMM
- IBM Model 1
- Source nodes can be freely reused or unused ?
- Future work Enforce 1-to-1 to allow good
decoding (NP-hard to do exactly)
24Some results Quasi-synch. Dep. Grammar
- Alignment (D. Smith Eisner 2006)
- Quasi-synchronous much better than synchronous
- Maybe also better than IBM Model 4
- Question answering (Wang et al. 2007)
- Align question w/ potential answer
- Mean average precision 43 ? 48 ? 60
- previous state of the art ? QG ? lexical
features - Bootstrapping a parser for a new language (D.
Smith Eisner 2007 ongoing) - Learn how parsed parallel text influences target
dependencies - Along with many other features! (cf.
co-training) - Unsupervised German 30 ? 69, Spanish 26 ? 65
25Summary of part I
- Current practice
- Use non-syntactic alignments (Giza)
- Some bits align nicely
- Use the frequent bits in a decoder
- Suggestion Let syntax influence alignments.
- So far, loose syntax methods are like IBM Model
I. - NP-hard to enforce 1-to-1 in any interesting
model. - Rest of talk
- How to enforce 1-to-1 in interesting models?
- Can we do something smarter than beam search?
26Shuffling Non-Constituents
with David A. Smith and Roy
Tromble
syntactically-flavored reordering model
ACL SSST Workshop, June 2008
27Motivation
- MT is really easy!
- Just use a finite-state transducer!
- Phrases, morphology, the works!
28Permutation search in MT
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
initial order(French)
best order(French)
easy transduction
29Motivation
- MT is really easy!
- Just use a finite-state transducer!
- Phrases, morphology, the works!
- Have just to fix that pesky word order.
Framing it this way lets us enforce 1-to-1
exactly at the permutation step. Deletion and
fertility gt 1 are still allowed in the subsequent
transduction.
30Often want to find an optimal permutation
- Machine translation Reorder French to
French-prime (Brown et al. 1992) So its easier
to align or translate - MT eval
- How much do you need to rearrange MT output so
it scores well under an LM derived from ref
translations? - Discourse generation, e.g., multi-doc
summarization Order the output sentences
(Lapata 2003) So they flow nicely - Reconstruct temporal order of events after info
extraction - Learn rule ordering or constraint ranking for
phonology? - Multi-word anagrams that score well under a LM
31Other applications (there are many )
- LOP
- Maximum-weight acyclic subgraph (equivalent)
- Graph drawing, task scheduling, archaeology,
aggregating ranked ballots, - TSP
- Transportation scheduling (schoolbus,
meals-on-wheels, service calls, ) - Motion scheduling (drill head, space telescopes,
) - Topology of a ring network
- Genome assembly
32Permutation search The problem
initial order
best orderaccording tosome costfunction
33Traditional approach Beam search
Approx. best path through a really big FSA N!
paths one for each permutation only 2N states
34An alternative Local search (hill climbing)
The SWAP neighborhood
1 3 2 4 5 6 cost20
2 1 3 4 5 6 cost26
1 2 3 4 5 6 cost22
1 2 4 3 5 6 cost19
1 2 3 4 5 6 cost22
1 2 3 5 4 6 cost25
35An alternative Local search (hill-climbing)
The SWAP neighborhood
1 2 3 4 5 6 cost22
1 2 4 3 5 6 cost19
36An alternative Local search (hill-climbing)Lik
e greedy decoder of Germann et al. 2001
The SWAP neighborhood
1
4
2
3
5
6
cost22
Why are the costs always going down? How long
does it take to pick best swap? How many swaps
might you need to reach answer? What if you get
stuck in a local min?
we pick best swap O(N) if youre
careful O(N2) random restarts
37Larger neighborhood
1 3 2 4 5 6 cost20
2 1 3 4 5 6 cost26
1 2 3 4 5 6 cost22
1 2 4 3 5 6 cost19
1 2 3 4 5 6 cost22
1 2 3 5 4 6 cost25
38Larger neighborhood(well-known in the
literature reportedly works well)
INSERT neighborhood
1
2
3
6
cost22
Fewer local minima? Graph diameter (max moves
needed)? How many neighbors? How long to find
best neighbor?
yes 3 can move past 4 ? to get past 5 ?? O(N)
rather than O(N2) O(N2) rather than O(N) O(N2)
rather than O(N)
39Even larger neighborhood
BLOCK neighborhood
1
3
6
cost22
2
yes 2 can get past 45 ? without having to
cross 3 ?? or move 3 first ?? still O(N) O(N3)
rather than O(N), O(N2) O(N3) rather than O(N),
O(N2)
Fewer local minima? Graph diameter (max moves
needed)? How many neighbors? How long to find
best neighbor?
40Larger yet Via dynamic programming??
1
3
6
cost22
2
logarithmic exponential polynomial
Fewer local minima? Graph diameter (max moves
needed)? How many neighbors? How long to find
best neighbor?
41Unifying/generalizing neighborhoods so far
1
3
6
7
8
2
Exchange two adjacent blocks, of max widths w
w SWAP w1, w1 INSERT w1, wN BLOCK
wN, wN
runtime neighbors O(wwN)
everything in this talk can be generalized to
other values of w,w
42Very large-scale neighborhoods
- What if we consider multiple simultaneous
exchanges that are independent? - The DYNASEARCH neighborhood (Potts van de
Velde 1995 Congram 2000)
1
5
2
4
3
6
Lowest-cost neighboris lowest-cost path
43Very large-scale neighborhoods
Lowest-cost neighboris lowest-cost path
- Why would this be a good idea?
Help get out of bad local minima? Help avoid
getting into bad local minima?
no theyre still local minima yes less greedy
0 -20 0 80
0 0 -30 -0
0 0 0 -20
0 0 0 0
44Very large-scale neighborhoods
Lowest-cost neighboris lowest-cost path
- Why would this be a good idea?
no theyre still local minima yes less
greedy yes! shortest-path algorithm finds the
best set of swaps in O(N) time, as fast as best
single swap. Up to N moves as fast as 1 moveno
penalty for parallelism! Globally optimizes
over exponentially many neighbors (paths).
Help get out of bad local minima? Help avoid
getting into bad local minima? More efficient?
45Can we extend this idea up to N moves in
parallel by dynamic programming to
neighborhoods beyond SWAP?
1
3
6
7
8
2
Exchange two adjacent blocks, of max widths w
w SWAP w1, w1 INSERT w1, wN BLOCK
wN, wN
runtime neighbors O(wwN)
46Lets define each neighbor by a colored
treeJust like ITG!
1
4
2
3
5
6
47Lets define each neighbor by a colored
treeJust like ITG!
48Lets define each neighbor by a colored
treeJust like ITG!
1
4
This is like the BLOCK neighborhood, but with
multiple block exchanges, which may be nested.
49If that was the optimal neighbor
now look for its optimal neighbor
new tree!
1
4
5
6
2
3
50If that was the optimal neighbor
now look for its optimal neighbor
new tree!
3
51If that was the optimal neighbor
now look for its optimal neighbor
repeat till reach local optimum
Each tree defines a neighbor. At each step,
optimize over all possible treesby dynamic
programming (CKY parsing).
5
6
1
4
2
3
Use your favorite parsing speedups (pruning,
best-first, )
52Very-large-scale versions of SWAP, INSERT, and
BLOCK all by the algorithm we just saw
1
3
6
7
8
2
Exchange two adjacent blocks, of max widths w
w Runtime of the algorithm we just saw was
O(N3) because we considered O(N3) distinct
(i,j,k) triples More generally, restrict to only
the O(wwN) triples of interest to define a
smaller neighborhood with runtime of O(wwN).
(yes, the dynamic programming recurrences go
through)
53How many steps to get from here to there?
initial order
8
4
6
2
5
3
7
1
One twisted-tree step? No As you probably
know, 3 1 4 2 ? 1 2 3 4 is impossible.
1
4
5
2
3
6
7
8
best order
54Can you get to the answer in one step?
German-English, Giza alignment
not always(yay, local search)
often(yay, big neighborhood)
55How many steps to the answer in the worst case?
(what is diameter of the search space?)
8
4
6
2
5
3
7
1
claim only log2N steps at worst (if you know
where to step) Lets sketch the proof!
1
4
5
2
3
6
7
8
56Quicksort anything into, e.g., 1 2 3 4 5 6 7 8
right-branchingtree
6
8
4
2
5
3
7
1
? 5
? 4
57Quicksort anything into, e.g., 1 2 3 4 5 6 7 8
Only log2 N steps to get to 1 2 3 4 5 6 7 8
or to anywhere!
? 5
? 4
? 6
? 7
? 2
? 3
58Defining best orderWhat class of cost
functions can we handle efficiently? How fast
can we compute a subtrees cost from its child
subtrees?
initial order
best orderaccording tosome costfunction
59Defining best orderWhat class of cost
functions?
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A
Traveling Salesperson Problem (TSP)
a25 a56 a63
a42
a14
a31
best orderaccording tosome costfunction
60Defining best orderWhat class of cost
functions?
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
5 4 -12 6 55 0
B
Linear Ordering Problem (LOP)
b26 cost of 2 preceding 6
(add up n(n-1)/2 such costs) (any order will
incur either b26 or b62)
best orderaccording tosome costfunction
61Defining best orderWhat class of cost
functions?
- TSP and LOP are both NP-complete
- In fact, believed to be inapproximable
- hard even to achieve C optimal cost (any C1)
- Practical approaches
- correct answer, typically fast ?
branch-and-bound, ILP, - fast answer, typically close to correct ? beam
search,
this talk,
62Moving small blocks helps on LOP(experiment on
LOLIB collection of 250-word problems from
economics)
63Defining best orderWhat class of cost
functions?
initial order
cost of this order
- Does my favorite WFSA like this string of s?
- Non-local pair order ok?
- Non-local triple order ok?
- Can add these all up
64Costs are derived from source sentence features
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
initial order(French)
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
-75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
65Costs are derived from source sentence features
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
initial order(French)
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
75
Can also include phrase boundary symbols in the
input!
66Costs are derived from source sentence features
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
initial order(French)
FSA costs Distortion model Language model
looks ahead to next step! (?? good
finite-state translation into good English?)
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
67Dynamic program must pick the tree that leads to
the lowest-cost permutation
initial order
cost of this order
- Does my favorite WFSA like it as a string?
68Scoring with a weighted FSA
This particular WFSA implements TSP scoring for
N3 After you read 1, youre in state 1 After
you read 2, youre in state 2 After you read 3,
youre in state 3 and this state determines
the cost of the next symbol you read
nitial
- Well handle a WFSA with Q states by using a
fancier grammar, with nonterminals. (Now runtime
goes up to O(N3Q3) )
69Including WFSA costs via nonterminals
A possible preterminal for word 2is an arc in A
thats labeled with 2. The preterminal 4?2
rewrites as word 2 with a cost equal to the
arcs cost.
4
5
6
1
2
3
70Including WFSA costs via nonterminals
71Dynamic program must pick the tree that leads to
the lowest-cost permutation
initial order
cost of this order
- Does my favorite WFSA like it as a string?
- Non-local pair order ok?
72Incorporating the pairwise ordering costs
This puts 5,6,7 before 1,2,3,4.
So this hypothesis must add costs 5 lt 1, 5
lt 2, 5 lt 3, 5 lt 4, 6 lt 1, 6 lt 2, 6 lt 3,
6 lt 4, 7 lt 1, 7 lt 2, 7 lt 3, 7 lt
4 Uh-oh! So now it takesO(N2) time to combine
twosubtrees, instead of O(1) time? Nope
dynamic programmingto the rescue again!
5
6
7
73Computing LOP cost of a block move
1 1 2 3 4
5
6
7
This puts 5,6,7 before 1,2,3,4.
So we have to add O(N2) costsjust to consider
this single neighbor!
Reuse work from other, narrower block moves
computed new cost in O(1)!
1
4
2
3
5
6
7
1 1 2 3 4
5
6
7
1 1 2 3 4
5
6
7
1 1 2 3 4
5
6
7
1 1 2 3 4
5
6
7
-
74Incorporating 3-way ordering costs
- See the initial paper (Eisner Tromble 2006)
- A little tricky, but
- comes for free if youre willing to accept a
certain restriction on these costs - more expensive without that restriction, but
possible
75Another option Markov chain Monte Carlo
- Random walk in the space of permutations
- interpret a permutations cost as a
log-probability - Sample a permutation from the neighborhood
instead of always picking the most probable - Why?
- Simulated annealing might beat greedy-with-random-
restarts - When learning the parameters of the distribution,
can use sampling to compute the feature
expectations
76Another option Markov chain Monte Carlo
- Random walk in the space of permutations
- interpret a permutations cost as a
log-probability - Sample a permutation from the neighborhood
instead of always picking the most probable - How?
- Pitfall Sampling a permutation ?? sampling a
tree - Spurious ambiguity some permutations have many
trees - Solution Exclude some trees, leaving 1 per
permutation - Normal form has long been known for colored trees
- For restricted colored trees (which limit the
size of blocks to swap), we have devised a more
complicated normal form
77Sampling from permutation space p(p)
exp(cost(p)) / ZWhy is this useful?
- To train the weights that determine the cost
matrix (as we saw earlier) - And to compute expectations of other
quantities(e.g., how often does 2 precede 5?) - Less greedy heuristic for finding the lowest-cost
permutation - This is the mode of p, i.e., the
highest-probability permutation. - Take a sample from p. If most of ps probability
mass is on the mode,you have a good chance of
getting the mode. - If not, boost the odds sample instead from pß,
for ß gt 1 - defined as pß(p) (exp ßcost(p)) / Zß (so
p2(p) proportional to p(p)2) - As ß ? 8, chances of getting the mode ? 1
- But as ß ? 8, MCMC sampling gets slower and
slower (no free lunch!) - ? simulated annealing gradually increase ß
during MCMC sampling
-Z/Z ?p p(p) cost(p) Epcost(p)
78Learning the costs
- Where do these costs come from?
- If we have some examples on which we know the
true permutation, could try to learn them
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
79Learning the costs
- Where do these costs come from?
- If we have some examples on which we know the
true permutation, could try to learn them - More precisely, try to learn these weights ?
(the knowledge thats reused across examples)
50 a verb (e.g., vu) shouldnt precede its
subject (e.g., Marie) 27 words at a distance of
5 shouldnt swap order -2 words with PRP between
them ought to swap
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
80Learning the costs
- Typical learning approach (details omitted)Tune
the weights ? to maximize probability of correct
answer p - Probability??? We were just trying to minimize
the cost. - But theres a standard way to convert costs to
probabilities
actually, log probabilityconvex optimization
with same answer
- For every permutation p,define p(p)
exp(cost(p)) / Z - where the partition function Z ?p exp
cost(p), so ?p p(p) 1 - Search is now argmaxp p(p)
- Learning is now argmax? log p(p)increase log
p(p) by gradient ascent
50 a verb (e.g., vu) shouldnt precede its
subject (e.g., Marie) 27 words at a distance of
5 shouldnt swap order -2 words with PRP between
them ought to swap
81Learning the costs
- Typical learning approach (details omitted)Tune
the weights ? to maximize probability of correct
answer p
actually, log probabilityconvex optimization
with same answer
find the gradient of log p(p) with respect to
the weights ? were trying to learn easy since
cost(p) is typically just a sum of many
weights slow sum over all permutations!
- For every permutation p,define p(p)
exp(cost(p)) / Z - where the partition function Z ?p exp
cost(p), so ?p p(p) 1 - Search is now argmaxp p(p)
- Learning increase log p(p) by gradient ascent
82Learning the costs
- Typical learning approach (details omitted)Tune
the weights ? to maximize probability of correct
answer p
actually, log probabilityconvex optimization
with same answer
find the gradient of log p(p) with respect to
the weights ? What is this gradient anyway? (log
p(p)) (cost(p) log Z) cost(p)
Z/Z Z ?p exp(cost(p)) (-cost(p)) so
-Z/Z ?p p(p) cost(p) Epcost(p)
- For every permutation p,define p(p)
exp(cost(p)) / Z - where the partition function Z ?p exp
cost(p), so ?p p(p) 1 - Search is now argmaxp p(p)
- Learning increase log p(p) by gradient ascent
aha! estimate by sampling from p (more about this
later)
83Experimenting with training LOP params(LOP is
quite fast O(n3) with no grammar constant)
PDS VMFIN PPER ADV APPR ART NN PTKNEG VVINF .
Das kann ich so aus dem Stand nicht sagen .
B7,9
84LOP feature templates
85LOP feature templates
- Only LOP features so far
- And theyre unnecessarily simple (dont examine
syntactic constituency) - And input sequence is only words(not
interspersed with syntactic brackets)
86Learning LOP Costs for MT
(interesting, if odd, to try to reorder with only
the LOP costs)
MOSES baseline
German
English
- Define German to be German in English word order
- To get German for training data, use Giza to
alignall German positions to English positions
(disallow NULL)
87Learning LOP Costs for MT(interesting, if odd,
to try to reorder with only the LOP costs)
MOSES baseline
German
English
- Easy first try Naïve Bayes
- Treat each feature in ? as independent
- Count and normalize over the training data
- No real improvement over baseline
88Learning LOP Costs for MT(interesting, if odd,
to try to reorder with only the LOP costs)
MOSES baseline
German
English
- Easy second try Perceptron
localoptimum
. . .
update
gold standard
Note Search error can be beneficial, e.g., just
take 1 step from identity permutation
89search error
model error
Warning different data
90Benefit from reordering
Learning method BLEU vs. German' BLEU vs. English
No reordering 49.65 25.55
Naïve BayesPOS 49.21
Naïve BayesPOSlexical 49.75
PerceptronPOS 50.05 25.92
PerceptronPOSlexical 51.30 26.34
obviously, not yet unscrambling German need
more features
91Alternatively, work back from gold standard
- Contrastive estimation (Smith Eisner 2005)
- Maximize the probability of the desired
permutation relative to its ITG neighborhood - Requires summing all permutations in a
neighborhood - Must use normal-form trees here
- Stochastic gradient descent
gold standard
92Alternatively, work back from gold standard
- k-best MIRA in the neighborhood
- Make gold standard beat its local competitors
- Beat the bad ones by a bigger margin
- Good close to gold in swap distance?
- Good close to gold using BLEU?
- Good translates into English thats close to
reference?
gold standard
93Alternatively, train each iterate
model best inneigh of ??(0)
. . .
update
update
update
oracle inneigh of ??(0)
- Or could do a k-best MIRA version of this, too
even use a loss measure based on lookahead
to??(n)
94Open Questions
- Search Is there practical benefit to using
larger neighborhoods (speed, quality of solution)
for hill-climbing? For MCMC? - Search Are the large-scale versions worth the
constant-factor runtime penalty? At some sizes? - Learning How should we learn the weights if we
plan to use them in greedy search? - Learning Can we tune adaptive search methods
that vary the neighborhood and the temperature
dynamically from step to step? - Theoretical Can it be determined in polytime
whether two permutations have a common neighbor
(using the full colored tree neighborhood)? - Theoretical Mixing time of MCMC with these
neighborhoods? - Algorithmic Is there a master theorem for normal
forms?
95Summary of part II
- Local search is fun and easy
- Popular elsewhere in AI
- Closely related to MCMC sampling
- Probably useful for translation
- Maybe other NP-hard problems too
- Can efficiently use huge local neighborhoods
- Algorithms are closely related to parsing and
FSMs - Our community knows that stuff better than
anyone!