Shuffling Non-Constituents

About This Presentation

Title:

Shuffling Non-Constituents

Description:

'kiss') un ('a') ('to') kids. Sam. kiss. quite. often ' ... 'beaucoup d'enfants donnent un baiser Sam' 'kids kiss Sam quite often' ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 88

Provided by: jason403

Category:

more less

Transcript and Presenter's Notes

Title: Shuffling Non-Constituents

1
Shuffling Non-Constituents

Jason Eisner

with David A. Smith and Roy
Tromble
syntactically-flavored reordering search methods
ACL SSST Workshop, June 2008
2
Starting point Synchronous alignment

Synchronous grammars are very pretty.
But does parallel text actually have parallel
structure?
Depends on what kind of parallel text
Free translations? Noisy translations?
Were the parsers trained on parallel annotation
schemes?
Depends on what kind of parallel structure
What kinds of divergences can your synchronous
grammar formalism capture?
E.g., wh-movement versus wh in situ

3
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English.
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
4
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
5
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. A much worse alignment ...
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
NP
beaucoup(lots)
quite
d (of)
NP
Adv
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
6
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
7
Grammar Set of Elementary Trees
8
But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
9
But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Displaced modifier (negation)
10
But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Displaced modifier (negation)
11
But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Displaced argument (here, because projective
parser)
12
But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Head-swapping (here, different annotation
conventions)
13
Free Translation
Tschernobyl Chernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
NULL
Then
we
could
deal
with
Chernobyl
some
time
later
14
Free Translation
Tschernobyl Chernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
NULL
Then
we
could
deal
with
Chernobyl
some
time
later
Probably not systematic (but words are correctly
aligned)
15
Free Translation
Tschernobyl Chernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
NULL
Then
we
could
deal
with
Chernobyl
some
time
later
Erroneous parse
16
What to do?

Current practice
Dont try to model all systematic phenomena!
Just use non-syntactic alignments (Giza).
Only care about the fragments that recur often
Phrases or gappy phrases
Sometimes even syntactic constituents (can favor
these, e.g., Marton Resnik 2008)
Use these (gappy) phrases in a decoder
Phrase based or hierarchical

17
What to do?

Current practice
Use non-syntactic alignments (Giza)
Keep frequent phrases for a decoder
But could syntax give us better alignments?
Would have to be loose syntax
Why do we want better alignments?
Throw away less of the parallel training data
Help learn a smarter, syntactic, reordering model
Could help decoding less reliance on LM
Some applications care about full alignments

18
Quasi-synchronous grammar

How do we handle loose syntax?
Translation story
Generate target English by a monolingual grammar
Any grammar formalism is okay
Pick a dependency grammar formalism for now

P(I did, PRP)
I
did
not
unfortunately
receive
an
answer
to
this
question
P(PRP no previous left children of did)
parsing O(n3)
19
Quasi-synchronous grammar

How do we handle loose syntax?
Translation story
Generate target English by a monolingual grammar
But probabilities are influenced by source
sentence
Each English node is aligned to some source node
Prefers to generate children aligned to nearby
source nodes

I
did
not
unfortunately
receive
an
answer
to
this
question
parsing O(n3)
20
QCFG Generative Story
observed
?
Auf
Frage
diese
bekommen
ich
leider
Antwort
keine
habe
NULL
P(parent-child)
P(breakage)
P(I did, PRP, ich)
I
did
not
unfortunately
receive
an
answer
to
this
question
P(PRP no previous left children
of did, habe)
aligned parsing O(m2n3)
21
Whats a nearby node?

Given parents alignment, where might child be
aligned?

synchronous grammar case
none of the above
22
Quasi-synchronous grammar

How do we handle loose syntax?
Translation story
Generate target English by a monolingual grammar
But probabilities are influenced by source
sentence

Useful analogies
Generative grammar with latent word senses
MEMM
Generate n-gramtag sequence,
but probabilities are influenced by word
sequence

23
Quasi-synchronous grammar

How do we handle loose syntax?
Translation story
Generate target English by a monolingual grammar
But probabilities are influenced by source
sentence

Useful analogies
Generative grammar with latent word senses
MEMM
IBM Model 1
Source nodes can be freely reused or unused ?
Future work Enforce 1-to-1 to allow good
decoding (NP-hard to do exactly)

24
Some results Quasi-synch. Dep. Grammar

Alignment (D. Smith Eisner 2006)
Quasi-synchronous much better than synchronous
Maybe also better than IBM Model 4
Question answering (Wang et al. 2007)
Align question w/ potential answer
Mean average precision 43 ? 48 ? 60
previous state of the art ? QG ? lexical
features
Bootstrapping a parser for a new language (D.
Smith Eisner 2007 ongoing)
Learn how parsed parallel text influences target
dependencies
Along with many other features! (cf.
co-training)
Unsupervised German 30 ? 69, Spanish 26 ? 65

25
Summary of part I

Current practice
Use non-syntactic alignments (Giza)
Some bits align nicely
Use the frequent bits in a decoder
Suggestion Let syntax influence alignments.
So far, loose syntax methods are like IBM Model
I.
NP-hard to enforce 1-to-1 in any interesting
model.
Rest of talk
How to enforce 1-to-1 in interesting models?
Can we do something smarter than beam search?

26
Shuffling Non-Constituents

Jason Eisner

with David A. Smith and Roy
Tromble
syntactically-flavored reordering model
ACL SSST Workshop, June 2008
27
Motivation

MT is really easy!
Just use a finite-state transducer!
Phrases, morphology, the works!

28
Permutation search in MT
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
initial order(French)
best order(French)
easy transduction
29
Motivation

MT is really easy!
Just use a finite-state transducer!
Phrases, morphology, the works!
Have just to fix that pesky word order.

Framing it this way lets us enforce 1-to-1
exactly at the permutation step. Deletion and
fertility gt 1 are still allowed in the subsequent
transduction.
30
Often want to find an optimal permutation

Machine translation Reorder French to
French-prime (Brown et al. 1992) So its easier
to align or translate
MT eval
How much do you need to rearrange MT output so
it scores well under an LM derived from ref
translations?
Discourse generation, e.g., multi-doc
summarization Order the output sentences
(Lapata 2003) So they flow nicely
Reconstruct temporal order of events after info
extraction
Learn rule ordering or constraint ranking for
phonology?
Multi-word anagrams that score well under a LM

31
Other applications (there are many )

LOP
Maximum-weight acyclic subgraph (equivalent)
Graph drawing, task scheduling, archaeology,
aggregating ranked ballots,
TSP
Transportation scheduling (schoolbus,
meals-on-wheels, service calls, )
Motion scheduling (drill head, space telescopes,
)
Topology of a ring network
Genome assembly

32
Permutation search The problem
initial order
best orderaccording tosome costfunction
33
Traditional approach Beam search
Approx. best path through a really big FSA N!
paths one for each permutation only 2N states
34
An alternative Local search (hill climbing)
The SWAP neighborhood
1 3 2 4 5 6 cost20
2 1 3 4 5 6 cost26
1 2 3 4 5 6 cost22
1 2 4 3 5 6 cost19
1 2 3 4 5 6 cost22
1 2 3 5 4 6 cost25
35
An alternative Local search (hill-climbing)
The SWAP neighborhood

1 2 3 4 5 6 cost22
1 2 4 3 5 6 cost19

36
An alternative Local search (hill-climbing)Lik
e greedy decoder of Germann et al. 2001
The SWAP neighborhood
1
4
2
3
5
6
cost22
Why are the costs always going down? How long
does it take to pick best swap? How many swaps
might you need to reach answer? What if you get
stuck in a local min?
we pick best swap O(N) if youre
careful O(N2) random restarts
37
Larger neighborhood
1 3 2 4 5 6 cost20
2 1 3 4 5 6 cost26
1 2 3 4 5 6 cost22
1 2 4 3 5 6 cost19
1 2 3 4 5 6 cost22
1 2 3 5 4 6 cost25
38
Larger neighborhood(well-known in the
literature reportedly works well)
INSERT neighborhood
1
2
3
6
cost22
Fewer local minima? Graph diameter (max moves
needed)? How many neighbors? How long to find
best neighbor?
yes 3 can move past 4 ? to get past 5 ?? O(N)
rather than O(N2) O(N2) rather than O(N) O(N2)
rather than O(N)
39
Even larger neighborhood
BLOCK neighborhood
1
3
6
cost22
2
yes 2 can get past 45 ? without having to
cross 3 ?? or move 3 first ?? still O(N) O(N3)
rather than O(N), O(N2) O(N3) rather than O(N),
O(N2)
Fewer local minima? Graph diameter (max moves
needed)? How many neighbors? How long to find
best neighbor?
40
Larger yet Via dynamic programming??
1
3
6
cost22
2
logarithmic exponential polynomial
Fewer local minima? Graph diameter (max moves
needed)? How many neighbors? How long to find
best neighbor?
41
Unifying/generalizing neighborhoods so far
1
3
6
7
8
2
Exchange two adjacent blocks, of max widths w
w SWAP w1, w1 INSERT w1, wN BLOCK
wN, wN
runtime neighbors O(wwN)
everything in this talk can be generalized to
other values of w,w
42
Very large-scale neighborhoods

What if we consider multiple simultaneous
exchanges that are independent?
The DYNASEARCH neighborhood (Potts van de
Velde 1995 Congram 2000)

1
5
2
4
3
6
Lowest-cost neighboris lowest-cost path
43
Very large-scale neighborhoods
Lowest-cost neighboris lowest-cost path

Why would this be a good idea?

Help get out of bad local minima? Help avoid
getting into bad local minima?
no theyre still local minima yes less greedy
0 -20 0 80
0 0 -30 -0
0 0 0 -20
0 0 0 0
44
Very large-scale neighborhoods
Lowest-cost neighboris lowest-cost path

Why would this be a good idea?

no theyre still local minima yes less
greedy yes! shortest-path algorithm finds the
best set of swaps in O(N) time, as fast as best
single swap. Up to N moves as fast as 1 moveno
penalty for parallelism! Globally optimizes
over exponentially many neighbors (paths).
Help get out of bad local minima? Help avoid
getting into bad local minima? More efficient?
45
Can we extend this idea up to N moves in
parallel by dynamic programming to
neighborhoods beyond SWAP?
1
3
6
7
8
2
Exchange two adjacent blocks, of max widths w
w SWAP w1, w1 INSERT w1, wN BLOCK
wN, wN
runtime neighbors O(wwN)
46
Lets define each neighbor by a colored
treeJust like ITG!
1
4
2
3
5
6
47
Lets define each neighbor by a colored
treeJust like ITG!
48
Lets define each neighbor by a colored
treeJust like ITG!
1
4
This is like the BLOCK neighborhood, but with
multiple block exchanges, which may be nested.
49
If that was the optimal neighbor
now look for its optimal neighbor
new tree!
1
4
5
6
2
3
50
If that was the optimal neighbor
now look for its optimal neighbor
new tree!
3
51
If that was the optimal neighbor
now look for its optimal neighbor
repeat till reach local optimum
Each tree defines a neighbor. At each step,
optimize over all possible treesby dynamic
programming (CKY parsing).
5
6
1
4
2
3
Use your favorite parsing speedups (pruning,
best-first, )
52
Very-large-scale versions of SWAP, INSERT, and
BLOCK all by the algorithm we just saw
1
3
6
7
8
2
Exchange two adjacent blocks, of max widths w
w Runtime of the algorithm we just saw was
O(N3) because we considered O(N3) distinct
(i,j,k) triples More generally, restrict to only
the O(wwN) triples of interest to define a
smaller neighborhood with runtime of O(wwN).
(yes, the dynamic programming recurrences go
through)
53
How many steps to get from here to there?
initial order
8
4
6
2
5
3
7
1
One twisted-tree step? No As you probably
know, 3 1 4 2 ? 1 2 3 4 is impossible.
1
4
5
2
3
6
7
8
best order
54
Can you get to the answer in one step?
German-English, Giza alignment
not always(yay, local search)
often(yay, big neighborhood)
55
How many steps to the answer in the worst case?
(what is diameter of the search space?)
8
4
6
2
5
3
7
1
claim only log2N steps at worst (if you know
where to step) Lets sketch the proof!
1
4
5
2
3
6
7
8
56
Quicksort anything into, e.g., 1 2 3 4 5 6 7 8
right-branchingtree
6
8
4
2
5
3
7
1
? 5
? 4
57
Quicksort anything into, e.g., 1 2 3 4 5 6 7 8
Only log2 N steps to get to 1 2 3 4 5 6 7 8
or to anywhere!
? 5
? 4
? 6
? 7
? 2
? 3
58
Defining best orderWhat class of cost
functions can we handle efficiently? How fast
can we compute a subtrees cost from its child
subtrees?
initial order
best orderaccording tosome costfunction
59
Defining best orderWhat class of cost
functions?
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A
Traveling Salesperson Problem (TSP)
a25 a56 a63
a42
a14
a31
best orderaccording tosome costfunction
60
Defining best orderWhat class of cost
functions?
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
5 4 -12 6 55 0
B
Linear Ordering Problem (LOP)
b26 cost of 2 preceding 6
(add up n(n-1)/2 such costs) (any order will
incur either b26 or b62)
best orderaccording tosome costfunction
61
Defining best orderWhat class of cost
functions?

TSP and LOP are both NP-complete
In fact, believed to be inapproximable
hard even to achieve C optimal cost (any C1)
Practical approaches
correct answer, typically fast ?
branch-and-bound, ILP,
fast answer, typically close to correct ? beam
search,
this talk,

62
Moving small blocks helps on LOP(experiment on
LOLIB collection of 250-word problems from
economics)
63
Defining best orderWhat class of cost
functions?
initial order
cost of this order

Does my favorite WFSA like this string of s?
Non-local pair order ok?
Non-local triple order ok?
Can add these all up

64
Costs are derived from source sentence features
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
initial order(French)
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
-75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
65
Costs are derived from source sentence features
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
initial order(French)
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
75
Can also include phrase boundary symbols in the
input!
66
Costs are derived from source sentence features
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
initial order(French)
FSA costs Distortion model Language model
looks ahead to next step! (?? good
finite-state translation into good English?)
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
67
Dynamic program must pick the tree that leads to
the lowest-cost permutation
initial order
cost of this order

Does my favorite WFSA like it as a string?

68
Scoring with a weighted FSA
This particular WFSA implements TSP scoring for
N3 After you read 1, youre in state 1 After
you read 2, youre in state 2 After you read 3,
youre in state 3 and this state determines
the cost of the next symbol you read
nitial

Well handle a WFSA with Q states by using a
fancier grammar, with nonterminals. (Now runtime
goes up to O(N3Q3) )

69
Including WFSA costs via nonterminals
A possible preterminal for word 2is an arc in A
thats labeled with 2. The preterminal 4?2
rewrites as word 2 with a cost equal to the
arcs cost.
4
5
6
1
2
3
70
Including WFSA costs via nonterminals
71
Dynamic program must pick the tree that leads to
the lowest-cost permutation
initial order
cost of this order

Does my favorite WFSA like it as a string?
Non-local pair order ok?

72
Incorporating the pairwise ordering costs
This puts 5,6,7 before 1,2,3,4.
So this hypothesis must add costs 5 lt 1, 5
lt 2, 5 lt 3, 5 lt 4, 6 lt 1, 6 lt 2, 6 lt 3,
6 lt 4, 7 lt 1, 7 lt 2, 7 lt 3, 7 lt
4 Uh-oh! So now it takesO(N2) time to combine
twosubtrees, instead of O(1) time? Nope
dynamic programmingto the rescue again!
5
6
7
73
Computing LOP cost of a block move
1 1 2 3 4
5
6
7
This puts 5,6,7 before 1,2,3,4.
So we have to add O(N2) costsjust to consider
this single neighbor!
Reuse work from other, narrower block moves
computed new cost in O(1)!
1
4
2
3
5
6
7
1 1 2 3 4
5
6
7
1 1 2 3 4
5
6
7
1 1 2 3 4
5
6
7
1 1 2 3 4
5
6
7

-

74
Incorporating 3-way ordering costs

See the initial paper (Eisner Tromble 2006)
A little tricky, but
comes for free if youre willing to accept a
certain restriction on these costs
more expensive without that restriction, but
possible

75
Another option Markov chain Monte Carlo

Random walk in the space of permutations
interpret a permutations cost as a
log-probability
Sample a permutation from the neighborhood
instead of always picking the most probable
Why?
Simulated annealing might beat greedy-with-random-
restarts
When learning the parameters of the distribution,
can use sampling to compute the feature
expectations

76
Another option Markov chain Monte Carlo

Random walk in the space of permutations
interpret a permutations cost as a
log-probability
Sample a permutation from the neighborhood
instead of always picking the most probable
How?
Pitfall Sampling a permutation ?? sampling a
tree
Spurious ambiguity some permutations have many
trees
Solution Exclude some trees, leaving 1 per
permutation
Normal form has long been known for colored trees
For restricted colored trees (which limit the
size of blocks to swap), we have devised a more
complicated normal form

77
Sampling from permutation space p(p)
exp(cost(p)) / ZWhy is this useful?

To train the weights that determine the cost
matrix (as we saw earlier)
And to compute expectations of other
quantities(e.g., how often does 2 precede 5?)
Less greedy heuristic for finding the lowest-cost
permutation
This is the mode of p, i.e., the
highest-probability permutation.
Take a sample from p. If most of ps probability
mass is on the mode,you have a good chance of
getting the mode.
If not, boost the odds sample instead from pß,
for ß gt 1
defined as pß(p) (exp ßcost(p)) / Zß (so
p2(p) proportional to p(p)2)
As ß ? 8, chances of getting the mode ? 1
But as ß ? 8, MCMC sampling gets slower and
slower (no free lunch!)
? simulated annealing gradually increase ß
during MCMC sampling

-Z/Z ?p p(p) cost(p) Epcost(p)
78
Learning the costs

Where do these costs come from?
If we have some examples on which we know the
true permutation, could try to learn them

0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
79
Learning the costs

Where do these costs come from?
If we have some examples on which we know the
true permutation, could try to learn them
More precisely, try to learn these weights ?
(the knowledge thats reused across examples)

50 a verb (e.g., vu) shouldnt precede its
subject (e.g., Marie) 27 words at a distance of
5 shouldnt swap order -2 words with PRP between
them ought to swap
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
80
Learning the costs

Typical learning approach (details omitted)Tune
the weights ? to maximize probability of correct
answer p
Probability??? We were just trying to minimize
the cost.
But theres a standard way to convert costs to
probabilities

actually, log probabilityconvex optimization
with same answer

For every permutation p,define p(p)
exp(cost(p)) / Z
where the partition function Z ?p exp
cost(p), so ?p p(p) 1
Search is now argmaxp p(p)
Learning is now argmax? log p(p)increase log
p(p) by gradient ascent

50 a verb (e.g., vu) shouldnt precede its
subject (e.g., Marie) 27 words at a distance of
5 shouldnt swap order -2 words with PRP between
them ought to swap
81
Learning the costs

Typical learning approach (details omitted)Tune
the weights ? to maximize probability of correct
answer p

actually, log probabilityconvex optimization
with same answer
find the gradient of log p(p) with respect to
the weights ? were trying to learn easy since
cost(p) is typically just a sum of many
weights slow sum over all permutations!

For every permutation p,define p(p)
exp(cost(p)) / Z
where the partition function Z ?p exp
cost(p), so ?p p(p) 1
Search is now argmaxp p(p)
Learning increase log p(p) by gradient ascent

82
Learning the costs

Typical learning approach (details omitted)Tune
the weights ? to maximize probability of correct
answer p

actually, log probabilityconvex optimization
with same answer
find the gradient of log p(p) with respect to
the weights ? What is this gradient anyway? (log
p(p)) (cost(p) log Z) cost(p)
Z/Z Z ?p exp(cost(p)) (-cost(p)) so
-Z/Z ?p p(p) cost(p) Epcost(p)

For every permutation p,define p(p)
exp(cost(p)) / Z
where the partition function Z ?p exp
cost(p), so ?p p(p) 1
Search is now argmaxp p(p)
Learning increase log p(p) by gradient ascent

aha! estimate by sampling from p (more about this
later)
83
Experimenting with training LOP params(LOP is
quite fast O(n3) with no grammar constant)
PDS VMFIN PPER ADV APPR ART NN PTKNEG VVINF .
Das kann ich so aus dem Stand nicht sagen .
B7,9
84
LOP feature templates
85
LOP feature templates

Only LOP features so far
And theyre unnecessarily simple (dont examine
syntactic constituency)
And input sequence is only words(not
interspersed with syntactic brackets)

86
Learning LOP Costs for MT
(interesting, if odd, to try to reorder with only
the LOP costs)
MOSES baseline
German
English

Define German to be German in English word order
To get German for training data, use Giza to
alignall German positions to English positions
(disallow NULL)

87
Learning LOP Costs for MT(interesting, if odd,
to try to reorder with only the LOP costs)
MOSES baseline
German
English

Easy first try Naïve Bayes
Treat each feature in ? as independent
Count and normalize over the training data
No real improvement over baseline

88
Learning LOP Costs for MT(interesting, if odd,
to try to reorder with only the LOP costs)
MOSES baseline
German
English

Easy second try Perceptron

localoptimum
. . .
update
gold standard
Note Search error can be beneficial, e.g., just
take 1 step from identity permutation
89
search error
model error
Warning different data
90
Benefit from reordering
Learning method BLEU vs. German' BLEU vs. English
No reordering 49.65 25.55
Naïve BayesPOS 49.21
Naïve BayesPOSlexical 49.75
PerceptronPOS 50.05 25.92
PerceptronPOSlexical 51.30 26.34
obviously, not yet unscrambling German need
more features
91
Alternatively, work back from gold standard

Contrastive estimation (Smith Eisner 2005)
Maximize the probability of the desired
permutation relative to its ITG neighborhood
Requires summing all permutations in a
neighborhood
Must use normal-form trees here
Stochastic gradient descent

gold standard
92
Alternatively, work back from gold standard

k-best MIRA in the neighborhood
Make gold standard beat its local competitors
Beat the bad ones by a bigger margin
Good close to gold in swap distance?
Good close to gold using BLEU?
Good translates into English thats close to
reference?

gold standard
93
Alternatively, train each iterate
model best inneigh of ??(0)
. . .
update
update
update
oracle inneigh of ??(0)

Or could do a k-best MIRA version of this, too
even use a loss measure based on lookahead
to??(n)

94
Open Questions

Search Is there practical benefit to using
larger neighborhoods (speed, quality of solution)
for hill-climbing? For MCMC?
Search Are the large-scale versions worth the
constant-factor runtime penalty? At some sizes?
Learning How should we learn the weights if we
plan to use them in greedy search?
Learning Can we tune adaptive search methods
that vary the neighborhood and the temperature
dynamically from step to step?
Theoretical Can it be determined in polytime
whether two permutations have a common neighbor
(using the full colored tree neighborhood)?
Theoretical Mixing time of MCMC with these
neighborhoods?
Algorithmic Is there a master theorem for normal
forms?