Title: Declarative Specification of NLP Systems
1Declarative Specification of NLP Systems
student co-authors on various parts of this work
Eric Goldlust, Noah A. Smith, John Blatz, Roy
Tromble
IBM, May 2006
2An Anecdote from ACL05
-Michael Jordan
3An Anecdote from ACL05
-Michael Jordan
4Conclusions to draw from that talk
- Mike his students are great.
- Graphical models are great.(because theyre
flexible) - Gibbs sampling is great.(because it works with
nearly any graphical model) - Matlab is great.(because it frees up Mike and
his students to doodle all day and then execute
their doodles)
5Could NLP be this nice?
- Mike his students are great.
- Graphical models are great.(because theyre
flexible) - Gibbs sampling is great.(because it works with
nearly any graphical model) - Matlab is great.(because it frees up Mike and
his students to doodle all day and then execute
their doodles)
6Could NLP be this nice?
- Parts of it already are
- Language modeling
- Binary classification (e.g., SVMs)
- Finite-state transductions
- Linear-chain graphical models
Toolkits available you dont have to be an expert
But other parts arent Context-free and
beyond Machine translation
Efficient parsers and MT systems are complicated
and painful to write
7Could NLP be this nice?
- This talk A toolkit thats general enough for
these cases. - (stretches from finite-state to Turing machines)
- Dyna
But other parts arent Context-free and
beyond Machine translation
Efficient parsers and MT systems are complicated
and painful to write
8Warning
- Lots more beyond this talk
- see the EMNLP05 and FG06 papers
- see http//dyna.org
- (download documentation)
- sign up for updates by email
-
- wait for the totally revamped next version ?
9the case forLittle Languages
- declarative programming
- small is beautiful
10Sapir-Whorf hypothesis
- Language shapes thought
- At least, it shapes conversation
- Computer language shapes thought
- At least, it shapes experimental research
- Lots of cute ideas that we never pursue
- Or if we do pursue them, it takes 6-12 months to
implement on large-scale data - Have we turned into a lab science?
11Declarative Specifications
- State what is to be done
- (How should the computer do it? Turn that over
to a general solver that handles the
specification language.) - Hundreds of domain-specific little languages
out there. Some have sophisticated solvers.
12dot (www.graphviz.org)
digraph g graph rankdir "LR" node
fontsize "16 shape "ellipse" edge
"node0" label "ltf0gt 0x10ba8 ltf1gt"shape
"record" "node1" label "ltf0gt 0xf7fc4380
ltf1gt ltf2gt -1"shape "record"
"node0"f0 -gt "node1"f0 id 0 "node0"f1
-gt "node2"f0 id 1 "node1"f0 -gt
"node3"f0 id 2
nodes
edges
Whats the hard part? Making a nice
layout! Actually, its NP-hard
13dot (www.graphviz.org)
14LilyPond (www.lilypond.org)
15LilyPond (www.lilypond.org)
16Declarative Specs in NLP
- Regular expression (for a FST toolkit)
- Grammar (for a parser)
- Feature set (for a maxent distribution, SVM,
etc.) - Graphical model (DBNs for ASR, IE, etc.)
Claim of this talk Sometimes its best to peek
under the shiny surface. Declarative methods are
still great, but should be layeredwe need them
one level lower, too.
17Declarative Specs in NLP
- Regular expression (for a FST toolkit)
- Grammar (for a parser)
- Feature set (for a maxent distribution, SVM,
etc.)
18Declarative Specification of Algorithms
19How you build a system (big picture slide)
cool model
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
20Wait a minute
Didnt I just implement something like this last
month?
chart management / indexing cache-conscious data
structures prioritization of partial solutions
(best-first, A) parameter management inside-outsi
de formulas different algorithms for training and
decoding conjugate gradient, annealing,
... parallelization?
I thought computers were supposed to automate
drudgery
21How you build a system (big picture slide)
cool model
- Dyna language specifies these equations.
- Most programs just need to compute some values
from other values. Any order is ok. - Some programs also need to update the outputs if
the inputs change - spreadsheets, makefiles, email readers
- dynamic graph algorithms
- EM and other iterative optimization
- leave-one-out training of smoothing params
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
22How you build a system (big picture slide)
cool model
practical equations
PCFG
Compilation strategies (well come back
to this)
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
23Writing equations in Dyna
- int a.
- a b c.
- a will be kept up to date if b or c changes.
- b x.b y. equivalent to b xy.
- b is a sum of two variables. Also kept up to
date. - c z(1).c z(2).c z(3).
- c z(four).c z(foo(bar,5)).
c z(N).
c is a sum of all nonzero z() values. At
compile time, we dont know how many!
24More interesting use of patterns
- a b c.
- scalar multiplication
- a(I) b(I) c(I).
- pointwise multiplication
- a b(I) c(I). means a b(I)c(I)
- dot product could be sparse
- a(I,K) b(I,J) c(J,K). b(I,J)c(J,K)
- matrix multiplication could be sparse
- J is free on the right-hand side, so we sum over
it
25Dyna vs. Prolog
- By now you may see what were up to!
- Prolog has Horn clauses
- a(I,K) - b(I,J) , c(J,K).
- Dyna has Horn equations
- a(I,K) b(I,J) c(J,K).
Like Prolog Allow nested terms Syntactic sugar
for lists, etc. Turing-complete
Unlike Prolog Charts, not backtracking! Compile
? efficient C classes Integrates with your C
code
26The CKY inside algorithm in Dyna
- double item 0. - bool length
false. constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J) constit(Y,I,Mid)
constit(Z,Mid,J) rewrite(X,Y,Z). goal
constit(s,0,N) if length(N).
using namespace cky chart c crewrite(s,np,
vp) 0.7 cword(Pierre,0,1)
1 clength(30) true // 30-word sentence cin
gtgt c // get more axioms from stdin cout ltlt
cgoal // print total weight of all parses
27visual debugger browse the proof forest
ambiguity
shared substructure
28Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?
29Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
max max max
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?
30Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
max max max
log log log
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?
31Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?
c word(Pierre, 0, 1)
1
state(5)
state(9)
0.2
air/0.3
8
9
P/0.5
Pierre/0.2
5
32Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?
Just add words one at a time to the chart Check
at any time what can be derived from words so
far Similarly, dynamic grammars
33Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?
Again, no change to the Dyna program
34Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?
Basically, just add extra arguments to the terms
above
35Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?
36Earleys algorithm in Dyna
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
magic templates transformation (as noted by
Minnen 1996)
37Program transformations
cool model
Blatz Eisner (FG 2006) Lots of
equivalent ways to write a system of
equations! Transforming from one to another
mayimprove efficiency. Many parsing tricks
can be generalized into automatic
transformations that help other programs, too!
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
38Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?
39Rule binarization
constit(X,I,J) constit(Y,I,Mid)
constit(Z,Mid,J) rewrite(X,Y,Z).
X
Y
Z
Z
Y
Mid
J
I
Mid
40Rule binarization
constit(X,I,J) constit(Y,I,Mid)
constit(Z,Mid,J) rewrite(X,Y,Z).
graphical models constraint programming multi-way
database join
41More program transformations
- Examples that add new semantics
- Compute gradient (e.g., derive outside algorithm
from inside) - Compute upper bounds for A (e.g., Klein
Manning ACL03) - Coarse-to-fine (e.g., Johnson Charniak
NAACL06) - Examples that preserve semantics
- On-demand computation by analogy with Earleys
algorithm - On-the-fly composition of FSTs
- Left-corner filter for parsing
- Program specialization as unfolding e.g.,
compile out the grammar - Rearranging computations by analogy with
categorial grammar - Folding reinterpreted as slashed categories
- Speculative computation using slashed
categories - abstract away repeated computation to do it once
only by analogy with unary rule closure or
epsilon-closure - derives Eisner Satta ACL99 O(n3) bilexical
parser
42How you build a system (big picture slide)
cool model
practical equations
PCFG
Propagate updates from right-to-left through the
equations. a.k.a. agenda algorithm forward
chaining bottom-up inference semi-naïve
bottom-up
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
use a general method
43Bottom-up inference
agenda of pending updates
rules of program
s(I,K) np(I,J) vp(J,K)
pp(I,K) prep(I,J) np(J,K)
prep(I,3) ?
prep(2,3) 1.0
s(3,9) 0.15
s(3,7) 0.21
vp(5,K) ?
vp(5,9) 0.5
pp(2,5) 0.3
vp(5,7) 0.7
np(3,5) 0.3
we updated np(3,5)what else must therefore
change?
If np(3,5) hadnt been in the chart already, we
would have added it.
np(3,5) 0.1
no more matches to this query
0.3
chart of derived items with current values
44How you build a system (big picture slide)
cool model
practical equations
PCFG
Whats going on under the hood?
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
45Compiler provides
agenda of pending updates
rules of program
s(I,K) np(I,J) vp(J,K)
np(3,5) 0.3
copy, compare, hashterms fast, via
integerization (interning)
efficient storage of terms (use native C types,
symbiotic storage, garbage collection,seriali
zation, )
chart of derived items with current values
46Beware double-counting!
agenda of pending updates
combining with itself
rules of program
n(I,K) n(I,J) n(J,K)
n(5,5) 0.2
n(5,5) ?
n(5,5) 0.3
to makeanother copyof itself
epsilon constituent
If np(3,5) hadnt been in the chart already, we
would have added it.
chart of derived items with current values
47Parameter training
objective functionas a theorems value
- Maximize some objective function.
- Use Dyna to compute the function.
- Then how do you differentiate it?
- for gradient ascent,conjugate gradient, etc.
- gradient also tells us the expected counts for
EM!
e.g., inside algorithm computes likelihood of the
sentence
- Two approaches
- Program transformation automatically derive the
outside formulas. - Back-propagation run the agenda algorithm
backwards. - works even with pruning, early stopping, etc.
48What can Dyna do beyond CKY?
49Some examples from my lab
- Parsing using
- factored dependency models (Dreyer, Smith,
Smith CONLL06) - with annealed risk minimization (Smith and Eisner
EMNLP06) - constraints on dependency length (Eisner Smith
IWPT05) - unsupervised learning of deep transformations (see
Eisner EMNLP02) - lexicalized algorithms (see Eisner Satta
ACL99, etc.) - Grammar induction using
- partial supervision (Dreyer Eisner EMNLP06)
- structural annealing (Smith Eisner ACL06)
- contrastive estimation (Smith Eisner GIA05)
- deterministic annealing (Smith Eisner ACL04)
- Machine translation using
- Very large neighborhood search of
permutations (Eisner Tromble, NAACL-W06) - Loosely syntax-based MT (Smith Eisner in
prep.) - Synchronous cross-lingual parsing (Smith Smith
EMNLP04) - Finite-state methods for morphology, phonology,
IE, even syntax - Unsupervised cognate discovery (Schafer
Yarowsky 05, 06) - Unsupervised log-linear models via contrastive
estimation (Smith Eisner ACL05) - Context-based morph. disambiguation (Smith,
Smith Tromble EMNLP05)
Easy to try stuff out! Programs are very short
easy to change!
- see also Eisner ACL03)
50Can it express everything in NLP? ?
- Remember, it integrates tightly with C, so you
only have to use it where its helpful,and write
the rest in C. Small is beautiful. - Were currently extending the class of allowed
formulas beyond the semiring - cf. Goodman (1999)
- will be able to express smoothing, neural nets,
etc. - Of course, it is Turing complete ?
51Smoothing in Dyna
- mle_prob(X,Y,Z) context
count(X,Y,Z)/count(X,Y). - smoothed_prob(X,Y,Z) lambdamle_prob(X,Y,Z)
(1-lambda)mle_prob(Y,Z). - for arbitrary n-grams, can use lists
- count_count(N) 1 whenever N is
count(Anything). - updates automatically during leave-one-out
jackknifing
52Information retrieval in Dyna
- score(Doc) tf(Doc,Word)tf(Query,Word)idf(Wor
d). - idf(Word) 1/log(df(Word)).
- df(Word) 1 whenever tf(Doc,Word) gt 0.
53Neural networks in Dyna
- out(Node) sigmoid(in(Node)).
- in(Node) input(Node).
- in(Node) weight(Node,Kid)out(Kid).
- error (out(Node)-target(Node))2
if ?target(Node). - Recurrent neural net is ok
54Game-tree analysis in Dyna
- goal best(Board) if start(Board).
- best(Board) max stop(player1, Board).
- best(Board) max move(player1, Board, NewBoard)
worst(NewBoard). - worst(Board) min stop(player2, Board).
- worst(Board) min move(player2, Board, NewBoard)
best(NewBoard).
55Weighted FST composition in Dyna(epsilon-free
case)
- - bool itemfalse.
- start (A o B, Q x R) start (A, Q) start (B,
R). - stop (A o B, Q x R) stop (A, Q) stop (B, R).
- arc (A o B, Q1 x R1, Q2 x R2, In, Out) arc
(A, Q1, Q2, In, Match) arc (B, R1, R2,
Match, Out). - Inefficient? How do we fix this?
56Constraint programming (arc consistency)
- - bool indomainfalse.
- - bool consistenttrue.
- variable(Var) indomain(VarVal).
- possible(VarVal) indomain(VarVal).
- possible(VarVal) support(VarVal, Var2)
whenever variable(Var2). - support(VarVal, Var2) possible(Var2Val2)
consistent(VarVal, Var2Val2).
57Edit distance in Dyna version 1
- letter1(c,0,1). letter1(l,1,2).
letter1(a,2,3). clara - letter2(c,0,1). letter2(a,1,2).
letter2(c,2,3). caca - end1(5). end2(4). delcost 1. inscost 1.
substcost 1. - align(0,0) 0.
- align(I1,J2) min align(I1,I2)
letter2(L2,I2,J2) inscost(L2). - align(J1,I2) min align(I1,I2)
letter1(L1,I1,J1) delcost(L1). - align(J1,J2) min align(I1,I2)
letter1(L1,I1,J1) letter2(L2,I2,J2)
subcost(L1,L2). - align(J1,J2) min align(I1,I2)letter1(L,I1,J1)le
tter2(L,I2,J2). - goal align(N1,N2) whenever end1(N1) end2(N2).
58Edit distance in Dyna version 2
- input(c, l, a, r, a, c, a, c,
a) 0. - delcost 1. inscost 1. substcost 1.
- alignupto(Xs,Ys) min input(Xs,Ys).
- alignupto(Xs,Ys) min alignupto(XXs,Ys)
delcost. - alignupto(Xs,Ys) min alignupto(Xs,YYs)
inscost. - alignupto(Xs,Ys) min alignupto(XXs,YYs)sub
stcost. - alignupto(Xs,Ys) min alignupto(AXs,AYs).
- goal min alignupto(, ).
How about different costs for different letters?
59Edit distance in Dyna version 2
- input(c, l, a, r, a, c, a, c,
a) 0. - delcost 1. inscost 1. substcost 1.
- alignupto(Xs,Ys) min input(Xs,Ys).
- alignupto(Xs,Ys) min alignupto(XXs,Ys)
delcost. - alignupto(Xs,Ys) min alignupto(Xs,YYs)
inscost. - alignupto(Xs,Ys) min alignupto(XXs,YYs)sub
stcost. - alignupto(Xs,Ys) min alignupto(LXs,LYs).
- goal min alignupto(, ).
(X).
(Y).
(X,Y).
60Is it fast enough?
(sort of)
- Asymptotically efficient
- 4 times slower than Mark Johnsons inside-outside
- 4-11 times slower than Klein Mannings Viterbi
parser
61Are you going to make it faster?
(yup!)
- Currently rewriting the term classes to match
hand-tuned code - Will support mix-and-matchimplementation
strategies - store X in an array
- store Y in a hash
- dont store Z (compute on demand)
- Eventually, choose strategies automaticallyby
execution profiling
62Synopsis your idea ? experimental results fast!
- Dyna is a language for computation (no I/O).
- Especially good for dynamic programming.
- It tries to encapsulate the black art of NLP.
- Much prior work in this vein
- Deductive parsing schemata (preferably weighted)
- Goodman, Nederhof, Pereira, Warren, Shieber,
Schabes, Sikkel - Deductive databases (preferably with aggregation)
- Ramakrishnan, Zukowski, Freitag, Specht, Ross,
Sagiv, - Probabilistic programming languages (implemented)
- Zhao, Sato, Pfeffer (also efficient Prologish
languages)
63Dyna contributors!
- Jason Eisner
- Eric Goldlust, Eric Northup, Johnny Graettinger
(compiler backend) - Noah A. Smith (parameter training)
- Markus Dreyer, David Smith (compiler frontend)
- Mike Kornbluh, George Shafer, Gordon Woodhull,
Constantinos Michael, Ray Buse (visual
debugger) - John Blatz (program transformations)
- Asheesh Laroia (web services)
64New examples of dynamic programming in NLP
65Some examples from my lab
- Parsing using
- factored dependency models (Dreyer, Smith,
Smith CONLL06) - with annealed risk minimization (Smith and Eisner
EMNLP06) - constraints on dependency length (Eisner Smith
IWPT05) - unsupervised learning of deep transformations (see
Eisner EMNLP02) - lexicalized algorithms (see Eisner Satta
ACL99, etc.) - Grammar induction using
- partial supervision (Dreyer Eisner EMNLP06)
- structural annealing (Smith Eisner ACL06)
- contrastive estimation (Smith Eisner GIA05)
- deterministic annealing (Smith Eisner ACL04)
- Machine translation using
- Very large neighborhood search of
permutations (Eisner Tromble, NAACL-W06) - Loosely syntax-based MT (Smith Eisner in
prep.) - Synchronous cross-lingual parsing (Smith Smith
EMNLP04) - Finite-state methods for morphology, phonology,
IE, even syntax - Unsupervised cognate discovery (Schafer
Yarowsky 05, 06) - Unsupervised log-linear models via contrastive
estimation (Smith Eisner ACL05) - Context-based morph. disambiguation (Smith,
Smith Tromble EMNLP05)
- see also Eisner ACL03)
66New examples of dynamic programming in NLP
- Parameterized finite-state machines
67Parameterized FSMs
- An FSM whose arc probabilities depend on
parameters they are formulas.
68Parameterized FSMs
- An FSM whose arc probabilities depend on
parameters they are formulas.
69Parameterized FSMs
- An FSM whose arc probabilities depend on
parameters they are formulas.
Expert first Construct the FSM (topology
parameterization). Automatic takes over Given
training data, find parameter valuesthat
optimize arc probs.
70Parameterized FSMs
Knight Graehl 1997 - transliteration
71Parameterized FSMs
Knight Graehl 1997 - transliteration
Would like to get some of that expert knowledge
in here Use probabilistic regexps like(a.7 b)
.5 (ab.6) If the probabilities are
variables (ax b) y (abz) then arc weights
of the compiled machine are nasty formulas.
(Especially after minimization!)
72Finite-State Operations
- Projection GIVES YOU marginal distribution
p(x,y)
domain(
)
73Finite-State Operations
- Probabilistic union GIVES YOU mixture model
p(x)
0.3
q(x)
74Finite-State Operations
- Probabilistic union GIVES YOU mixture model
?
p(x)
q(x)
Learn the mixture parameter ?!
75Finite-State Operations
- Composition GIVES YOU chain rule
p(xy)
o
p(yz)
- The most popular statistical FSM operation
- Cross-product construction
76Finite-State Operations
- Concatenation, probabilistic closure
HANDLE unsegmented text
0.3
p(x)
p(x)
q(x)
- Just glue together machines for the different
segments, and let them figure out how to align
with the text
77Finite-State Operations
- Directed replacement MODELS noise or
postprocessing
p(x,y)
o
- Resulting machine compensates for noise or
postprocessing
78Finite-State Operations
- Intersection GIVES YOU product models
- e.g., exponential / maxent, perceptron, Naïve
Bayes,
- Need a normalization op too computes ?x f(x)
pathsum or
partition function
p(x)
q(x)
- Cross-product construction (like composition)
79Finite-State Operations
- Conditionalization (new operation)
p(x,y)
condit(
)
- Resulting machine can be composed with other
distributions p(y x) q(x)
80New examples of dynamic programming in NLP
- Parameterized infinite-state machines
81Universal grammar as a parameterized FSA over an
infinite state space
82New examples of dynamic programming in NLP
- More abuses of finite-state machines
83Huge-alphabet FSAs for OT phonology
etc.
Gen proposes all candidates that include this
input.
Gen
voi
underlying tiers
C
C
V
C
voi
voi
surface tiers
C
C
V
C
V
C
C
V
C
voi
voi
C
C
V
C
C
C
V
C
velar
voi
V
C
C
V
C
C
C
C
C
C
C
84Huge-alphabet FSAs for OT phonology
encode this candidate as a string
voi
at each moment, need to describe whats going
on on many tiers
C
C
V
C
velar
V
C
C
C
C
C
C
85Directional Best Paths construction
- Keep best output string for each input string
- Yields a new transducer (size ?? 3n)
For input abc abc axc For input abd axd
Must allow red arc just if next input is d
86Minimization of semiring-weighted FSAs
- New definition of ? for pushing
- ?(q) weight of the shortest path from
q, breaking ties alphabetically on input
symbols - Computation is simple, well-defined, independent
of (K, ?) - Breadth-first search back from final states
Compute ?(q) in O(1) time as soon as we visit
q. Whole alg. is linear.
b
b
a
b
a
c
d
c
Faster than finding min-weight path à la Mohri.
distance 2
?(q) k ? ?(r)
87New examples of dynamic programming in NLP
88Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English.
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
89Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
90Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. Alignment shows how trees are
generated synchronously from little trees ...
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
91New examples of dynamic programming in NLP
- Bilexical parsing in O(n3)
- (with Giorgio Satta)
92Lexicalized CKY
loves
Mary
the
girl
outdoors
93Lexicalized CKY is O(n5) not O(n3)
... advocate
visiting relatives
... hug
visiting relatives
B
C
i
j
j1
k
O(n3 combinations)
94Idea 1
- Combine B with what C?
- must try different-width Cs (vary k)
- must try differently-headed Cs (vary h)
- Separate these!
95Idea 1
(the old CKY way)
96Idea 2
97Idea 2
- Combine what B and C?
- must try different-width Cs (vary k)
- must try different midpoints j
- Separate these!
98Idea 2
(the old CKY way)
99Idea 2
B
j
h
(the old CKY way)
A
C
h
h
A
h
k
100An O(n3) algorithm (with G. Satta)
loves
Mary
the
girl
outdoors
101(No Transcript)
102New examples of dynamic programming in NLP
- O(n)-time partial parsing by limiting dependency
length - (with Noah A. Smith)
103Short-Dependency Preference
- A words dependents (adjuncts, arguments)
- tend to fall near it
- in the string.
104length of a dependency surface distance
3
1
1
1
10550 of English dependencies have length 1,
another 20 have length 2, 10 have length 3 ...
fraction of all dependencies
length
106Related Ideas
- Score parses based on whats between a head and
child - (Collins, 1997 Zeman, 2004 McDonald et al.,
2005) - Assume short ? faster human processing
- (Church, 1980 Gibson, 1998)
- Attach low heuristic for PPs (English)
- (Frazier, 1979 Hobbs and Bear, 1990)
- Obligatory and optional re-orderings (English)
- (see paper)
107Going to Extremes
Longer dependencies are less likely.
What if we eliminate them completely?
108Hard Constraints
- Disallow dependencies between words of distance gt
b ... - Risk best parse contrived, or no parse at all!
- Solution allow fragments (partial parsing
Hindle, 1990 inter alia). - Why not model the sequence of fragments?
109Building a Vine SBG Parser
- Grammar generates sequence of trees from
- Parser recognizes sequences of trees without
long dependencies - Need to modify training data
- so the model is consistent
- with the parser.
1108
would
9
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
some
2
third
(from the Penn Treebank)
1
a
111would
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 4
some
2
third
(from the Penn Treebank)
1
a
112would
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 3
some
2
third
(from the Penn Treebank)
1
a
113would
1
1
.
,
According
changes
cut
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 2
some
2
third
(from the Penn Treebank)
1
a
114would
1
1
.
,
According
changes
cut
1
to
by
1
filings
the
rule
1
1
estimates
more
insider
1
1
than
b 1
some
third
(from the Penn Treebank)
1
a
115would
.
,
According
cut
changes
to
by
filings
the
rule
estimates
more
insider
than
b 0
some
third
(from the Penn Treebank)
a
116Vine Grammar is Regular
- Even for small b, bunches can grow to arbitrary
size - But arbitrary center embedding is out
117Vine Grammar is Regular
- Could compile into an FSA and get O(n) parsing!
- Problem whats the grammar constant?
EXPONENTIAL
- insider has no parent
- cut and would can have more children
- can have more children
FSA
According to some estimates , the rule changes
would cut insider ...
118Alternative
- Instead, we adapt
- an SBG chart parser
- which implicitly shares fragments of stack state
- to the vine case,
- eliminating unnecessary work.
119Limiting dependency length
- Linear-time partial parsing
-
Finite-state model of root sequence
NP
S
NP
Bounded dependencylength within each chunk (but
chunk could be arbitrarilywide right- or left-
branching)
- Natural-language dependencies tend to be short
- So even if you dont have enough data to model
what the heads are - you might want to keep track of where they are.
120Limiting dependency length
- Linear-time partial parsing
-
- Dont convert into an FSA!
- Less structure sharing
- Explosion of states for different stack
configurations - Hard to get your parse back
Finite-state model of root sequence
NP
S
NP
Bounded dependencylength within each chunk (but
chunk could be arbitrarilywide right- or left-
branching)
121Limiting dependency length
- Linear-time partial parsing
-
NP
S
NP
Each piece is at most k wordswide No
dependencies between pieces Finite state model
of sequence ? Linear time! O(k2n)
122Limiting dependency length
- Linear-time partial parsing
-
Each piece is at most k wordswide No
dependencies between pieces Finite state model
of sequence ? Linear time! O(k2n)
123Quadratic Recognition/Parsing
goal
...
O(n2b)
...
O(n2b)
O(n3) combinations
only construct trapezoids such that k i b
i
j
i
j
k
k
O(nb2)
O(n3) combinations
i
j
i
j
k
k
124would
.
,
According
changes
cut
O(nb) vine construction
b 4
- According to some , the new changes would cut
insider filings by more than a third .
all width 4
125Parsing Algorithm
- Same grammar constant as Eisner and Satta (1999)
- O(n3) ? O(nb2) runtime
- Includes some overhead (low-order term) for
constructing the vine - Reality check ... is it worth it?
126F-measure runtime of a limited-dependency-lengt
h parser (POS seqs)
127Precision recall of a limited-dependency-length
parser (POS seqs)
128Results Penn Treebank
evaluation against original ungrafted Treebank
non-punctuation only
b 20
b 1
129Results Chinese Treebank
evaluation against original ungrafted Treebank
non-punctuation only
b 20
b 1
130Results TIGER Corpus
evaluation against original ungrafted Treebank
non-punctuation only
b 20
b 1
131Type-Specific Bounds
- b can be specific to dependency type
- e.g., b(V-O) can be longer than b(S-V)
- b specific to parent, child, direction
- gradually tighten based on training data
-
132- English 50 runtime, no loss
- Chinese 55 runtime, no loss
- German 44 runtime, 2 loss
133Related Work
- Nederhof (2000) surveys finite-state
approximation of context-free languages. - CFG ? FSA
- We limit all dependency lengths (not just
center-embedding), and derive weights from the
Treebank (not by approximation). - Chart parser ? reasonable grammar constant.
134Softer Modeling of Dep. Length
When running parsing algorithm, just multiply in
these probabilities at the appropriate time.
p
DEFICIENT
p(3 r, a, L)
p(2 r, b, L)
p(1 b, c, R)
p
p(1 r, d, R)
p(1 d, e, R)
p(1 e, f, R)
135Generating with SBGs
?w0
?w0
- Start with left wall
- Generate root w0
- Generate left children w-1, w-2, ..., w-l from
the FSA ?w0 - Generate right children w1, w2, ..., wr from the
FSA ?w0 - Recurse on each wi for i in -l, ..., -1, 1,
..., r, sampling ai (steps 2-4) - Return al...a-1w0a1...ar
w0
w-1
w1
w-2
w2
...
...
?w-l
w-l
wr
w-l.-1
136Very Simple Model for ?w and ?w
We parse POS tag sequences, not words.
p(child first, parent, direction) p(stop
first, parent, direction) p(child not first,
parent, direction) p(stop not first, parent,
direction)
?takes
?takes
It
takes
two
to
137Baseline
test set recall () test set recall () test set recall () test set runtime (items/word) test set runtime (items/word) test set runtime (items/word)
73 61 77 90 149 49
138Modeling Dependency Length
test set recall () test set recall () test set recall () test set runtime (items/word) test set runtime (items/word) test set runtime (items/word)
73 61 77 90 149 49
76 62 75 67 103 31
4.1 1.6 -2.6 -26 -31 -37
length
139Conclusion
- Modeling dependency length can
- cut runtime of simple models by 26-37
- with effects ranging from
- -3 to 4 on recall.
- (Loss on recall perhaps due to deficient/MLE
estimation.)
140Future Work
apply to state-of-the-art parsing models
better parameter estimation
applications MT, IE, grammar induction
141This Talk in a Nutshell
3
length of a dependency surface distance
1
1
1
- Empirical results (English, Chinese, German)
- Hard constraints cut runtime in half or more
with no accuracy loss (English, Chinese) or by
44 with -2.2 accuracy (German). - Soft constraints affect accuracy of simple
models by -3 to 24 and cut runtime by 25 to
40.
- Formal results
- A hard bound b on dependency length
- results in a regular language.
- allows O(nb2) parsing.
142New examples of dynamic programming in NLP
- Grammar induction by initially limiting
dependency length - (with Noah A. Smith)
143Soft bias toward short dependencies
dS j k
(j, k) in t
where p(t, xi) Z-1(d)pT(t, xi) e
MLE baseline
-8
d 0
8
linear structure preferred
144Soft bias toward short dependencies
- Multiply parse probability by exp -dS
- where S is the total length of all dependencies
- Then renormalize probabilities
MLE baseline
-8
d 0
8
linear structure preferred
145Structural Annealing
MLE baseline
-8
d 0
8
Repeat ...
Increase d and retrain.
Until performance stops improving on a
small validation dataset.
Start here train a model.
146Grammar Induction
Other structural biases can be annealed. We
tried annealing on connectivity ( of fragments),
and got similar results.
147A 6/9-Accurate Parse
These errors look like ones made by a supervised
parser in 2000!
Treebank
can
gene
thus
the
prevent
plant
from
fertilizing
itself
a
MLE with locality bias
verb instead of modal as root
preposition misattachment
prevent
gene
plant
the
can
thus
a
from
fertilizing
itself
misattachment of adverb thus
148Accuracy Improvements
language random tree Klein Manning (2004) Smith Eisner (2006)
German 27.5 50.3 70.0
English 30.3 41.6 61.8
Bulgarian 30.4 45.6 58.4
Mandarin 22.6 50.1 57.2
Turkish 29.8 48.0 62.4
Portuguese 30.6 42.3 71.8
state-of-the-art, supervised
82.61
90.92
85.91
84.61
69.61
86.51
1CoNLL-X shared task, best system. 2McDonald
et al., 2005
149Combining with Contrastive Estimation
- This generally gives us our best results
150New examples of dynamic programming in NLP
- Contrastive estimation for HMM and grammar
induction - Uses lattice parsing
- (with Noah A. Smith)
151Contrastive EstimationTraining Log-Linear
Modelson Unlabeled Data
- Noah A. Smith and Jason Eisner
- Department of Computer Science /
- Center for Language and Speech Processing
- Johns Hopkins University
- nasmith,jason_at_cs.jhu.edu
152Contrastive Estimation(Efficiently) Training
Log-Linear Models (of Sequences) on Unlabeled Data
- Noah A. Smith and Jason Eisner
- Department of Computer Science /
- Center for Language and Speech Processing
- Johns Hopkins University
- nasmith,jason_at_cs.jhu.edu
153Nutshell Version
unannotated text
tractable training
contrastive estimation with lattice neighborhoods
Experiments on unlabeled data POS tagging 46
error rate reduction (relative to EM) Max ent
features make it possible to survive damage to
tag dictionary Dependency parsing 21
attachment error reduction (relative to EM)
max ent features
sequence models
154Red leaves dont hide blue jays.
155Maximum Likelihood Estimation(Supervised)
y
JJ
NNS
MD
VB
JJ
NNS
p
red
leaves
dont
hide
blue
jays
x
?
p
?
S ?
156Maximum Likelihood Estimation(Unsupervised)
?
?
?
?
?
?
p
red
leaves
dont
hide
blue
jays
x
?
This is what EM does.
p
?
S ?
157Focusing Probability Mass
numerator
denominator
158Conditional Estimation(Supervised)
y
JJ
NNS
MD
VB
JJ
NNS
p
red
leaves
dont
hide
blue
jays
x
?
?
?
?
?
?
A different denominator!
p
red
leaves
dont
hide
blue
jays
(x) ?
159Objective Functions
Objective Optimization Algorithm Numerator Denominator
MLE Count Normalize tags words S ?
MLE with hidden variables EM words S ?
Conditional Likelihood Iterative Scaling tags words (words) ?
Perceptron Backprop tags words hypothesized tags words
generic numerical solvers (in this talk, LMVM
L-BFGS)
Contrastive Estimation
observed data (in this talk, raw word sequence,
sum over all possible taggings)
?
For generative models.
160- This talk is about denominators ...
- in the unsupervised case.
- A good denominator can improve
- accuracy
- and
- tractability.
161Language Learning (Syntax)
At last! My own language learning device!
Why did he pick that sequence for those
words? Why not say leaves red ... or ... hide
dont ... or ...
Why didnt he say, birds fly or dancing
granola or the wash dishes or any other
sequence of words?
EM
162- What is a syntax model supposed to explain?
- Each learning hypothesis
- corresponds to
- a denominator / neighborhood.
163The Job of Syntax
- Explain why each word is necessary.
- ? DEL1WORD neighborhood
164The Job of Syntax
- Explain the (local) order of the words.
- ? TRANS1 neighborhood
165?
?
?
?
?
?
p
red
leaves
dont
hide
blue
jays
sentences in TRANS1 neighborhood
p
166?
?
?
?
?
?
p
red
leaves
dont
hide
blue
jays
www.dyna.org (shameless self promotion)
red
leaves
dont
hide
blue
jays
hide
jays
leaves
dont
blue
p
blue
hide
leaves
dont
red
dont
hide
blue
jays
(with any tagging)
sentences in TRANS1 neighborhood
167The New Modeling Imperative
A good sentence hints that a set of bad ones is
nearby.
numerator
denominator (neighborhood)
Make the good sentence likely, at the expense
of those bad neighbors.
168- This talk is about denominators ...
- in the unsupervised case.
- A good denominator can improve
- accuracy
- and
- tractability.
169Log-Linear Models
score of x, y
partition function
Computing Z is undesirable!
Sums over all possible taggings of all possible
sentences!
Contrastive Estimation (Unsupervised)
Conditional Estimation (Supervised)
a few sentences
1 sentence
170A Big Picture Sequence Model Estimation
unannotated data
tractable sums
generative, EM p(x)
generative, MLE p(x, y)
log-linear, CE with lattice neighborhoods
log-linear, EM p(x)
log-linear, conditional estimation p(y x)
log-linear, MLE p(x, y)
overlapping features
171Contrastive Neighborhoods
- Guide the learner toward models that do what
syntax is supposed to do. - Lattice representation ? efficient algorithms.
There is an art to choosing neighborhood
functions.
172Neighborhoods
neighborhood size lattice arcs perturbations
n1 O(n) delete up to 1 word
n O(n) transpose any bigram
O(n) O(n) ?
O(n2) O(n2) delete any contiguous subsequence
(EM) 8 - replace each word with anything
DEL1WORD
TRANS1
DELORTRANS1
DEL1WORD
TRANS1
DEL1SUBSEQUENCE
S
173The Merialdo (1994) Task
- Given unlabeled text
- and a POS dictionary
- (that tells all possible tags for each word
type), - learn to tag.
A form of supervision.
174Trigram Tagging Model
JJ
NNS
MD
VB
JJ
NNS
red
leaves
dont
hide
blue
jays
feature set tag trigrams tag/word pairs from a
POS dictionary
175CRF
log-linear EM
supervised
HMM
LENGTH
TRANS1
DELORTRANS1
DA
Smith Eisner (2004)
10 data
EM
Merialdo (1994)
EM
DEL1WORD
DEL1SUBSEQUENCE
random
- 96K words
- full POS dictionary
- uninformative initializer
- best of 8 smoothing conditions
176- Dictionary includes ...
- all words
- words from 1st half of corpus
- words with count ? 2
- words with count ? 3
- Dictionary excludes
- OOV words,
- which can get any tag.
What if we damage the POS dictionary?
- 96K words
- 17 coarse POS tags
- uninformative initializer
EM
random
LENGTH
DELORTRANS1
177Trigram Tagging Model Spelling
JJ
NNS
MD
VB
JJ
NNS
red
leaves
dont
hide
blue
jays
feature set tag trigrams tag/word pairs from a
POS dictionary 1- to 3-character suffixes,
contains hyphen, digit
178Log-linear spelling features aided recovery ...
... but only with a smart neighborhood.
EM
LENGTH spelling
random
LENGTH
DELORTRANS1 spelling
DELORTRANS1
179- The model need not be finite-state.
180Unsupervised Dependency Parsing
Klein Manning (2004)
attachment accuracy
EM
LENGTH
TRANS1
initializer
181To Sum Up ...
Contrastive Estimation means
picking your own denominator
for tractability
or for accuracy
(or, as in our case, for both).
Now we can use the task to guide the unsupervised
learner
(like discriminative techniques do for supervised
learners).
Its a particularly good fit for log-linear
models
with max ent features
unsupervised sequence models
all in time for ACL 2006.
182(No Transcript)