Title: Weighted Deduction as a Programming Language
1Weighted Deductionas a Programming Language
co-authors on various parts of this work Eric
Goldlust, Noah A. Smith, John Blatz, Wes Filardo,
Wren Thornton
CMU and Google, May 2008
2An Anecdote from ACL05
-Michael Jordan
3An Anecdote from ACL05
-Michael Jordan
4Conclusions to draw from that talk
- Mike his students are great.
- Graphical models are great.(because theyre
flexible) - Gibbs sampling is great.(because it works with
nearly any graphical model) - Matlab is great.(because it frees up Mike and
his students to doodle all day and then execute
their doodles)
5Could NLP be this nice?
- Mike his students are great.
- Graphical models are great.(because theyre
flexible) - Gibbs sampling is great.(because it works with
nearly any graphical model) - Matlab is great.(because it frees up Mike and
his students to doodle all day and then execute
their doodles)
6Systems are big!Large-scale noisy data, complex
models, search approximations, software
engineering
7Systems are big!Large-scale noisy data, complex
models, search approximations, software
engineering
- Maybe a bit smaller outside NLP
- But still big and carefully engineered
- And will get bigger, e.g., as machine vision
systems do more scene analysis and compositional
object modeling
8Systems are big!Large-scale noisy data, complex
models, search approximations, software
engineering
- Consequences
- Barriers to entry
- Small number of players
- Significant investment to be taken seriously
- Need to know implement the standard tricks
- Barriers to experimentation
- Too painful to tear up and reengineer your old
system, to try a cute idea of unknown payoff - Barriers to education and sharing
- Hard to study or combine systems
- Potentially general techniques are described and
implemented only one context at a time
9How to spend ones life?
Didnt I just implement something like this last
month?
chart management / indexing cache-conscious data
structures memory layout, file formats,
integerization, prioritization of partial
solutions (best-first, A) lazy k-best, forest
reranking parameter management inside-outside
formulas, gradients, different algorithms for
training and decoding conjugate gradient,
annealing, ... parallelization
I thought computers were supposed to automate
drudgery
10Solution
- Presumably, we ought toadd another layer of
abstraction. - After all, this is CS.
- Hope to convince you thata substantive new layer
exists. - But what would it look like?
- Whats shared by many programs?
11Can toolkits help?
12Can toolkits help?
- Hmm, there are a lot of toolkits.
- And theyre big too.
- Plus, they dont always cover what you want.
- Which is why people keep writing them.
- E.g., I love use OpenFST and have learned lots
from its implementation! But sometimes I also
want ... - So what is common across toolkits?
- automata with gt 2 tapes
- infinite alphabets
- parameter training
- A decoding
- automatic integerization
- automata defined by policy
- mixed sparse/dense implementation (per state)
- parallel execution
- hybrid models (90 finite-state)
13The Dyna language
- A toolkits job is to abstract away the
semantics, operations, and algorithmsfor a
particular domain. - In contrast, Dyna is domain-independent.
- (like MapReduce, Bigtable, etc.)
- Manages data computations that you specify.
- Toolkits or applications can be built on top.
14Warning
- Lots more beyond this talk
- See http//dyna.org
- read our papers
- download an earlier prototype
- sign up for updates by email
- wait for the totally revamped next version ?
15A Quick Sketch of Dyna
16How you build a system (big picture slide)
cool model
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
17How you build a system (big picture slide)
cool model
Dyna language specifies these
equations. Most programs just need to compute
some values from other values. Any order is
ok. Feed-forward! Dynamic programming! Message
passing! (including Gibbs) Must quickly figure
out what influences what. Compute Markov
blanket Compute transitions in state machine
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
18How you build a system (big picture slide)
cool model
- Dyna language specifies these equations.
- Most programs just need to compute some values
from other values. Any order is ok. - Some programs also need to update the outputs if
the inputs change - spreadsheets, makefiles, email readers
- dynamic graph algorithms
- EM and other iterative optimization
- Energy of a proposed configuation for MCMC
- leave-one-out training of smoothing params
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
19How you build a system (big picture slide)
cool model
practical equations
PCFG
Compilation strategies (well come back
to this)
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
20Writing equations in Dyna
- int a.
- a b c.
- a will be kept up to date if b or c changes.
- b x.b y. equivalent to b xy.
- b is a sum of two variables. Also kept up to
date. - c z(1).c z(2).c z(3).
- c z(four).c z(foo(bar,5)).
c z(N).
c is a sum of all nonzero z() values. At
compile time, we dont know how many!
21More interesting use of patterns
- a b c.
- scalar multiplication
- a(I) b(I) c(I).
- pointwise multiplication
- a b(I) c(I). means a b(I)c(I)
- dot product could be sparse
- a(I,K) b(I,J) c(J,K). b(I,J)c(J,K)
- matrix multiplication could be sparse
- J is free on the right-hand side, so we sum over
it
22Dyna vs. Prolog
- By now you may see what were up to!
- Prolog has Horn clauses
- a(I,K) - b(I,J) , c(J,K).
- Dyna has Horn equations
- a(I,K) b(I,J) c(J,K).
Like Prolog Allow nested terms Syntactic sugar
for lists, etc. Turing-complete
Unlike Prolog Charts, not backtracking! Compile
? efficient C classes Terms have values
23Some connections and intellectual debts
- Deductive parsing schemata (preferably weighted)
- Goodman, Nederhof, Pereira, McAllester, Warren,
Shieber, Schabes, Sikkel - Deductive databases (preferably with aggregation)
- Ramakrishnan, Zukowski, Freitag, Specht, Ross,
Sagiv, - Query optimization
- Usually limited to decidable fragments, e.g.,
Datalog - Theorem proving
- Theorem provers, term rewriting, etc.
- Nonmonotonic reasoning
- Programming languages
- Efficient Prologs (Mercury, XSB, )
- Probabilistic programming languages (PRISM, IBAL
) - Declarative networking (P2)
- XML processing languages (XTatic, CDuce)
- Functional logic programming (Curry, )
- Self-adjusting computation, adaptive memoization
(Acar et al.)
Increasing interest in resurrecting declarative
and logic-based system specifications.
24Example CKY and Variations
25The CKY inside algorithm in Dyna
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
using namespace cky chart c crewrite(s,np,
vp) 0.7 cword(Pierre,0,1)
1 csentence_length 30 cin gtgt c
// get more axioms from stdin cout ltlt cgoal
// print total weight of all parses
26Visual debugger Browse the proof forest
27Visual debugger Browse the proof forest
28Parameterization
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
- rewrite(X,Y,Z) doesnt have to be an atomic
parameter - urewrite(X,Y,Z) weight1(X,Y).
- urewrite(X,Y,Z) weight2(X,Z).
- urewrite(X,Y,Z) weight3(Y,Z).
- urewrite(X,Same,Same) weight4.
- urewrite(X) urewrite(X,Y,Z).
normalizing constant - rewrite(X,Y,Z) urewrite(X,Y,Z) / urewrite(X).
normalize
29Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Binarized CKY?
- Earleys algorithm?
30Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
max max max
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Binarized CKY?
- Earleys algorithm?
31Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
max max max
log log log
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Binarized CKY?
- Earleys algorithm?
32Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Binarized CKY?
- Earleys algorithm?
c word(Pierre, 0, 1)
1
state(5)
state(9)
0.2
air/0.3
8
9
P/0.5
Pierre/0.2
5
33Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Binarized CKY?
- Earleys algorithm?
Just add words one at a time to the chart Check
at any time what can be derived from words so
far Similarly, dynamic grammars
34Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Binarized CKY?
- Earleys algorithm?
Again, no change to the Dyna program
35Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Binarized CKY?
- Earleys algorithm?
Basically, just add extra arguments to the terms
above
36Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Binarized CKY?
- Earleys algorithm?
37Rule binarization
phrase(X,I,J) phrase(Y,I,Mid)
phrase(Z,Mid,J) rewrite(X,Y,Z).
X
Y
Z
Z
Y
Mid
J
I
Mid
38Rule binarization
phrase(X,I,J) phrase(Y,I,Mid)
phrase(Z,Mid,J) rewrite(X,Y,Z).
graphical models constraint programming multi-way
database join
39Program transformations
cool model
Blatz Eisner (FG 2007) Lots of
equivalent ways to write a system of
equations! Transforming from one to another
mayimprove efficiency. Many parsing tricks
can be generalized into automatic
transformations that help other programs, too!
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
40Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Binarized CKY?
- Earleys algorithm?
41Earleys algorithm in Dyna
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
magic templates transformation (as noted by
Minnen 1996)
42Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Binarized CKY?
- Earleys algorithm?
- Epsilon symbols?
word(epsilon,I,I) 1. (i.e., epsilons are freely
available everywhere)
43Some examples from my lab (as of 2006,
w/prototype)
- Parsing using
- factored dependency models (Dreyer, Smith,
Smith CONLL06) - with annealed risk minimization (Smith and Eisner
EMNLP06) - constraints on dependency length (Eisner Smith
IWPT05) - unsupervised learning of deep transformations (see
Eisner EMNLP02) - lexicalized algorithms (see Eisner Satta
ACL99, etc.) - Grammar induction using
- partial supervision (Dreyer Eisner EMNLP06)
- structural annealing (Smith Eisner ACL06)
- contrastive estimation (Smith Eisner GIA05)
- deterministic annealing (Smith Eisner ACL04)
- Machine translation using
- Very large neighborhood search of
permutations (Eisner Tromble, NAACL-W06) - Loosely syntax-based MT (Smith Eisner in
prep.) - Synchronous cross-lingual parsing (Smith Smith
EMNLP04) - Finite-state methods for morphology, phonology,
IE, even syntax - Unsupervised cognate discovery (Schafer
Yarowsky 05, 06) - Unsupervised log-linear models via contrastive
estimation (Smith Eisner ACL05) - Context-based morph. disambiguation (Smith,
Smith Tromble EMNLP05)
Easy to try stuff out! Programs are very short
easy to change!
- see also Eisner ACL03)
44Can it express everything in NLP? ?
- Remember, it integrates tightly with C, so you
only have to use it where its helpful,and write
the rest in C. Small is beautiful. - Of course, it is Turing complete ?
45One Execution Strategy(forward chaining)
46How you build a system (big picture slide)
cool model
practical equations
PCFG
Propagate updates from right-to-left through the
equations. a.k.a. agenda algorithm forward
chaining bottom-up inference semi-naïve
bottom-up
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
use a general method
47Bottom-up inference
agenda of pending updates
rules of program
s(I,K) np(I,J) vp(J,K)
pp(I,K) prep(I,J) np(J,K)
prep(I,3) ?
prep(2,3) 1.0
s(3,9) 0.15
s(3,7) 0.21
vp(5,K) ?
vp(5,9) 0.5
pp(2,5) 0.3
vp(5,7) 0.7
np(3,5) 0.3
we updated np(3,5)what else must therefore
change?
If np(3,5) hadnt been in the chart already, we
would have added it.
np(3,5) 0.1
no more matches to this query
0.3
chart of derived items with current values
48How you build a system (big picture slide)
cool model
practical equations
PCFG
Whats going on under the hood?
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
49Compiler provides
agenda of pending updates
rules of program
s(I,K) np(I,J) vp(J,K)
np(3,5) 0.3
copy, compare, hashterms fast, via
integerization (interning)
efficient storage of terms (given static type
info) (implicit storage,symbiotic storage,
various data structures, support for
indices,stack vs. heap, )
chart of derived items with current values
50Beware double-counting!
agenda of pending updates
combining with itself
rules of program
n(I,K) n(I,J) n(J,K)
n(5,5) 0.2
n(5,5) ?
n(5,5) 0.3
to makeanother copyof itself
epsilon constituent
chart of derived items with current values
51More issues in implementing inference
- Handling non-distributive updates
- Replacement
- p max q(X). what if current max q(0) is
reduced? - Retraction
- p max q(X). what if q(0) becomes unprovable
(no value)? - Non-distributive rules
- p 1/q(X). adding ? to q(0) doesnt simply
add to p - Backpointers (hyperedges in the derivation
forest) - Efficient storage, or on-demand recomputation
- Information flow between f(3), f(int X), f(X)
52More issues in implementing inference
- User-defined priorities
- priority(phrase(X,I,J)) -(J-I). CKY (narrow
to wide) - priority(phrase(X,I,J)) phrase(X,I,J).
uniform-cost - Can we learn a good priority function? (can be
dynamic) - User-defined parallelization
- host(phrase(X,I,J)) J.
- Can we learn a host choosing function? (can be
dynamic) - User-defined convergence tests
heuristic(X,I,J)
A
53More issues in implementing inference
- Time-space tradeoffs
- Which queries to index, and how?
- Selective or temporary memoization
- Can we learn a policy?
- On-demand computation (backward chaining)
- Prioritizing subgoals query planning
- Safely invalidating memos
- Mixing forward-chaining and backward-chaining
- Can we choose a good mixed strategy?
54Parameter training
objective functionas a theorems value
- Maximize some objective function.
- Use Dyna to compute the function.
- Then how do you differentiate it?
- for gradient ascent,conjugate gradient, etc.
- gradient of log-partition function also tells
us the expected counts for EM
e.g., inside algorithm computes likelihood of the
sentence
- Two approaches supported
- Tape algorithm remember agenda order and run it
backwards. - Program transformation automatically derive the
outside formulas.
55Automatic differentiation via the gradient
transform
- a b c. ?
- Now g(x) denotes ?f/?x, f being the objective
func. - Examples
- Backprop for neural networks
- Backward algorithm for HMMs and CRFs
- Outside algorithm for PCFGs
- g(b) a g(c).
- g(a) g(b) c.
Dyna implementation also supports tape-based
differentiation.
56More on Program Transformations
57Program transformations
- An optimizing compiler would like the freedom to
radically rearrange your code. - Easier in a declarative language than in C.
- Dont need to reconstruct the source programs
intended semantics. - Also, source program is much shorter.
- Search problem (open) Find a good sequence of
transformations (on a given workload).
58Variable elimination
- Dechters bucket elimination for hard
constraints - But how do we do it for soft constraints?
- How do we join soft constraints?
Bucket E E ¹ D, E ¹ C Bucket D D ¹
A Bucket C C ¹ B Bucket B B ¹ A Bucket A
join all constraints in Es bucket
yielding a new constraint on D (and C)
now join all constraints in Ds bucket
figure thanks to Rina Dechter
59Variable elimination via a folding transform
- goal max f1(A,B)f2(A,C)f3(A,D)f4(C,E)f5(D,E).
- tempE(C,D)
- tempE(C,D) max f4(C,E)f5(D,E).
- Undirected graphical model
to eliminate E, join constraints mentioning
E, and project E out
figure thanks to Rina Dechter
60Variable elimination via a folding transform
- goal max f1(A,B)f2(A,C)f3(A,D)tempE(C,D).
- tempD(A,C)
- tempD(A,C) max f3(A,D)tempE(C,D).
- tempE(C,D) max f4(C,E)f5(D,E).
- Undirected graphical model
to eliminate D, join constraints mentioning
D, and project D out
figure thanks to Rina Dechter
61Variable elimination via a folding transform
- goal max f1(A,B)f2(A,C)tempD(A,C).
- tempC(A)
- tempC(A) max f2(A,C)tempD(A,C).
- tempD(A,C) max f3(A,D)tempE(C,D).
- tempE(C,D) max f4(C,E)f5(D,E).
- Undirected graphical model
figure thanks to Rina Dechter
62Variable elimination via a folding transform
- goal max tempC(A)f1(A,B).
- tempB(A) max f1(A,B).
- tempC(A) max f2(A,C)tempD(A,C).
- tempD(A,C) max f3(A,D)tempE(C,D).
- tempE(C,D) max f4(C,E)f5(D,E).
- Undirected graphical model
tempB(A)
figure thanks to Rina Dechter
63Variable elimination via a folding transform
- goal max tempC(A)tempB(A).
- tempB(A) max f1(A,B).
- tempC(A) max f2(A,C)tempD(A,C).
- tempD(A,C) max f3(A,D)tempE(C,D).
- tempE(C,D) max f4(C,E)f5(D,E).
- Undirected graphical model
could replace max with throughout, to compute
partition function Z
figure thanks to Rina Dechter
64Grammar specialization as an unfolding transform
- phrase(X,I,J) rewrite(X,Y,Z) phrase(Y,I,Mid)
phrase(Z,Mid,J). - rewrite(s,np,vp) 0.7.
- phrase(s,I,J) 0.7 phrase(np,I,Mid)
phrase(vp,Mid,J). - s(I,J) 0.7 np(I,Mid)
vp(Mid,J).
unfolding
term flattening
(actually handled implicitly by subtype storage
declarations)
65On-demand computation via a magic templates
transform
- a - b, c. ?
- Examples
- Earleys algorithm for parsing
- Left-corner filter for parsing
- On-the-fly composition of FSTs
- The weighted generalization turns out to be the
generalized A algorithm (coarse-to-fine
search).
- a - magic(a), b, c.
- magic(b) - magic(a).
- magic(c) - magic(a), b.
66Speculation transformation(generalization of
folding)
- Perform some portion of computation
speculatively, before we have all the inputs we
need - Fill those inputs in later
- Examples from parsing
- Gap passing in categorial grammar
- Build an S/NP (a sentence missing its direct
object NP) - Transform a parser so that it preprocesses the
grammar - E.g., unary rule closure or epsilon closure
- Build phrase(np,I,K) from a phrase(s,I,K) we
dont have yet (so we havent yet chosen a
particular I, K) - Transform lexical context-free parsing from O(n5)
? O(n3) - Add left children to a constituent we dont have
yet (without committing to its width) - Derive Eisner Satta (1999) algorithm
67A few more language details
- So youll understand the examples
68Terms (generalized from Prolog)
- These are the Objects of the language
- Primitives
- 3, 3.14159, myUnicodeString
- user-defined primitive types
- Variables
- X
- int X type-restricted variable types are tree
automata - Compound terms
- atom
- atom(subterm1, subterm2, ) e.g.,
f(g(h(3),X,Y), Y) - Adding support for keyword arguments(similar to
R, but must support unification)
69Fixpoint semantics
- A Dyna program is a finite rule set that defines
a partial function (map) - Map only defines values for ground terms
- Variables (X,Y,) let us define values for 8ly
many ground terms - Compute a map that satisfies the equations in the
program - Not guaranteed to halt (Dyna is Turing-complete,
unlike Datalog) - Not guaranteed to be unique
Map
70Fixpoint semantics
- A Dyna program is a finite rule set that defines
a partial function (map) - Map only defines values for ground terms
- Map may accept modifications at runtime
- Runtime input
- Adjustments to input (dynamic algorithms)
- Retraction (remove input), detachment (forget
input but preserve output)
Map
71Object-oriented features
- Maps are terms, i.e., first-class objects
- Maps can appear as subterms or as values
- Useful for encapsulating data and passing it
around - fst3 compose(fst1, fst2). value of fst3 is
a chart - forest parse(sentence).
- Typed by their public interface
- fst4-gtedge(Q,R) fst3-gtedge(R,Q).
- Maps can be stored in files and loaded from files
- Human-readable format (looks like a Dyna program)
- Binary format (mimics in-memory layout)
72Functional features Auto-evaluation
- Terms can have values.
- So by default, subterms are evaluated in place.
- Arranged by a simple desugaring transformation
- foo( X ) 3bar(X).
- ? foo( X ) B is bar(X), Result is 3B,
Result. - Possible to suppress evaluation f(x) or force it
f(x) - Some contexts also suppress evaluation.
- Variables are replaced with their bindings but
not otherwise evaluated.
2 things to evaluate here bar and
73Functional features Auto-evaluation
- Terms can have values.
- So by default, subterms are evaluated in place.
- Arranged by a simple desugaring transformation
- foo(f(X)) 3bar(g(X)).
- ? foo( F )
- Possible to suppress evaluation f(x) or force it
f(x) - Some contexts also suppress evaluation.
- Variables are replaced with their bindings but
not otherwise evaluated.
F is f(X), G is g(X), B is bar(G), Result is
3B, Result.
74Other handy features
- fact(0) 1.
- fact(int N) N gt 0, Nfact(N-1).
- 0! 1.
- (int N)! N(N-1)! if N 1.
user-defined syntactic sugar
Unicode
75Aggregation operators
- f(X) 3. immutable
- f(X) 3. can be incremented later
- f(X) min 3. can be reduced later
- f(X) 3. can be arbitrarily changed
later - f(X) gt 3. like but can be overridden
by more specific rule
76Aggregation operators
- f(X) 1. can be arbitrarily changed
later - Non-monotonic reasoning
- flies(bird X) true.
- flies(bird X) penguin(X), false. overrides
- flies(bigbird) false.
also overrides - Iterative update algorithms (EM, Gibbs, BP)
- a init_a.
- a updated_a(b). will override once b is
proved - b updated_b(a).
77Declarations(ultimately, should be chosen
automatically)
- at term level
- lazy vs. eager computational strategies
- memoization and flushing strategies
- prioritization, parallelization, etc.
- at class level
- class an implementation of a type
- type some subset of the term universe
- class specifies storage strategy
- classes may implement overlapping types
78Frozen variables
- Dyna map semantics concerns ground terms.
- But want to be able to reason about non-ground
terms, too. - Manipulate Dyna rules (which are non-ground
terms) - Work with classes of ground terms (specified by
non-ground terms) - Queries, memoized queries
- Memoization, updating, prioritization of updates,
- So, allow ground terms that contain frozen
variables - Treatment under unification is beyond scope of
this talk - priority(f(X)) f(X). for each X
- priority(f(X)) infinity. frozen
non-ground term
79Gensyms
80Some More Examples
- Shortest paths
- Neural nets
- Vector-space IR
- FST composition
- Generalized A parsing
n-gram smoothing Arc consistency Game trees Edit
distance
81Path-finding in Prolog
- pathto(1). the start of all pathspathto(V)
- edge(U,V), pathto(U). - When is the query pathto(14) really inefficient?
- Whats wrong with this swapped version?
- pathto(V) - pathto(U), edge(U,V).
14
82Shortest paths in Dyna
- Single source
- pathto(start) min 0.
- pathto(W) min pathto(V) edge(V,W).
- All pairs
- path(U,U) min 0.
- path(U,W) min path(U,V) edge(V,W).
- This hint gives Dijkstras algorithm (pqueue)
- priority(pathto(V) min Delta) Delta.
- Must also declare that pathto(V) has converged as
soon as it pops off the priority queue this is
true if heuristic is admissible.
can change min to to sum over paths (e.g.,
PageRank)
heuristic(V).
83Neural networks in Dyna
- out(Node) sigmoid(in(Node)).
- sigmoid(X) 1/(1exp(-X)).
- in(Node) weight(Node,Previous)out(Previous).
- in(Node) input(Node).
- error (out(Node)-target(Node))2.
Recurrent neural net is ok
84Vector-space IR in Dyna
- bestscore(Query) max score(Query,Doc).
- score(Query,Doc) tf(Query,Word)tf(Doc,Word)i
df(Word). - idf(Word) 1/log(df(Word)).
- df(Word) 1 whenever tf(Doc,Word) gt 0.
85Weighted FST composition in Dyna(epsilon-free
case)
- start(A o B) start(A) x start(B).
- stop(A o B, Q x R) stop (A, Q) stop (B, R).
- arc(A o B, Q1 x R1, Q2 x R2, In, Out) arc(A,
Q1, Q2, In, Match) arc(B, R1, R2, Match,
Out). - Computes full cross-product.
- Use magic templates transform to build only
reachable states.
86n-gram smoothing in Dyna
- These values all update automatically during
leave-one-out jackknifing. - mle_prob(X,Y,Z) count(X,Y,Z)/count(X,Y).
- smoothed_prob(X,Y,Z) ?mle_prob(X,Y,Z)
(1-?)mle_prob(Y,Z). - for arbitrary-length contexts, could use lists
- count_of_count(X,Y,count(X,Y,Z)) 1.
- Used for Good-Turing and Kneser-Ney smoothing.
- E.g., count_of_count(the, big, 1) is number
of word types that appeared exactly once after
the big.
87Arc consistency ( 2-consistency)
Agenda algorithm
X3 has no support in Y, so kill it off
Y1 has no support in X, so kill it off
Z1 just lost its only support in Y, so kill it
off
X
Y
?
3
2,
1,
3
2,
1,
X, Y, Z, T 1..3 X ? Y Y Z T ? Z X lt T
Note These steps can occur in somewhat arbitrary
order
?
3
2,
1,
3
2,
1,
?
T
Z
slide thanks to Rina Dechter (modified)
88Arc consistency in Dyna (AC-4 algorithm)
- Axioms (alternatively, could define them by
rule) - indomain(VarVal) define some
values true - consistent(VarVal, Var2Val2)
- Define to be true or false if Var, Var2 are
co-constrained. - Otherwise, leave undefined (or define as true).
- For VarVal to be kept, Val must be in-domain and
also not ruled out by any Var2 that cares - possible(VarVal) indomain(VarVal).
- possible(VarVal) supported(VarVal, Var2).
- Var2 cares if its co-constrained with VarVal
- supported(VarVal, Var2)
consistent(VarVal, Var2Val2)
possible(Var2Val2).
89Propagating bounds consistency in Dyna
- E.g., suppose we have a constraint A lt B(as
well as other constraints on A). Then - maxval(a) min maxval(b).
- if Bs max is reduced, then As should be
too - minval(b) max minval(a). by symmetry
- Similarly, if CD 10, then
- maxval(c) min 10-minval(d).
- maxval(d) min 10-minval(c).
- minval(c) max 10-maxval(d).
- minval(d) max 10-maxval(c).
90Game-tree analysis
- All values represent total advantage to player 1
starting at this board. - how good is Board for player 1, if its player
1s move? - best(Board) max stop(player1, Board).
- best(Board) max move(player1, Board, NewBoard)
worst(NewBoard). - how good is Board for player 1, if its player
2s move? (player 2 is trying to make player 1
lose zero-sum game) - worst(Board) min stop(player2, Board).
- worst(Board) min move(player2, Board,
NewBoard) best(NewBoard). - How good for player 1 is the starting board?
- goal best(Board) if start(Board).
91 Edit distance between two strings
Traditional picture
92Edit distance in Dyna
- dist(, ) 0.
- dist(XXs,Ys) min dist(Xs,Ys) delcost(X).
- dist(Xs,YYs) min dist(Xs,Ys) inscost(Y).
- dist(XXs,YYs) min dist(Xs,Ys)
substcost(X,Y). - substcost(L,L) 0.
- result align(c, l, a, r, a, c,
a, c, a).
93Edit distance in Dyna on input lattices
- dist(S,T) min dist(S,T,Q,R) S?final(Q)
T?final(R). - dist(S,T, S-gtstart, T-gtstart) min 0.
- dist(S,T, I2, J) min dist(S,T, I, J)
S?arc(I,I2,X) delcost(X). - dist(S,T, I, J2) min dist(S,T, I, J)
T?arc(J,J2,Y) inscost(Y). - dist(S,T, I2,J2) min dist(S,T, I, J)
S?arc(I,I2,X) S?arc(J,J2,Y)
substcost(X,Y). - substcost(L,L) 0.
- result dist(lattice1, lattice2).
- lattice1 startstate(0).
- arc(state(0),state(1),c)0.3.
- arc(state(1),state(2),l)0.
- final(state(5)).
94Generalized A parsing (CKY)
- Get Viterbi outside probabilities.
- Isomorphic to automatic differentiation
(reverse mode). - outside(goal) 1.
- outside(Body) max outside(Head)
whenever rule(Head max Body). - outside(phrase B) max (phrase A)
outside((AB)). - outside(phrase A) max outside((AB)) (phrase
B). - Prioritize by outside estimates from coarsened
grammar. - priority(phrase P) (P) outside(coarsen(P)).
- priority(phrase P) 1 if Pcoarsen(P).
can't coarsen any further
95Generalized A parsing (CKY)
- coarsen nonterminals.
- coa("PluralNoun") "Noun".
- coa("Noun") "Anything".
- coa("Anything") "Anything".
- coarsen phrases.
- coarsen(phrase(X,I,J)) phrase(coa(X),I,J).
- make successively coarser grammars
- each is an admissible estimate for the
next-finer one. - coarsen(rewrite(X,Y,Z)) rewrite(coa(X),coa(Y),co
a(Z)). - coarsen(rewrite(X,Word)) rewrite(coa(X),Word).
- coarsen(Rule) max Rule.
- i.e., Coarse max Rule whenever
Coarsecoarsen(Rule).
96Lightweight information interchange?
- Easy for Dyna terms to represent
- XML data (Dyna types are analogous to DTDs)
- RDF triples (semantic web)
- Annotated corpora
- Ontologies
- Graphs, automata, social networks
- Also provides facilities missing from semantic
web - Queries against these data
- State generalizations (rules, defaults) using
variables - Aggregate data and draw conclusions
- Keep track of provenance (backpointers)
- Keep track of confidence (weights)
- Map deductive database in a box
- Like a spreadsheet, but more powerful, safer to
maintain, and can communicate with outside world
97How fast was the prototype version?
- It used one size fits all strategies
- Asymptotically optimal, but
- 4 times slower than Mark Johnsons inside-outside
- 4-11 times slower than Klein Mannings Viterbi
parser - 5-6x speedup not too hard to get
98Are you going to make it faster?
(yup!)
- Static analysis
- Mixed storage strategies
- store X in an array
- store Y in a hash
- Mixed inference strategies
- dont store Z (compute on demand)
- Choose strategies by
- User declarations
- Automatically by execution profiling
99Summary
- AI systems are too hard to write and modify.
- Need a new layer of abstraction.
- Dyna is a language for computation (no I/O)
- Simple, powerful idea
- Define values from other values by weighted
logic. - Produces classes that interface with C, etc.
- Compiler supports many implementation strategies
- Tries to abstract and generalize many tricks
- Fitting a strategy to the workload is a great
opportunity for learning! - Natural fit to fine-grained parallelization
- Natural fit to web services
100Dyna contributors!
- Prototype (available)
- Eric Goldlust (core compiler), Noah A. Smith
(parameter training), Markus Dreyer (front-end
processing), David A. Smith, Roy Tromble,
Asheesh Laroia - All-new version (under development)
- Nathaniel Filardo (core compiler), Wren Ng
Thornton (core compiler), Jay Van Der Wall
(source language parser), John Blatz
(transformations and inference), Johnny
Graettinger (early design), Eric Northup (early
design) - Dynasty hypergraph browser (usable)
- Michael Kornbluh (initial version), Gordon
Woodhull (graph layout), Samuel Huang (latest
version), George Shafer, Raymond Buse,
Constantinos Michael
101FIN
102the case forLittle Languages
- declarative programming
- small is beautiful
103Sapir-Whorf hypothesis
- Language shapes thought
- At least, it shapes conversation
- Computer language shapes thought
- At least, it shapes experimental research
- Lots of cute ideas that we never pursue
- Or if we do pursue them, it takes 6-12 months to
implement on large-scale data - Have we turned into a lab science?
104Declarative Specifications
- State what is to be done
- (How should the computer do it? Turn that over
to a general solver that handles the
specification language.) - Hundreds of domain-specific little languages
out there. Some have sophisticated solvers.
105dot (www.graphviz.org)
digraph g graph rankdir "LR" node
fontsize "16 shape "ellipse" edge
"node0" label "ltf0gt 0x10ba8 ltf1gt"shape
"record" "node1" label "ltf0gt 0xf7fc4380
ltf1gt ltf2gt -1"shape "record"
"node0"f0 -gt "node1"f0 id 0 "node0"f1
-gt "node2"f0 id 1 "node1"f0 -gt
"node3"f0 id 2
nodes
edges
Whats the hard part? Making a nice
layout! Actually, its NP-hard
106dot (www.graphviz.org)
107LilyPond (www.lilypond.org)
108LilyPond (www.lilypond.org)
109Declarative Specs in NLP
- Regular expression (for a FST toolkit)
- Grammar (for a parser)
- Feature set (for a maxent distribution, SVM,
etc.) - Graphical model (DBNs for ASR, IE, etc.)
Claim of this talk Sometimes its best to peek
under the shiny surface. Declarative methods are
still great, but should be layeredwe need them
one level lower, too.
110Declarative Specs in NLP
- Regular expression (for a FST toolkit)
- Grammar (for a parser)
- Feature set (for a maxent distribution, SVM,
etc.)
111New examples of dynamic programming in NLP
- Parameterized finite-state machines
112Parameterized FSMs
- An FSM whose arc probabilities depend on
parameters they are formulas.
113Parameterized FSMs
- An FSM whose arc probabilities depend on
parameters they are formulas.
114Parameterized FSMs
- An FSM whose arc probabilities depend on
parameters they are formulas.
Expert first Construct the FSM (topology
parameterization). Automatic takes over Given
training data, find parameter valuesthat
optimize arc probs.
115Parameterized FSMs
Knight Graehl 1997 - transliteration
116Parameterized FSMs
Knight Graehl 1997 - transliteration
Would like to get some of that expert knowledge
in here Use probabilistic regexps like(a.7 b)
.5 (ab.6) If the probabilities are
variables (ax b) y (abz) then arc weights
of the compiled machine are nasty formulas.
(Especially after minimization!)
117Finite-State Operations
- Projection GIVES YOU marginal distribution
p(x,y)
domain(
)
118Finite-State Operations
- Probabilistic union GIVES YOU mixture model
p(x)
0.3
q(x)
119Finite-State Operations
- Probabilistic union GIVES YOU mixture model
?
p(x)
q(x)
Learn the mixture parameter ?!
120Finite-State Operations
- Composition GIVES YOU chain rule
p(xy)
o
p(yz)
- The most popular statistical FSM operation
- Cross-product construction
121Finite-State Operations
- Concatenation, probabilistic closure
HANDLE unsegmented text
0.3
p(x)
p(x)
q(x)
- Just glue together machines for the different
segments, and let them figure out how to align
with the text
122Finite-State Operations
- Directed replacement MODELS noise or
postprocessing
p(x,y)
o
- Resulting machine compensates for noise or
postprocessing
123Finite-State Operations
- Intersection GIVES YOU product models
- e.g., exponential / maxent, perceptron, Naïve
Bayes,
- Need a normalization op too computes ?x f(x)
pathsum or
partition function
p(x)
q(x)
- Cross-product construction (like composition)
124Finite-State Operations
- Conditionalization (new operation)
p(x,y)
condit(
)
- Resulting machine can be composed with other
distributions p(y x) q(x)
125New examples of dynamic programming in NLP
- Parameterized infinite-state machines
126Universal grammar as a parameterized FSA over an
infinite state space
127New examples of dynamic programming in NLP
- More abuses of finite-state machines
128Huge-alphabet FSAs for OT phonology
etc.
Gen proposes all candidates that include this
input.
Gen
voi
underlying tiers
C
C
V
C
voi
voi
surface tiers
C
C
V
C
V
C
C
V
C
voi
voi
C
C
V
C
C
C
V
C
velar
voi
V
C
C
V
C
C
C
C
C
C
C
129Huge-alphabet FSAs for OT phonology
encode this candidate as a string
voi
at each moment, need to describe whats going
on on many tiers
C
C
V
C
velar
V
C
C
C
C
C
C
130Directional Best Paths construction
- Keep best output string for each input string
- Yields a new transducer (size ?? 3n)
For input abc abc axc For input abd axd
Must allow red arc just if next input is d
131Minimization of semiring-weighted FSAs
- New definition of ? for pushing
- ?(q) weight of the shortest path from
q, breaking ties alphabetically on input
symbols - Computation is simple, well-defined, independent
of (K, ?) - Breadth-first search back from final states
Compute ?(q) in O(1) time as soon as we visit
q. Whole alg. is linear.
b
b
a
b
a
c
d
c
Faster than finding min-weight path à la Mohri.
distance 2
?(q) k ? ?(r)
132New examples of dynamic programming in NLP
133Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English.
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
134Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
135Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. Alignment shows how trees are
generated synchronously from little trees ...
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
136New examples of dynamic programming in NLP
- Bilexical parsing in O(n3)
- (with Giorgio Satta)
137Lexicalized CKY
loves
Mary
the
girl
outdoors
138Lexicalized CKY is O(n5) not O(n3)
... advocate
visiting relatives
... hug
visiting relatives
B
C
i
j
j1
k
O(n3 combinations)
139Idea 1
- Combine B with what C?
- must try different-width Cs (vary k)
- must try differently-headed Cs (vary h)
- Separate these!
140Idea 1
(the old CKY way)
141Idea 2
142Idea 2
- Combine what B and C?
- must try different-width Cs (vary k)
- must try different midpoints j
- Separate these!
143Idea 2
(the old CKY way)
144Idea 2
B
j
h
(the old CKY way)
A
C
h
h
A
h
k
145An O(n3) algorithm (with G. Satta)
loves
Mary
the
girl
outdoors
146(No Transcript)
147New examples of dynamic programming in NLP
- O(n)-time partial parsing by limiting dependency
length - (with Noah A. Smith)
148Short-Dependency Preference
- A words dependents (adjuncts, arguments)
- tend to fall near it
- in the string.
149length of a dependency surface distance
3
1
1
1
15050 of English dependencies have length 1,
another 20 have length 2, 10 have length 3 ...
fraction of all dependencies
length
151Related Ideas
- Score parses based on whats between a head and
child - (Collins, 1997 Zeman, 2004 McDonald et al.,
2005) - Assume short ? faster human processing
- (Church, 1980 Gibson, 1998)
- Attach low heuristic for PPs (English)
- (Frazier, 1979 Hobbs and Bear, 1990)
- Obligatory and optional re-orderings (English)
- (see paper)
152Going to Extremes
Longer dependencies are less likely.
What if we eliminate them completely?
153Hard Constraints
- Disallow dependencies between words of distance gt
b ... - Risk best parse contrived, or no parse at all!
- Solution allow fragments (partial parsing
Hindle, 1990 inter alia). - Why not model the sequence of fragments?
154Building a Vine SBG Parser
- Grammar generates sequence of trees from
- Parser recognizes sequences of trees without
long dependencies - Need to modify training data
- so the model is consistent
- with the parser.
1558
would
9
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
some
2
third
(from the Penn Treebank)
1
a
156would
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 4
some
2
third
(from the Penn Treebank)
1
a
157would
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 3
some
2
third
(from the Penn Treebank)
1
a
158would
1
1
.
,
According
changes
cut
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 2
some
2
third
(from the Penn Treebank)
1
a
159would
1
1
.
,
According
changes
cut
1
to
by
1
filings
the
rule
1
1
estimates
more
insider
1
1
than
b 1
some
third
(from the Penn Treebank)
1
a
160would
.
,
According
cut
changes
to
by
filings
the
rule
estimates
more
insider
than
b 0
some
third
(from the Penn Treebank)
a
161Vine Grammar is Regular
- Even for small b, bunches can grow to arbitrary
size - But arbitrary center embedding is out
162Vine Grammar is Regular
- Could compile into an FSA and get O(n) parsing!
- Problem whats the grammar constant?
EXPONENTIAL
- insider has no parent
- cut and would can have more children
- can have more children
FSA
According to some estimates , the rule changes
would cut insider ...
163Alternative
- Instead, we adapt
- an SBG chart parser
- which implicitly shares fragments of stack state
- to the vine case,
- eliminating unnecessary work.
164Limiting dependency length
- Linear-time partial parsing
-
Finite-state model of root sequence
NP
S
NP
Bounded dependencylength within each chunk (but
chunk could be arbitrarilywide right- or left-
branching)
- Natural-language dependencies tend to be short
- So even if you dont have enough data to model
what the heads are - you might want to keep track of where they are.
165Limiting dependency length
- Linear-time partial parsing
-
- Dont convert into an FSA!
- Less structure sharing
- Explosion of states for different stack
configurations - Hard to get your parse back
Finite-state model of root sequence
NP
S
NP
Bounded dependencylengt