Weighted Deduction as a Programming Language

About This Presentation

Title:

Weighted Deduction as a Programming Language

Description:

co-authors on various parts of this work: ... MELISMA. MSA of protein structures. C . 3524. 7620. 50. MUSTANG. MSA of amino acid seqs ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 191

Provided by: jasone2

Learn more at: https://www.cs.jhu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Weighted Deduction as a Programming Language

1
Weighted Deductionas a Programming Language

Jason Eisner

co-authors on various parts of this work Eric
Goldlust, Noah A. Smith, John Blatz, Wes Filardo,
Wren Thornton
CMU and Google, May 2008
2
An Anecdote from ACL05
-Michael Jordan
3
An Anecdote from ACL05
-Michael Jordan
4
Conclusions to draw from that talk

Mike his students are great.
Graphical models are great.(because theyre
flexible)
Gibbs sampling is great.(because it works with
nearly any graphical model)
Matlab is great.(because it frees up Mike and
his students to doodle all day and then execute
their doodles)

5
Could NLP be this nice?

Mike his students are great.
Graphical models are great.(because theyre
flexible)
Gibbs sampling is great.(because it works with
nearly any graphical model)
Matlab is great.(because it frees up Mike and
his students to doodle all day and then execute
their doodles)

6
Systems are big!Large-scale noisy data, complex
models, search approximations, software
engineering
7
Systems are big!Large-scale noisy data, complex
models, search approximations, software
engineering

Maybe a bit smaller outside NLP
But still big and carefully engineered
And will get bigger, e.g., as machine vision
systems do more scene analysis and compositional
object modeling

8
Systems are big!Large-scale noisy data, complex
models, search approximations, software
engineering

Consequences
Barriers to entry
Small number of players
Significant investment to be taken seriously
Need to know implement the standard tricks
Barriers to experimentation
Too painful to tear up and reengineer your old
system, to try a cute idea of unknown payoff
Barriers to education and sharing
Hard to study or combine systems
Potentially general techniques are described and
implemented only one context at a time

9
How to spend ones life?
Didnt I just implement something like this last
month?
chart management / indexing cache-conscious data
structures memory layout, file formats,
integerization, prioritization of partial
solutions (best-first, A) lazy k-best, forest
reranking parameter management inside-outside
formulas, gradients, different algorithms for
training and decoding conjugate gradient,
annealing, ... parallelization
I thought computers were supposed to automate
drudgery
10
Solution

Presumably, we ought toadd another layer of
abstraction.
After all, this is CS.
Hope to convince you thata substantive new layer
exists.
But what would it look like?
Whats shared by many programs?

11
Can toolkits help?
12
Can toolkits help?

Hmm, there are a lot of toolkits.
And theyre big too.
Plus, they dont always cover what you want.
Which is why people keep writing them.
E.g., I love use OpenFST and have learned lots
from its implementation! But sometimes I also
want ...
So what is common across toolkits?

automata with gt 2 tapes
infinite alphabets
parameter training
A decoding
automatic integerization

automata defined by policy
mixed sparse/dense implementation (per state)
parallel execution
hybrid models (90 finite-state)

13
The Dyna language

A toolkits job is to abstract away the
semantics, operations, and algorithmsfor a
particular domain.
In contrast, Dyna is domain-independent.
(like MapReduce, Bigtable, etc.)
Manages data computations that you specify.
Toolkits or applications can be built on top.

14
Warning

Lots more beyond this talk
See http//dyna.org
read our papers
download an earlier prototype
sign up for updates by email
wait for the totally revamped next version ?

15
A Quick Sketch of Dyna
16
How you build a system (big picture slide)
cool model
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
17
How you build a system (big picture slide)
cool model
Dyna language specifies these
equations. Most programs just need to compute
some values from other values. Any order is
ok. Feed-forward! Dynamic programming! Message
passing! (including Gibbs) Must quickly figure
out what influences what. Compute Markov
blanket Compute transitions in state machine
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
18
How you build a system (big picture slide)
cool model

Dyna language specifies these equations.
Most programs just need to compute some values
from other values. Any order is ok.
Some programs also need to update the outputs if
the inputs change
spreadsheets, makefiles, email readers
dynamic graph algorithms
EM and other iterative optimization
Energy of a proposed configuation for MCMC
leave-one-out training of smoothing params

practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
19
How you build a system (big picture slide)
cool model
practical equations
PCFG
Compilation strategies (well come back
to this)
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
20
Writing equations in Dyna

int a.
a b c.
a will be kept up to date if b or c changes.
b x.b y. equivalent to b xy.
b is a sum of two variables. Also kept up to
date.
c z(1).c z(2).c z(3).
c z(four).c z(foo(bar,5)).

c z(N).
c is a sum of all nonzero z() values. At
compile time, we dont know how many!
21
More interesting use of patterns

a b c.
scalar multiplication
a(I) b(I) c(I).
pointwise multiplication
a b(I) c(I). means a b(I)c(I)
dot product could be sparse
a(I,K) b(I,J) c(J,K). b(I,J)c(J,K)
matrix multiplication could be sparse
J is free on the right-hand side, so we sum over
it

22
Dyna vs. Prolog

By now you may see what were up to!
Prolog has Horn clauses
a(I,K) - b(I,J) , c(J,K).
Dyna has Horn equations
a(I,K) b(I,J) c(J,K).

Like Prolog Allow nested terms Syntactic sugar
for lists, etc. Turing-complete
Unlike Prolog Charts, not backtracking! Compile
? efficient C classes Terms have values
23
Some connections and intellectual debts

Deductive parsing schemata (preferably weighted)
Goodman, Nederhof, Pereira, McAllester, Warren,
Shieber, Schabes, Sikkel
Deductive databases (preferably with aggregation)
Ramakrishnan, Zukowski, Freitag, Specht, Ross,
Sagiv,
Query optimization
Usually limited to decidable fragments, e.g.,
Datalog
Theorem proving
Theorem provers, term rewriting, etc.
Nonmonotonic reasoning
Programming languages
Efficient Prologs (Mercury, XSB, )
Probabilistic programming languages (PRISM, IBAL
)
Declarative networking (P2)
XML processing languages (XTatic, CDuce)
Functional logic programming (Curry, )
Self-adjusting computation, adaptive memoization
(Acar et al.)

Increasing interest in resurrecting declarative
and logic-based system specifications.
24
Example CKY and Variations
25
The CKY inside algorithm in Dyna
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
using namespace cky chart c crewrite(s,np,
vp) 0.7 cword(Pierre,0,1)
1 csentence_length 30 cin gtgt c
// get more axioms from stdin cout ltlt cgoal
// print total weight of all parses
26
Visual debugger Browse the proof forest
27
Visual debugger Browse the proof forest
28
Parameterization
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).

rewrite(X,Y,Z) doesnt have to be an atomic
parameter
urewrite(X,Y,Z) weight1(X,Y).
urewrite(X,Y,Z) weight2(X,Z).
urewrite(X,Y,Z) weight3(Y,Z).
urewrite(X,Same,Same) weight4.
urewrite(X) urewrite(X,Y,Z).
normalizing constant
rewrite(X,Y,Z) urewrite(X,Y,Z) / urewrite(X).
normalize

29
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earleys algorithm?

30
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
max max max

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earleys algorithm?

31
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
max max max
log log log

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earleys algorithm?

32
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earleys algorithm?

c word(Pierre, 0, 1)
1
state(5)
state(9)
0.2
air/0.3
8
9
P/0.5
Pierre/0.2
5
33
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earleys algorithm?

Just add words one at a time to the chart Check
at any time what can be derived from words so
far Similarly, dynamic grammars
34
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earleys algorithm?

Again, no change to the Dyna program
35
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earleys algorithm?

Basically, just add extra arguments to the terms
above
36
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earleys algorithm?

37
Rule binarization
phrase(X,I,J) phrase(Y,I,Mid)
phrase(Z,Mid,J) rewrite(X,Y,Z).
X
Y
Z
Z
Y
Mid
J
I
Mid
38
Rule binarization
phrase(X,I,J) phrase(Y,I,Mid)
phrase(Z,Mid,J) rewrite(X,Y,Z).
graphical models constraint programming multi-way
database join
39
Program transformations
cool model
Blatz Eisner (FG 2007) Lots of
equivalent ways to write a system of
equations! Transforming from one to another
mayimprove efficiency. Many parsing tricks
can be generalized into automatic
transformations that help other programs, too!
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
40
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earleys algorithm?

41
Earleys algorithm in Dyna
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
magic templates transformation (as noted by
Minnen 1996)
42
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earleys algorithm?
Epsilon symbols?

word(epsilon,I,I) 1. (i.e., epsilons are freely
available everywhere)
43
Some examples from my lab (as of 2006,
w/prototype)

Parsing using
factored dependency models (Dreyer, Smith,
Smith CONLL06)
with annealed risk minimization (Smith and Eisner
EMNLP06)
constraints on dependency length (Eisner Smith
IWPT05)
unsupervised learning of deep transformations (see
Eisner EMNLP02)
lexicalized algorithms (see Eisner Satta
ACL99, etc.)
Grammar induction using
partial supervision (Dreyer Eisner EMNLP06)
structural annealing (Smith Eisner ACL06)
contrastive estimation (Smith Eisner GIA05)
deterministic annealing (Smith Eisner ACL04)
Machine translation using
Very large neighborhood search of
permutations (Eisner Tromble, NAACL-W06)
Loosely syntax-based MT (Smith Eisner in
prep.)
Synchronous cross-lingual parsing (Smith Smith
EMNLP04)
Finite-state methods for morphology, phonology,
IE, even syntax
Unsupervised cognate discovery (Schafer
Yarowsky 05, 06)
Unsupervised log-linear models via contrastive
estimation (Smith Eisner ACL05)
Context-based morph. disambiguation (Smith,
Smith Tromble EMNLP05)

Easy to try stuff out! Programs are very short
easy to change!
- see also Eisner ACL03)
44
Can it express everything in NLP? ?

Remember, it integrates tightly with C, so you
only have to use it where its helpful,and write
the rest in C. Small is beautiful.
Of course, it is Turing complete ?

45
One Execution Strategy(forward chaining)
46
How you build a system (big picture slide)
cool model
practical equations
PCFG
Propagate updates from right-to-left through the
equations. a.k.a. agenda algorithm forward
chaining bottom-up inference semi-naïve
bottom-up
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
use a general method
47
Bottom-up inference
agenda of pending updates
rules of program
s(I,K) np(I,J) vp(J,K)
pp(I,K) prep(I,J) np(J,K)
prep(I,3) ?
prep(2,3) 1.0
s(3,9) 0.15
s(3,7) 0.21
vp(5,K) ?
vp(5,9) 0.5
pp(2,5) 0.3
vp(5,7) 0.7
np(3,5) 0.3
we updated np(3,5)what else must therefore
change?
If np(3,5) hadnt been in the chart already, we
would have added it.
np(3,5) 0.1
no more matches to this query
0.3
chart of derived items with current values
48
How you build a system (big picture slide)
cool model
practical equations
PCFG
Whats going on under the hood?
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
49
Compiler provides
agenda of pending updates
rules of program
s(I,K) np(I,J) vp(J,K)
np(3,5) 0.3
copy, compare, hashterms fast, via
integerization (interning)
efficient storage of terms (given static type
info) (implicit storage,symbiotic storage,
various data structures, support for
indices,stack vs. heap, )
chart of derived items with current values
50
Beware double-counting!
agenda of pending updates
combining with itself
rules of program
n(I,K) n(I,J) n(J,K)
n(5,5) 0.2
n(5,5) ?
n(5,5) 0.3
to makeanother copyof itself
epsilon constituent
chart of derived items with current values
51
More issues in implementing inference

Handling non-distributive updates
Replacement
p max q(X). what if current max q(0) is
reduced?
Retraction
p max q(X). what if q(0) becomes unprovable
(no value)?
Non-distributive rules
p 1/q(X). adding ? to q(0) doesnt simply
add to p
Backpointers (hyperedges in the derivation
forest)
Efficient storage, or on-demand recomputation
Information flow between f(3), f(int X), f(X)

52
More issues in implementing inference

User-defined priorities
priority(phrase(X,I,J)) -(J-I). CKY (narrow
to wide)
priority(phrase(X,I,J)) phrase(X,I,J).
uniform-cost
Can we learn a good priority function? (can be
dynamic)
User-defined parallelization
host(phrase(X,I,J)) J.
Can we learn a host choosing function? (can be
dynamic)
User-defined convergence tests

heuristic(X,I,J)
A
53
More issues in implementing inference

Time-space tradeoffs
Which queries to index, and how?
Selective or temporary memoization
Can we learn a policy?
On-demand computation (backward chaining)
Prioritizing subgoals query planning
Safely invalidating memos
Mixing forward-chaining and backward-chaining
Can we choose a good mixed strategy?

54
Parameter training
objective functionas a theorems value

Maximize some objective function.
Use Dyna to compute the function.
Then how do you differentiate it?
for gradient ascent,conjugate gradient, etc.
gradient of log-partition function also tells
us the expected counts for EM

e.g., inside algorithm computes likelihood of the
sentence

Two approaches supported
Tape algorithm remember agenda order and run it
backwards.
Program transformation automatically derive the
outside formulas.

55
Automatic differentiation via the gradient
transform

a b c. ?
Now g(x) denotes ?f/?x, f being the objective
func.
Examples
Backprop for neural networks
Backward algorithm for HMMs and CRFs
Outside algorithm for PCFGs

g(b) a g(c).
g(a) g(b) c.

Dyna implementation also supports tape-based
differentiation.
56
More on Program Transformations
57
Program transformations

An optimizing compiler would like the freedom to
radically rearrange your code.
Easier in a declarative language than in C.
Dont need to reconstruct the source programs
intended semantics.
Also, source program is much shorter.
Search problem (open) Find a good sequence of
transformations (on a given workload).

58
Variable elimination

Dechters bucket elimination for hard
constraints
But how do we do it for soft constraints?
How do we join soft constraints?

Bucket E E ¹ D, E ¹ C Bucket D D ¹
A Bucket C C ¹ B Bucket B B ¹ A Bucket A
join all constraints in Es bucket
yielding a new constraint on D (and C)
now join all constraints in Ds bucket
figure thanks to Rina Dechter
59
Variable elimination via a folding transform

goal max f1(A,B)f2(A,C)f3(A,D)f4(C,E)f5(D,E).
tempE(C,D)
tempE(C,D) max f4(C,E)f5(D,E).

Undirected graphical model

to eliminate E, join constraints mentioning
E, and project E out
figure thanks to Rina Dechter
60
Variable elimination via a folding transform

goal max f1(A,B)f2(A,C)f3(A,D)tempE(C,D).
tempD(A,C)
tempD(A,C) max f3(A,D)tempE(C,D).
tempE(C,D) max f4(C,E)f5(D,E).

Undirected graphical model

to eliminate D, join constraints mentioning
D, and project D out
figure thanks to Rina Dechter
61
Variable elimination via a folding transform

goal max f1(A,B)f2(A,C)tempD(A,C).
tempC(A)
tempC(A) max f2(A,C)tempD(A,C).
tempD(A,C) max f3(A,D)tempE(C,D).
tempE(C,D) max f4(C,E)f5(D,E).

Undirected graphical model

figure thanks to Rina Dechter
62
Variable elimination via a folding transform

goal max tempC(A)f1(A,B).
tempB(A) max f1(A,B).
tempC(A) max f2(A,C)tempD(A,C).
tempD(A,C) max f3(A,D)tempE(C,D).
tempE(C,D) max f4(C,E)f5(D,E).

Undirected graphical model

tempB(A)
figure thanks to Rina Dechter
63
Variable elimination via a folding transform

goal max tempC(A)tempB(A).
tempB(A) max f1(A,B).
tempC(A) max f2(A,C)tempD(A,C).
tempD(A,C) max f3(A,D)tempE(C,D).
tempE(C,D) max f4(C,E)f5(D,E).

Undirected graphical model

could replace max with throughout, to compute
partition function Z
figure thanks to Rina Dechter
64
Grammar specialization as an unfolding transform

phrase(X,I,J) rewrite(X,Y,Z) phrase(Y,I,Mid)
phrase(Z,Mid,J).
rewrite(s,np,vp) 0.7.
phrase(s,I,J) 0.7 phrase(np,I,Mid)
phrase(vp,Mid,J).
s(I,J) 0.7 np(I,Mid)
vp(Mid,J).

unfolding
term flattening
(actually handled implicitly by subtype storage
declarations)
65
On-demand computation via a magic templates
transform

a - b, c. ?
Examples
Earleys algorithm for parsing
Left-corner filter for parsing
On-the-fly composition of FSTs
The weighted generalization turns out to be the
generalized A algorithm (coarse-to-fine
search).

a - magic(a), b, c.
magic(b) - magic(a).
magic(c) - magic(a), b.

66
Speculation transformation(generalization of
folding)

Perform some portion of computation
speculatively, before we have all the inputs we
need
Fill those inputs in later
Examples from parsing
Gap passing in categorial grammar
Build an S/NP (a sentence missing its direct
object NP)
Transform a parser so that it preprocesses the
grammar
E.g., unary rule closure or epsilon closure
Build phrase(np,I,K) from a phrase(s,I,K) we
dont have yet (so we havent yet chosen a
particular I, K)
Transform lexical context-free parsing from O(n5)
? O(n3)
Add left children to a constituent we dont have
yet (without committing to its width)
Derive Eisner Satta (1999) algorithm

67
A few more language details

So youll understand the examples

68
Terms (generalized from Prolog)

These are the Objects of the language
Primitives
3, 3.14159, myUnicodeString
user-defined primitive types
Variables
X
int X type-restricted variable types are tree
automata
Compound terms
atom
atom(subterm1, subterm2, ) e.g.,
f(g(h(3),X,Y), Y)
Adding support for keyword arguments(similar to
R, but must support unification)

69
Fixpoint semantics

A Dyna program is a finite rule set that defines
a partial function (map)
Map only defines values for ground terms
Variables (X,Y,) let us define values for 8ly
many ground terms
Compute a map that satisfies the equations in the
program
Not guaranteed to halt (Dyna is Turing-complete,
unlike Datalog)
Not guaranteed to be unique

Map
70
Fixpoint semantics

A Dyna program is a finite rule set that defines
a partial function (map)
Map only defines values for ground terms
Map may accept modifications at runtime
Runtime input
Adjustments to input (dynamic algorithms)
Retraction (remove input), detachment (forget
input but preserve output)

Map
71
Object-oriented features

Maps are terms, i.e., first-class objects
Maps can appear as subterms or as values
Useful for encapsulating data and passing it
around
fst3 compose(fst1, fst2). value of fst3 is
a chart
forest parse(sentence).
Typed by their public interface
fst4-gtedge(Q,R) fst3-gtedge(R,Q).
Maps can be stored in files and loaded from files
Human-readable format (looks like a Dyna program)
Binary format (mimics in-memory layout)

72
Functional features Auto-evaluation

Terms can have values.
So by default, subterms are evaluated in place.
Arranged by a simple desugaring transformation
foo( X ) 3bar(X).
? foo( X ) B is bar(X), Result is 3B,
Result.
Possible to suppress evaluation f(x) or force it
f(x)
Some contexts also suppress evaluation.
Variables are replaced with their bindings but
not otherwise evaluated.

2 things to evaluate here bar and
73
Functional features Auto-evaluation

Terms can have values.
So by default, subterms are evaluated in place.
Arranged by a simple desugaring transformation
foo(f(X)) 3bar(g(X)).
? foo( F )
Possible to suppress evaluation f(x) or force it
f(x)
Some contexts also suppress evaluation.
Variables are replaced with their bindings but
not otherwise evaluated.

F is f(X), G is g(X), B is bar(G), Result is
3B, Result.
74
Other handy features

fact(0) 1.
fact(int N) N gt 0, Nfact(N-1).
0! 1.
(int N)! N(N-1)! if N 1.

user-defined syntactic sugar
Unicode
75
Aggregation operators

f(X) 3. immutable
f(X) 3. can be incremented later
f(X) min 3. can be reduced later
f(X) 3. can be arbitrarily changed
later
f(X) gt 3. like but can be overridden
by more specific rule

76
Aggregation operators

f(X) 1. can be arbitrarily changed
later
Non-monotonic reasoning
flies(bird X) true.
flies(bird X) penguin(X), false. overrides
flies(bigbird) false.
also overrides
Iterative update algorithms (EM, Gibbs, BP)
a init_a.
a updated_a(b). will override once b is
proved
b updated_b(a).

77
Declarations(ultimately, should be chosen
automatically)

at term level
lazy vs. eager computational strategies
memoization and flushing strategies
prioritization, parallelization, etc.
at class level
class an implementation of a type
type some subset of the term universe
class specifies storage strategy
classes may implement overlapping types

78
Frozen variables

Dyna map semantics concerns ground terms.
But want to be able to reason about non-ground
terms, too.
Manipulate Dyna rules (which are non-ground
terms)
Work with classes of ground terms (specified by
non-ground terms)
Queries, memoized queries
Memoization, updating, prioritization of updates,
So, allow ground terms that contain frozen
variables
Treatment under unification is beyond scope of
this talk
priority(f(X)) f(X). for each X
priority(f(X)) infinity. frozen
non-ground term

79
Gensyms
80
Some More Examples

Shortest paths
Neural nets
Vector-space IR
FST composition
Generalized A parsing

n-gram smoothing Arc consistency Game trees Edit
distance
81
Path-finding in Prolog

pathto(1). the start of all pathspathto(V)
- edge(U,V), pathto(U).
When is the query pathto(14) really inefficient?
Whats wrong with this swapped version?
pathto(V) - pathto(U), edge(U,V).

14
82
Shortest paths in Dyna

Single source
pathto(start) min 0.
pathto(W) min pathto(V) edge(V,W).
All pairs
path(U,U) min 0.
path(U,W) min path(U,V) edge(V,W).
This hint gives Dijkstras algorithm (pqueue)
priority(pathto(V) min Delta) Delta.
Must also declare that pathto(V) has converged as
soon as it pops off the priority queue this is
true if heuristic is admissible.

can change min to to sum over paths (e.g.,
PageRank)
heuristic(V).
83
Neural networks in Dyna

out(Node) sigmoid(in(Node)).
sigmoid(X) 1/(1exp(-X)).
in(Node) weight(Node,Previous)out(Previous).
in(Node) input(Node).
error (out(Node)-target(Node))2.

Recurrent neural net is ok
84
Vector-space IR in Dyna

bestscore(Query) max score(Query,Doc).
score(Query,Doc) tf(Query,Word)tf(Doc,Word)i
df(Word).
idf(Word) 1/log(df(Word)).
df(Word) 1 whenever tf(Doc,Word) gt 0.

85
Weighted FST composition in Dyna(epsilon-free
case)

start(A o B) start(A) x start(B).
stop(A o B, Q x R) stop (A, Q) stop (B, R).
arc(A o B, Q1 x R1, Q2 x R2, In, Out) arc(A,
Q1, Q2, In, Match) arc(B, R1, R2, Match,
Out).
Computes full cross-product.
Use magic templates transform to build only
reachable states.

86
n-gram smoothing in Dyna

These values all update automatically during
leave-one-out jackknifing.
mle_prob(X,Y,Z) count(X,Y,Z)/count(X,Y).
smoothed_prob(X,Y,Z) ?mle_prob(X,Y,Z)
(1-?)mle_prob(Y,Z).
for arbitrary-length contexts, could use lists
count_of_count(X,Y,count(X,Y,Z)) 1.
Used for Good-Turing and Kneser-Ney smoothing.
E.g., count_of_count(the, big, 1) is number
of word types that appeared exactly once after
the big.

87
Arc consistency ( 2-consistency)
Agenda algorithm
X3 has no support in Y, so kill it off
Y1 has no support in X, so kill it off
Z1 just lost its only support in Y, so kill it
off
X
Y
?
3
2,
1,
3
2,
1,
X, Y, Z, T 1..3 X ? Y Y Z T ? Z X lt T
Note These steps can occur in somewhat arbitrary
order

?
3
2,
1,
3
2,
1,
?
T
Z
slide thanks to Rina Dechter (modified)
88
Arc consistency in Dyna (AC-4 algorithm)

Axioms (alternatively, could define them by
rule)
indomain(VarVal) define some
values true
consistent(VarVal, Var2Val2)
Define to be true or false if Var, Var2 are
co-constrained.
Otherwise, leave undefined (or define as true).
For VarVal to be kept, Val must be in-domain and
also not ruled out by any Var2 that cares
possible(VarVal) indomain(VarVal).
possible(VarVal) supported(VarVal, Var2).
Var2 cares if its co-constrained with VarVal
supported(VarVal, Var2)
consistent(VarVal, Var2Val2)
possible(Var2Val2).

89
Propagating bounds consistency in Dyna

E.g., suppose we have a constraint A lt B(as
well as other constraints on A). Then
maxval(a) min maxval(b).
if Bs max is reduced, then As should be
too
minval(b) max minval(a). by symmetry
Similarly, if CD 10, then
maxval(c) min 10-minval(d).
maxval(d) min 10-minval(c).
minval(c) max 10-maxval(d).
minval(d) max 10-maxval(c).

90
Game-tree analysis

All values represent total advantage to player 1
starting at this board.
how good is Board for player 1, if its player
1s move?
best(Board) max stop(player1, Board).
best(Board) max move(player1, Board, NewBoard)
worst(NewBoard).
how good is Board for player 1, if its player
2s move? (player 2 is trying to make player 1
lose zero-sum game)
worst(Board) min stop(player2, Board).
worst(Board) min move(player2, Board,
NewBoard) best(NewBoard).
How good for player 1 is the starting board?
goal best(Board) if start(Board).

91
Edit distance between two strings
Traditional picture
92
Edit distance in Dyna

dist(, ) 0.
dist(XXs,Ys) min dist(Xs,Ys) delcost(X).
dist(Xs,YYs) min dist(Xs,Ys) inscost(Y).
dist(XXs,YYs) min dist(Xs,Ys)
substcost(X,Y).
substcost(L,L) 0.
result align(c, l, a, r, a, c,
a, c, a).

93
Edit distance in Dyna on input lattices

dist(S,T) min dist(S,T,Q,R) S?final(Q)
T?final(R).
dist(S,T, S-gtstart, T-gtstart) min 0.
dist(S,T, I2, J) min dist(S,T, I, J)
S?arc(I,I2,X) delcost(X).
dist(S,T, I, J2) min dist(S,T, I, J)
T?arc(J,J2,Y) inscost(Y).
dist(S,T, I2,J2) min dist(S,T, I, J)
S?arc(I,I2,X) S?arc(J,J2,Y)
substcost(X,Y).
substcost(L,L) 0.
result dist(lattice1, lattice2).
lattice1 startstate(0).
arc(state(0),state(1),c)0.3.
arc(state(1),state(2),l)0.
final(state(5)).

94
Generalized A parsing (CKY)

Get Viterbi outside probabilities.
Isomorphic to automatic differentiation
(reverse mode).
outside(goal) 1.
outside(Body) max outside(Head)
whenever rule(Head max Body).
outside(phrase B) max (phrase A)
outside((AB)).
outside(phrase A) max outside((AB)) (phrase
B).
Prioritize by outside estimates from coarsened
grammar.
priority(phrase P) (P) outside(coarsen(P)).
priority(phrase P) 1 if Pcoarsen(P).
can't coarsen any further

95
Generalized A parsing (CKY)

coarsen nonterminals.
coa("PluralNoun") "Noun".
coa("Noun") "Anything".
coa("Anything") "Anything".
coarsen phrases.
coarsen(phrase(X,I,J)) phrase(coa(X),I,J).
make successively coarser grammars
each is an admissible estimate for the
next-finer one.
coarsen(rewrite(X,Y,Z)) rewrite(coa(X),coa(Y),co
a(Z)).
coarsen(rewrite(X,Word)) rewrite(coa(X),Word).
coarsen(Rule) max Rule.
i.e., Coarse max Rule whenever
Coarsecoarsen(Rule).

96
Lightweight information interchange?

Easy for Dyna terms to represent
XML data (Dyna types are analogous to DTDs)
RDF triples (semantic web)
Annotated corpora
Ontologies
Graphs, automata, social networks
Also provides facilities missing from semantic
web
Queries against these data
State generalizations (rules, defaults) using
variables
Aggregate data and draw conclusions
Keep track of provenance (backpointers)
Keep track of confidence (weights)
Map deductive database in a box
Like a spreadsheet, but more powerful, safer to
maintain, and can communicate with outside world

97
How fast was the prototype version?

It used one size fits all strategies
Asymptotically optimal, but
4 times slower than Mark Johnsons inside-outside
4-11 times slower than Klein Mannings Viterbi
parser
5-6x speedup not too hard to get

98
Are you going to make it faster?
(yup!)

Static analysis
Mixed storage strategies
store X in an array
store Y in a hash
Mixed inference strategies
dont store Z (compute on demand)
Choose strategies by
User declarations
Automatically by execution profiling

99
Summary

AI systems are too hard to write and modify.
Need a new layer of abstraction.
Dyna is a language for computation (no I/O)
Simple, powerful idea
Define values from other values by weighted
logic.
Produces classes that interface with C, etc.
Compiler supports many implementation strategies
Tries to abstract and generalize many tricks
Fitting a strategy to the workload is a great
opportunity for learning!
Natural fit to fine-grained parallelization
Natural fit to web services

100
Dyna contributors!

Prototype (available)
Eric Goldlust (core compiler), Noah A. Smith
(parameter training), Markus Dreyer (front-end
processing), David A. Smith, Roy Tromble,
Asheesh Laroia
All-new version (under development)
Nathaniel Filardo (core compiler), Wren Ng
Thornton (core compiler), Jay Van Der Wall
(source language parser), John Blatz
(transformations and inference), Johnny
Graettinger (early design), Eric Northup (early
design)
Dynasty hypergraph browser (usable)
Michael Kornbluh (initial version), Gordon
Woodhull (graph layout), Samuel Huang (latest
version), George Shafer, Raymond Buse,
Constantinos Michael

101
FIN
102
the case forLittle Languages

declarative programming
small is beautiful

103
Sapir-Whorf hypothesis

Language shapes thought
At least, it shapes conversation
Computer language shapes thought
At least, it shapes experimental research
Lots of cute ideas that we never pursue
Or if we do pursue them, it takes 6-12 months to
implement on large-scale data
Have we turned into a lab science?

104
Declarative Specifications

State what is to be done
(How should the computer do it? Turn that over
to a general solver that handles the
specification language.)
Hundreds of domain-specific little languages
out there. Some have sophisticated solvers.

105
dot (www.graphviz.org)
digraph g graph rankdir "LR" node
fontsize "16 shape "ellipse" edge
"node0" label "ltf0gt 0x10ba8 ltf1gt"shape
"record" "node1" label "ltf0gt 0xf7fc4380
ltf1gt ltf2gt -1"shape "record"
"node0"f0 -gt "node1"f0 id 0 "node0"f1
-gt "node2"f0 id 1 "node1"f0 -gt
"node3"f0 id 2
nodes
edges
Whats the hard part? Making a nice
layout! Actually, its NP-hard
106
dot (www.graphviz.org)
107
LilyPond (www.lilypond.org)
108
LilyPond (www.lilypond.org)
109
Declarative Specs in NLP

Regular expression (for a FST toolkit)
Grammar (for a parser)
Feature set (for a maxent distribution, SVM,
etc.)
Graphical model (DBNs for ASR, IE, etc.)

Claim of this talk Sometimes its best to peek
under the shiny surface. Declarative methods are
still great, but should be layeredwe need them
one level lower, too.
110
Declarative Specs in NLP

Regular expression (for a FST toolkit)
Grammar (for a parser)
Feature set (for a maxent distribution, SVM,
etc.)

111
New examples of dynamic programming in NLP

Parameterized finite-state machines

112
Parameterized FSMs

An FSM whose arc probabilities depend on
parameters they are formulas.

113
Parameterized FSMs

An FSM whose arc probabilities depend on
parameters they are formulas.

114
Parameterized FSMs

An FSM whose arc probabilities depend on
parameters they are formulas.

Expert first Construct the FSM (topology
parameterization). Automatic takes over Given
training data, find parameter valuesthat
optimize arc probs.
115
Parameterized FSMs
Knight Graehl 1997 - transliteration
116
Parameterized FSMs
Knight Graehl 1997 - transliteration
Would like to get some of that expert knowledge
in here Use probabilistic regexps like(a.7 b)
.5 (ab.6) If the probabilities are
variables (ax b) y (abz) then arc weights
of the compiled machine are nasty formulas.
(Especially after minimization!)
117
Finite-State Operations

Projection GIVES YOU marginal distribution

p(x,y)
domain(
)
118
Finite-State Operations

Probabilistic union GIVES YOU mixture model

p(x)
0.3
q(x)
119
Finite-State Operations

Probabilistic union GIVES YOU mixture model

?
p(x)
q(x)
Learn the mixture parameter ?!
120
Finite-State Operations

Composition GIVES YOU chain rule

p(xy)
o
p(yz)

The most popular statistical FSM operation
Cross-product construction

121
Finite-State Operations

Concatenation, probabilistic closure
HANDLE unsegmented text

0.3
p(x)
p(x)
q(x)

Just glue together machines for the different
segments, and let them figure out how to align
with the text

122
Finite-State Operations

Directed replacement MODELS noise or
postprocessing

p(x,y)
o

Resulting machine compensates for noise or
postprocessing

123
Finite-State Operations

Intersection GIVES YOU product models
e.g., exponential / maxent, perceptron, Naïve
Bayes,

Need a normalization op too computes ?x f(x)
pathsum or
partition function

p(x)

q(x)

Cross-product construction (like composition)

124
Finite-State Operations

Conditionalization (new operation)

p(x,y)
condit(
)

Resulting machine can be composed with other
distributions p(y x) q(x)

125
New examples of dynamic programming in NLP

Parameterized infinite-state machines

126
Universal grammar as a parameterized FSA over an
infinite state space
127
New examples of dynamic programming in NLP

More abuses of finite-state machines

128
Huge-alphabet FSAs for OT phonology
etc.
Gen proposes all candidates that include this
input.
Gen
voi
underlying tiers
C
C
V
C

voi
voi
surface tiers
C
C
V
C
V
C
C
V
C
voi
voi
C
C
V
C
C
C
V
C
velar
voi
V
C
C
V
C
C
C
C
C
C
C
129
Huge-alphabet FSAs for OT phonology
encode this candidate as a string
voi
at each moment, need to describe whats going
on on many tiers
C
C
V
C
velar
V
C
C
C
C
C
C
130
Directional Best Paths construction

Keep best output string for each input string
Yields a new transducer (size ?? 3n)

For input abc abc axc For input abd axd
Must allow red arc just if next input is d
131
Minimization of semiring-weighted FSAs

New definition of ? for pushing
?(q) weight of the shortest path from
q, breaking ties alphabetically on input
symbols
Computation is simple, well-defined, independent
of (K, ?)
Breadth-first search back from final states

Compute ?(q) in O(1) time as soon as we visit
q. Whole alg. is linear.
b
b
a
b
a
c
d
c
Faster than finding min-weight path à la Mohri.
distance 2
?(q) k ? ?(r)
132
New examples of dynamic programming in NLP

Tree-to-tree alignment

133
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English.
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
134
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
135
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. Alignment shows how trees are
generated synchronously from little trees ...
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
136
New examples of dynamic programming in NLP

Bilexical parsing in O(n3)
(with Giorgio Satta)

137
Lexicalized CKY
loves
Mary
the
girl
outdoors
138
Lexicalized CKY is O(n5) not O(n3)
... advocate
visiting relatives
... hug
visiting relatives
B
C
i
j
j1
k
O(n3 combinations)
139
Idea 1

Combine B with what C?
must try different-width Cs (vary k)
must try differently-headed Cs (vary h)
Separate these!

140
Idea 1
(the old CKY way)
141
Idea 2

Some grammars allow

142
Idea 2

Combine what B and C?
must try different-width Cs (vary k)
must try different midpoints j
Separate these!

143
Idea 2
(the old CKY way)
144
Idea 2
B
j
h
(the old CKY way)
A
C
h
h
A
h
k
145
An O(n3) algorithm (with G. Satta)
loves
Mary
the
girl
outdoors

146
(No Transcript)
147
New examples of dynamic programming in NLP

O(n)-time partial parsing by limiting dependency
length
(with Noah A. Smith)

148
Short-Dependency Preference

A words dependents (adjuncts, arguments)
tend to fall near it
in the string.

149
length of a dependency surface distance
3
1
1
1
150
50 of English dependencies have length 1,
another 20 have length 2, 10 have length 3 ...
fraction of all dependencies
length
151
Related Ideas

Score parses based on whats between a head and
child
(Collins, 1997 Zeman, 2004 McDonald et al.,
2005)
Assume short ? faster human processing
(Church, 1980 Gibson, 1998)
Attach low heuristic for PPs (English)
(Frazier, 1979 Hobbs and Bear, 1990)
Obligatory and optional re-orderings (English)
(see paper)

152
Going to Extremes
Longer dependencies are less likely.
What if we eliminate them completely?
153
Hard Constraints

Disallow dependencies between words of distance gt
b ...
Risk best parse contrived, or no parse at all!
Solution allow fragments (partial parsing
Hindle, 1990 inter alia).
Why not model the sequence of fragments?

154
Building a Vine SBG Parser

Grammar generates sequence of trees from
Parser recognizes sequences of trees without
long dependencies
Need to modify training data
so the model is consistent
with the parser.

155

8
would
9
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
some
2
third
(from the Penn Treebank)
1
a
156

would
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 4
some
2
third
(from the Penn Treebank)
1
a
157

would
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 3
some
2
third
(from the Penn Treebank)
1
a
158

would
1
1
.
,
According
changes
cut
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 2
some
2
third
(from the Penn Treebank)
1
a
159

would
1
1
.
,
According
changes
cut
1
to
by
1
filings
the
rule
1
1
estimates
more
insider
1
1
than
b 1
some
third
(from the Penn Treebank)
1
a
160

would
.
,
According
cut
changes
to
by
filings
the
rule
estimates
more
insider
than
b 0
some
third
(from the Penn Treebank)
a
161
Vine Grammar is Regular