Declarative Specification of NLP Systems

About This Presentation

Title:

Declarative Specification of NLP Systems

Description:

Declarative Specification of NLP Systems Jason Eisner student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John Blatz, Roy Tromble – PowerPoint PPT presentation

Number of Views:414

Avg rating:3.0/5.0

Slides: 156

Provided by: Jason454

Learn more at: https://www.cs.jhu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Declarative Specification of NLP Systems

1
Declarative Specification of NLP Systems

Jason Eisner

student co-authors on various parts of this work
Eric Goldlust, Noah A. Smith, John Blatz, Roy
Tromble
IBM, May 2006
2
An Anecdote from ACL05
-Michael Jordan
3
An Anecdote from ACL05
-Michael Jordan
4
Conclusions to draw from that talk

Mike his students are great.
Graphical models are great.(because theyre
flexible)
Gibbs sampling is great.(because it works with
nearly any graphical model)
Matlab is great.(because it frees up Mike and
his students to doodle all day and then execute
their doodles)

5
Could NLP be this nice?

Mike his students are great.
Graphical models are great.(because theyre
flexible)
Gibbs sampling is great.(because it works with
nearly any graphical model)
Matlab is great.(because it frees up Mike and
his students to doodle all day and then execute
their doodles)

6
Could NLP be this nice?

Parts of it already are
Language modeling
Binary classification (e.g., SVMs)
Finite-state transductions
Linear-chain graphical models

Toolkits available you dont have to be an expert
But other parts arent Context-free and
beyond Machine translation
Efficient parsers and MT systems are complicated
and painful to write
7
Could NLP be this nice?

This talk A toolkit thats general enough for
these cases.
(stretches from finite-state to Turing machines)
Dyna

But other parts arent Context-free and
beyond Machine translation
Efficient parsers and MT systems are complicated
and painful to write
8
Warning

Lots more beyond this talk
see the EMNLP05 and FG06 papers
see http//dyna.org
(download documentation)
sign up for updates by email
wait for the totally revamped next version ?

9
the case forLittle Languages

declarative programming
small is beautiful

10
Sapir-Whorf hypothesis

Language shapes thought
At least, it shapes conversation
Computer language shapes thought
At least, it shapes experimental research
Lots of cute ideas that we never pursue
Or if we do pursue them, it takes 6-12 months to
implement on large-scale data
Have we turned into a lab science?

11
Declarative Specifications

State what is to be done
(How should the computer do it? Turn that over
to a general solver that handles the
specification language.)
Hundreds of domain-specific little languages
out there. Some have sophisticated solvers.

12
dot (www.graphviz.org)
digraph g graph rankdir "LR" node
fontsize "16 shape "ellipse" edge
"node0" label "ltf0gt 0x10ba8 ltf1gt"shape
"record" "node1" label "ltf0gt 0xf7fc4380
ltf1gt ltf2gt -1"shape "record"
"node0"f0 -gt "node1"f0 id 0 "node0"f1
-gt "node2"f0 id 1 "node1"f0 -gt
"node3"f0 id 2
nodes
edges
Whats the hard part? Making a nice
layout! Actually, its NP-hard
13
dot (www.graphviz.org)
14
LilyPond (www.lilypond.org)
15
LilyPond (www.lilypond.org)
16
Declarative Specs in NLP

Regular expression (for a FST toolkit)
Grammar (for a parser)
Feature set (for a maxent distribution, SVM,
etc.)
Graphical model (DBNs for ASR, IE, etc.)

Claim of this talk Sometimes its best to peek
under the shiny surface. Declarative methods are
still great, but should be layeredwe need them
one level lower, too.
17
Declarative Specs in NLP

Regular expression (for a FST toolkit)
Grammar (for a parser)
Feature set (for a maxent distribution, SVM,
etc.)

18
Declarative Specification of Algorithms
19
How you build a system (big picture slide)
cool model
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
20
Wait a minute
Didnt I just implement something like this last
month?
chart management / indexing cache-conscious data
structures prioritization of partial solutions
(best-first, A) parameter management inside-outsi
de formulas different algorithms for training and
decoding conjugate gradient, annealing,
... parallelization?
I thought computers were supposed to automate
drudgery
21
How you build a system (big picture slide)
cool model

Dyna language specifies these equations.
Most programs just need to compute some values
from other values. Any order is ok.
Some programs also need to update the outputs if
the inputs change
spreadsheets, makefiles, email readers
dynamic graph algorithms
EM and other iterative optimization
leave-one-out training of smoothing params

practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
22
How you build a system (big picture slide)
cool model
practical equations
PCFG
Compilation strategies (well come back
to this)
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
23
Writing equations in Dyna

int a.
a b c.
a will be kept up to date if b or c changes.
b x.b y. equivalent to b xy.
b is a sum of two variables. Also kept up to
date.
c z(1).c z(2).c z(3).
c z(four).c z(foo(bar,5)).

c z(N).
c is a sum of all nonzero z() values. At
compile time, we dont know how many!
24
More interesting use of patterns

a b c.
scalar multiplication
a(I) b(I) c(I).
pointwise multiplication
a b(I) c(I). means a b(I)c(I)
dot product could be sparse
a(I,K) b(I,J) c(J,K). b(I,J)c(J,K)
matrix multiplication could be sparse
J is free on the right-hand side, so we sum over
it

25
Dyna vs. Prolog

By now you may see what were up to!
Prolog has Horn clauses
a(I,K) - b(I,J) , c(J,K).
Dyna has Horn equations
a(I,K) b(I,J) c(J,K).

Like Prolog Allow nested terms Syntactic sugar
for lists, etc. Turing-complete
Unlike Prolog Charts, not backtracking! Compile
? efficient C classes Integrates with your C
code
26
The CKY inside algorithm in Dyna
- double item 0. - bool length
false. constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J) constit(Y,I,Mid)
constit(Z,Mid,J) rewrite(X,Y,Z). goal
constit(s,0,N) if length(N).
using namespace cky chart c crewrite(s,np,
vp) 0.7 cword(Pierre,0,1)
1 clength(30) true // 30-word sentence cin
gtgt c // get more axioms from stdin cout ltlt
cgoal // print total weight of all parses
27
visual debugger browse the proof forest
ambiguity
shared substructure
28
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Earleys algorithm?
Binarized CKY?

29
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
max max max

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Earleys algorithm?
Binarized CKY?

30
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
max max max
log log log

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Earleys algorithm?
Binarized CKY?

31
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Earleys algorithm?
Binarized CKY?

c word(Pierre, 0, 1)
1
state(5)
state(9)
0.2
air/0.3
8
9
P/0.5
Pierre/0.2
5
32
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Earleys algorithm?
Binarized CKY?

Just add words one at a time to the chart Check
at any time what can be derived from words so
far Similarly, dynamic grammars
33
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Earleys algorithm?
Binarized CKY?

Again, no change to the Dyna program
34
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Earleys algorithm?
Binarized CKY?

Basically, just add extra arguments to the terms
above
35
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Earleys algorithm?
Binarized CKY?

36
Earleys algorithm in Dyna
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
magic templates transformation (as noted by
Minnen 1996)
37
Program transformations
cool model
Blatz Eisner (FG 2006) Lots of
equivalent ways to write a system of
equations! Transforming from one to another
mayimprove efficiency. Many parsing tricks
can be generalized into automatic
transformations that help other programs, too!
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
38
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).

Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Earleys algorithm?
Binarized CKY?

39
Rule binarization
constit(X,I,J) constit(Y,I,Mid)
constit(Z,Mid,J) rewrite(X,Y,Z).
X
Y
Z
Z
Y
Mid
J
I
Mid
40
Rule binarization
constit(X,I,J) constit(Y,I,Mid)
constit(Z,Mid,J) rewrite(X,Y,Z).
graphical models constraint programming multi-way
database join
41
More program transformations

Examples that add new semantics
Compute gradient (e.g., derive outside algorithm
from inside)
Compute upper bounds for A (e.g., Klein
Manning ACL03)
Coarse-to-fine (e.g., Johnson Charniak
NAACL06)
Examples that preserve semantics
On-demand computation by analogy with Earleys
algorithm
On-the-fly composition of FSTs
Left-corner filter for parsing
Program specialization as unfolding e.g.,
compile out the grammar
Rearranging computations by analogy with
categorial grammar
Folding reinterpreted as slashed categories
Speculative computation using slashed
categories
abstract away repeated computation to do it once
only by analogy with unary rule closure or
epsilon-closure
derives Eisner Satta ACL99 O(n3) bilexical
parser

42
How you build a system (big picture slide)
cool model
practical equations
PCFG
Propagate updates from right-to-left through the
equations. a.k.a. agenda algorithm forward
chaining bottom-up inference semi-naïve
bottom-up
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
use a general method
43
Bottom-up inference
agenda of pending updates
rules of program
s(I,K) np(I,J) vp(J,K)
pp(I,K) prep(I,J) np(J,K)
prep(I,3) ?
prep(2,3) 1.0
s(3,9) 0.15
s(3,7) 0.21
vp(5,K) ?
vp(5,9) 0.5
pp(2,5) 0.3
vp(5,7) 0.7
np(3,5) 0.3
we updated np(3,5)what else must therefore
change?
If np(3,5) hadnt been in the chart already, we
would have added it.
np(3,5) 0.1
no more matches to this query
0.3
chart of derived items with current values
44
How you build a system (big picture slide)
cool model
practical equations
PCFG
Whats going on under the hood?
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
45
Compiler provides
agenda of pending updates
rules of program
s(I,K) np(I,J) vp(J,K)
np(3,5) 0.3
copy, compare, hashterms fast, via
integerization (interning)
efficient storage of terms (use native C types,
symbiotic storage, garbage collection,seriali
zation, )
chart of derived items with current values
46
Beware double-counting!
agenda of pending updates
combining with itself
rules of program
n(I,K) n(I,J) n(J,K)
n(5,5) 0.2
n(5,5) ?
n(5,5) 0.3
to makeanother copyof itself
epsilon constituent
If np(3,5) hadnt been in the chart already, we
would have added it.
chart of derived items with current values
47
Parameter training
objective functionas a theorems value

Maximize some objective function.
Use Dyna to compute the function.
Then how do you differentiate it?
for gradient ascent,conjugate gradient, etc.
gradient also tells us the expected counts for
EM!

e.g., inside algorithm computes likelihood of the
sentence

Two approaches
Program transformation automatically derive the
outside formulas.
Back-propagation run the agenda algorithm
backwards.
works even with pruning, early stopping, etc.

48
What can Dyna do beyond CKY?
49
Some examples from my lab

Parsing using
factored dependency models (Dreyer, Smith,
Smith CONLL06)
with annealed risk minimization (Smith and Eisner
EMNLP06)
constraints on dependency length (Eisner Smith
IWPT05)
unsupervised learning of deep transformations (see
Eisner EMNLP02)
lexicalized algorithms (see Eisner Satta
ACL99, etc.)
Grammar induction using
partial supervision (Dreyer Eisner EMNLP06)
structural annealing (Smith Eisner ACL06)
contrastive estimation (Smith Eisner GIA05)
deterministic annealing (Smith Eisner ACL04)
Machine translation using
Very large neighborhood search of
permutations (Eisner Tromble, NAACL-W06)
Loosely syntax-based MT (Smith Eisner in
prep.)
Synchronous cross-lingual parsing (Smith Smith
EMNLP04)
Finite-state methods for morphology, phonology,
IE, even syntax
Unsupervised cognate discovery (Schafer
Yarowsky 05, 06)
Unsupervised log-linear models via contrastive
estimation (Smith Eisner ACL05)
Context-based morph. disambiguation (Smith,
Smith Tromble EMNLP05)

Easy to try stuff out! Programs are very short
easy to change!
- see also Eisner ACL03)
50
Can it express everything in NLP? ?

Remember, it integrates tightly with C, so you
only have to use it where its helpful,and write
the rest in C. Small is beautiful.
Were currently extending the class of allowed
formulas beyond the semiring
cf. Goodman (1999)
will be able to express smoothing, neural nets,
etc.
Of course, it is Turing complete ?

51
Smoothing in Dyna

mle_prob(X,Y,Z) context
count(X,Y,Z)/count(X,Y).
smoothed_prob(X,Y,Z) lambdamle_prob(X,Y,Z)
(1-lambda)mle_prob(Y,Z).
for arbitrary n-grams, can use lists
count_count(N) 1 whenever N is
count(Anything).
updates automatically during leave-one-out
jackknifing

52
Information retrieval in Dyna

score(Doc) tf(Doc,Word)tf(Query,Word)idf(Wor
d).
idf(Word) 1/log(df(Word)).
df(Word) 1 whenever tf(Doc,Word) gt 0.

53
Neural networks in Dyna

out(Node) sigmoid(in(Node)).
in(Node) input(Node).
in(Node) weight(Node,Kid)out(Kid).
error (out(Node)-target(Node))2
if ?target(Node).
Recurrent neural net is ok

54
Game-tree analysis in Dyna

goal best(Board) if start(Board).
best(Board) max stop(player1, Board).
best(Board) max move(player1, Board, NewBoard)
worst(NewBoard).
worst(Board) min stop(player2, Board).
worst(Board) min move(player2, Board, NewBoard)
best(NewBoard).

55
Weighted FST composition in Dyna(epsilon-free
case)

- bool itemfalse.
start (A o B, Q x R) start (A, Q) start (B,
R).
stop (A o B, Q x R) stop (A, Q) stop (B, R).
arc (A o B, Q1 x R1, Q2 x R2, In, Out) arc
(A, Q1, Q2, In, Match) arc (B, R1, R2,
Match, Out).
Inefficient? How do we fix this?

56
Constraint programming (arc consistency)

- bool indomainfalse.
- bool consistenttrue.
variable(Var) indomain(VarVal).
possible(VarVal) indomain(VarVal).
possible(VarVal) support(VarVal, Var2)
whenever variable(Var2).
support(VarVal, Var2) possible(Var2Val2)
consistent(VarVal, Var2Val2).

57
Edit distance in Dyna version 1

letter1(c,0,1). letter1(l,1,2).
letter1(a,2,3). clara
letter2(c,0,1). letter2(a,1,2).
letter2(c,2,3). caca
end1(5). end2(4). delcost 1. inscost 1.
substcost 1.
align(0,0) 0.
align(I1,J2) min align(I1,I2)
letter2(L2,I2,J2) inscost(L2).
align(J1,I2) min align(I1,I2)
letter1(L1,I1,J1) delcost(L1).
align(J1,J2) min align(I1,I2)
letter1(L1,I1,J1) letter2(L2,I2,J2)
subcost(L1,L2).
align(J1,J2) min align(I1,I2)letter1(L,I1,J1)le
tter2(L,I2,J2).
goal align(N1,N2) whenever end1(N1) end2(N2).

58
Edit distance in Dyna version 2

input(c, l, a, r, a, c, a, c,
a) 0.
delcost 1. inscost 1. substcost 1.
alignupto(Xs,Ys) min input(Xs,Ys).
alignupto(Xs,Ys) min alignupto(XXs,Ys)
delcost.
alignupto(Xs,Ys) min alignupto(Xs,YYs)
inscost.
alignupto(Xs,Ys) min alignupto(XXs,YYs)sub
stcost.
alignupto(Xs,Ys) min alignupto(AXs,AYs).
goal min alignupto(, ).

How about different costs for different letters?
59
Edit distance in Dyna version 2

input(c, l, a, r, a, c, a, c,
a) 0.
delcost 1. inscost 1. substcost 1.
alignupto(Xs,Ys) min input(Xs,Ys).
alignupto(Xs,Ys) min alignupto(XXs,Ys)
delcost.
alignupto(Xs,Ys) min alignupto(Xs,YYs)
inscost.
alignupto(Xs,Ys) min alignupto(XXs,YYs)sub
stcost.
alignupto(Xs,Ys) min alignupto(LXs,LYs).
goal min alignupto(, ).

(X).
(Y).
(X,Y).
60
Is it fast enough?
(sort of)

Asymptotically efficient
4 times slower than Mark Johnsons inside-outside
4-11 times slower than Klein Mannings Viterbi
parser

61
Are you going to make it faster?
(yup!)

Currently rewriting the term classes to match
hand-tuned code
Will support mix-and-matchimplementation
strategies
store X in an array
store Y in a hash
dont store Z (compute on demand)
Eventually, choose strategies automaticallyby
execution profiling

62
Synopsis your idea ? experimental results fast!

Dyna is a language for computation (no I/O).
Especially good for dynamic programming.
It tries to encapsulate the black art of NLP.
Much prior work in this vein
Deductive parsing schemata (preferably weighted)
Goodman, Nederhof, Pereira, Warren, Shieber,
Schabes, Sikkel
Deductive databases (preferably with aggregation)
Ramakrishnan, Zukowski, Freitag, Specht, Ross,
Sagiv,
Probabilistic programming languages (implemented)
Zhao, Sato, Pfeffer (also efficient Prologish
languages)

63
Dyna contributors!

Jason Eisner
Eric Goldlust, Eric Northup, Johnny Graettinger
(compiler backend)
Noah A. Smith (parameter training)
Markus Dreyer, David Smith (compiler frontend)
Mike Kornbluh, George Shafer, Gordon Woodhull,
Constantinos Michael, Ray Buse (visual
debugger)
John Blatz (program transformations)
Asheesh Laroia (web services)

64
New examples of dynamic programming in NLP
65
Some examples from my lab

Parsing using
factored dependency models (Dreyer, Smith,
Smith CONLL06)
with annealed risk minimization (Smith and Eisner
EMNLP06)
constraints on dependency length (Eisner Smith
IWPT05)
unsupervised learning of deep transformations (see
Eisner EMNLP02)
lexicalized algorithms (see Eisner Satta
ACL99, etc.)
Grammar induction using
partial supervision (Dreyer Eisner EMNLP06)
structural annealing (Smith Eisner ACL06)
contrastive estimation (Smith Eisner GIA05)
deterministic annealing (Smith Eisner ACL04)
Machine translation using
Very large neighborhood search of
permutations (Eisner Tromble, NAACL-W06)
Loosely syntax-based MT (Smith Eisner in
prep.)
Synchronous cross-lingual parsing (Smith Smith
EMNLP04)
Finite-state methods for morphology, phonology,
IE, even syntax
Unsupervised cognate discovery (Schafer
Yarowsky 05, 06)
Unsupervised log-linear models via contrastive
estimation (Smith Eisner ACL05)
Context-based morph. disambiguation (Smith,
Smith Tromble EMNLP05)

- see also Eisner ACL03)
66
New examples of dynamic programming in NLP

Parameterized finite-state machines

67
Parameterized FSMs

An FSM whose arc probabilities depend on
parameters they are formulas.

68
Parameterized FSMs

An FSM whose arc probabilities depend on
parameters they are formulas.

69
Parameterized FSMs

An FSM whose arc probabilities depend on
parameters they are formulas.

Expert first Construct the FSM (topology
parameterization). Automatic takes over Given
training data, find parameter valuesthat
optimize arc probs.
70
Parameterized FSMs
Knight Graehl 1997 - transliteration
71
Parameterized FSMs
Knight Graehl 1997 - transliteration
Would like to get some of that expert knowledge
in here Use probabilistic regexps like(a.7 b)
.5 (ab.6) If the probabilities are
variables (ax b) y (abz) then arc weights
of the compiled machine are nasty formulas.
(Especially after minimization!)
72
Finite-State Operations

Projection GIVES YOU marginal distribution

p(x,y)
domain(
)
73
Finite-State Operations

Probabilistic union GIVES YOU mixture model

p(x)
0.3
q(x)
74
Finite-State Operations

Probabilistic union GIVES YOU mixture model

?
p(x)
q(x)
Learn the mixture parameter ?!
75
Finite-State Operations

Composition GIVES YOU chain rule

p(xy)
o
p(yz)

The most popular statistical FSM operation
Cross-product construction

76
Finite-State Operations

Concatenation, probabilistic closure
HANDLE unsegmented text

0.3
p(x)
p(x)
q(x)

Just glue together machines for the different
segments, and let them figure out how to align
with the text

77
Finite-State Operations

Directed replacement MODELS noise or
postprocessing

p(x,y)
o

Resulting machine compensates for noise or
postprocessing

78
Finite-State Operations

Intersection GIVES YOU product models
e.g., exponential / maxent, perceptron, Naïve
Bayes,

Need a normalization op too computes ?x f(x)
pathsum or
partition function

p(x)

q(x)

Cross-product construction (like composition)

79
Finite-State Operations

Conditionalization (new operation)

p(x,y)
condit(
)

Resulting machine can be composed with other
distributions p(y x) q(x)

80
New examples of dynamic programming in NLP

Parameterized infinite-state machines

81
Universal grammar as a parameterized FSA over an
infinite state space
82
New examples of dynamic programming in NLP

More abuses of finite-state machines

83
Huge-alphabet FSAs for OT phonology
etc.
Gen proposes all candidates that include this
input.
Gen
voi
underlying tiers
C
C
V
C

voi
voi
surface tiers
C
C
V
C
V
C
C
V
C
voi
voi
C
C
V
C
C
C
V
C
velar
voi
V
C
C
V
C
C
C
C
C
C
C
84
Huge-alphabet FSAs for OT phonology
encode this candidate as a string
voi
at each moment, need to describe whats going
on on many tiers
C
C
V
C
velar
V
C
C
C
C
C
C
85
Directional Best Paths construction

Keep best output string for each input string
Yields a new transducer (size ?? 3n)

For input abc abc axc For input abd axd
Must allow red arc just if next input is d
86
Minimization of semiring-weighted FSAs

New definition of ? for pushing
?(q) weight of the shortest path from
q, breaking ties alphabetically on input
symbols
Computation is simple, well-defined, independent
of (K, ?)
Breadth-first search back from final states

Compute ?(q) in O(1) time as soon as we visit
q. Whole alg. is linear.
b
b
a
b
a
c
d
c
Faster than finding min-weight path à la Mohri.
distance 2
?(q) k ? ?(r)
87
New examples of dynamic programming in NLP

Tree-to-tree alignment

88
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English.
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
89
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
90
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. Alignment shows how trees are
generated synchronously from little trees ...
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
91
New examples of dynamic programming in NLP

Bilexical parsing in O(n3)
(with Giorgio Satta)

92
Lexicalized CKY
loves
Mary
the
girl
outdoors
93
Lexicalized CKY is O(n5) not O(n3)
... advocate
visiting relatives
... hug
visiting relatives
B
C
i
j
j1
k
O(n3 combinations)
94
Idea 1

Combine B with what C?
must try different-width Cs (vary k)
must try differently-headed Cs (vary h)
Separate these!

95
Idea 1
(the old CKY way)
96
Idea 2

Some grammars allow

97
Idea 2

Combine what B and C?
must try different-width Cs (vary k)
must try different midpoints j
Separate these!

98
Idea 2
(the old CKY way)
99
Idea 2
B
j
h
(the old CKY way)
A
C
h
h
A
h
k
100
An O(n3) algorithm (with G. Satta)
loves
Mary
the
girl
outdoors

101
(No Transcript)
102
New examples of dynamic programming in NLP

O(n)-time partial parsing by limiting dependency
length
(with Noah A. Smith)

103
Short-Dependency Preference

A words dependents (adjuncts, arguments)
tend to fall near it
in the string.

104
length of a dependency surface distance
3
1
1
1
105
50 of English dependencies have length 1,
another 20 have length 2, 10 have length 3 ...
fraction of all dependencies
length
106
Related Ideas

Score parses based on whats between a head and
child
(Collins, 1997 Zeman, 2004 McDonald et al.,
2005)
Assume short ? faster human processing
(Church, 1980 Gibson, 1998)
Attach low heuristic for PPs (English)
(Frazier, 1979 Hobbs and Bear, 1990)
Obligatory and optional re-orderings (English)
(see paper)

107
Going to Extremes
Longer dependencies are less likely.
What if we eliminate them completely?
108
Hard Constraints

Disallow dependencies between words of distance gt
b ...
Risk best parse contrived, or no parse at all!
Solution allow fragments (partial parsing
Hindle, 1990 inter alia).
Why not model the sequence of fragments?

109
Building a Vine SBG Parser

Grammar generates sequence of trees from
Parser recognizes sequences of trees without
long dependencies
Need to modify training data
so the model is consistent
with the parser.

110

8
would
9
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
some
2
third
(from the Penn Treebank)
1
a
111

would
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 4
some
2
third
(from the Penn Treebank)
1
a
112

would
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 3
some
2
third
(from the Penn Treebank)
1
a
113

would
1
1
.
,
According
changes
cut
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 2
some
2
third
(from the Penn Treebank)
1
a
114

would
1
1
.
,
According
changes
cut
1
to
by
1
filings
the
rule
1
1
estimates
more
insider
1
1
than
b 1
some
third
(from the Penn Treebank)
1
a
115

would
.
,
According
cut
changes
to
by
filings
the
rule
estimates
more
insider
than
b 0
some
third
(from the Penn Treebank)
a
116
Vine Grammar is Regular

Even for small b, bunches can grow to arbitrary
size
But arbitrary center embedding is out

117
Vine Grammar is Regular

Could compile into an FSA and get O(n) parsing!
Problem whats the grammar constant?

EXPONENTIAL

insider has no parent
cut and would can have more children
can have more children

FSA
According to some estimates , the rule changes
would cut insider ...
118
Alternative

Instead, we adapt
an SBG chart parser
which implicitly shares fragments of stack state
to the vine case,
eliminating unnecessary work.

119
Limiting dependency length

Linear-time partial parsing

Finite-state model of root sequence
NP
S
NP
Bounded dependencylength within each chunk (but
chunk could be arbitrarilywide right- or left-
branching)

Natural-language dependencies tend to be short
So even if you dont have enough data to model
what the heads are
you might want to keep track of where they are.

120
Limiting dependency length

Linear-time partial parsing
Dont convert into an FSA!
Less structure sharing
Explosion of states for different stack
configurations
Hard to get your parse back

Finite-state model of root sequence
NP
S
NP
Bounded dependencylength within each chunk (but
chunk could be arbitrarilywide right- or left-
branching)
121
Limiting dependency length

Linear-time partial parsing

NP
S
NP
Each piece is at most k wordswide No
dependencies between pieces Finite state model
of sequence ? Linear time! O(k2n)
122
Limiting dependency length

Linear-time partial parsing

Each piece is at most k wordswide No
dependencies between pieces Finite state model
of sequence ? Linear time! O(k2n)
123
Quadratic Recognition/Parsing
goal
...
O(n2b)

...
O(n2b)
O(n3) combinations
only construct trapezoids such that k i b
i
j
i
j
k
k
O(nb2)
O(n3) combinations
i
j
i
j
k
k
124

would
.
,
According
changes
cut
O(nb) vine construction
b 4

According to some , the new changes would cut
insider filings by more than a third .

all width 4
125
Parsing Algorithm

Same grammar constant as Eisner and Satta (1999)
O(n3) ? O(nb2) runtime
Includes some overhead (low-order term) for
constructing the vine
Reality check ... is it worth it?

126
F-measure runtime of a limited-dependency-lengt
h parser (POS seqs)
127
Precision recall of a limited-dependency-length
parser (POS seqs)
128
Results Penn Treebank
evaluation against original ungrafted Treebank
non-punctuation only
b 20
b 1
129
Results Chinese Treebank
evaluation against original ungrafted Treebank
non-punctuation only
b 20
b 1
130
Results TIGER Corpus
evaluation against original ungrafted Treebank
non-punctuation only
b 20
b 1
131
Type-Specific Bounds

b can be specific to dependency type
e.g., b(V-O) can be longer than b(S-V)
b specific to parent, child, direction
gradually tighten based on training data

132

English 50 runtime, no loss
Chinese 55 runtime, no loss
German 44 runtime, 2 loss

133
Related Work

Nederhof (2000) surveys finite-state
approximation of context-free languages.
CFG ? FSA
We limit all dependency lengths (not just
center-embedding), and derive weights from the
Treebank (not by approximation).
Chart parser ? reasonable grammar constant.

134
Softer Modeling of Dep. Length
When running parsing algorithm, just multiply in
these probabilities at the appropriate time.
p
DEFICIENT
p(3 r, a, L)
p(2 r, b, L)
p(1 b, c, R)
p
p(1 r, d, R)
p(1 d, e, R)
p(1 e, f, R)
135
Generating with SBGs

?w0
?w0

Start with left wall
Generate root w0
Generate left children w-1, w-2, ..., w-l from
the FSA ?w0
Generate right children w1, w2, ..., wr from the
FSA ?w0
Recurse on each wi for i in -l, ..., -1, 1,
..., r, sampling ai (steps 2-4)
Return al...a-1w0a1...ar

w0
w-1
w1
w-2
w2
...
...
?w-l
w-l
wr
w-l.-1
136
Very Simple Model for ?w and ?w
We parse POS tag sequences, not words.
p(child first, parent, direction) p(stop
first, parent, direction) p(child not first,
parent, direction) p(stop not first, parent,
direction)
?takes
?takes
It
takes
two
to
137
Baseline
test set recall () test set recall () test set recall () test set runtime (items/word) test set runtime (items/word) test set runtime (items/word)

73 61 77 90 149 49
138
Modeling Dependency Length
test set recall () test set recall () test set recall () test set runtime (items/word) test set runtime (items/word) test set runtime (items/word)

73 61 77 90 149 49
76 62 75 67 103 31
4.1 1.6 -2.6 -26 -31 -37
length
139
Conclusion

Modeling dependency length can
cut runtime of simple models by 26-37
with effects ranging from
-3 to 4 on recall.
(Loss on recall perhaps due to deficient/MLE
estimation.)

140
Future Work
apply to state-of-the-art parsing models

better parameter estimation
applications MT, IE, grammar induction
141
This Talk in a Nutshell
3
length of a dependency surface distance
1
1
1

Empirical results (English, Chinese, German)
Hard constraints cut runtime in half or more
with no accuracy loss (English, Chinese) or by
44 with -2.2 accuracy (German).
Soft constraints affect accuracy of simple
models by -3 to 24 and cut runtime by 25 to
40.

Formal results
A hard bound b on dependency length
results in a regular language.
allows O(nb2) parsing.

142
New examples of dynamic programming in NLP

Grammar induction by initially limiting
dependency length
(with Noah A. Smith)

143
Soft bias toward short dependencies
dS j k
(j, k) in t
where p(t, xi) Z-1(d)pT(t, xi) e
MLE baseline
-8
d 0
8
linear structure preferred
144
Soft bias toward short dependencies

Multiply parse probability by exp -dS
where S is the total length of all dependencies
Then renormalize probabilities

MLE baseline
-8
d 0
8
linear structure preferred
145
Structural Annealing
MLE baseline
-8
d 0
8
Repeat ...
Increase d and retrain.
Until performance stops improving on a
small validation dataset.
Start here train a model.
146
Grammar Induction
Other structural biases can be annealed. We
tried annealing on connectivity ( of fragments),
and got similar results.
147
A 6/9-Accurate Parse
These errors look like ones made by a supervised
parser in 2000!
Treebank
can
gene
thus
the
prevent
plant
from
fertilizing
itself
a
MLE with locality bias
verb instead of modal as root
preposition misattachment
prevent
gene
plant
the
can
thus
a
from
fertilizing
itself
misattachment of adverb thus
148
Accuracy Improvements
language random tree Klein Manning (2004) Smith Eisner (2006)
German 27.5 50.3 70.0
English 30.3 41.6 61.8
Bulgarian 30.4 45.6 58.4
Mandarin 22.6 50.1 57.2
Turkish 29.8 48.0 62.4
Portuguese 30.6 42.3 71.8
state-of-the-art, supervised
82.61
90.92
85.91
84.61
69.61
86.51
1CoNLL-X shared task, best system. 2McDonald
et al., 2005
149
Combining with Contrastive Estimation

This generally gives us our best results

150
New examples of dynamic programming in NLP

Contrastive estimation for HMM and grammar
induction
Uses lattice parsing
(with Noah A. Smith)

151
Contrastive EstimationTraining Log-Linear
Modelson Unlabeled Data

Noah A. Smith and Jason Eisner
Department of Computer Science /
Center for Language and Speech Processing
Johns Hopkins University
nasmith,jason_at_cs.jhu.edu

152
Contrastive Estimation(Efficiently) Training
Log-Linear Models (of Sequences) on Unlabeled Data

Noah A. Smith and Jason Eisner
Department of Computer Science /
Center for Language and Speech Processing
Johns Hopkins University
nasmith,jason_at_cs.jhu.edu

153
Nutshell Version
unannotated text
tractable training
contrastive estimation with lattice neighborhoods
Experiments on unlabeled data POS tagging 46
error rate reduction (relative to EM) Max ent
features make it possible to survive damage to
tag dictionary Dependency parsing 21
attachment error reduction (relative to EM)
max ent features
sequence models
154
Red leaves dont hide blue jays.
155
Maximum Likelihood Estimation(Supervised)
y
JJ
NNS
MD
VB
JJ
NNS
p
red
leaves
dont
hide
blue
jays
x
?

p
?
S ?
156
Maximum Likelihood Estimation(Unsupervised)
?
?
?
?
?
?
p
red
leaves
dont
hide
blue
jays
x
?
This is what EM does.

p
?
S ?
157
Focusing Probability Mass
numerator
denominator
158
Conditional Estimation(Supervised)
y
JJ
NNS
MD
VB
JJ
NNS
p
red
leaves
dont
hide
blue
jays
x
?
?
?
?
?
?
A different denominator!
p
red
leaves
dont
hide
blue
jays
(x) ?
159
Objective Functions
Objective Optimization Algorithm Numerator Denominator
MLE Count Normalize tags words S ?
MLE with hidden variables EM words S ?
Conditional Likelihood Iterative Scaling tags words (words) ?
Perceptron Backprop tags words hypothesized tags words
generic numerical solvers (in this talk, LMVM
L-BFGS)
Contrastive Estimation
observed data (in this talk, raw word sequence,
sum over all possible taggings)
?
For generative models.
160

This talk is about denominators ...
in the unsupervised case.
A good denominator can improve
accuracy
and
tractability.

161
Language Learning (Syntax)
At last! My own language learning device!
Why did he pick that sequence for those
words? Why not say leaves red ... or ... hide
dont ... or ...
Why didnt he say, birds fly or dancing
granola or the wash dishes or any other
sequence of words?
EM
162

What is a syntax model supposed to explain?
Each learning hypothesis
corresponds to
a denominator / neighborhood.

163
The Job of Syntax

Explain why each word is necessary.
? DEL1WORD neighborhood

164
The Job of Syntax

Explain the (local) order of the words.
? TRANS1 neighborhood

165
?
?
?
?
?
?
p
red
leaves
dont
hide
blue
jays
sentences in TRANS1 neighborhood
p
166
?
?
?
?
?
?
p
red
leaves
dont
hide
blue
jays
www.dyna.org (shameless self promotion)
red
leaves
dont
hide
blue
jays
hide
jays
leaves
dont
blue
p
blue
hide
leaves
dont
red
dont
hide
blue
jays
(with any tagging)
sentences in TRANS1 neighborhood
167
The New Modeling Imperative
A good sentence hints that a set of bad ones is
nearby.
numerator
denominator (neighborhood)
Make the good sentence likely, at the expense
of those bad neighbors.
168

This talk is about denominators ...
in the unsupervised case.
A good denominator can improve
accuracy
and
tractability.

169
Log-Linear Models
score of x, y
partition function
Computing Z is undesirable!
Sums over all possible taggings of all possible
sentences!
Contrastive Estimation (Unsupervised)
Conditional Estimation (Supervised)
a few sentences
1 sentence
170
A Big Picture Sequence Model Estimation
unannotated data
tractable sums
generative, EM p(x)
generative, MLE p(x, y)
log-linear, CE with lattice neighborhoods
log-linear, EM p(x)
log-linear, conditional estimation p(y x)
log-linear, MLE p(x, y)
overlapping features
171
Contrastive Neighborhoods

Guide the learner toward models that do what
syntax is supposed to do.
Lattice representation ? efficient algorithms.

There is an art to choosing neighborhood
functions.
172
Neighborhoods
neighborhood size lattice arcs perturbations
n1 O(n) delete up to 1 word
n O(n) transpose any bigram
O(n) O(n) ?
O(n2) O(n2) delete any contiguous subsequence
(EM) 8 - replace each word with anything
DEL1WORD
TRANS1
DELORTRANS1
DEL1WORD
TRANS1
DEL1SUBSEQUENCE
S
173
The Merialdo (1994) Task

Given unlabeled text
and a POS dictionary
(that tells all possible tags for each word
type),
learn to tag.

A form of supervision.
174
Trigram Tagging Model
JJ
NNS
MD
VB
JJ
NNS
red
leaves
dont
hide
blue
jays
feature set tag trigrams tag/word pairs from a
POS dictionary
175
CRF
log-linear EM
supervised
HMM
LENGTH
TRANS1
DELORTRANS1
DA
Smith Eisner (2004)
10 data
EM
Merialdo (1994)
EM
DEL1WORD
DEL1SUBSEQUENCE
random

96K words
full POS dictionary
uninformative initializer
best of 8 smoothing conditions

176

Dictionary includes ...
all words
words from 1st half of corpus
words with count ? 2
words with count ? 3
Dictionary excludes
OOV words,
which can get any tag.

What if we damage the POS dictionary?

96K words
17 coarse POS tags
uninformative initializer

EM
random
LENGTH
DELORTRANS1
177
Trigram Tagging Model Spelling
JJ
NNS
MD
VB
JJ
NNS
red
leaves
dont
hide
blue
jays
feature set tag trigrams tag/word pairs from a
POS dictionary 1- to 3-character suffixes,
contains hyphen, digit
178
Log-linear spelling features aided recovery ...
... but only with a smart neighborhood.
EM
LENGTH spelling
random
LENGTH
DELORTRANS1 spelling
DELORTRANS1
179

The model need not be finite-state.

180
Unsupervised Dependency Parsing
Klein Manning (2004)
attachment accuracy
EM
LENGTH
TRANS1
initializer
181
To Sum Up ...
Contrastive Estimation means
picking your own denominator
for tractability
or for accuracy
(or, as in our case, for both).
Now we can use the task to guide the unsupervised
learner
(like discriminative techniques do for supervised
learners).
Its a particularly good fit for log-linear
models
with max ent features
unsupervised sequence models
all in time for ACL 2006.
182
(No Transcript)

Write a Comment

User Comments (0)