Title: Parsing with Soft and Hard Constraints on Dependency Length
1Parsing with Soft and Hard Constraints on
Dependency Length
- Jason Eisner and Noah A. Smith
- Department of Computer Science /
- Center for Language and Speech Processing
- Johns Hopkins University
- jason,nasmith_at_cs.jhu.edu
2Premise
here at IWPT 2005 Burstein Sagae
Lavie Tsuruoka Tsujii Dzikovska and Rosé ...
- Many parsing consumers
- (IE, ASR, MT)
- will benefit more from
- fast, precise partial parsing
- than from full, deep parses that are slow to
build.
3Outline of the Talk
- The Short Dependency Preference
- Review of split bilexical grammars (SBGs)
- O(n3) algorithm
- Modeling dependency length
- Experiments
- Constraining dependency length in a parser
- O(n) algorithm, same grammar constant as SBG
- Experiments
Soft constraints
Hard constraints
4Short-Dependency Preference
- A words dependents (adjuncts, arguments)
- tend to fall near it
- in the string.
5length of a dependency surface distance
3
1
1
1
650 of English dependencies have length 1,
another 20 have length 2, 10 have length 3 ...
fraction of all dependencies
length
7Related Ideas
- Score parses based on whats between a head and
child - (Collins, 1997 Zeman, 2004 McDonald et al.,
2005) - Assume short ? faster human processing
- (Church, 1980 Gibson, 1998)
- Attach low heuristic for PPs (English)
- (Frazier, 1979 Hobbs and Bear, 1990)
- Obligatory and optional re-orderings (English)
- (see paper)
8Split Bilexical Grammars (Eisner, 1996 2000)
- Bilexical capture relationships between two
words using rules of the form - Xp ? Yp Zc
- Xp ? Zc Yp
- Xw ? w
- grammar size N3S2
- Split left children conditionally independent of
right children, given parent - (equivalent to split HAGs Eisner and Satta, 1999)
head
9Generating with SBGs
?w0
?w0
- Start with left wall
- Generate root w0
- Generate left children w-1, w-2, ..., w-l from
the FSA ?w0 - Generate right children w1, w2, ..., wr from the
FSA ?w0 - Recurse on each wi for i in -l, ..., -1, 1,
..., r, sampling ai (steps 2-4) - Return al...a-1w0a1...ar
w0
w-1
w1
w-2
w2
...
...
?w-l
w-l
wr
w-l.-1
10Naïve Recognition/Parsing
p
goal
O(n5) combinations
O(n5N3) if N nonterminals
r
p
c
i
j
0
k
n
goal
takes
takes
It
to
takes
tango
It
takes
two
to
It
takes
two
to
tango
11Cubic Recognition/Parsing (Eisner Satta, 1999)
A triangle is a head with some left (or right)
subtrees.
goal
One trapezoid per dependency.
It
takes
two
to
tango
12Cubic Recognition/Parsing (Eisner Satta, 1999)
goal
O(n) combinations
0
i
n
O(n3) combinations
i
j
i
j
k
k
O(n3) combinations
i
j
i
j
k
k
O(n3g2N) if N nonterminals, polysemy g
13Implementation
- Augment items with (Viterbi) weights order by
weight. - Agenda-based, best-first algorithm.
- We use Dyna see the HLT-EMNLP paper to
implement all parsers here. - Count the number of items built ? a measure of
runtime.
14Very Simple Model for ?w and ?w
We parse POS tag sequences, not words.
p(child first, parent, direction) p(stop
first, parent, direction) p(child not first,
parent, direction) p(stop not first, parent,
direction)
?takes
?takes
It
takes
two
to
15Baseline
test set recall () test set recall () test set recall () test set runtime (items/word) test set runtime (items/word) test set runtime (items/word)
73 61 77 90 149 49
16Improvements
smoothing/max ent
parse words, not tags
bigger FSAs/ more nonterminals
73
LTAG, CCG, etc.
model dependency length?
special NP-treatment, punctuation
train discriminatively
17Modeling Dependency Length
When running parsing algorithm, just multiply in
these probabilities at the appropriate time.
p
DEFICIENT
p(3 r, a, L)
p(2 r, b, L)
p(1 b, c, R)
p
p(1 r, d, R)
p(1 d, e, R)
p(1 e, f, R)
18Modeling Dependency Length
test set recall () test set recall () test set recall () test set runtime (items/word) test set runtime (items/word) test set runtime (items/word)
73 61 77 90 149 49
76 62 75 67 103 31
4.1 1.6 -2.6 -26 -31 -37
length
19Conclusion (I)
- Modeling dependency length can
- cut runtime of simple models by 26-37
- with effects ranging from
- -3 to 4 on recall.
- (Loss on recall perhaps due to deficient/MLE
estimation.)
20Going to Extremes
Longer dependencies are less likely.
What if we eliminate them completely?
21Hard Constraints
- Disallow dependencies between words of distance gt
b ... - Risk best parse contrived, or no parse at all!
- Solution allow fragments (partial parsing
Hindle, 1990 inter alia). - Why not model the sequence of fragments?
22From SBG to Vine SBG
L(?) ? S
L(?) e
An SBG wall () has one child.
L(?) ? S
A vine SBG wall has a sequence of children.
L(?) e
23Building a Vine SBG Parser
- Grammar generates sequence of trees from
- Parser recognizes sequences of trees without
long dependencies - Need to modify training data
- so the model is consistent
- with the parser.
248
would
9
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
some
2
third
(from the Penn Treebank)
1
a
25would
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 4
some
2
third
(from the Penn Treebank)
1
a
26would
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 3
some
2
third
(from the Penn Treebank)
1
a
27would
1
1
.
,
According
changes
cut
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 2
some
2
third
(from the Penn Treebank)
1
a
28would
1
1
.
,
According
changes
cut
1
to
by
1
filings
the
rule
1
1
estimates
more
insider
1
1
than
b 1
some
third
(from the Penn Treebank)
1
a
29would
.
,
According
cut
changes
to
by
filings
the
rule
estimates
more
insider
than
b 0
some
third
(from the Penn Treebank)
a
30Observation
- Even for small b, bunches can grow to arbitrary
size - But arbitrary center embedding is out
31Vine SBG is Finite-State
- Could compile into an FSA and get O(n) parsing!
- Problem whats the grammar constant?
EXPONENTIAL
- insider has no parent
- cut and would can have more children
- can have more children
FSA
According to some estimates , the rule changes
would cut insider ...
32Alternative
- Instead, we adapt
- an SBG chart parser
- which implicitly shares fragments of stack state
- to the vine case,
- eliminating unnecessary work.
33Quadratic Recognition/Parsing
goal
...
O(n2b)
...
O(n2b)
O(n3) combinations
only construct trapezoids such that k i b
i
j
i
j
k
k
O(nb2)
O(n3) combinations
i
j
i
j
k
k
34would
.
,
According
changes
cut
O(nb) vine construction
b 4
- According to some , the new changes would cut
insider filings by more than a third .
all width 4
35Parsing Algorithm
- Same grammar constant as Eisner and Satta (1999)
- O(n3) ? O(nb2) runtime
- Includes some overhead (low-order term) for
constructing the vine - Reality check ... is it worth it?
36Results Penn Treebank
evaluation against original ungrafted Treebank
non-punctuation only
b 20
b 1
37Results Chinese Treebank
evaluation against original ungrafted Treebank
non-punctuation only
b 20
b 1
38Results TIGER Corpus
evaluation against original ungrafted Treebank
non-punctuation only
b 20
b 1
39Type-Specific Bounds
- b can be specific to dependency type
- e.g., b(V-O) can be longer than b(S-V)
- b specific to parent, child, direction
- gradually tighten based on training data
-
40- English 50 runtime, no loss
- Chinese 55 runtime, no loss
- German 44 runtime, 2 loss
41Related Work
- Nederhof (2000) surveys finite-state
approximation of context-free languages. - CFG ? FSA
- We limit all dependency lengths (not just
center-embedding), and derive weights from the
Treebank (not by approximation). - Chart parser ? reasonable grammar constant.
42Future Work
apply to state-of-the-art parsing models
better parameter estimation
applications MT, IE, grammar induction
43Conclusion (II)
- Dependency length can be a helpful feature in
improving the - speed and accuracy
- (or trading off between them)
- of simple parsing models that
- consider dependencies.
44This Talk in a Nutshell
3
length of a dependency surface distance
1
1
1
- Empirical results (English, Chinese, German)
- Hard constraints cut runtime in half or more
with no accuracy loss (English, Chinese) or by
44 with -2.2 accuracy (German). - Soft constraints affect accuracy of simple
models by -3 to 24 and cut runtime by 25 to
40.
- Formal results
- A hard bound b on dependency length
- results in a regular language.
- allows O(nb2) parsing.