Title: David A. Smith (JHU ? UMass Amherst)
1Dependency Parsingby Belief Propagation
- David A. Smith (JHU ? UMass Amherst)
- Jason Eisner (Johns Hopkins University)
1
2Outline
- Edge-factored parsing
- Dependency parses
- Scoring the competing parses Edge features
- Finding the best parse
- Higher-order parsing
- Throwing in more features Graphical models
- Finding the best parse Belief propagation
- Experiments
- Conclusions
Old
New!
3Outline
- Edge-factored parsing
- Dependency parses
- Scoring the competing parses Edge features
- Finding the best parse
- Higher-order parsing
- Throwing in more features Graphical models
- Finding the best parse Belief propagation
- Experiments
- Conclusions
Old
New!
4Word Dependency Parsing
Raw sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September.
POS-tagged sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September. PRP
VBZ DT JJ NN NN
MD VB TO RB CD CD IN
NNP .
Word dependency parsed sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September .
MOD
MOD
COMP
SUBJ
MOD
SUBJ
COMP
SPEC
S-COMP
ROOT
slide adapted from Yuji Matsumoto
5What does parsing have to do with belief
propagation?
loopy
propagation
belief
6Outline
- Edge-factored parsing
- Dependency parses
- Scoring the competing parses Edge features
- Finding the best parse
- Higher-order parsing
- Throwing in more features Graphical models
- Finding the best parse Belief propagation
- Experiments
- Conclusions
Old
New!
7Great ideas in NLP Log-linear models
(Berger, della Pietra, della Pietra 1996 Darroch
Ratcliff 1972)
- In the beginning, we used generative models.
-
p(A) p(B A) p(C A,B) p(D A,B,C)
each choice depends on a limited part of the
history
but which dependencies to allow? what if theyre
all worthwhile?
7
8Great ideas in NLP Log-linear models
(Berger, della Pietra, della Pietra 1996 Darroch
Ratcliff 1972)
- In the beginning, we used generative models.
-
- Solution Log-linear (max-entropy) modeling
- Features may interact in arbitrary ways
- Iterative scaling keeps adjusting the feature
weightsuntil the model agrees with the training
data.
p(A) p(B A) p(C A,B) p(D A,B,C)
which dependencies to allow? (given limited
training data)
(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
throw them all in!
8
9How about structured outputs?
- Log-linear models great for n-way classification
- Also good for predicting sequences
- Also good for dependency parsing
but to allow fast dynamic programming, only use
n-gram features
but to allow fast dynamic programming or MST
parsing,only use single-edge features
9
10How about structured outputs?
but to allow fast dynamic programming or MST
parsing,only use single-edge features
10
11Edge-Factored Parsers (McDonald et al. 2005)
yes, lots of green ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
12Edge-Factored Parsers (McDonald et al. 2005)
jasný ? den (bright day)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
13Edge-Factored Parsers (McDonald et al. 2005)
jasný ? N (bright NOUN)
jasný ? den (bright day)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
14Edge-Factored Parsers (McDonald et al. 2005)
jasný ? N (bright NOUN)
jasný ? den (bright day)
A ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
15Edge-Factored Parsers (McDonald et al. 2005)
jasný ? N (bright NOUN)
jasný ? den (bright day)
A ? N preceding conjunction
A ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
16Edge-Factored Parsers (McDonald et al. 2005)
- How about this competing edge?
not as good, lots of red ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
17Edge-Factored Parsers (McDonald et al. 2005)
- How about this competing edge?
jasný ? hodiny (bright clocks)
... undertrained ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
18Edge-Factored Parsers (McDonald et al. 2005)
- How about this competing edge?
jasn ? hodi (bright clock,stems only)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
19Edge-Factored Parsers (McDonald et al. 2005)
- How about this competing edge?
jasn ? hodi (bright clock,stems only)
Aplural ? Nsingular
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
20Edge-Factored Parsers (McDonald et al. 2005)
- How about this competing edge?
jasn ? hodi (bright clock,stems only)
A ? N where N followsa conjunction
Aplural ? Nsingular
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
21Edge-Factored Parsers (McDonald et al. 2005)
- Which edge is better?
- bright day or bright clocks?
jasný
Byl
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
22Edge-Factored Parsers (McDonald et al. 2005)
- Which edge is better?
- Score of an edge e ? ? features(e)
- Standard algos ? valid parse with max total score
jasný
Byl
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
23Edge-Factored Parsers (McDonald et al. 2005)
- Which edge is better?
- Score of an edge e ? ? features(e)
- Standard algos ? valid parse with max total score
cant have both(one parent per word)
Thus, an edge may lose (or win) because of a
consensus of other edges.
24Outline
- Edge-factored parsing
- Dependency parses
- Scoring the competing parses Edge features
- Finding the best parse
- Higher-order parsing
- Throwing in more features Graphical models
- Finding the best parse Belief propagation
- Experiments
- Conclusions
Old
New!
25Finding Highest-Scoring Parse
- Convert to context-free grammar (CFG)
- Then use dynamic programming
each subtree is a linguistic constituent (here a
noun phrase)
26Finding Highest-Scoring Parse
- Convert to context-free grammar (CFG)
- Then use dynamic programming
- CKY algorithm for CFG parsing is O(n3)
- Unfortunately, O(n5) in this case
- to score cat ? wore link, not enough to know
this is NP - must know its rooted at cat
- so expand nonterminal set by O(n) NPthe, NPcat,
NPhat, ...
- so CKYs grammar constant is no longer constant
?
each subtree is a linguistic constituent (here a
noun phrase)
27Finding Highest-Scoring Parse
- Convert to context-free grammar (CFG)
- Then use dynamic programming
- CKY algorithm for CFG parsing is O(n3)
- Unfortunately, O(n5) in this case
- Solution Use a different decomposition (Eisner
1996) - Back to O(n3)
each subtree is a linguistic constituent (here a
noun phrase)
28Spans vs. constituents
- Two kinds of substring.
- Constituent of the tree links to the rest only
through its headword (root). - Span of the tree links to the rest only through
its endwords.
The cat in the hat wore a stovepipe. ROOT
29Decomposing a tree into spans
cat in the hat wore a stovepipe. ROOT
wore a stovepipe. ROOT
cat in the hat wore
in the hat wore
cat in
in the hat
30Finding Highest-Scoring Parse
- Convert to context-free grammar (CFG)
- Then use dynamic programming
- CKY algorithm for CFG parsing is O(n3)
- Unfortunately, O(n5) in this case
- Solution Use a different decomposition (Eisner
1996) - Back to O(n3)
- Can play usual tricks for dynamic programming
parsing - Further refining the constituents or spans
- Allow prob. model to keep track of even more
internal information - A, best-first, coarse-to-fine
- Training by EM etc.
31Hard Constraints on Valid Trees
- Score of an edge e ? ? features(e)
- Standard algos ? valid parse with max total score
cant have both(one parent per word)
Thus, an edge may lose (or win) because of a
consensus of other edges.
32Non-Projective Parses
talk
I
give
a
on
bootstrapping
tomorrow
ROOT
ll
subtree rooted at talkis a discontiguous noun
phrase
The projectivity restriction. Do we really want
it?
33Non-Projective Parses
I
give
a
on
bootstrapping
talk
tomorrow
ROOT
ll
occasional non-projectivity in English
ista
meam
norit
gloria
canitiem
ROOT
thatNOM
myACC
may-know
gloryNOM
going-grayACC
That glory may-know my going-gray (i.e., it
shall last till I go gray)
frequent non-projectivity in Latin, etc.
34Finding highest-scoring non-projective tree
- Consider the sentence John saw Mary (left).
- The Chu-Liu-Edmonds algorithm finds the
maximum-weight spanning tree (right) may be
non-projective. - Can be found in time O(n2).
9
root
root
10
10
30
30
9
saw
saw
20
0
30
30
Mary
Mary
John
John
Every node selects best parent If cycles,
contract them and repeat
11
3
slide thanks to Dragomir Radev
35Finding highest-scoring non-projective tree
Summing over all non-projective trees
- Consider the sentence John saw Mary (left).
- The Chu-Liu-Edmonds algorithm finds the
maximum-weight spanning tree (right) may be
non-projective. - Can be found in time O(n2).
- How about total weight Z of all trees?
- How about outside probabilities or gradients?
- Can be found in time O(n3) by matrix determinants
and inverses (Smith Smith, 2007).
slide thanks to Dragomir Radev
36Graph Theory to the Rescue!
O(n3) time!
Tuttes Matrix-Tree Theorem (1948)
The determinant of the Kirchoff (aka Laplacian)
adjacency matrix of directed graph G without row
and column r is equal to the sum of scores of all
directed spanning trees of G rooted at node r.
Exactly the Z we need!
37Building the Kirchoff (Laplacian) Matrix
- Negate edge scores
- Sum columns (children)
- Strike root row/col.
- Take determinant
N.B. This allows multiple children of root, but
see Koo et al. 2007.
38Why Should This Work?
Clear for 1x1 matrix use induction
Chu-Liu-Edmonds analogy Every node selects best
parent If cycles, contract and recur
Undirected case special root cases for directed
39Outline
- Edge-factored parsing
- Dependency parses
- Scoring the competing parses Edge features
- Finding the best parse
- Higher-order parsing
- Throwing in more features Graphical models
- Finding the best parse Belief propagation
- Experiments
- Conclusions
Old
New!
40Exactly Finding the Best Parse
but to allow fast dynamic programming or MST
parsing,only use single-edge features
- With arbitrary features, runtime blows up
- Projective parsing O(n3) by dynamic programming
- Non-projective O(n2) by minimum spanning tree
40
41Lets reclaim our freedom (again!)
This paper in a nutshell
- Output probability is a product of local factors
- Throw in any factors we want! (log-linear
model) - How could we find best parse?
- Integer linear programming (Riedel et al., 2006)
- doesnt give us probabilities when training or
parsing - MCMC
- Slow to mix? High rejection rate because of hard
TREE constraint? - Greedy hill-climbing (McDonald Pereira 2006)
(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
41
42Lets reclaim our freedom (again!)
This paper in a nutshell
- Output probability is a product of local factors
- Throw in any factors we want! (log-linear
model) - Let local factors negotiate via belief
propagation - Links (and tags) reinforce or suppress one
another - Each iteration takes total time O(n2) or O(n3)
- Converges to a pretty good (but approx.) global
parse
(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
42
43Lets reclaim our freedom (again!)
This paper in a nutshell
Training with many features Decoding with many features
Iterative scaling Belief propagation
Each weight in turn is influenced by others Each variable in turn is influenced by others
Iterate to achieve globally optimal weights Iterate to achievelocally consistent beliefs
To train distrib. over trees, use dynamic programming to compute normalizer Z To decode distrib. over trees, use dynamic programming to compute messages
New!
44Outline
- Edge-factored parsing
- Dependency parses
- Scoring the competing parses Edge features
- Finding the best parse
- Higher-order parsing
- Throwing in more features Graphical models
- Finding the best parse Belief propagation
- Experiments
- Conclusions
Old
New!
45Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
Possible tagging (i.e., assignment to remaining
variables)
v
v
v
preferred
find
tags
Observed input sentence (shaded)
45
46Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
Possible tagging (i.e., assignment to remaining
variables) Another possible tagging
v
a
n
preferred
find
tags
Observed input sentence (shaded)
46
47Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
Binary factor that measures compatibility of 2
adjacent tags
Model reusessame parameters at this position
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v n a
v 0 2 1
n 2 1 0
a 0 3 1
preferred
find
tags
47
48Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
Unary factor evaluates this tag Its values
depend on corresponding word
v 0.2
n 0.2
a 0
v 0.2
n 0.2
a 0
cant be adj
preferred
find
tags
48
49Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
Unary factor evaluates this tag Its values
depend on corresponding word
v 0.2
n 0.2
a 0
preferred
find
tags
(could be made to depend onentire observed
sentence)
49
50Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
Unary factor evaluates this tag Different unary
factor at each position
v 0.2
n 0.2
a 0
v 0.3
n 0.02
a 0
v 0.3
n 0
a 0.1
preferred
find
tags
50
51Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
p(v a n) is proportionalto the product of
all factors values on v a n
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v
a
n
v 0.3
n 0.02
a 0
v 0.3
n 0
a 0.1
v 0.2
n 0.2
a 0
preferred
find
tags
51
52Local factors in a graphical model
- First, a familiar example
- Conditional Random Field (CRF) for POS tagging
p(v a n) is proportionalto the product of
all factors values on v a n
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v n a
v 0 2 1
n 2 1 0
a 0 3 1
130.30.10.2
v
a
n
v 0.3
n 0.02
a 0
v 0.3
n 0
a 0.1
v 0.2
n 0.2
a 0
preferred
find
tags
52
53Local factors in a graphical model
- First, a familiar example
- CRF for POS tagging
- Now lets do dependency parsing!
- O(n2) boolean variables for the possible links
preferred
find
links
53
54Local factors in a graphical model
- First, a familiar example
- CRF for POS tagging
- Now lets do dependency parsing!
- O(n2) boolean variables for the possible links
Possible parse
encoded as an assignment to these vars
preferred
find
links
54
55Local factors in a graphical model
- First, a familiar example
- CRF for POS tagging
- Now lets do dependency parsing!
- O(n2) boolean variables for the possible links
Possible parse
encoded as an assignment to these vars
Another possible parse
f
f
t
t
f
f
preferred
find
links
55
56Local factors in a graphical model
- First, a familiar example
- CRF for POS tagging
- Now lets do dependency parsing!
- O(n2) boolean variables for the possible links
Possible parse
encoded as an assignment to these vars
Another possible parse
An illegal parse
f
t
t
t
f
f
preferred
find
links
56
57Local factors in a graphical model
- First, a familiar example
- CRF for POS tagging
- Now lets do dependency parsing!
- O(n2) boolean variables for the possible links
Possible parse
encoded as an assignment to these vars
Another possible parse
An illegal parse Another illegal parse
f
t
t
t
f
t
preferred
find
links
(multiple parents)
57
58Local factors for parsing
- So what factors shall we multiply to define parse
probability? - Unary factors to evaluate each link in isolation
But what if the best assignment isnt a tree??
as before, goodness of this link can depend on
entire observed input context
t 2
f 1
some other links arent as good given this
inputsentence
t 1
f 2
t 1
f 8
t 1
f 3
preferred
find
links
t 1
f 2
t 1
f 6
58
59Global factors for parsing
- So what factors shall we multiply to define parse
probability? - Unary factors to evaluate each link in isolation
- Global TREE factor to require that the links form
a legal tree - this is a hard constraint factor is either 0
or 1
ffffff 0
ffffft 0
fffftf 0
fftfft 1
tttttt 0
preferred
find
links
59
60Global factors for parsing
- So what factors shall we multiply to define parse
probability? - Unary factors to evaluate each link in isolation
- Global TREE factor to require that the links form
a legal tree - this is a hard constraint factor is either 0
or 1
ffffff 0
ffffft 0
fffftf 0
fftfft 1
tttttt 0
t
f
were legal!
f
f
64 entries (0/1)
f
t
preferred
find
links
60
61Local factors for parsing
- So what factors shall we multiply to define parse
probability? - Unary factors to evaluate each link in isolation
- Global TREE factor to require that the links form
a legal tree - this is a hard constraint factor is either 0
or 1 - Second-order effects factors on 2 variables
- grandparent
f t
f 1 1
t 1 3
t
3
t
preferred
find
links
61
62Local factors for parsing
- So what factors shall we multiply to define parse
probability? - Unary factors to evaluate each link in isolation
- Global TREE factor to require that the links form
a legal tree - this is a hard constraint factor is either 0
or 1 - Second-order effects factors on 2 variables
- grandparent
- no-cross
f t
f 1 1
t 1 0.2
t
t
preferred
find
links
by
62
63Local factors for parsing
- So what factors shall we multiply to define parse
probability? - Unary factors to evaluate each link in isolation
- Global TREE factor to require that the links form
a legal tree - this is a hard constraint factor is either 0
or 1 - Second-order effects factors on 2 variables
- grandparent
- no-cross
- siblings
- hidden POS tags
- subcategorization
-
preferred
find
links
by
63
64Outline
- Edge-factored parsing
- Dependency parses
- Scoring the competing parses Edge features
- Finding the best parse
- Higher-order parsing
- Throwing in more features Graphical models
- Finding the best parse Belief propagation
- Experiments
- Conclusions
Old
New!
65Good to have lots of features, but
- Nice model ?
- Shame about the NP-hardness ?
- Can we approximate?
- Machine learning to the rescue!
- ML community has given a lot to NLP
- In the 2000s, NLP has been giving back to ML
- Mainly techniques for joint prediction of
structures - Much earlier, speech recognition had HMMs, EM,
smoothing
65
66Great Ideas in ML Message Passing
Count the soldiers
1 beforeyou
2 beforeyou
3 beforeyou
4 beforeyou
5 beforeyou
3 behind you
2 behind you
1 behind you
4 behind you
5 behind you
adapted from MacKay (2003) textbook
66
67Great Ideas in ML Message Passing
Count the soldiers
BeliefMust be 2 1 3 6 of us
2 beforeyou
3 behind you
only see my incoming messages
adapted from MacKay (2003) textbook
67
68Great Ideas in ML Message Passing
Count the soldiers
1 beforeyou
4 behind you
only see my incoming messages
adapted from MacKay (2003) textbook
68
69Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
11 here ( 731)
adapted from MacKay (2003) textbook
69
70Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here ( 331)
3 here
adapted from MacKay (2003) textbook
70
71Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
11 here ( 731)
7 here
3 here
adapted from MacKay (2003) textbook
71
72Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
Belief Must be 14 of us
3 here
adapted from MacKay (2003) textbook
72
73Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
BeliefMust be 14 of us
3 here
wouldnt work correctlywith a loopy (cyclic)
graph
adapted from MacKay (2003) textbook
73
74Great ideas in ML Forward-Backward
- In the CRF, message passing forward-backward
belief
v 1.8
n 0
a 4.2
message
message
a
ß
a
ß
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v 2
n 1
a 7
v 7
n 2
a 1
v 3
n 1
a 6
v 3
n 6
a 1
v 0.3
n 0
a 0.1
find
tags
preferred
74
75Great ideas in ML Forward-Backward
- Extend CRF to skip chain to capture non-local
factor - More influences on belief ?
v 5.4
n 0
a 25.2
a
ß
v 3
n 1
a 6
v 2
n 1
a 7
v 0.3
n 0
a 0.1
v 3
n 1
a 6
find
tags
preferred
75
76Great ideas in ML Forward-Backward
- Extend CRF to skip chain to capture non-local
factor - More influences on belief ?
- Graph becomes loopy ?
Red messages not independent? Pretend they are!
v 5.4
n 0
a 25.2
a
ß
v 3
n 1
a 6
v 2
n 1
a 7
v 0.3
n 0
a 0.1
v 3
n 1
a 6
find
tags
preferred
76
77Two great tastes that taste great together
You got belief propagation in my dynamic
programming!
You got dynamic programming in my belief
propagation!
78Loopy Belief Propagation for Parsing
- Sentence tells word 3, Please be a verb
- Word 3 tells the 3 ? 7 link, Sorry, then you
probably dont exist - The 3 ? 7 link tells the Tree factor, Youll
have to find another parent for 7 - The tree factor tells the 10 ? 7 link, Youre
on! - The 10 ? 7 link tells 10, Could you please be a
noun? -
preferred
find
links
78
79Loopy Belief Propagation for Parsing
- Higher-order factors (e.g., Grandparent) induce
loops - Lets watch a loop around one triangle
- Strong links are suppressing or promoting other
links
preferred
find
links
79
80Loopy Belief Propagation for Parsing
- Higher-order factors (e.g., Grandparent) induce
loops - Lets watch a loop around one triangle
- How did we compute outgoing message to green
link? - Does the TREE factor think that the green link
is probably t,given the messages it receives
from all the other links?
?
TREE factor TREE factor
ffffff 0
ffffft 0
fffftf 0
fftfft 1
tttttt 0
preferred
find
links
80
81Loopy Belief Propagation for Parsing
- How did we compute outgoing message to green
link? - Does the TREE factor think that the green link
is probably t,given the messages it receives
from all the other links?
TREE factor TREE factor
ffffff 0
ffffft 0
fffftf 0
fftfft 1
tttttt 0
?
preferred
find
links
81
82Loopy Belief Propagation for Parsing
- How did we compute outgoing message to green
link? - Does the TREE factor think that the green link
is probably t,given the messages it receives
from all the other links?
Belief propagation assumes incoming messages to
TREE are independent. So outgoing messages can be
computed with first-order parsing algorithms
(fast, no grammar constant).
82
83Some connections
- Parser stacking (Nivre McDonald 2008, Martins
et al. 2008) - Global constraints in arc consistency
- ALLDIFFERENT constraint (Régin 1994)
- Matching constraint in max-product BP
- For computer vision (Duchi et al., 2006)
- Could be used for machine translation
- As far as we know, our parser is the first use of
global constraints in sum-product BP.
84Outline
- Edge-factored parsing
- Dependency parses
- Scoring the competing parses Edge features
- Finding the best parse
- Higher-order parsing
- Throwing in more features Graphical models
- Finding the best parse Belief propagation
- Experiments
- Conclusions
Old
New!
85Runtimes for each factor type (see paper)
Factor type degree runtime count total
Tree O(n2) O(n3) 1 O(n3)
Proj. Tree O(n2) O(n3) 1 O(n3)
Individual links 1 O(1) O(n2) O(n2)
Grandparent 2 O(1) O(n3) O(n3)
Sibling pairs 2 O(1) O(n3) O(n3)
Sibling bigrams O(n) O(n2) O(n) O(n3)
NoCross O(n) O(n) O(n2) O(n3)
Tag 1 O(g) O(n) O(n)
TagLink 3 O(g2) O(n2) O(n2)
TagTrigram O(n) O(ng3) 1 O(n)
TOTAL O(n3)
per iteration
Additive, not multiplicative!
86Runtimes for each factor type (see paper)
Factor type degree runtime count total
Tree O(n2) O(n3) 1 O(n3)
Proj. Tree O(n2) O(n3) 1 O(n3)
Individual links 1 O(1) O(n2) O(n2)
Grandparent 2 O(1) O(n3) O(n3)
Sibling pairs 2 O(1) O(n3) O(n3)
Sibling bigrams O(n) O(n2) O(n) O(n3)
NoCross O(n) O(n) O(n2) O(n3)
Tag 1 O(g) O(n) O(n)
TagLink 3 O(g2) O(n2) O(n2)
TagTrigram O(n) O(ng3) 1 O(n)
TOTAL O(n3)
Each global factor coordinates an unbounded
of variables Standard belief propagation would
take exponential time to iterate over all
configurations of those variables See paper for
efficient propagators
Additive, not multiplicative!
87Experimental Details
- Decoding
- Run several iterations of belief propagation
- Get final beliefs at link variables
- Feed them into first-order parser
- This gives the Min Bayes Risk tree (minimizes
expected error) - Training
- BP computes beliefs about each factor, too
- which gives us gradients for max conditional
likelihood. - (as in forward-backward algorithm)
- Features used in experiments
- First-order Individual links just as in McDonald
et al. 2005 - Higher-order Grandparent, Sibling bigrams,
NoCross
87
88Dependency AccuracyThe extra, higher-order
features help! (non-projective parsing)
Danish Dutch English
TreeLink 85.5 87.3 88.6
NoCross 86.1 88.3 89.1
Grandparent 86.1 88.6 89.4
ChildSeq 86.5 88.5 90.1
89Dependency AccuracyThe extra, higher-order
features help! (non-projective parsing)
Danish Dutch English
TreeLink 85.5 87.3 88.6
NoCross 86.1 88.3 89.1
Grandparent 86.1 88.6 89.4
ChildSeq 86.5 88.5 90.1
Best projective parse with all factors 86.0 84.5 90.2
hill-climbing 86.1 87.6 90.2
exact, slow
doesnt fixenough edges
90Time vs. Projective Search Error
iterations
iterations
DP 140
iterations
Compared with O(n4) DP
Compared with O(n5) DP
91Runtime BP vs. DP
Vs. O(n4) DP
Vs. O(n5) DP
91
92Outline
- Edge-factored parsing
- Dependency parses
- Scoring the competing parses Edge features
- Finding the best parse
- Higher-order parsing
- Throwing in more features Graphical models
- Finding the best parse Belief propagation
- Experiments
- Conclusions
Old
New!
93Freedom Regained
This paper in a nutshell
- Output probability defined as product of local
and global factors - Throw in any factors we want! (log-linear
model) - Each factor must be fast, but they run
independently - Let local factors negotiate via belief
propagation - Each bit of syntactic structure is influenced by
others - Some factors need combinatorial algorithms to
compute messages fast - e.g., existing parsing algorithms using dynamic
programming - Each iteration takes total time O(n3) or even
O(n2) see paper - Compare reranking or stacking
- Converges to a pretty good (but approximate)
global parse - Fast parsing for formerly intractable or slow
models - Extra features of these models really do help
accuracy
93
94Future Opportunities
- Efficiently modeling more hidden structure
- POS tags, link roles, secondary links (DAG-shaped
parses) - Beyond dependencies
- Constituency parsing, traces, lattice parsing
- Beyond parsing
- Alignment, translation
- Bipartite matching and network flow
- Joint decoding of parsing and other tasks (IE,
MT, reasoning ...) - Beyond text
- Image tracking and retrieval
- Social networks
95you
thank