David A. Smith (JHU ? UMass Amherst)

About This Presentation

Title:

David A. Smith (JHU ? UMass Amherst)

Description:

Dependency Parsing by Belief Propagation David A. Smith (JHU UMass Amherst) Jason Eisner (Johns Hopkins University) * Outline Edge-factored parsing Dependency parses ... – PowerPoint PPT presentation

Number of Views:90

Avg rating:3.0/5.0

Slides: 95

Provided by: Davi1349

Learn more at: https://www.khoury.northeastern.edu

Category:

more less

Transcript and Presenter's Notes

Title: David A. Smith (JHU ? UMass Amherst)

1
Dependency Parsingby Belief Propagation

David A. Smith (JHU ? UMass Amherst)
Jason Eisner (Johns Hopkins University)

1
2
Outline

Edge-factored parsing
Dependency parses
Scoring the competing parses Edge features
Finding the best parse
Higher-order parsing
Throwing in more features Graphical models
Finding the best parse Belief propagation
Experiments
Conclusions

Old
New!
3
Outline

Edge-factored parsing
Dependency parses
Scoring the competing parses Edge features
Finding the best parse
Higher-order parsing
Throwing in more features Graphical models
Finding the best parse Belief propagation
Experiments
Conclusions

Old
New!
4
Word Dependency Parsing
Raw sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September.
POS-tagged sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September. PRP
VBZ DT JJ NN NN
MD VB TO RB CD CD IN
NNP .
Word dependency parsed sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September .
MOD
MOD
COMP
SUBJ
MOD
SUBJ
COMP
SPEC
S-COMP
ROOT
slide adapted from Yuji Matsumoto
5
What does parsing have to do with belief
propagation?
loopy
propagation
belief
6
Outline

Edge-factored parsing
Dependency parses
Scoring the competing parses Edge features
Finding the best parse
Higher-order parsing
Throwing in more features Graphical models
Finding the best parse Belief propagation
Experiments
Conclusions

Old
New!
7
Great ideas in NLP Log-linear models
(Berger, della Pietra, della Pietra 1996 Darroch
Ratcliff 1972)

In the beginning, we used generative models.

p(A) p(B A) p(C A,B) p(D A,B,C)
each choice depends on a limited part of the
history
but which dependencies to allow? what if theyre
all worthwhile?
7
8
Great ideas in NLP Log-linear models
(Berger, della Pietra, della Pietra 1996 Darroch
Ratcliff 1972)

In the beginning, we used generative models.
Solution Log-linear (max-entropy) modeling
Features may interact in arbitrary ways
Iterative scaling keeps adjusting the feature
weightsuntil the model agrees with the training
data.

p(A) p(B A) p(C A,B) p(D A,B,C)
which dependencies to allow? (given limited
training data)
(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
throw them all in!
8
9
How about structured outputs?

Log-linear models great for n-way classification
Also good for predicting sequences
Also good for dependency parsing

but to allow fast dynamic programming, only use
n-gram features
but to allow fast dynamic programming or MST
parsing,only use single-edge features
9
10
How about structured outputs?
but to allow fast dynamic programming or MST
parsing,only use single-edge features
10
11
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?

yes, lots of green ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
12
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?

jasný ? den (bright day)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
13
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?

jasný ? N (bright NOUN)
jasný ? den (bright day)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
14
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?

jasný ? N (bright NOUN)
jasný ? den (bright day)
A ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
15
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?

jasný ? N (bright NOUN)
jasný ? den (bright day)
A ? N preceding conjunction
A ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
16
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?

not as good, lots of red ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
17
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?

jasný ? hodiny (bright clocks)
... undertrained ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
18
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?

jasn ? hodi (bright clock,stems only)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
19
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?

jasn ? hodi (bright clock,stems only)
Aplural ? Nsingular
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
20
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?

jasn ? hodi (bright clock,stems only)
A ? N where N followsa conjunction
Aplural ? Nsingular
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
21
Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better?
bright day or bright clocks?

jasný
Byl
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
22
Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better?
Score of an edge e ? ? features(e)
Standard algos ? valid parse with max total score

jasný
Byl
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
23
Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better?
Score of an edge e ? ? features(e)
Standard algos ? valid parse with max total score

cant have both(one parent per word)
Thus, an edge may lose (or win) because of a
consensus of other edges.
24
Outline

Edge-factored parsing
Dependency parses
Scoring the competing parses Edge features
Finding the best parse
Higher-order parsing
Throwing in more features Graphical models
Finding the best parse Belief propagation
Experiments
Conclusions

Old
New!
25
Finding Highest-Scoring Parse

Convert to context-free grammar (CFG)
Then use dynamic programming

each subtree is a linguistic constituent (here a
noun phrase)
26
Finding Highest-Scoring Parse

Convert to context-free grammar (CFG)
Then use dynamic programming
CKY algorithm for CFG parsing is O(n3)
Unfortunately, O(n5) in this case
to score cat ? wore link, not enough to know
this is NP
must know its rooted at cat
so expand nonterminal set by O(n) NPthe, NPcat,
NPhat, ...

so CKYs grammar constant is no longer constant
?

each subtree is a linguistic constituent (here a
noun phrase)
27
Finding Highest-Scoring Parse

Convert to context-free grammar (CFG)
Then use dynamic programming
CKY algorithm for CFG parsing is O(n3)
Unfortunately, O(n5) in this case
Solution Use a different decomposition (Eisner
1996)
Back to O(n3)

each subtree is a linguistic constituent (here a
noun phrase)
28
Spans vs. constituents

Two kinds of substring.
Constituent of the tree links to the rest only
through its headword (root).
Span of the tree links to the rest only through
its endwords.

The cat in the hat wore a stovepipe. ROOT
29
Decomposing a tree into spans

cat in the hat wore a stovepipe. ROOT

wore a stovepipe. ROOT
cat in the hat wore

in the hat wore
cat in

in the hat
30
Finding Highest-Scoring Parse

Convert to context-free grammar (CFG)
Then use dynamic programming
CKY algorithm for CFG parsing is O(n3)
Unfortunately, O(n5) in this case
Solution Use a different decomposition (Eisner
1996)
Back to O(n3)
Can play usual tricks for dynamic programming
parsing
Further refining the constituents or spans
Allow prob. model to keep track of even more
internal information
A, best-first, coarse-to-fine
Training by EM etc.

31
Hard Constraints on Valid Trees

Score of an edge e ? ? features(e)
Standard algos ? valid parse with max total score

cant have both(one parent per word)
Thus, an edge may lose (or win) because of a
consensus of other edges.
32
Non-Projective Parses
talk
I
give
a
on
bootstrapping
tomorrow
ROOT
ll
subtree rooted at talkis a discontiguous noun
phrase
The projectivity restriction. Do we really want
it?
33
Non-Projective Parses
I
give
a
on
bootstrapping
talk
tomorrow
ROOT
ll
occasional non-projectivity in English
ista
meam
norit
gloria
canitiem
ROOT
thatNOM
myACC
may-know
gloryNOM
going-grayACC
That glory may-know my going-gray (i.e., it
shall last till I go gray)
frequent non-projectivity in Latin, etc.
34
Finding highest-scoring non-projective tree

Consider the sentence John saw Mary (left).
The Chu-Liu-Edmonds algorithm finds the
maximum-weight spanning tree (right) may be
non-projective.
Can be found in time O(n2).

9
root
root
10
10
30
30
9
saw
saw
20
0
30
30
Mary
Mary
John
John
Every node selects best parent If cycles,
contract them and repeat
11
3
slide thanks to Dragomir Radev
35
Finding highest-scoring non-projective tree
Summing over all non-projective trees

Consider the sentence John saw Mary (left).
The Chu-Liu-Edmonds algorithm finds the
maximum-weight spanning tree (right) may be
non-projective.
Can be found in time O(n2).
How about total weight Z of all trees?
How about outside probabilities or gradients?
Can be found in time O(n3) by matrix determinants
and inverses (Smith Smith, 2007).

slide thanks to Dragomir Radev
36
Graph Theory to the Rescue!
O(n3) time!
Tuttes Matrix-Tree Theorem (1948)
The determinant of the Kirchoff (aka Laplacian)
adjacency matrix of directed graph G without row
and column r is equal to the sum of scores of all
directed spanning trees of G rooted at node r.
Exactly the Z we need!
37
Building the Kirchoff (Laplacian) Matrix

Negate edge scores
Sum columns (children)
Strike root row/col.
Take determinant

N.B. This allows multiple children of root, but
see Koo et al. 2007.
38
Why Should This Work?
Clear for 1x1 matrix use induction
Chu-Liu-Edmonds analogy Every node selects best
parent If cycles, contract and recur
Undirected case special root cases for directed
39
Outline

Edge-factored parsing
Dependency parses
Scoring the competing parses Edge features
Finding the best parse
Higher-order parsing
Throwing in more features Graphical models
Finding the best parse Belief propagation
Experiments
Conclusions

Old
New!
40
Exactly Finding the Best Parse
but to allow fast dynamic programming or MST
parsing,only use single-edge features

With arbitrary features, runtime blows up
Projective parsing O(n3) by dynamic programming
Non-projective O(n2) by minimum spanning tree

40
41
Lets reclaim our freedom (again!)
This paper in a nutshell

Output probability is a product of local factors
Throw in any factors we want! (log-linear
model)
How could we find best parse?
Integer linear programming (Riedel et al., 2006)
doesnt give us probabilities when training or
parsing
MCMC
Slow to mix? High rejection rate because of hard
TREE constraint?
Greedy hill-climbing (McDonald Pereira 2006)

(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
41
42
Lets reclaim our freedom (again!)
This paper in a nutshell

Output probability is a product of local factors
Throw in any factors we want! (log-linear
model)
Let local factors negotiate via belief
propagation
Links (and tags) reinforce or suppress one
another
Each iteration takes total time O(n2) or O(n3)
Converges to a pretty good (but approx.) global
parse

(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
42
43
Lets reclaim our freedom (again!)
This paper in a nutshell
Training with many features Decoding with many features
Iterative scaling Belief propagation
Each weight in turn is influenced by others Each variable in turn is influenced by others
Iterate to achieve globally optimal weights Iterate to achievelocally consistent beliefs
To train distrib. over trees, use dynamic programming to compute normalizer Z To decode distrib. over trees, use dynamic programming to compute messages
New!
44
Outline

Edge-factored parsing
Dependency parses
Scoring the competing parses Edge features
Finding the best parse
Higher-order parsing
Throwing in more features Graphical models
Finding the best parse Belief propagation
Experiments
Conclusions

Old
New!
45
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

Possible tagging (i.e., assignment to remaining
variables)

v
v
v
preferred
find
tags
Observed input sentence (shaded)
45
46
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

Possible tagging (i.e., assignment to remaining
variables) Another possible tagging

v
a
n
preferred
find
tags
Observed input sentence (shaded)
46
47
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

Binary factor that measures compatibility of 2
adjacent tags
Model reusessame parameters at this position
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v n a
v 0 2 1
n 2 1 0
a 0 3 1

preferred
find
tags
47
48
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

Unary factor evaluates this tag Its values
depend on corresponding word

v 0.2
n 0.2
a 0
v 0.2
n 0.2
a 0
cant be adj
preferred
find
tags
48
49
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

Unary factor evaluates this tag Its values
depend on corresponding word

v 0.2
n 0.2
a 0
preferred
find
tags
(could be made to depend onentire observed
sentence)
49
50
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

Unary factor evaluates this tag Different unary
factor at each position

v 0.2
n 0.2
a 0
v 0.3
n 0.02
a 0
v 0.3
n 0
a 0.1
preferred
find
tags
50
51
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

p(v a n) is proportionalto the product of
all factors values on v a n
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v n a
v 0 2 1
n 2 1 0
a 0 3 1

v
a
n
v 0.3
n 0.02
a 0
v 0.3
n 0
a 0.1
v 0.2
n 0.2
a 0
preferred
find
tags
51
52
Local factors in a graphical model

First, a familiar example
Conditional Random Field (CRF) for POS tagging

p(v a n) is proportionalto the product of
all factors values on v a n
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v n a
v 0 2 1
n 2 1 0
a 0 3 1
130.30.10.2

v
a
n
v 0.3
n 0.02
a 0
v 0.3
n 0
a 0.1
v 0.2
n 0.2
a 0
preferred
find
tags
52
53
Local factors in a graphical model

First, a familiar example
CRF for POS tagging
Now lets do dependency parsing!
O(n2) boolean variables for the possible links

preferred
find
links
53
54
Local factors in a graphical model

First, a familiar example
CRF for POS tagging
Now lets do dependency parsing!
O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars

preferred
find
links
54
55
Local factors in a graphical model

First, a familiar example
CRF for POS tagging
Now lets do dependency parsing!
O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars
Another possible parse
f
f
t
t
f
f

preferred
find
links
55
56
Local factors in a graphical model

First, a familiar example
CRF for POS tagging
Now lets do dependency parsing!
O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars
Another possible parse
An illegal parse
f
t
t
t
f
f

preferred
find
links
56
57
Local factors in a graphical model

First, a familiar example
CRF for POS tagging
Now lets do dependency parsing!
O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars
Another possible parse
An illegal parse Another illegal parse
f
t
t
t
f
t

preferred
find
links
(multiple parents)
57
58
Local factors for parsing

So what factors shall we multiply to define parse
probability?
Unary factors to evaluate each link in isolation

But what if the best assignment isnt a tree??
as before, goodness of this link can depend on
entire observed input context
t 2
f 1
some other links arent as good given this
inputsentence
t 1
f 2
t 1
f 8
t 1
f 3

preferred
find
links
t 1
f 2
t 1
f 6
58
59
Global factors for parsing

So what factors shall we multiply to define parse
probability?
Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form
a legal tree
this is a hard constraint factor is either 0
or 1

ffffff 0
ffffft 0
fffftf 0

fftfft 1

tttttt 0

preferred
find
links
59
60
Global factors for parsing

So what factors shall we multiply to define parse
probability?
Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form
a legal tree
this is a hard constraint factor is either 0
or 1

ffffff 0
ffffft 0
fffftf 0

fftfft 1

tttttt 0
t
f
were legal!
f
f
64 entries (0/1)
f
t

preferred
find
links
60
61
Local factors for parsing

So what factors shall we multiply to define parse
probability?
Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form
a legal tree
this is a hard constraint factor is either 0
or 1
Second-order effects factors on 2 variables
grandparent

f t
f 1 1
t 1 3
t
3
t

preferred
find
links
61
62
Local factors for parsing

So what factors shall we multiply to define parse
probability?
Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form
a legal tree
this is a hard constraint factor is either 0
or 1
Second-order effects factors on 2 variables
grandparent
no-cross

f t
f 1 1
t 1 0.2
t
t

preferred
find
links
by
62
63
Local factors for parsing

So what factors shall we multiply to define parse
probability?
Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form
a legal tree
this is a hard constraint factor is either 0
or 1
Second-order effects factors on 2 variables
grandparent
no-cross
siblings
hidden POS tags
subcategorization

preferred
find
links
by
63
64
Outline

Edge-factored parsing
Dependency parses
Scoring the competing parses Edge features
Finding the best parse
Higher-order parsing
Throwing in more features Graphical models
Finding the best parse Belief propagation
Experiments
Conclusions

Old
New!
65
Good to have lots of features, but

Nice model ?
Shame about the NP-hardness ?
Can we approximate?
Machine learning to the rescue!
ML community has given a lot to NLP
In the 2000s, NLP has been giving back to ML
Mainly techniques for joint prediction of
structures
Much earlier, speech recognition had HMMs, EM,
smoothing

65
66
Great Ideas in ML Message Passing
Count the soldiers
1 beforeyou
2 beforeyou
3 beforeyou
4 beforeyou
5 beforeyou
3 behind you
2 behind you
1 behind you
4 behind you
5 behind you
adapted from MacKay (2003) textbook
66
67
Great Ideas in ML Message Passing
Count the soldiers
BeliefMust be 2 1 3 6 of us
2 beforeyou
3 behind you
only see my incoming messages
adapted from MacKay (2003) textbook
67
68
Great Ideas in ML Message Passing
Count the soldiers
1 beforeyou
4 behind you
only see my incoming messages
adapted from MacKay (2003) textbook
68
69
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
11 here ( 731)
adapted from MacKay (2003) textbook
69
70
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here ( 331)
3 here
adapted from MacKay (2003) textbook
70
71
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
11 here ( 731)
7 here
3 here
adapted from MacKay (2003) textbook
71
72
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
Belief Must be 14 of us
3 here
adapted from MacKay (2003) textbook
72
73
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
BeliefMust be 14 of us
3 here
wouldnt work correctlywith a loopy (cyclic)
graph
adapted from MacKay (2003) textbook
73
74
Great ideas in ML Forward-Backward

In the CRF, message passing forward-backward

belief
v 1.8
n 0
a 4.2
message
message
a
ß
a
ß
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v 2
n 1
a 7
v 7
n 2
a 1
v 3
n 1
a 6
v 3
n 6
a 1

v 0.3
n 0
a 0.1
find
tags
preferred
74
75
Great ideas in ML Forward-Backward

Extend CRF to skip chain to capture non-local
factor
More influences on belief ?

v 5.4
n 0
a 25.2
a
ß
v 3
n 1
a 6
v 2
n 1
a 7

v 0.3
n 0
a 0.1
v 3
n 1
a 6
find
tags
preferred
75
76
Great ideas in ML Forward-Backward

Extend CRF to skip chain to capture non-local
factor
More influences on belief ?
Graph becomes loopy ?

Red messages not independent? Pretend they are!
v 5.4
n 0
a 25.2
a
ß
v 3
n 1
a 6
v 2
n 1
a 7

v 0.3
n 0
a 0.1
v 3
n 1
a 6
find
tags
preferred
76
77
Two great tastes that taste great together

Upcoming attractions

You got belief propagation in my dynamic
programming!
You got dynamic programming in my belief
propagation!
78
Loopy Belief Propagation for Parsing

Sentence tells word 3, Please be a verb
Word 3 tells the 3 ? 7 link, Sorry, then you
probably dont exist
The 3 ? 7 link tells the Tree factor, Youll
have to find another parent for 7
The tree factor tells the 10 ? 7 link, Youre
on!
The 10 ? 7 link tells 10, Could you please be a
noun?

preferred
find
links
78
79
Loopy Belief Propagation for Parsing

Higher-order factors (e.g., Grandparent) induce
loops
Lets watch a loop around one triangle
Strong links are suppressing or promoting other
links

preferred
find
links
79
80
Loopy Belief Propagation for Parsing

Higher-order factors (e.g., Grandparent) induce
loops
Lets watch a loop around one triangle
How did we compute outgoing message to green
link?
Does the TREE factor think that the green link
is probably t,given the messages it receives
from all the other links?

?
TREE factor TREE factor
ffffff 0
ffffft 0
fffftf 0

fftfft 1

tttttt 0

preferred
find
links
80
81
Loopy Belief Propagation for Parsing

How did we compute outgoing message to green
link?
Does the TREE factor think that the green link
is probably t,given the messages it receives
from all the other links?

TREE factor TREE factor
ffffff 0
ffffft 0
fffftf 0

fftfft 1

tttttt 0
?

preferred
find
links
81
82
Loopy Belief Propagation for Parsing

How did we compute outgoing message to green
link?
Does the TREE factor think that the green link
is probably t,given the messages it receives
from all the other links?

Belief propagation assumes incoming messages to
TREE are independent. So outgoing messages can be
computed with first-order parsing algorithms
(fast, no grammar constant).
82
83
Some connections

Parser stacking (Nivre McDonald 2008, Martins
et al. 2008)
Global constraints in arc consistency
ALLDIFFERENT constraint (Régin 1994)
Matching constraint in max-product BP
For computer vision (Duchi et al., 2006)
Could be used for machine translation
As far as we know, our parser is the first use of
global constraints in sum-product BP.

84
Outline

Edge-factored parsing
Dependency parses
Scoring the competing parses Edge features
Finding the best parse
Higher-order parsing
Throwing in more features Graphical models
Finding the best parse Belief propagation
Experiments
Conclusions

Old
New!
85
Runtimes for each factor type (see paper)
Factor type degree runtime count total
Tree O(n2) O(n3) 1 O(n3)
Proj. Tree O(n2) O(n3) 1 O(n3)
Individual links 1 O(1) O(n2) O(n2)
Grandparent 2 O(1) O(n3) O(n3)
Sibling pairs 2 O(1) O(n3) O(n3)
Sibling bigrams O(n) O(n2) O(n) O(n3)
NoCross O(n) O(n) O(n2) O(n3)
Tag 1 O(g) O(n) O(n)
TagLink 3 O(g2) O(n2) O(n2)
TagTrigram O(n) O(ng3) 1 O(n)
TOTAL O(n3)

per iteration
Additive, not multiplicative!
86
Runtimes for each factor type (see paper)
Factor type degree runtime count total
Tree O(n2) O(n3) 1 O(n3)
Proj. Tree O(n2) O(n3) 1 O(n3)
Individual links 1 O(1) O(n2) O(n2)
Grandparent 2 O(1) O(n3) O(n3)
Sibling pairs 2 O(1) O(n3) O(n3)
Sibling bigrams O(n) O(n2) O(n) O(n3)
NoCross O(n) O(n) O(n2) O(n3)
Tag 1 O(g) O(n) O(n)
TagLink 3 O(g2) O(n2) O(n2)
TagTrigram O(n) O(ng3) 1 O(n)
TOTAL O(n3)

Each global factor coordinates an unbounded
of variables Standard belief propagation would
take exponential time to iterate over all
configurations of those variables See paper for
efficient propagators
Additive, not multiplicative!
87
Experimental Details

Decoding
Run several iterations of belief propagation
Get final beliefs at link variables
Feed them into first-order parser
This gives the Min Bayes Risk tree (minimizes
expected error)
Training
BP computes beliefs about each factor, too
which gives us gradients for max conditional
likelihood.
(as in forward-backward algorithm)
Features used in experiments
First-order Individual links just as in McDonald
et al. 2005
Higher-order Grandparent, Sibling bigrams,
NoCross

87
88
Dependency AccuracyThe extra, higher-order
features help! (non-projective parsing)
Danish Dutch English
TreeLink 85.5 87.3 88.6
NoCross 86.1 88.3 89.1
Grandparent 86.1 88.6 89.4
ChildSeq 86.5 88.5 90.1
89
Dependency AccuracyThe extra, higher-order
features help! (non-projective parsing)
Danish Dutch English
TreeLink 85.5 87.3 88.6
NoCross 86.1 88.3 89.1
Grandparent 86.1 88.6 89.4
ChildSeq 86.5 88.5 90.1
Best projective parse with all factors 86.0 84.5 90.2
hill-climbing 86.1 87.6 90.2
exact, slow
doesnt fixenough edges
90
Time vs. Projective Search Error
iterations
iterations
DP 140
iterations
Compared with O(n4) DP
Compared with O(n5) DP
91
Runtime BP vs. DP
Vs. O(n4) DP
Vs. O(n5) DP
91
92
Outline