David A. Smith (JHU ? UMass Amherst) - PowerPoint PPT Presentation

About This Presentation
Title:

David A. Smith (JHU ? UMass Amherst)

Description:

Dependency Parsing by Belief Propagation David A. Smith (JHU UMass Amherst) Jason Eisner (Johns Hopkins University) * Outline Edge-factored parsing Dependency parses ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 95
Provided by: Davi1349
Category:
Tags: jhu | amherst | belief | david | smith | umass

less

Transcript and Presenter's Notes

Title: David A. Smith (JHU ? UMass Amherst)


1
Dependency Parsingby Belief Propagation
  • David A. Smith (JHU ? UMass Amherst)
  • Jason Eisner (Johns Hopkins University)

1
2
Outline
  • Edge-factored parsing
  • Dependency parses
  • Scoring the competing parses Edge features
  • Finding the best parse
  • Higher-order parsing
  • Throwing in more features Graphical models
  • Finding the best parse Belief propagation
  • Experiments
  • Conclusions

Old
New!
3
Outline
  • Edge-factored parsing
  • Dependency parses
  • Scoring the competing parses Edge features
  • Finding the best parse
  • Higher-order parsing
  • Throwing in more features Graphical models
  • Finding the best parse Belief propagation
  • Experiments
  • Conclusions

Old
New!
4
Word Dependency Parsing
Raw sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September.
POS-tagged sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September. PRP
VBZ DT JJ NN NN
MD VB TO RB CD CD IN
NNP .
Word dependency parsed sentence
He reckons the current account deficit will
narrow to only 1.8 billion in September .
MOD
MOD
COMP
SUBJ
MOD
SUBJ
COMP
SPEC
S-COMP
ROOT
slide adapted from Yuji Matsumoto
5
What does parsing have to do with belief
propagation?
loopy
propagation
belief
6
Outline
  • Edge-factored parsing
  • Dependency parses
  • Scoring the competing parses Edge features
  • Finding the best parse
  • Higher-order parsing
  • Throwing in more features Graphical models
  • Finding the best parse Belief propagation
  • Experiments
  • Conclusions

Old
New!
7
Great ideas in NLP Log-linear models
(Berger, della Pietra, della Pietra 1996 Darroch
Ratcliff 1972)
  • In the beginning, we used generative models.

p(A) p(B A) p(C A,B) p(D A,B,C)
each choice depends on a limited part of the
history
but which dependencies to allow? what if theyre
all worthwhile?
7
8
Great ideas in NLP Log-linear models
(Berger, della Pietra, della Pietra 1996 Darroch
Ratcliff 1972)
  • In the beginning, we used generative models.
  • Solution Log-linear (max-entropy) modeling
  • Features may interact in arbitrary ways
  • Iterative scaling keeps adjusting the feature
    weightsuntil the model agrees with the training
    data.

p(A) p(B A) p(C A,B) p(D A,B,C)
which dependencies to allow? (given limited
training data)
(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
throw them all in!
8
9
How about structured outputs?
  • Log-linear models great for n-way classification
  • Also good for predicting sequences
  • Also good for dependency parsing

but to allow fast dynamic programming, only use
n-gram features
but to allow fast dynamic programming or MST
parsing,only use single-edge features
9
10
How about structured outputs?
but to allow fast dynamic programming or MST
parsing,only use single-edge features
10
11
Edge-Factored Parsers (McDonald et al. 2005)
  • Is this a good edge?

yes, lots of green ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
12
Edge-Factored Parsers (McDonald et al. 2005)
  • Is this a good edge?

jasný ? den (bright day)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
13
Edge-Factored Parsers (McDonald et al. 2005)
  • Is this a good edge?

jasný ? N (bright NOUN)
jasný ? den (bright day)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
14
Edge-Factored Parsers (McDonald et al. 2005)
  • Is this a good edge?

jasný ? N (bright NOUN)
jasný ? den (bright day)
A ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
15
Edge-Factored Parsers (McDonald et al. 2005)
  • Is this a good edge?

jasný ? N (bright NOUN)
jasný ? den (bright day)
A ? N preceding conjunction
A ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
16
Edge-Factored Parsers (McDonald et al. 2005)
  • How about this competing edge?

not as good, lots of red ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
17
Edge-Factored Parsers (McDonald et al. 2005)
  • How about this competing edge?

jasný ? hodiny (bright clocks)
... undertrained ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
18
Edge-Factored Parsers (McDonald et al. 2005)
  • How about this competing edge?

jasn ? hodi (bright clock,stems only)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
19
Edge-Factored Parsers (McDonald et al. 2005)
  • How about this competing edge?

jasn ? hodi (bright clock,stems only)
Aplural ? Nsingular
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
20
Edge-Factored Parsers (McDonald et al. 2005)
  • How about this competing edge?

jasn ? hodi (bright clock,stems only)
A ? N where N followsa conjunction
Aplural ? Nsingular
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
21
Edge-Factored Parsers (McDonald et al. 2005)
  • Which edge is better?
  • bright day or bright clocks?

jasný
Byl
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
22
Edge-Factored Parsers (McDonald et al. 2005)
  • Which edge is better?
  • Score of an edge e ? ? features(e)
  • Standard algos ? valid parse with max total score

jasný
Byl
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
byl
jasn
stud
dubn
den
a
hodi
odbí
trin
23
Edge-Factored Parsers (McDonald et al. 2005)
  • Which edge is better?
  • Score of an edge e ? ? features(e)
  • Standard algos ? valid parse with max total score

cant have both(one parent per word)
Thus, an edge may lose (or win) because of a
consensus of other edges.
24
Outline
  • Edge-factored parsing
  • Dependency parses
  • Scoring the competing parses Edge features
  • Finding the best parse
  • Higher-order parsing
  • Throwing in more features Graphical models
  • Finding the best parse Belief propagation
  • Experiments
  • Conclusions

Old
New!
25
Finding Highest-Scoring Parse
  • Convert to context-free grammar (CFG)
  • Then use dynamic programming

each subtree is a linguistic constituent (here a
noun phrase)
26
Finding Highest-Scoring Parse
  • Convert to context-free grammar (CFG)
  • Then use dynamic programming
  • CKY algorithm for CFG parsing is O(n3)
  • Unfortunately, O(n5) in this case
  • to score cat ? wore link, not enough to know
    this is NP
  • must know its rooted at cat
  • so expand nonterminal set by O(n) NPthe, NPcat,
    NPhat, ...
  • so CKYs grammar constant is no longer constant
    ?

each subtree is a linguistic constituent (here a
noun phrase)
27
Finding Highest-Scoring Parse
  • Convert to context-free grammar (CFG)
  • Then use dynamic programming
  • CKY algorithm for CFG parsing is O(n3)
  • Unfortunately, O(n5) in this case
  • Solution Use a different decomposition (Eisner
    1996)
  • Back to O(n3)

each subtree is a linguistic constituent (here a
noun phrase)
28
Spans vs. constituents
  • Two kinds of substring.
  • Constituent of the tree links to the rest only
    through its headword (root).
  • Span of the tree links to the rest only through
    its endwords.

The cat in the hat wore a stovepipe. ROOT
29
Decomposing a tree into spans

cat in the hat wore a stovepipe. ROOT

wore a stovepipe. ROOT
cat in the hat wore

in the hat wore
cat in

in the hat
30
Finding Highest-Scoring Parse
  • Convert to context-free grammar (CFG)
  • Then use dynamic programming
  • CKY algorithm for CFG parsing is O(n3)
  • Unfortunately, O(n5) in this case
  • Solution Use a different decomposition (Eisner
    1996)
  • Back to O(n3)
  • Can play usual tricks for dynamic programming
    parsing
  • Further refining the constituents or spans
  • Allow prob. model to keep track of even more
    internal information
  • A, best-first, coarse-to-fine
  • Training by EM etc.

31
Hard Constraints on Valid Trees
  • Score of an edge e ? ? features(e)
  • Standard algos ? valid parse with max total score

cant have both(one parent per word)
Thus, an edge may lose (or win) because of a
consensus of other edges.
32
Non-Projective Parses
talk
I
give
a
on
bootstrapping
tomorrow
ROOT
ll
subtree rooted at talkis a discontiguous noun
phrase
The projectivity restriction. Do we really want
it?
33
Non-Projective Parses
I
give
a
on
bootstrapping
talk
tomorrow
ROOT
ll
occasional non-projectivity in English
ista
meam
norit
gloria
canitiem
ROOT
thatNOM
myACC
may-know
gloryNOM
going-grayACC
That glory may-know my going-gray (i.e., it
shall last till I go gray)
frequent non-projectivity in Latin, etc.
34
Finding highest-scoring non-projective tree
  • Consider the sentence John saw Mary (left).
  • The Chu-Liu-Edmonds algorithm finds the
    maximum-weight spanning tree (right) may be
    non-projective.
  • Can be found in time O(n2).

9
root
root
10
10
30
30
9
saw
saw
20
0
30
30
Mary
Mary
John
John
Every node selects best parent If cycles,
contract them and repeat
11
3
slide thanks to Dragomir Radev
35
Finding highest-scoring non-projective tree
Summing over all non-projective trees
  • Consider the sentence John saw Mary (left).
  • The Chu-Liu-Edmonds algorithm finds the
    maximum-weight spanning tree (right) may be
    non-projective.
  • Can be found in time O(n2).
  • How about total weight Z of all trees?
  • How about outside probabilities or gradients?
  • Can be found in time O(n3) by matrix determinants
    and inverses (Smith Smith, 2007).

slide thanks to Dragomir Radev
36
Graph Theory to the Rescue!
O(n3) time!
Tuttes Matrix-Tree Theorem (1948)
The determinant of the Kirchoff (aka Laplacian)
adjacency matrix of directed graph G without row
and column r is equal to the sum of scores of all
directed spanning trees of G rooted at node r.
Exactly the Z we need!
37
Building the Kirchoff (Laplacian) Matrix
  • Negate edge scores
  • Sum columns (children)
  • Strike root row/col.
  • Take determinant

N.B. This allows multiple children of root, but
see Koo et al. 2007.
38
Why Should This Work?
Clear for 1x1 matrix use induction
Chu-Liu-Edmonds analogy Every node selects best
parent If cycles, contract and recur
Undirected case special root cases for directed
39
Outline
  • Edge-factored parsing
  • Dependency parses
  • Scoring the competing parses Edge features
  • Finding the best parse
  • Higher-order parsing
  • Throwing in more features Graphical models
  • Finding the best parse Belief propagation
  • Experiments
  • Conclusions

Old
New!
40
Exactly Finding the Best Parse
but to allow fast dynamic programming or MST
parsing,only use single-edge features
  • With arbitrary features, runtime blows up
  • Projective parsing O(n3) by dynamic programming
  • Non-projective O(n2) by minimum spanning tree

40
41
Lets reclaim our freedom (again!)
This paper in a nutshell
  • Output probability is a product of local factors
  • Throw in any factors we want! (log-linear
    model)
  • How could we find best parse?
  • Integer linear programming (Riedel et al., 2006)
  • doesnt give us probabilities when training or
    parsing
  • MCMC
  • Slow to mix? High rejection rate because of hard
    TREE constraint?
  • Greedy hill-climbing (McDonald Pereira 2006)

(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
41
42
Lets reclaim our freedom (again!)
This paper in a nutshell
  • Output probability is a product of local factors
  • Throw in any factors we want! (log-linear
    model)
  • Let local factors negotiate via belief
    propagation
  • Links (and tags) reinforce or suppress one
    another
  • Each iteration takes total time O(n2) or O(n3)
  • Converges to a pretty good (but approx.) global
    parse

(1/Z) F(A) F(B,A) F(C,A) F(C,B)
F(D,A,B) F(D,B,C) F(D,A,C)
42
43
Lets reclaim our freedom (again!)
This paper in a nutshell
Training with many features Decoding with many features
Iterative scaling Belief propagation
Each weight in turn is influenced by others Each variable in turn is influenced by others
Iterate to achieve globally optimal weights Iterate to achievelocally consistent beliefs
To train distrib. over trees, use dynamic programming to compute normalizer Z To decode distrib. over trees, use dynamic programming to compute messages
New!
44
Outline
  • Edge-factored parsing
  • Dependency parses
  • Scoring the competing parses Edge features
  • Finding the best parse
  • Higher-order parsing
  • Throwing in more features Graphical models
  • Finding the best parse Belief propagation
  • Experiments
  • Conclusions

Old
New!
45
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

Possible tagging (i.e., assignment to remaining
variables)


v
v
v
preferred
find
tags
Observed input sentence (shaded)
45
46
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

Possible tagging (i.e., assignment to remaining
variables) Another possible tagging


v
a
n
preferred
find
tags
Observed input sentence (shaded)
46
47
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

Binary factor that measures compatibility of 2
adjacent tags
Model reusessame parameters at this position
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v n a
v 0 2 1
n 2 1 0
a 0 3 1


preferred
find
tags
47
48
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

Unary factor evaluates this tag Its values
depend on corresponding word


v 0.2
n 0.2
a 0
v 0.2
n 0.2
a 0
cant be adj
preferred
find
tags
48
49
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

Unary factor evaluates this tag Its values
depend on corresponding word


v 0.2
n 0.2
a 0
preferred
find
tags
(could be made to depend onentire observed
sentence)
49
50
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

Unary factor evaluates this tag Different unary
factor at each position


v 0.2
n 0.2
a 0
v 0.3
n 0.02
a 0
v 0.3
n 0
a 0.1
preferred
find
tags
50
51
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

p(v a n) is proportionalto the product of
all factors values on v a n
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v n a
v 0 2 1
n 2 1 0
a 0 3 1


v
a
n
v 0.3
n 0.02
a 0
v 0.3
n 0
a 0.1
v 0.2
n 0.2
a 0
preferred
find
tags
51
52
Local factors in a graphical model
  • First, a familiar example
  • Conditional Random Field (CRF) for POS tagging

p(v a n) is proportionalto the product of
all factors values on v a n
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v n a
v 0 2 1
n 2 1 0
a 0 3 1
130.30.10.2


v
a
n
v 0.3
n 0.02
a 0
v 0.3
n 0
a 0.1
v 0.2
n 0.2
a 0
preferred
find
tags
52
53
Local factors in a graphical model
  • First, a familiar example
  • CRF for POS tagging
  • Now lets do dependency parsing!
  • O(n2) boolean variables for the possible links



preferred
find
links
53
54
Local factors in a graphical model
  • First, a familiar example
  • CRF for POS tagging
  • Now lets do dependency parsing!
  • O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars


preferred
find
links
54
55
Local factors in a graphical model
  • First, a familiar example
  • CRF for POS tagging
  • Now lets do dependency parsing!
  • O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars
Another possible parse
f
f
t
t
f
f


preferred
find
links
55
56
Local factors in a graphical model
  • First, a familiar example
  • CRF for POS tagging
  • Now lets do dependency parsing!
  • O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars
Another possible parse
An illegal parse
f
t
t
t
f
f


preferred
find
links
56
57
Local factors in a graphical model
  • First, a familiar example
  • CRF for POS tagging
  • Now lets do dependency parsing!
  • O(n2) boolean variables for the possible links

Possible parse
encoded as an assignment to these vars
Another possible parse
An illegal parse Another illegal parse
f
t
t
t
f
t


preferred
find
links
(multiple parents)
57
58
Local factors for parsing
  • So what factors shall we multiply to define parse
    probability?
  • Unary factors to evaluate each link in isolation

But what if the best assignment isnt a tree??
as before, goodness of this link can depend on
entire observed input context
t 2
f 1
some other links arent as good given this
inputsentence
t 1
f 2
t 1
f 8
t 1
f 3


preferred
find
links
t 1
f 2
t 1
f 6
58
59
Global factors for parsing
  • So what factors shall we multiply to define parse
    probability?
  • Unary factors to evaluate each link in isolation
  • Global TREE factor to require that the links form
    a legal tree
  • this is a hard constraint factor is either 0
    or 1

ffffff 0
ffffft 0
fffftf 0

fftfft 1

tttttt 0


preferred
find
links
59
60
Global factors for parsing
  • So what factors shall we multiply to define parse
    probability?
  • Unary factors to evaluate each link in isolation
  • Global TREE factor to require that the links form
    a legal tree
  • this is a hard constraint factor is either 0
    or 1

ffffff 0
ffffft 0
fffftf 0

fftfft 1

tttttt 0
t
f
were legal!
f
f
64 entries (0/1)
f
t


preferred
find
links
60
61
Local factors for parsing
  • So what factors shall we multiply to define parse
    probability?
  • Unary factors to evaluate each link in isolation
  • Global TREE factor to require that the links form
    a legal tree
  • this is a hard constraint factor is either 0
    or 1
  • Second-order effects factors on 2 variables
  • grandparent

f t
f 1 1
t 1 3
t
3
t


preferred
find
links
61
62
Local factors for parsing
  • So what factors shall we multiply to define parse
    probability?
  • Unary factors to evaluate each link in isolation
  • Global TREE factor to require that the links form
    a legal tree
  • this is a hard constraint factor is either 0
    or 1
  • Second-order effects factors on 2 variables
  • grandparent
  • no-cross

f t
f 1 1
t 1 0.2
t
t


preferred
find
links
by
62
63
Local factors for parsing
  • So what factors shall we multiply to define parse
    probability?
  • Unary factors to evaluate each link in isolation
  • Global TREE factor to require that the links form
    a legal tree
  • this is a hard constraint factor is either 0
    or 1
  • Second-order effects factors on 2 variables
  • grandparent
  • no-cross
  • siblings
  • hidden POS tags
  • subcategorization



preferred
find
links
by
63
64
Outline
  • Edge-factored parsing
  • Dependency parses
  • Scoring the competing parses Edge features
  • Finding the best parse
  • Higher-order parsing
  • Throwing in more features Graphical models
  • Finding the best parse Belief propagation
  • Experiments
  • Conclusions

Old
New!
65
Good to have lots of features, but
  • Nice model ?
  • Shame about the NP-hardness ?
  • Can we approximate?
  • Machine learning to the rescue!
  • ML community has given a lot to NLP
  • In the 2000s, NLP has been giving back to ML
  • Mainly techniques for joint prediction of
    structures
  • Much earlier, speech recognition had HMMs, EM,
    smoothing

65
66
Great Ideas in ML Message Passing
Count the soldiers
1 beforeyou
2 beforeyou
3 beforeyou
4 beforeyou
5 beforeyou
3 behind you
2 behind you
1 behind you
4 behind you
5 behind you
adapted from MacKay (2003) textbook
66
67
Great Ideas in ML Message Passing
Count the soldiers
BeliefMust be 2 1 3 6 of us
2 beforeyou
3 behind you
only see my incoming messages
adapted from MacKay (2003) textbook
67
68
Great Ideas in ML Message Passing
Count the soldiers
1 beforeyou
4 behind you
only see my incoming messages
adapted from MacKay (2003) textbook
68
69
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
11 here ( 731)
adapted from MacKay (2003) textbook
69
70
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here ( 331)
3 here
adapted from MacKay (2003) textbook
70
71
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
11 here ( 731)
7 here
3 here
adapted from MacKay (2003) textbook
71
72
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
Belief Must be 14 of us
3 here
adapted from MacKay (2003) textbook
72
73
Great Ideas in ML Message Passing
Each soldier receives reports from all branches
of tree
3 here
7 here
BeliefMust be 14 of us
3 here
wouldnt work correctlywith a loopy (cyclic)
graph
adapted from MacKay (2003) textbook
73
74
Great ideas in ML Forward-Backward
  • In the CRF, message passing forward-backward

belief
v 1.8
n 0
a 4.2
message
message
a
ß
a
ß
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v n a
v 0 2 1
n 2 1 0
a 0 3 1
v 2
n 1
a 7
v 7
n 2
a 1
v 3
n 1
a 6
v 3
n 6
a 1


v 0.3
n 0
a 0.1
find
tags
preferred
74
75
Great ideas in ML Forward-Backward
  • Extend CRF to skip chain to capture non-local
    factor
  • More influences on belief ?

v 5.4
n 0
a 25.2
a
ß
v 3
n 1
a 6
v 2
n 1
a 7


v 0.3
n 0
a 0.1
v 3
n 1
a 6
find
tags
preferred
75
76
Great ideas in ML Forward-Backward
  • Extend CRF to skip chain to capture non-local
    factor
  • More influences on belief ?
  • Graph becomes loopy ?

Red messages not independent? Pretend they are!
v 5.4
n 0
a 25.2
a
ß
v 3
n 1
a 6
v 2
n 1
a 7


v 0.3
n 0
a 0.1
v 3
n 1
a 6
find
tags
preferred
76
77
Two great tastes that taste great together
  • Upcoming attractions

You got belief propagation in my dynamic
programming!
You got dynamic programming in my belief
propagation!
78
Loopy Belief Propagation for Parsing
  • Sentence tells word 3, Please be a verb
  • Word 3 tells the 3 ? 7 link, Sorry, then you
    probably dont exist
  • The 3 ? 7 link tells the Tree factor, Youll
    have to find another parent for 7
  • The tree factor tells the 10 ? 7 link, Youre
    on!
  • The 10 ? 7 link tells 10, Could you please be a
    noun?



preferred
find
links
78
79
Loopy Belief Propagation for Parsing
  • Higher-order factors (e.g., Grandparent) induce
    loops
  • Lets watch a loop around one triangle
  • Strong links are suppressing or promoting other
    links



preferred
find
links
79
80
Loopy Belief Propagation for Parsing
  • Higher-order factors (e.g., Grandparent) induce
    loops
  • Lets watch a loop around one triangle
  • How did we compute outgoing message to green
    link?
  • Does the TREE factor think that the green link
    is probably t,given the messages it receives
    from all the other links?

?
TREE factor TREE factor
ffffff 0
ffffft 0
fffftf 0

fftfft 1

tttttt 0


preferred
find
links
80
81
Loopy Belief Propagation for Parsing
  • How did we compute outgoing message to green
    link?
  • Does the TREE factor think that the green link
    is probably t,given the messages it receives
    from all the other links?

TREE factor TREE factor
ffffff 0
ffffft 0
fffftf 0

fftfft 1

tttttt 0
?


preferred
find
links
81
82
Loopy Belief Propagation for Parsing
  • How did we compute outgoing message to green
    link?
  • Does the TREE factor think that the green link
    is probably t,given the messages it receives
    from all the other links?

Belief propagation assumes incoming messages to
TREE are independent. So outgoing messages can be
computed with first-order parsing algorithms
(fast, no grammar constant).
82
83
Some connections
  • Parser stacking (Nivre McDonald 2008, Martins
    et al. 2008)
  • Global constraints in arc consistency
  • ALLDIFFERENT constraint (Régin 1994)
  • Matching constraint in max-product BP
  • For computer vision (Duchi et al., 2006)
  • Could be used for machine translation
  • As far as we know, our parser is the first use of
    global constraints in sum-product BP.

84
Outline
  • Edge-factored parsing
  • Dependency parses
  • Scoring the competing parses Edge features
  • Finding the best parse
  • Higher-order parsing
  • Throwing in more features Graphical models
  • Finding the best parse Belief propagation
  • Experiments
  • Conclusions

Old
New!
85
Runtimes for each factor type (see paper)
Factor type degree runtime count total
Tree O(n2) O(n3) 1 O(n3)
Proj. Tree O(n2) O(n3) 1 O(n3)
Individual links 1 O(1) O(n2) O(n2)
Grandparent 2 O(1) O(n3) O(n3)
Sibling pairs 2 O(1) O(n3) O(n3)
Sibling bigrams O(n) O(n2) O(n) O(n3)
NoCross O(n) O(n) O(n2) O(n3)
Tag 1 O(g) O(n) O(n)
TagLink 3 O(g2) O(n2) O(n2)
TagTrigram O(n) O(ng3) 1 O(n)
TOTAL O(n3)

per iteration
Additive, not multiplicative!
86
Runtimes for each factor type (see paper)
Factor type degree runtime count total
Tree O(n2) O(n3) 1 O(n3)
Proj. Tree O(n2) O(n3) 1 O(n3)
Individual links 1 O(1) O(n2) O(n2)
Grandparent 2 O(1) O(n3) O(n3)
Sibling pairs 2 O(1) O(n3) O(n3)
Sibling bigrams O(n) O(n2) O(n) O(n3)
NoCross O(n) O(n) O(n2) O(n3)
Tag 1 O(g) O(n) O(n)
TagLink 3 O(g2) O(n2) O(n2)
TagTrigram O(n) O(ng3) 1 O(n)
TOTAL O(n3)

Each global factor coordinates an unbounded
of variables Standard belief propagation would
take exponential time to iterate over all
configurations of those variables See paper for
efficient propagators
Additive, not multiplicative!
87
Experimental Details
  • Decoding
  • Run several iterations of belief propagation
  • Get final beliefs at link variables
  • Feed them into first-order parser
  • This gives the Min Bayes Risk tree (minimizes
    expected error)
  • Training
  • BP computes beliefs about each factor, too
  • which gives us gradients for max conditional
    likelihood.
  • (as in forward-backward algorithm)
  • Features used in experiments
  • First-order Individual links just as in McDonald
    et al. 2005
  • Higher-order Grandparent, Sibling bigrams,
    NoCross

87
88
Dependency AccuracyThe extra, higher-order
features help! (non-projective parsing)
Danish Dutch English
TreeLink 85.5 87.3 88.6
NoCross 86.1 88.3 89.1
Grandparent 86.1 88.6 89.4
ChildSeq 86.5 88.5 90.1
89
Dependency AccuracyThe extra, higher-order
features help! (non-projective parsing)
Danish Dutch English
TreeLink 85.5 87.3 88.6
NoCross 86.1 88.3 89.1
Grandparent 86.1 88.6 89.4
ChildSeq 86.5 88.5 90.1
Best projective parse with all factors 86.0 84.5 90.2
hill-climbing 86.1 87.6 90.2
exact, slow
doesnt fixenough edges
90
Time vs. Projective Search Error
iterations
iterations
DP 140
iterations
Compared with O(n4) DP
Compared with O(n5) DP
91
Runtime BP vs. DP
Vs. O(n4) DP
Vs. O(n5) DP
91
92
Outline
  • Edge-factored parsing
  • Dependency parses
  • Scoring the competing parses Edge features
  • Finding the best parse
  • Higher-order parsing
  • Throwing in more features Graphical models
  • Finding the best parse Belief propagation
  • Experiments
  • Conclusions

Old
New!
93
Freedom Regained
This paper in a nutshell
  • Output probability defined as product of local
    and global factors
  • Throw in any factors we want! (log-linear
    model)
  • Each factor must be fast, but they run
    independently
  • Let local factors negotiate via belief
    propagation
  • Each bit of syntactic structure is influenced by
    others
  • Some factors need combinatorial algorithms to
    compute messages fast
  • e.g., existing parsing algorithms using dynamic
    programming
  • Each iteration takes total time O(n3) or even
    O(n2) see paper
  • Compare reranking or stacking
  • Converges to a pretty good (but approximate)
    global parse
  • Fast parsing for formerly intractable or slow
    models
  • Extra features of these models really do help
    accuracy

93
94
Future Opportunities
  • Efficiently modeling more hidden structure
  • POS tags, link roles, secondary links (DAG-shaped
    parses)
  • Beyond dependencies
  • Constituency parsing, traces, lattice parsing
  • Beyond parsing
  • Alignment, translation
  • Bipartite matching and network flow
  • Joint decoding of parsing and other tasks (IE,
    MT, reasoning ...)
  • Beyond text
  • Image tracking and retrieval
  • Social networks

95
you
thank
Write a Comment
User Comments (0)
About PowerShow.com