Title: Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors
1Bootstrapping Feature-Rich Dependency Parsers
with Entropic Priors
- David A. Smith
- Jason Eisner
- Johns Hopkins University
2Only Connect
Textual Entailment
Training trees
Raw text
LM
Parser
Trained
(Dependency)
Learning
Weischedel 2004
IE
Parallel comparable corpora
Quirk et al. 2005
MT
Pantel Lin 2002
Out-of-domain text
Lexical Semantics
3Outline Bootstrapping Parsers
- What kind of parser should we train?
- How should we train it semi-supervised?
- Does it work? (initial experiments)
- How can we incorporate other knowledge?
4Re-estimation EM or Viterbi EM
TrainedParser
5Re-estimation EM or Viterbi EM
(iterate process)
TrainedParser
Oops! Not much supervised training. So most of
these parses were bad. Retraining on all of them
overwhelms the good supervised data.
6Simple Bootstrapping Self-Training
So only retrain on good parses ...
?
TrainedParser
7Simple Bootstrapping Self-Training
So only retrain on good parses ...
TrainedParser
at least, those the parser itself thinks are
good. (Can we trust it? Well see ...)
8Why Might This Work?
- Sure, now we avoid harming the parser with bad
training. - But why do we learn anything new from the unsup.
data?
TrainedParser
- But unsupervised parses have
- Few positive or negative features
- Mostly unknown features
- Words or situations not seen in training data
- After training, training parses have
- Many features with positive weights
- Few features with negative weights
Still, sometimes enough positive features to be
sure its the right parse
9Why Might This Work?
- Sure, we avoid bad guesses that harm the parser.
- But why do we learn anything new from the unsup.
data?
TrainedParser
Now, retraining the weights makes the gray (and
red)features greener
Still, sometimes enough positive features to be
sure its the right parse
10Why Might This Work?
- Sure, we avoid bad guesses that harm the parser.
- But why do we learn anything new from the unsup.
data?
TrainedParser
Now, retraining the weights makes the gray (and
red)features greener
... and makes features redder for the losing
parses of this sentence (not shown)
Still, sometimes enough positive features to be
sure its the right parse
Learning!
11This Story Requires Many Redundant Features!
More features ? more chances to identify correct
parse even when were undertrained
- Bootstrapping for WSD (Yarowsky 1995)
- Lots of contextual features ? success
- Co-training for parsing (Steedman et. al 2003)
- Feature-poor parsers ? disappointment
- Self-training for parsing (McClosky et al. 2006)
- Feature-poor parsers ? disappointment
- Reranker with more features ? success
12This Story Requires Many Redundant Features!
More features ? more chances to identify correct
parse even when were undertrained
- So, lets bootstrap a feature-rich parser!
- In our experiments so far, we followMcDonald et
al. (2005) - Our model has 450 million features (on Czech)
- Prune down to 90 million frequent features
- About 200 are considered per possible edge
Note Even more features proposed at end of talk
13Edge-Factored Parsers (McDonald et al. 2005)
- No global features of a parse
- Each feature is attached to some edge
- Simple allows fast O(n2) or O(n3) parsing
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
14Edge-Factored Parsers (McDonald et al. 2005)
yes, lots of green ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
15Edge-Factored Parsers (McDonald et al. 2005)
jasný ? den (bright day)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
16Edge-Factored Parsers (McDonald et al. 2005)
jasný ? N (bright NOUN)
jasný ? den (bright day)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
17Edge-Factored Parsers (McDonald et al. 2005)
jasný ? N (bright NOUN)
jasný ? den (bright day)
A ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
18Edge-Factored Parsers (McDonald et al. 2005)
jasný ? N (bright NOUN)
jasný ? den (bright day)
A ? N preceding conjunction
A ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
19Edge-Factored Parsers (McDonald et al. 2005)
- How about this competing edge?
not as good, lots of red ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
20Edge-Factored Parsers (McDonald et al. 2005)
- How about this competing edge?
jasný ? hodiny (bright clocks)
... undertrained ...
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
21Edge-Factored Parsers (McDonald et al. 2005)
- How about this competing edge?
jasn- ? hodi- (bright clock,stems only)
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
být-
jasn-
stud-
dubn-
den-
a-
hodi-
odbí-
trin-
22Edge-Factored Parsers (McDonald et al. 2005)
- How about this competing edge?
jasn- ? hodi- (bright clock,stems only)
Aplural ? Nsingular
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
být-
jasn-
stud-
dubn-
den-
a-
hodi-
odbí-
trin-
23Edge-Factored Parsers (McDonald et al. 2005)
- How about this competing edge?
jasný ? hodiny (bright clock,stems only)
A ? N where N followsa conjunction
Aplural ? Nsingular
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
být-
jasn-
stud-
dubn-
den-
a-
hodi-
odbí-
trin-
24Edge-Factored Parsers (McDonald et al. 2005)
- Which edge is better?
- bright day or bright clocks?
jasný
Byl
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
být-
jasn-
stud-
dubn-
den-
a-
hodi-
odbí-
trin-
25Edge-Factored Parsers (McDonald et al. 2005)
- Which edge is better?
- Score of an edge e ? ? features(e)
- Standard algos ? valid parse with max total score
jasný
Byl
studený
dubnový
den
a
hodiny
odbíjely
trináctou
V
A
A
A
N
J
N
V
C
být
jasný
studený
dubnový
den
a
hodiny
odbit
trináct
26Edge-Factored Parsers (McDonald et al. 2005)
- Which edge is better?
- Score of an edge e ? ? features(e)
- Standard algos ? valid parse with max total score
cant have both(one parent per word)
Thus, an edge may lose (or win) because of a
consensus of other edges. Retraining then learns
toreduce (or increase) its score.
27Only Connect
Textual Entailment
Training trees
Raw text
LM
TrainedParser
Learning
IE
Parallel comparable corpora
MT
Out-of-domain text
Lexical Semantics
28Can we recast this declaratively?
Only retrain on good parses ...
TrainedParser
at least, those the parser itself thinks are
good.
29Can we recast this declaratively?
Seed set
Classifier
Label Examples
Select Examples W/ High Confidence
New Labeled Set
30Bootstrapping as Optimization
Maximize a function on supervised and
unsupervised data
Entropy regularization (Brand 1999 Grandvalet
Bengio Jiao et al.)
Yesterdays talk How to compute these for
non-projective models
See Hwa 01 for projective tree entropy
31Claim Gradient descent on this objective
function works like bootstrapping
- When were pretty sure the true parse is A or B,
we reduce entropy H by becoming even surer(?
retraining ? on the example) - When were not sure, the example doesnt affect ?
(? not retraining on the example)
not sure(H?1)
32Claim Gradient descent on this objective
function works like bootstrapping
In the paper, we generalize replace Shannon
entropy H(?) with Rényi entropy H?(?)
- This gives us a tunable parameter ?
- Connect to Abneys view of bootstrapping (?0)
- Obtain Viterbi variant (limit as ? ? ?)
- Obtain Gini variant (?2)
- Still get Shannon entropy (limit as ? ? 1)
- Also easier to compute in some circumstances
33Experimental Questions
- Are confident parses (or edges) actually good for
retraining? - Does bootstrapping help accuracy?
- What is being learned?
34Experimental Design
- Czech, German, and Spanish (some Bulgarian)
- CoNLL-X dependency trees
- Non-projective (MST) parsing
- Hundreds of millions of features
- Supervised training sets of 100 1000 trees
- Unparsed but tagged sets of 2k to 70k sentences
- Stochastic gradient descent
- First optimize just likelihood on seed set
- Then optimize likelihood confidence criterion
on all data - Stop when accuracy peaks on development data
35Are confident parses accurate?Correlation of
entropy with accuracy
Shannon entropy
Viterbi self-training
-.32
-.26
Gini -log(expected 0/1 gain)
log( of parses)favor short sentences Abneys
Yarowsky alg.
-.27
-.25
36How Accurate Is Bootstrapping?
100-tree supervised set
?
(baseline)
71K
37K
2K
Significant on paired permutation test
37How Does Bootstrapping Learn?
Precision
Recall
38Bootstrapping vs. EM
Two ways to add unsupervised data
Compare on a feature-poor model that EM can
handle (DMV)
90
80
70
60
50
40
30
20
10
0
Bulgarian
German
Spanish
100 training trees, 100 dev trees for model
selection
39Theres No Data Like More Data
Textual Entailment
Training trees
Raw text
LM
TrainedParser
Learning
IE
Parallel comparable corpora
MT
Out-of-domain text
Lexical Semantics
40Token Projection
What if some sentences have parallel text?
- Project 1-best English dependencies (Hwa et al.
04)??? - Imperfect or free translation
- Imperfect parse
- Imperfect alignment
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
41Token Projection
What if some sentences have parallel text?
Probably aligns to some English link A ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
It
bright
cold
day
April
and
clocks
were
thirteen
was
a
in
the
striking
42Token Projection
What if some sentences have parallel text?
Probably aligns to some English path N ? in ? N
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
It
bright
cold
day
April
and
clocks
were
thirteen
was
a
in
the
striking
Cf. quasi-synchronous grammars(Smith Eisner,
2006)
43Type Projection
Can we use world knowledge, e.g., from comparable
corpora?
Probably translate as English words that usually
link as N ? V when cosentential
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
44Type Projection
Can we use world knowledge, e.g., from comparable
corpora?
Probably translate as English words that usually
link as N ? V when cosentential
Byl
jasný
studený
dubnový
den
a
hodiny
odbíjely
trináctou
45Conclusions
- Declarative view of bootstrapping as entropy
minimization - Improvements in parser accuracy with feature-rich
models - Easily added features from alternative data
sources, e.g. comparable text - In future consider also the WSD decision list
learner is it important for learning robust
feature weights?
46Thanks
Noah Smith Keith Hall The Anonymous
Reviewers Ryan McDonald for making his code
available
47Extra slides
48Dependency Treebanks
49A Supervised CoNLL-X System
What system was this?
50How Does Bootstrapping Learn?
Supervised iter. 10
Supervised iter. 1
Boostrapping w/ R2
Boostrapping w/ Rinf
51How Does Bootstrapping Learn?
Updated M feat. Acc. Updated M feat. Acc.
all 15.5 64.3 none 0 60.9
seed 1.4 64.1 Non-seed 14.1 44.7
Non-lex. 3.5 64.4 lexical 12.0 59.9
Non-bilex. 12.6 64.4 bilexical 2.9 61.0
52Review Yarowskys bootstrapping algorithm
table taken from Yarowsky (1995)
life (1)
target word plant
98
manufacturing(1)
53Review Yarowskys bootstrapping algorithm
figure taken from Yarowsky (1995)
Should be a good classifier, unless we
accidentally learned some bad cues along the way
that polluted the original sense distinction.
54Review Yarowskys bootstrapping algorithm
figure taken from Yarowsky (1995)
55Review Yarowskys bootstrapping algorithm
figure taken from Yarowsky (1995)
repeat
That confidently classifies some of the remaining
examples.
repeat
56Bootstrapping Pivot Features
Sat beside the river bank
quick
and
sly
fox
Sat on the bank
sly
and
crafty
fox
Run on the bank
quick
of
sly
fox
gait
the
Lots of overlapping features vs. PCFG (McClosky
et al.)
57Bootstrapping as Optimization
Given a labeling distribution p, log
likelihood to max is
Abney (2004)
On labeled data, p is 1 at the label and 0
elsewhere. Thus, supervised training
58Triangular Trade
Features
Data
Words, Tags, Translations,
Parent Prediction Inside/Outside Matrix-Tree
???
Models
Objectives
Derivational (Rényi) entropy
EM Abneys K Entropy Regularization
Globally normalized LL Projective/non-projective