Title: Simple Training of Dependency Parsers via Structured Boosting
1Simple Training of Dependency Parsers via
Structured Boosting
Qin Iris Wang University of Alberta Joint
work with Dekang Lin Dale Schuurmans Hyderab
ad, India Jan 11, 2007
2Structured Boosting
- A simple variant of standard boosting algorithms
- Global optimization
- As cheap as local methods
- Can be easily applied to any local predictor
- Successfully applied to dependency parsing
3Dependency Tree
- A dependency tree structure for a sentence
- Syntactic relationships between word pairs in
the sentence
4Increasing Interest
- Dependency parsing has been an active research
area - Dependency trees are much easier to understand
and annotate than constituency trees - Dependency relations have been widely used
- Machine translation (Fox 2002, Cherry Lin 2003,
Ding Palmer 2005) - Information extraction (Culotta Sorensen 2004)
- Question answering (Pinchak Lin 2006)
- Coreference resolution (Bergsma Lin 2006)
5Overview
- Dependency parsing model
- Local training methods for dependency parsing
- Global training methods
- Structured boosting
- Experimental results
- Conclusion
6Dependency Parsing Model
- W (w1, , wn ) an input sentence
- T a candidate dependency tree
- the set of possible dependency trees
spanning on W. - Scoring functions
Eisner 1996 McDonald 2005
Feature weights
Feature functions
7Features for a Link
Lots!
- Word pair indicator
- POS
- of that word pair
- Pointwise Mutual Information (PMI) for that word
pair - Distance between words
8Static Features
p-word/c-word word of parent/child p-pos/c-pos
pos of parent/child
9Dynamic Features
- Take into account the link labels of the
surrounding components when predicting the label
of a target - Commonly used in sequential labeling task
(McCallum et al 2000, Toutanova et al. 2003) - A simple but useful idea for improving structured
predictors - Can also be easily employed for dependency
parsing (Wang et al. 2005, McDonald and Pereira
2006)
10Local Training Examples
- Given training examples (S, T)
local examples
link label features L W1_The,
W2_boy, W1W2_The_boy, T1_DT, T2_NN, T1T2_DT_NN,
Dist_1 L W1_boy, W2_skipped, R
W1_skipped, W2_ school, R W1_skipped,
W2_regularly, N W1_The, W2_skipped,
N W1_The, W2_school, N W1_The,
W2_regularly,
L left link R right link N no link
11Local Training Methods
- Learn a local link predictor given feature inputs
- A purely local approach to the learning problem
- For each word pair in a sentence
- No link, left link or right link ?
12But, If Only Use a Local Classifier
- The output may not be a tree
- Dependency parsing is a structured classification
problem, i.e., the output has to be a dependency
tree - Need to satisfy the constraints between the
classifications of different local links
How to perform a structured prediction using a
local prediction model?
13Local Training Parsing Algorithm
Training sentences (S, T)
Local training examples
Eisner 1996
Link score
Dependency parsing algorithm
A projective spanning tree
Local predictor
Dependency trees
14Parsing With a Local Link Predictor
- Support vector machines (Yamada and Matsumoto
2003) - Logistic regression / Maximum entropy models
(Ratnaparkhi 1999 and Charniak 2000) - But, we can do better if we use global training
15Global Training for Structured Prediction
- Recent global training algorithms for learning
structured predictors - CRFs (Lafferty et at. 2001)
- Structured SVMs (Tsochantaridis et al. 2004,
Altun et al. 2003) - Max-Margin Markov Networks (Taskar et al. 2003)
- Incorporate the effects of the structured
predictor directly into the training algorithm - These training algorithms have been applied to
parsing (Taskar et al. 2004, McDonald et al.
2005, Wang et al. 2006)
16But, Drawbacks
- Unfortunately, for a complicated task like
dependency parsing, these structured training
techniques - Expensive
- Specialized
- Complex to implement
17Our Idea Structured Boosting
- Global optimization
- Almost as cheap as local methods
- Promising results
18A Generic Boosting Algorithm
- Train a local classifier (weak hypothesis), h1
- Re-weight the local training examples
- Re-train the local link predictor using the
re-weighted training examples, getting h2 - Finally we have h1 , h2 , , hk
19Standard Boosting for Classification
Training sentences (S, T)
Local training examples
Local predictor
Classification
Re-weighting the local examples
20Structured Boosting
- Train a local link predictor, h1
- Re-parse training data using h1
- Re-weight local examples
- Compare the parser outputs with the gold standard
trees - Increase the weight of mis-parsed local examples
- Re-train local link predictor, getting h2
- Finally we have h1 , h2 , , hk
21Structured Boosting for Dependency Parsing
Training sentences (S, T)
Local training examples
Eisner 1996
Link score
Dependency parsing algorithm
A projective spanning tree
Local predictor
Re-parse the training data
Dependency trees
Re-weight the local examples
22Experimental Design
- Learning dependency parsers for English and
Chinese - Data set
- English Penn Treebank 3.0 (PTB3)
- Chinese Treebank 4.0 5.0 (CTB4, CTB5)
- Features
- Static
- Dynamic
23Experimental Design Cont.
- Local training algorithm
- Logistic regression model/ Maximum Entropy
- Parser
- Standard bi-lexical CKY parser (O(n5))
- Boosting method
- A variant of Adaboost M1 (Freund Schapire 1997)
24Experimental Results - 1
Table 1 Boosting with static features
25Experimental Results - 2
Table 2 Boosting with dynamic features
26Comparison With Others
IWPT 2005
Chinese Treebank 4.0 Chinese Treebank
5.0 _at_ personal communication
27Conclusion
- Structured boosting is a simple and general way
to coordinate local link predictor with global
optimization - Successfully applied to natural language
dependency parsing
28Thanks!
Questions?
29Advantages
- Structured boosting is a simple variant of
standard boosting algorithms - Local parameter optimization is coordinated to
global structured prediction (parsing) - Training (parameter estimation) is directly
influenced by the resulting global accuracy of
the parser - Simpler, more general, and can be easily applied
to any local predictor
30Decomposition of Training Data
- Given training examples (S, T)
- Decomposed into a set of local examples
arbitrary word pairs and their link label (none,
left, right) in context (S, T) - Learn a weight vector over a set of features
defined on the local examples