Learning Structured Classifiers for Statistical Dependency Parsing - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Learning Structured Classifiers for Statistical Dependency Parsing

Description:

Learning Structured Classifiers for Statistical Dependency Parsing. Qin ... Pereira 2006, Corston-Oliver et al. 2006, Smith & Eisner 2005, Smith & Eisner 2006 ) ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 84
Provided by: wang6
Category:

less

Transcript and Presenter's Notes

Title: Learning Structured Classifiers for Statistical Dependency Parsing


1
Learning Structured Classifiers for Statistical
Dependency Parsing
  • Qin Iris Wang
  • Joint work with Dekang Lin Dale Schuurmans
  • University of Alberta
  • September 11, 2007

2
Ambiguities In NLP
I saw her duck.
How about I saw her duck with a telescope.
3
Dependency Trees vs. Constituency Trees
S
NP
VP
V
NP
N
Dt
N
Mike
ate
the
cake
A Constituency tree
4
Dependency Trees
  • A dependency tree structure for a sentence
    represents
  • Syntactic relations between word pairs in the
    sentence

mod
obj
subj
obj
det
gen
with
a
telescope
duck
saw
I
her
Over 100 possible trees!
1M trees for a 20-word sentence
5
Dependency Parsing
  • An increasingly active research area (Yamada
    Matsumoto 2003, McDonald et al. 2005, McDonald
    Pereira 2006, Corston-Oliver et al. 2006, Smith
    Eisner 2005, Smith Eisner 2006 )
  • Dependency trees are much easier to understand
    and annotate than other syntactic representations
  • Dependency relations have been widely used in
  • Machine translation (Fox 2002, Cherry Lin 2003,
    Ding Palmer 2005)
  • Information extraction (Culotta Sorensen 2004)
  • Question answering (Pinchak Lin 2006)
  • Coreference resolution (Bergsma Lin 2006)

6
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Structured boosting
  • Experimental results
  • Conclusion

7
Dependency Parsing Model
  • W an input sentence T a candidate
    dependency tree
  • a dependency link from word i to
    word j
  • the set of possible dependency trees
    over W
  • Can be applied to both probabilistic and
    non-probabilistic models

Edge/link-based factorization
8
Scoring Functions
A vector of feature weights
A vector of features
  • Feature weights can be learned either
    locally or globally

9
Features for an arc
  • Word pair indicator
  • Part-of-Speech (POS) tags of the word pair
  • Pointwise Mutual Information (PMI) for that word
    pair
  • Distance between words

lots! sparse
Abstraction or smoothing
10
Score of Each Link / Word-pair
  • The score of each link is based on the features
  • Considering the word pair (skipped, regularly)
  • POS (skipped, regularly) (VBD, RB)
  • PMI (skipped, regularly) 0.27
  • dist(skipped, regularly) 2

11
Score of A Tree
regularly.
school
The
boy
skipped
  • Edge-based factorization

12
My Work on Dependency Parsing
  • 1. Strictly Lexicalized Dependency Parsing (Wang
    et al. 2005)
  • MLE similarity-based smoothing
  • 2. Improving Large Margin Training (Wang et al.
    2006)
  • Local constraints capture local errors in a
    parse tree
  • Laplacian regularization enforce similar links
    to have similar weights introduced the
    similarity-based smoothing technique into the
    large margin framework
  • 3. Structured Boosting (Wang et al. 2007)
  • Global optimization, efficient and flexible

Focus of this talk
13
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Structured boosting
  • Experimental results
  • Conclusion

14
Strictly Lexicalized Dependency Parsing
-- IWPT 2005
15
POS Tags for Handling Sparseness in Parsing
  • All previous parsers use a POS lexicon
  • Natural language data is sparse
  • Bikel (2004) found that of all the needed bi-gram
    statistics, only 1.49 were observed in the Penn
    Treebank
  • Words belonging to the same POS are expected to
    have the same syntactic behavior

16
However,
  • POS tags are not part of natural text
  • Need to be annotated by human effort
  • Introduce more noise to training data
  • For some languages, POS tags are not clearly
    defined
  • Such as Chinese or Japanese
  • A single word is often combined with other words

Can we use another smoothing technique rather
than POS?
17
Strictly Lexicalized Dependency Parsing
  • All the features are based on word statistics and
    no POS tags needed
  • Using similarity-based smoothing to deal with
    data sparseness

18
An Alternative Method to Deal with Sparseness
  • Distributional word similarities
  • Distributional Hypothesis --- Words that appear
    in similar contexts have similar meanings
    (Harris, 1968)
  • Soft clusters of words
  • Advantages of using similarity smoothing
  • Computed automatically from the raw text
  • Making the construction of Treebank easier

POS hard clusters of words
19
Similarity Between Word Pairs
  • Similarity between two words Sim(w1, w2) (Lin,
    1998)
  • Construct a feature vector for w that contains a
    set of words occurring within a small context
    window
  • Compute similarities between the two feature
    vectors
  • Using cosine measure
  • e.g., Sim (skipped, missed) 0.134966
  • Similarity between word pairs geometric average

20
Similarity-based Smoothing
5
3
2
1
4
school 4
skipped 3
? 0
regularly 5
The 1
kid 2
21
Similarity-based Smoothing
skipped, regularly )
Context
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
22
Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
S (regularly)
S (skipped)
frequently 0.365862routinely 0.286178periodica
lly 0.273665often 0.24077constantly
0.234693occasionally 0.226324who
0.200348continuously 0.194026repeatedly
0.177434
Skipping 0.229951skip 0.197991skips
0.169982sprinted 0.140535bounced
0.139547missed 0.134966cruised
0.133933scooted 0.13387jogged 0.133638
23
Similarity-based Smoothing
skipped, regularly )
regularly 5
skipped 3
P (
Pairs seen in the training data
skip frequently skip routinely skip
repeatedly bounced often bounced who bounced
repeatedly
Similar Contexts
24
Similarity-based Smoothing
similarity-based Prob.
MLE-based Prob.
Similar Contexts of C
regularly 5
skipped 3
regularly 5
skipped 3
skip frequently skip routinely skip
repeatedly Bounced often Bounced
who Bounced repeatedly
25
Similarity-based Smoothing
  • Finally,

P(E C) a PMLE(E C) (1 a) PSIM(E C)
C is the frequency count of the corresponding
context C in the training data
(skipped, regulary) 1 (The, kid) 95
26
Comparison With An Unlexicalized Model
  • In this model, the input to the parser is the
    sequence of the POS tags, which is opposite to
    our model
  • Using gold standard POS tags
  • Accuracy of the unlexicalized model 71.1
  • Our strictly lexicalized model 79.9

27
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Structured boosting
  • Experimental results
  • Conclusion

28
Simple Training of Dependency Parsers via
Structured Boosting
-- IJCAI 2007
29
Our Contributions
  • Structured Boosting a simple approach to
    training structured classifiers by applying a
    boosting-like procedure to standard supervised
    training methods
  • Advantages
  • Simple
  • Inexpensive
  • General
  • Successfully applied to dependency parsing

30
Local Training Examples
  • Given training data (S, T)

local examples
Word-pair Link-label Weight
Features The-boy L 1 W1_The, W2_boy,
W1W2_The_boy, T1_DT, T2_NN,
T1T2_DT_NN, Dist_1, boy-skipped L 1
W1_boy, W2_skipped, skipped-school R 1
W1_skipped, W2_ school, skipped-regularly R
1 W1_skipped, W2_regularly,
The-skipped N 1 W1_The, W2_skipped,
The-school N 1 W1_The, W2_school,
L left link R right link N no link
31
Local Training Methods
  • Learn a local link classifier given a set of
    features defined on the local examples
  • For each word pair in a sentence
  • No link, left link or right link ?
  • 3-class classification
  • Any classifier can be used as a link classifier
    for parsing

32
Combining Local Training witha Parsing Algorithm
Training sentences (S, T)
Local training examples
Link score
Dependency parsing algorithm
Local link classifier h
Standard application of ML
Dependency trees
33
Parsing With a Local Link Classifier
  • Learn the weight vector over a set of
    features defined on the local examples
  • Maximum entropy models (Ratnaparkhi 1999,
    Charniak 2000)
  • Support vector machines (Yamada and Matsumoto
    2003)
  • The parameters of the local model are not being
    trained to consider the global parsing accuracy
  • Global training can do better

34
Global Training for Parsing
  • Directly capture the relations between the links
    of an output tree
  • Incorporate the effects of the parser directly
    into the training algorithm
  • Structured SVMs (Tsochantaridis et al. 2004)
  • Max-Margin Parsing (Taskar et al. 2004)
  • Online large-margin training (McDonald et al.
    2005)
  • Improving large-margin training (Wang et al. 2006)

35
But, Drawbacks
  • Unfortunately, these structured training
    techniques are
  • Expensive
  • Specialized
  • Complex to implement
  • Require a great deal of refinement and
    computational resources to apply to parsing

Need efficient global training algorithms!
36
Structured Boosting
  • A simple variant of standard boosting algorithms
    Adaboost M1 (Freund Schapire 1997)
  • Global optimization
  • As efficient as local methods
  • General, can use any local classifier
  • Also, can be easily applied to other tasks

37
Standard Boosting for Classification
training examples
Increase the weight of mis-classified examples
Local predictor h

hk
h1
h2
h3
38
Structured Boosting (An Example)
Gold tree
saw
duck
her
with
telescope
I
a
Parsers output
At each iteration, Increasing the weights of
local examples which are mis-parsed
Iter 1
Try harder!
I
saw
her
with
duck
a
telescope
Try harder!
Iter 2
I
saw
duck
her
with
telescope
a

Good job!
Iter T
I
saw
her
telescope
duck
with
a
39
Structured Boosting (An Example)
saw
her
duck
I
with
a
telescope
Weight of saw-with
Weight of duck-with
Weights of local examples

Iter 1 2 T
40
Structured Boosting for Dependency Parsing
Global training efficient ?
Training sentences (S, T)
Compare with the gold standard trees
Local training examples
Re-weight the mis-parsed examples
Dependency parsing algorithm
Link score
Local link classifier h

h1
h2
h3
hk
Dependency trees
41
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Structured boosting
  • Experimental results
  • Conclusion

42
Experimental Design
  • Data set (Linguistic Data Consortium), split into
    training, development and test sets
  • PTB3 50 K sentences (split standard)
  • CTB4 15 K sentences (split Wang et al. 2005)
  • CTB5 19 K sentences (split Corston-Oliver et
    al. 2006)
  • Features
  • Word-pair indicator, POS-pair, PMI, context
    features, distance

43
Experimental Design Cont.
  • Local link classifier
  • Logistic regression model/ maximum entropy
  • Boosting method
  • A variant of Adaboost M1 (Freund Schapire 1997)

44
Results
Accuracy on Chinese and English ()
Dependency accuracy percentage of words that
have the correct head
45
Comparison with State of the Art
IWPT 2005
IJCAI 2007
Chinese Treebank 4.0 Chinese Treebank 5.0
46
Overview
  • Dependency parsing model
  • Learning dependency parsers
  • Strictly lexicalized dependency parsing
  • Structured boosting
  • Experimental results
  • Conclusion

47
Conclusion
  • Similarity-based smoothing as an alternative to
    POS to deal with data sparseness
  • Structured boosting is an efficient and effective
    approach to coordinating local link classifiers
    with global parsing accuracy
  • Both of the above techniques have been
    successfully applied to dependency parsing

48
More on Structured Learning
  • Improved Estimation for Unsupervised
    Part-of-Speech Tagging (Wang Schuurmans 2005)
  • Improved Large Margin Dependency Parsing via
    Local Constraints and Laplacian Regularization
    (Wang, Cherry, Lizotte and Schuurmans 2006)
  • Learning Noun Phrase Query Segmentation
    (Bergsma and Wang 2007)
  • Semi-supervised Topic Segmentation of Web Docs
    (in progress)

49
Thanks!
Questions?
50
  • Features,
  • Linguistic intuitions
  • Models

Training criteria, Regularization, smoothing
NLP
Machine learning
51
What Have I Worked On?
Dependency parsing
Output is a set of inter-dependent labels with a
specific structure (E.g., a parse tree)
Structured learning
Part-of-Speech tagging
Query segmentation
52
Ambiguities In NLP
Courtesy of Aravind Joshi
I like eating sushi with tuna.
53
Dependency Tree
  • A dependency tree structure for a sentence
  • Syntactic relationships between word pairs in
    the sentence

obj
obj
mod
obj
subj
with
tuna
sushi
I
like
eating
with
tuna
sushi
I
like
eating
54
Structured Boosting
  • Train a local link predictor, h1
  • Re-parse training data using h1
  • Re-weight local examples
  • Compare the parser outputs with the gold standard
    trees
  • Increase the weight of mis-parsed local examples
  • Re-train local link predictor, getting h2
  • Finally we have h1 , h2 , , hk

55
However
  • Exponential number of constraints (number of
    incorrect trees)
  • Loss Ignoring the local errors of the parse tree
  • Over-fitting the training corpus
  • large number of bi-lexical/word-pair features,
    need a good smoothing (regularization) method

56
Variants of Structured Boosting
  • Using alternative boosting algorithms for
    structured boosting
  • Adaboost M2 (Freund Schapire 1997)
  • Re-weighting class labels
  • Logistic regression form of boosting (Collins et
    al. 2002)

57
Improving Large Margin Training (Wang et al.
2006)
  • A margin is created between the correct
    dependency tree and each incorrect dependency
    tree at least as large as the loss of the
    incorrect tree.
  • Our contributions
  • Local constraints to capture local errors in a
    parse tree
  • Laplacian regularization to deal with data
    sparseness introduce the similarity-based
    smoothing technique into the large margin
    framework

58
(Existing) Large Margin Training
  • Having been used for parsing
  • Tsochantaridis et al. 2004
  • Taskar et al.2004
  • State of the art performance in dependency
    parsing
  • McDonald et at. 2005a, 2005b, 2006

59
Large Margin Training
  • Minimizing a regularized loss (Hastie et at.,
    2004)

i the index of the training sentences Ti the
target tree Li a candidate tree the
distance between the two trees
60
Large Margin Training
McDonald 2005
  • Exponential number of constraints (number of
    incorrect trees)
  • Loss Ignoring the local errors of the parse tree
  • Over-fitting the training corpus
  • large number of bi-lexical/word-pair features,
    need a good smoothing (regularization) method

61
Local Constraints (an example)
4
1
2
3
school
The
boy
skipped
regularly
loss
6
5
score(The, boy) gt score(The,
skipped) 1 score(boy, skipped) gt
score(The, skipped) 1
score(skipped, school) gt score(school,
regularly) 1 score(skipped, regularly) gt
score(school, regularly) 1
5
1
2
5
polynomial constraints!
6
3
6
4
62
Local Constraints
w1 common node
Convex!
Correct link
Missing link
  • With slack variables

63
Objective with Local Constraints
  • The corresponding new quadratic program

polynomial constraints!
j number of constraints in A
64
Laplacian Regularization
  • Enforce similar links (word pairs) to have
    similar weights

L(S) D(S) S D(S) a diagonal matrix
S similarity matrix of word pairs L(S)
Laplacian matrix of S
65
Similarity Between Word Pairs
  • Similarity between two words Sim(w1, w2)
  • Cosine
  • Similarity between word pairs

66
Refined Large Margin Objective
only for bi-lexical features
67
Unsupervised POS Tagging (Wang Schuurmans 2005)
  • Weakness of transition model and emission model
    in HMM tagging
  • Poorly Learned transition parameters
  • No form of parameter tying over emission model
  • Our ideas
  • Transition model Marginally constrained HMMs
  • Emission model Similarity-based smoothing

68
Parameters of an HMM
Ti1
Ti
Ti-1
T1
Tn

Wi1
Wi
Wi-1
W1
Wn


69
We Did Better
  • Improved Estimation for Unsupervised
    Part-of-Speech Tagging (Wang Schuurmans,
    2005)
  • Full/unfiltered lexicon
  • 77.2 (Banko and Moore 2004)
  • 90.5 (Our model) ?
  • Reduced/filtered lexicon
  • 95.9 (Banko and Moore 2004)
  • 94.7 (Our model)

70
Query Segmentation (Bergsma Wang 2007)
  • Input search engine query
  • Output query separated into phrases
  • Goal improve information retrieval
  • Approach supervised machine-learning
  • Classifier makes segmentation decisions
  • Conclusion richer features allow for large
    increases in segmentation performance

71
Query Segmentation
  • Example query
  • two man power saw
  • Output segmentations
  • two man power saw
  • two man power saw
  • two man power saw
  • two man power saw
  • etc.

72
Query Segmentation
  • Unsegmented
  • two man power saw
  • two
  • man
  • power
  • saw

73
Query Segmentation
  • two man
  • power saw

74
Query Segmentation
  • two
  • man
  • power saw

75
Semi-supervised Dependency Parsing
  • Unsupervised/Semi-supervised dependency parsing
  • EM
  • Using a discriminative, convex, unsupervised
    structured learning algorithm (Xu et al. 2006)
  • Combining a supervised structured large margin
    loss with a cheap unsupervised least squares loss
    on unlabeled data.

Expensive
Much cheaper
76
Topic Segmentation of Web Docs
  • A structured classification problem
  • Input a document containing a sequence of k
    sentences
  • Output a sequence of break decisions (each
    sentence boundary is a possible segmentation
    point)
  • Goal segment a document into a few blocks
    according to subtopic
  • Approach semi-supervised training (combining a
    supervised large-margin training loss with an
    unsupervised least squares loss)

77
Experimental Results
Dependency accuracy on Chinese Treebank (CTB) 4.0
An undirected tree root A directed tree
78
Dynamic Features
  • Also known as non-local features
  • Take into account the link labels of the
    surrounding word-pairs when predicting the label
    of current pair
  • Commonly used in sequential labeling (McCallum
    et al 2000, Toutanova et al. 2003)
  • A simple but useful idea for improving parsing
    accuracy
  • Wang et al. 2005
  • McDonald and Pereira 2006

79
Dynamic Features
with
a
telescope
duck
I
saw
her
with
a
spot
duck
I
saw
her
  • Define a canonical order so that a words
    children are generated first, before it modifies
    another word
  • telescope/spot are the dynamic features for
    deciding whether generating a link between saw
    with or duck with

80
Results - 1
Table 1 Boosting with static features
81
Results - 2
Table 2 Boosting with dynamic features
82
Feature Representation
  • Represent a word w by a feature vector
  • The value of a feature c is

where P(w, c) is the probability of w and c
co-occur in a context window
83
Similarity-based Smoothing
  • Similarity measure cosine
  • In (Dagan et al., 1999)
Write a Comment
User Comments (0)
About PowerShow.com