Title: Optimizing Local Probability Models for Statistical Parsing
1Optimizing Local Probability Models for
Statistical Parsing
- Kristina Toutanova, Mark Mitchell, Christopher
Manning - Computer Science Department
- Stanford University
2Highlights
- Choosing a local probability model
P(expansion(n)history(n)) for statistical
parsing a comparison of commonly used models - A new player memory based models and their
relation to interpolated models - Joint likelihood, conditional likelihood and
classification accuracy for models of this form
3Motivation
- Many problems in natural language processing are
disambiguation problems - word senses
- jaguar a big cat, a car, name of a Java
package - line - phone, queue, in mathematics, air
line, etc. - part-of-speech tags (noun, verb, proper noun,
etc.) -
- ? ? ?
- Joy makes progress every day .
NN VB
NN NNP
VBZ NNS
DT
NN
4Parsing as Classification
- I would like to meet with you again on Monday
Input a sentence
Classify to one of the possible parses
5Motivation Classification Problems
- There are two major differences from typical ML
domains - The number of classes can be very large or even
infinite the set of available classes for an
input varies (and depends on a grammar) - Data is usually very sparse and the number of
possible features is large (e.g. words)
6Solutions
- The possible parse trees are broken down into
small pieces defining features - features are now functions of input and class,
not input only - Discriminative or generative models are built
using these features - we concentrate on generative models here when a
huge number of analyses are possible, they are
the only practical ones
7History-Based Generative Parsing Models
TOP
- Tuesday Marks bought Brooks.
- The generative models
- learn a distribution P(S,T) on ltsentence, parse
treegt pairs - select a single most likely parse based on
8Factors in the Performance of Generative
History-Based Models
- The chosen decomposition of parse tree
generation, including the representation of parse
tree nodes and the independence assumptions - The model family chosen for representing local
probability distributions - Decision Trees, Naïve Bayes, Log-linear Models
- The optimization method for fitting major and
smoothing parameters - Maximum likelihood, maximum conditional
likelihood, minimum error rate , etc.
9Previous Studies and This Work
- The influence of the previous three factors has
not been isolated in previous work authors
presented specific choices for all components and
the importance of each was unclear. - We assume the generative history-based model and
set of features (the representation of parse tree
nodes) are fixed and we study carefully the other
two factors.
10Deleted Interpolation
Estimating the probability P(yX) by
interpolating relative frequency estimates for
lower-order distributions
Most commonly used linear feature subsets order
Jelinek-Mercer with fixed weight, Witten Bell
with varying d, Decision Trees with path
interpolation,Memory-Based Learning
11Memory-Based Learning as Deleted Interpolation
- In k-NN, the probability of a class given
features is estimated as - If the distance function depends only on the
positions of the matching features , it is a
case of deleted interpolation
12Memory-Based Learning as Deleted Interpolation
- P(eye-colorbluehair-colorblond)
- We have N12 samples of people
- d1 or d0 (match), w(1)w1 , w(0)w0, K12
- Deleted Interpolation where the interpolation
weights depend on the counts and weights of
nearest neighbors at all accepted distances
13The Task and Features Used
Maximum ambiguity 507, minimum - 2
14Experiments
- Linear Feature Subsets Order
- Jelinek-Mercer with fixed weight
- Witten Bell with varying d
- Linear Memory-Based Learning
- Arbitrary Feature Subsets Order
- Decision Trees
- Memory-Based Learning
- Log-linear Models
- Experiments on the connection among likelihoods
and accuracy
15Experiments Linear Sequence
The features 1,2,,8 ordered by gain ratio
1,8,2,3,5,4,7,6
Jelinek Mercer Fixed Weight
Witten-Bell Varying d
16Experiments Linear Sequence
heavy smoothing for best results
17MBL Linear Subsets Sequence
- Restrict MBL to be an instance of the same linear
subsets sequence deleted interpolation as
follows - Weighting functions INV3 and INV4 performed best
LKNN3 best at K3,000 79.94 LKNN4 best at
K15,000 80.18 LKNN4 is best of all
18Experiments
- Linear Subsets Feature Order
- Jelinek-Mercer with fixed weight
- Witten Bell with varying d
- Linear Memory-Based Learning
- Arbitrary Subsets Feature Order
- Decision Trees
- Memory-Based Learning
- Log-linear Models
- Experiments on the connection among likelihoods
and accuracy
19Model Implementations Decision Trees
- (DecTreeWBd)
- n-ary decision trees If we choose a feature f to
split on, all its values form subtrees - splitting criterion gain ratio
- final probabilities estimates at the leaves are
Witten Bell d interpolations of estimates on the
path to the root
feat 1
HCOMP
feat2
- instances of deleted interpolation models!
NOPTCOMP
20Model Implementations Log-linear Models
- Binary features formed by instantiating templates
- Three models with different allowable features
- Single attributes only LogLinSingle
- Pairs of attributes, only pairs involving the
most important feature (node label) LogLinPairs - Linear feature subsets comparable to previous
models LogLinBackoff - Gaussian smoothing was used
- Trained by Conjugate Gradient (Stanford Classify
Package)
21Model Implementations Memory-Based Learning
- Weighting functions INV3 and INV4
- KNN4 better than DecTreeWBd and Log-linear models
- KNN4 has 5.8 error reduction from WBd
(significant at the 0.01 level)
22Accuracy Curves for MBL and Decision Trees
23Experiments
- Linear Subsets Feature Order
- Jelinek-Mercer with fixed weight
- Witten Bell with varying d
- Linear Memory-Based Learning
- Arbitrary Subsets Feature Order
- Decision Trees
- Memory-Based Learning
- Log-linear Models
- Experiments on the connection among likelihoods
and accuracy
24Joint Likelihood, Conditional Likelihood, and
Classification Accuracy
- Our aim is to maximize parsing accuracy, but
- Smoothing parameters are usually fit on held-out
data to maximize joint likelihood - Sometimes conditional likelihood is optimized
- We look at the relationship among the maxima of
these three scoring functions, depending on the
amount of smoothing, finding that - Much heavier smoothing is needed to maximize
accuracy than joint likelihood - Conditional likelihood also increases with
smoothing, even long after the maximum for joint
likelihood
25Test Set Performance versus Amount of Smoothing -
I
26Test Set Performance versus Amount of Smoothing
27Test Set Performance versus Amount of Smoothing
PP Attachment
Witten-Bell Varying d
28Summary
- The problem of effectively estimating local
probability distributions for compound decision
models used for classification is under-explored - We showed that the chosen local distribution
model matters - We showed the relationship between MBL and
deleted interpolation models - MBL with large numbers of neighbors and
appropriate weighting outperformed more expensive
and popular algorithms Decision Trees and
Log-linear Models - Fitting a small number of smoothing parameters to
maximize classification accuracy is promising for
improving performance
29Future Work
- Compare MBL to other state-of-the art smoothing
methods - Better ways of fitting MBL weight functions
- Theoretical investigation of bias-variance
tradeoffs for compound decision systems with
strong independence assumptions