Optimizing Local Probability Models for Statistical Parsing

About This Presentation

Title:

Optimizing Local Probability Models for Statistical Parsing

Description:

Stanford University. Natural Language Processing. N L P. S ... part-of-speech tags (noun, verb, proper noun, etc.) Joy makes progress every day. ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 30

Provided by: KristinaT3

Learn more at: https://nlp.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing Local Probability Models for Statistical Parsing

1
Optimizing Local Probability Models for
Statistical Parsing

Kristina Toutanova, Mark Mitchell, Christopher
Manning
Computer Science Department
Stanford University

2
Highlights

Choosing a local probability model
P(expansion(n)history(n)) for statistical
parsing a comparison of commonly used models
A new player memory based models and their
relation to interpolated models
Joint likelihood, conditional likelihood and
classification accuracy for models of this form

3
Motivation

Many problems in natural language processing are
disambiguation problems
word senses
jaguar a big cat, a car, name of a Java
package
line - phone, queue, in mathematics, air
line, etc.
part-of-speech tags (noun, verb, proper noun,
etc.)
? ? ?
Joy makes progress every day .

NN VB
NN NNP
VBZ NNS
DT
NN
4
Parsing as Classification

I would like to meet with you again on Monday

Input a sentence
Classify to one of the possible parses
5
Motivation Classification Problems

There are two major differences from typical ML
domains
The number of classes can be very large or even
infinite the set of available classes for an
input varies (and depends on a grammar)
Data is usually very sparse and the number of
possible features is large (e.g. words)

6
Solutions

The possible parse trees are broken down into
small pieces defining features
features are now functions of input and class,
not input only
Discriminative or generative models are built
using these features
we concentrate on generative models here when a
huge number of analyses are possible, they are
the only practical ones

7
History-Based Generative Parsing Models
TOP

Tuesday Marks bought Brooks.

The generative models
learn a distribution P(S,T) on ltsentence, parse
treegt pairs
select a single most likely parse based on

8
Factors in the Performance of Generative
History-Based Models

The chosen decomposition of parse tree
generation, including the representation of parse
tree nodes and the independence assumptions
The model family chosen for representing local
probability distributions
Decision Trees, Naïve Bayes, Log-linear Models
The optimization method for fitting major and
smoothing parameters
Maximum likelihood, maximum conditional
likelihood, minimum error rate , etc.

9
Previous Studies and This Work

The influence of the previous three factors has
not been isolated in previous work authors
presented specific choices for all components and
the importance of each was unclear.
We assume the generative history-based model and
set of features (the representation of parse tree
nodes) are fixed and we study carefully the other
two factors.

10
Deleted Interpolation
Estimating the probability P(yX) by
interpolating relative frequency estimates for
lower-order distributions
Most commonly used linear feature subsets order
Jelinek-Mercer with fixed weight, Witten Bell
with varying d, Decision Trees with path
interpolation,Memory-Based Learning
11
Memory-Based Learning as Deleted Interpolation

In k-NN, the probability of a class given
features is estimated as
If the distance function depends only on the
positions of the matching features , it is a
case of deleted interpolation

12
Memory-Based Learning as Deleted Interpolation

P(eye-colorbluehair-colorblond)
We have N12 samples of people
d1 or d0 (match), w(1)w1 , w(0)w0, K12
Deleted Interpolation where the interpolation
weights depend on the counts and weights of
nearest neighbors at all accepted distances

13
The Task and Features Used
Maximum ambiguity 507, minimum - 2
14
Experiments

Linear Feature Subsets Order
Jelinek-Mercer with fixed weight
Witten Bell with varying d
Linear Memory-Based Learning
Arbitrary Feature Subsets Order
Decision Trees
Memory-Based Learning
Log-linear Models
Experiments on the connection among likelihoods
and accuracy

15
Experiments Linear Sequence
The features 1,2,,8 ordered by gain ratio
1,8,2,3,5,4,7,6
Jelinek Mercer Fixed Weight
Witten-Bell Varying d
16
Experiments Linear Sequence
heavy smoothing for best results
17
MBL Linear Subsets Sequence

Restrict MBL to be an instance of the same linear
subsets sequence deleted interpolation as
follows
Weighting functions INV3 and INV4 performed best

LKNN3 best at K3,000 79.94 LKNN4 best at
K15,000 80.18 LKNN4 is best of all
18
Experiments

Linear Subsets Feature Order
Jelinek-Mercer with fixed weight
Witten Bell with varying d
Linear Memory-Based Learning
Arbitrary Subsets Feature Order
Decision Trees
Memory-Based Learning
Log-linear Models
Experiments on the connection among likelihoods
and accuracy

19
Model Implementations Decision Trees

(DecTreeWBd)
n-ary decision trees If we choose a feature f to
split on, all its values form subtrees
splitting criterion gain ratio
final probabilities estimates at the leaves are
Witten Bell d interpolations of estimates on the
path to the root

feat 1
HCOMP
feat2

instances of deleted interpolation models!

NOPTCOMP
20
Model Implementations Log-linear Models

Binary features formed by instantiating templates
Three models with different allowable features
Single attributes only LogLinSingle
Pairs of attributes, only pairs involving the
most important feature (node label) LogLinPairs
Linear feature subsets comparable to previous
models LogLinBackoff
Gaussian smoothing was used
Trained by Conjugate Gradient (Stanford Classify
Package)

21
Model Implementations Memory-Based Learning

Weighting functions INV3 and INV4
KNN4 better than DecTreeWBd and Log-linear models
KNN4 has 5.8 error reduction from WBd
(significant at the 0.01 level)

22
Accuracy Curves for MBL and Decision Trees
23
Experiments

Linear Subsets Feature Order
Jelinek-Mercer with fixed weight
Witten Bell with varying d
Linear Memory-Based Learning
Arbitrary Subsets Feature Order
Decision Trees
Memory-Based Learning
Log-linear Models
Experiments on the connection among likelihoods
and accuracy

24
Joint Likelihood, Conditional Likelihood, and
Classification Accuracy

Our aim is to maximize parsing accuracy, but
Smoothing parameters are usually fit on held-out
data to maximize joint likelihood
Sometimes conditional likelihood is optimized
We look at the relationship among the maxima of
these three scoring functions, depending on the
amount of smoothing, finding that
Much heavier smoothing is needed to maximize
accuracy than joint likelihood
Conditional likelihood also increases with
smoothing, even long after the maximum for joint
likelihood

25
Test Set Performance versus Amount of Smoothing -
I
26
Test Set Performance versus Amount of Smoothing
27
Test Set Performance versus Amount of Smoothing
PP Attachment
Witten-Bell Varying d
28
Summary

The problem of effectively estimating local
probability distributions for compound decision
models used for classification is under-explored
We showed that the chosen local distribution
model matters
We showed the relationship between MBL and
deleted interpolation models
MBL with large numbers of neighbors and
appropriate weighting outperformed more expensive
and popular algorithms Decision Trees and
Log-linear Models
Fitting a small number of smoothing parameters to
maximize classification accuracy is promising for
improving performance

29
Future Work

Compare MBL to other state-of-the art smoothing
methods
Better ways of fitting MBL weight functions
Theoretical investigation of bias-variance
tradeoffs for compound decision systems with
strong independence assumptions

Write a Comment

User Comments (0)

About PowerShow.com

Optimizing Local Probability Models for Statistical Parsing - PowerPoint PPT Presentation

Optimizing Local Probability Models for Statistical Parsing

Stanford University. Natural Language Processing. N L P. S ... part-of-speech tags (noun, verb, proper noun, etc.) Joy makes progress every day. ... – PowerPoint PPT presentation