Optimizing Local Probability Models for Statistical Parsing - PowerPoint PPT Presentation

About This Presentation
Title:

Optimizing Local Probability Models for Statistical Parsing

Description:

Stanford University. Natural Language Processing. N L P. S ... part-of-speech tags (noun, verb, proper noun, etc.) Joy makes progress every day. ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 30
Provided by: KristinaT3
Learn more at: https://nlp.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Optimizing Local Probability Models for Statistical Parsing


1
Optimizing Local Probability Models for
Statistical Parsing
  • Kristina Toutanova, Mark Mitchell, Christopher
    Manning
  • Computer Science Department
  • Stanford University

2
Highlights
  • Choosing a local probability model
    P(expansion(n)history(n)) for statistical
    parsing a comparison of commonly used models
  • A new player memory based models and their
    relation to interpolated models
  • Joint likelihood, conditional likelihood and
    classification accuracy for models of this form

3
Motivation
  • Many problems in natural language processing are
    disambiguation problems
  • word senses
  • jaguar a big cat, a car, name of a Java
    package
  • line - phone, queue, in mathematics, air
    line, etc.
  • part-of-speech tags (noun, verb, proper noun,
    etc.)
  • ? ? ?
  • Joy makes progress every day .

NN VB
NN NNP
VBZ NNS
DT
NN
4
Parsing as Classification
  • I would like to meet with you again on Monday

Input a sentence
Classify to one of the possible parses
5
Motivation Classification Problems
  • There are two major differences from typical ML
    domains
  • The number of classes can be very large or even
    infinite the set of available classes for an
    input varies (and depends on a grammar)
  • Data is usually very sparse and the number of
    possible features is large (e.g. words)

6
Solutions
  • The possible parse trees are broken down into
    small pieces defining features
  • features are now functions of input and class,
    not input only
  • Discriminative or generative models are built
    using these features
  • we concentrate on generative models here when a
    huge number of analyses are possible, they are
    the only practical ones

7
History-Based Generative Parsing Models
TOP
  • Tuesday Marks bought Brooks.
  • The generative models
  • learn a distribution P(S,T) on ltsentence, parse
    treegt pairs
  • select a single most likely parse based on

8
Factors in the Performance of Generative
History-Based Models
  • The chosen decomposition of parse tree
    generation, including the representation of parse
    tree nodes and the independence assumptions
  • The model family chosen for representing local
    probability distributions
  • Decision Trees, Naïve Bayes, Log-linear Models
  • The optimization method for fitting major and
    smoothing parameters
  • Maximum likelihood, maximum conditional
    likelihood, minimum error rate , etc.

9
Previous Studies and This Work
  • The influence of the previous three factors has
    not been isolated in previous work authors
    presented specific choices for all components and
    the importance of each was unclear.
  • We assume the generative history-based model and
    set of features (the representation of parse tree
    nodes) are fixed and we study carefully the other
    two factors.

10
Deleted Interpolation
Estimating the probability P(yX) by
interpolating relative frequency estimates for
lower-order distributions
Most commonly used linear feature subsets order
Jelinek-Mercer with fixed weight, Witten Bell
with varying d, Decision Trees with path
interpolation,Memory-Based Learning
11
Memory-Based Learning as Deleted Interpolation
  • In k-NN, the probability of a class given
    features is estimated as
  • If the distance function depends only on the
    positions of the matching features , it is a
    case of deleted interpolation

12
Memory-Based Learning as Deleted Interpolation
  • P(eye-colorbluehair-colorblond)
  • We have N12 samples of people
  • d1 or d0 (match), w(1)w1 , w(0)w0, K12
  • Deleted Interpolation where the interpolation
    weights depend on the counts and weights of
    nearest neighbors at all accepted distances

13
The Task and Features Used
Maximum ambiguity 507, minimum - 2
14
Experiments
  • Linear Feature Subsets Order
  • Jelinek-Mercer with fixed weight
  • Witten Bell with varying d
  • Linear Memory-Based Learning
  • Arbitrary Feature Subsets Order
  • Decision Trees
  • Memory-Based Learning
  • Log-linear Models
  • Experiments on the connection among likelihoods
    and accuracy

15
Experiments Linear Sequence
The features 1,2,,8 ordered by gain ratio
1,8,2,3,5,4,7,6
Jelinek Mercer Fixed Weight
Witten-Bell Varying d
16
Experiments Linear Sequence
heavy smoothing for best results
17
MBL Linear Subsets Sequence
  • Restrict MBL to be an instance of the same linear
    subsets sequence deleted interpolation as
    follows
  • Weighting functions INV3 and INV4 performed best

LKNN3 best at K3,000 79.94 LKNN4 best at
K15,000 80.18 LKNN4 is best of all
18
Experiments
  • Linear Subsets Feature Order
  • Jelinek-Mercer with fixed weight
  • Witten Bell with varying d
  • Linear Memory-Based Learning
  • Arbitrary Subsets Feature Order
  • Decision Trees
  • Memory-Based Learning
  • Log-linear Models
  • Experiments on the connection among likelihoods
    and accuracy

19
Model Implementations Decision Trees
  • (DecTreeWBd)
  • n-ary decision trees If we choose a feature f to
    split on, all its values form subtrees
  • splitting criterion gain ratio
  • final probabilities estimates at the leaves are
    Witten Bell d interpolations of estimates on the
    path to the root

feat 1
HCOMP
feat2
  • instances of deleted interpolation models!

NOPTCOMP
20
Model Implementations Log-linear Models
  • Binary features formed by instantiating templates
  • Three models with different allowable features
  • Single attributes only LogLinSingle
  • Pairs of attributes, only pairs involving the
    most important feature (node label) LogLinPairs
  • Linear feature subsets comparable to previous
    models LogLinBackoff
  • Gaussian smoothing was used
  • Trained by Conjugate Gradient (Stanford Classify
    Package)

21
Model Implementations Memory-Based Learning
  • Weighting functions INV3 and INV4
  • KNN4 better than DecTreeWBd and Log-linear models
  • KNN4 has 5.8 error reduction from WBd
    (significant at the 0.01 level)

22
Accuracy Curves for MBL and Decision Trees
23
Experiments
  • Linear Subsets Feature Order
  • Jelinek-Mercer with fixed weight
  • Witten Bell with varying d
  • Linear Memory-Based Learning
  • Arbitrary Subsets Feature Order
  • Decision Trees
  • Memory-Based Learning
  • Log-linear Models
  • Experiments on the connection among likelihoods
    and accuracy

24
Joint Likelihood, Conditional Likelihood, and
Classification Accuracy
  • Our aim is to maximize parsing accuracy, but
  • Smoothing parameters are usually fit on held-out
    data to maximize joint likelihood
  • Sometimes conditional likelihood is optimized
  • We look at the relationship among the maxima of
    these three scoring functions, depending on the
    amount of smoothing, finding that
  • Much heavier smoothing is needed to maximize
    accuracy than joint likelihood
  • Conditional likelihood also increases with
    smoothing, even long after the maximum for joint
    likelihood

25
Test Set Performance versus Amount of Smoothing -
I
26
Test Set Performance versus Amount of Smoothing
27
Test Set Performance versus Amount of Smoothing
PP Attachment
Witten-Bell Varying d
28
Summary
  • The problem of effectively estimating local
    probability distributions for compound decision
    models used for classification is under-explored
  • We showed that the chosen local distribution
    model matters
  • We showed the relationship between MBL and
    deleted interpolation models
  • MBL with large numbers of neighbors and
    appropriate weighting outperformed more expensive
    and popular algorithms Decision Trees and
    Log-linear Models
  • Fitting a small number of smoothing parameters to
    maximize classification accuracy is promising for
    improving performance

29
Future Work
  • Compare MBL to other state-of-the art smoothing
    methods
  • Better ways of fitting MBL weight functions
  • Theoretical investigation of bias-variance
    tradeoffs for compound decision systems with
    strong independence assumptions
Write a Comment
User Comments (0)
About PowerShow.com