Chapter6. Statistical Inference : ngram Model over Sparse Data - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Chapter6. Statistical Inference : ngram Model over Sparse Data

Description:

General linear interpolation. weight : function of history ... Good-Turing, linear interpolation or back-off. Good-Turing smoothing is good. Church & Gale (1991) ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 21
Provided by: klplReP
Category:

less

Transcript and Presenter's Notes

Title: Chapter6. Statistical Inference : ngram Model over Sparse Data


1
Chapter6. Statistical Inference n-gram
Modelover Sparse Data
Foundations of Statistic Natural Language
Processing
  • 2005. 1. 13
  • ? ? ?
  • huni77_at_pusan.ac.kr

2
Table of Contents
  • Introduction
  • Bins Forming Equivalence Classes
  • Reliability vs. Discrimination
  • N-gram models
  • Statistical Estimators
  • Maximum Likelihood Estimation (MLE)
  • Laplaces law, Lidstones law and the
    Jeffreys-Perks law
  • Held out estimation
  • Cross-validation (deleted estimation)
  • Good-Turing estimation
  • Combining Estimators
  • Simple linear interpolation
  • Katzs backing-off
  • General linear interpolation
  • Conclusions

3
Introduction
  • Object of Statistical NLP
  • Do statistical inference for the field of natural
    language.
  • Statistical inference in general consists of
  • Taking some data generated by unknown probability
    distribution.
  • Making some inferences about this distribution.
  • Divides the problem into three areas
  • Dividing the training data into equivalence
    class.
  • Finding a good statistical estimator for each
    equivalence class.
  • Combining multiple estimators.

4
Bins Forming Equivalence Classes1/2
  • Reliability vs. Discrimination
  • large green ___________
  • tree? mountain? frog? car?
  • swallowed the large green ________
  • pill? broccoli?
  • larger n more information about the context of
    the specific instance (greater discrimination)
  • smaller n more instances in training data,
    better statistical estimates (more reliability)

5
Bins Forming Equivalence Classes2/2
  • N-gram models
  • n-gram sequence of n words
  • predicting the next word
  • Markov assumption
  • Only the prior local con text - the last few
    words affects the next word.
  • Selecting an n Vocabulary size 20,000 words

6
Statistical Estimators1/3
  • Given the observed training data.
  • How do you develop a model (probability
    distribution) to predict future events?
  • Probability estimate
  • target feature
  • Estimating the unknown probability distribution
    of n-grams.

7
Statistical Estimators2/3
  • Notation for the statistical estimation chapter.

8
Statistical Estimators3/3
  • Example - Instances in the training corpus

inferior to ________
9
Maximum Likelihood Estimation (MLE)1/2
  • Definition
  • Using the relative frequency as a probability
    estimate.
  • Example
  • In corpus, found 10 training instances of the
    word comes across
  • 8 times they were followed by as P(as) 0.8
  • Once by more and a P(more) 0.1 , P(a)
    0.1
  • Not among the above 3 word P(x) 0.0
  • Formula

10
Maximum Likelihood Estimation (MLE)2/2
11
Laplaces law, Lidstones law and the
Jeffreys-Perks law1/2
  • Laplaces law
  • Add a little bit of probability space to unseen
    events

12
Laplaces law, Lidstones law and the
Jeffreys-Perks law2/2
  • Lidstones law and the Jeffreys-Perks law
  • Lidstones Law
  • add some positive value
  • Jeffreys-Perks Law
  • 0.5
  • Called ELE (Expected Likelihood Estimation)

13
Held out estimation
  • Validate by holding out part of the training
    data.
  • C1 (w1n) Frequency of w1n in training data
  • C2(w1n) Frequency of w1n in held out data
  • T Number of token in held out data

14
Cross-validation (deleted estimation)1/2
  • Use data for both training and validation
  • Divide test data into 2 parts
  • Train on A, validate on B
  • Train on B, validate on A
  • Combine two models

15
Cross-validation (deleted estimation)2/2
  • Cross validation training data is used both as
  • initial training data
  • held out data
  • On large training corpora, deleted estimation
    works better than held-out estimation

16
Good-Turing estimation
  • Suitable for large number of observations from a
    large vocabulary
  • Works well for n-grams

( r is an adjusted frequency )
( E denotes the expectation of random
variable )
17
Combining Estimators1/3
  • Basic Idea
  • Consider how to combine multiple probability
    estimate from various different models
  • How can you develop a model to utilize different
    length n-grams as appropriate?
  • Simple linear interpolation
  • Combination of trigram , bigram and unigram

18
Combining Estimators2/3
  • Katzs backing-off
  • used to smooth or to combine information source
  • n-gram appeared more than k time
  • n-gram estimate
  • k or less than k
  • estimate from a shorter n-gram

19
Combining Estimators3/3
  • General linear interpolation
  • weight function of history
  • Very general way to combine models (commonly used)

20
Conclusions
  • problems of sparse data
  • Good-Turing, linear interpolation or back-off
  • Good-Turing smoothing is good
  • Church Gale (1991)
  • Active research
  • combining probability models
  • dealing with sparse data
Write a Comment
User Comments (0)
About PowerShow.com