Chapter6. Statistical Inference : ngram Model over Sparse Data - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Chapter6. Statistical Inference : ngram Model over Sparse Data

Description:

General linear interpolation. weight : function of history ... Good-Turing, linear interpolation or back-off. Good-Turing smoothing is good. Church & Gale (1991) ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 21

Provided by: klplReP

Category:

more less

Transcript and Presenter's Notes

Title: Chapter6. Statistical Inference : ngram Model over Sparse Data

1
Chapter6. Statistical Inference n-gram
Modelover Sparse Data
Foundations of Statistic Natural Language
Processing

2005. 1. 13
? ? ?
huni77_at_pusan.ac.kr

2
Table of Contents

Introduction
Bins Forming Equivalence Classes
Reliability vs. Discrimination
N-gram models
Statistical Estimators
Maximum Likelihood Estimation (MLE)
Laplaces law, Lidstones law and the
Jeffreys-Perks law
Held out estimation
Cross-validation (deleted estimation)
Good-Turing estimation
Combining Estimators
Simple linear interpolation
Katzs backing-off
General linear interpolation
Conclusions

3
Introduction

Object of Statistical NLP
Do statistical inference for the field of natural
language.
Statistical inference in general consists of
Taking some data generated by unknown probability
distribution.
Making some inferences about this distribution.
Divides the problem into three areas
Dividing the training data into equivalence
class.
Finding a good statistical estimator for each
equivalence class.
Combining multiple estimators.

4
Bins Forming Equivalence Classes1/2

Reliability vs. Discrimination
large green ___________
tree? mountain? frog? car?
swallowed the large green ________
pill? broccoli?
larger n more information about the context of
the specific instance (greater discrimination)
smaller n more instances in training data,
better statistical estimates (more reliability)

5
Bins Forming Equivalence Classes2/2

N-gram models
n-gram sequence of n words
predicting the next word
Markov assumption
Only the prior local con text - the last few
words affects the next word.
Selecting an n Vocabulary size 20,000 words

6
Statistical Estimators1/3

Given the observed training data.
How do you develop a model (probability
distribution) to predict future events?
Probability estimate
target feature
Estimating the unknown probability distribution
of n-grams.

7
Statistical Estimators2/3

Notation for the statistical estimation chapter.

8
Statistical Estimators3/3

Example - Instances in the training corpus

inferior to ________
9
Maximum Likelihood Estimation (MLE)1/2

Definition
Using the relative frequency as a probability
estimate.
Example
In corpus, found 10 training instances of the
word comes across
8 times they were followed by as P(as) 0.8
Once by more and a P(more) 0.1 , P(a)
0.1
Not among the above 3 word P(x) 0.0
Formula

10
Maximum Likelihood Estimation (MLE)2/2
11
Laplaces law, Lidstones law and the
Jeffreys-Perks law1/2

Laplaces law
Add a little bit of probability space to unseen
events

12
Laplaces law, Lidstones law and the
Jeffreys-Perks law2/2

Lidstones law and the Jeffreys-Perks law
Lidstones Law
add some positive value
Jeffreys-Perks Law
0.5
Called ELE (Expected Likelihood Estimation)

13
Held out estimation

Validate by holding out part of the training
data.
C1 (w1n) Frequency of w1n in training data
C2(w1n) Frequency of w1n in held out data
T Number of token in held out data

14
Cross-validation (deleted estimation)1/2

Use data for both training and validation

Divide test data into 2 parts
Train on A, validate on B
Train on B, validate on A
Combine two models

15
Cross-validation (deleted estimation)2/2

Cross validation training data is used both as
initial training data
held out data
On large training corpora, deleted estimation
works better than held-out estimation

16
Good-Turing estimation

Suitable for large number of observations from a
large vocabulary
Works well for n-grams

( r is an adjusted frequency )
( E denotes the expectation of random
variable )
17
Combining Estimators1/3

Basic Idea
Consider how to combine multiple probability
estimate from various different models
How can you develop a model to utilize different
length n-grams as appropriate?
Simple linear interpolation
Combination of trigram , bigram and unigram

18
Combining Estimators2/3