Title: Chapter6. Statistical Inference : ngram Model over Sparse Data
1Chapter6. Statistical Inference n-gram
Modelover Sparse Data
Foundations of Statistic Natural Language
Processing
- 2005. 1. 13
- ? ? ?
- huni77_at_pusan.ac.kr
2Table of Contents
- Introduction
- Bins Forming Equivalence Classes
- Reliability vs. Discrimination
- N-gram models
- Statistical Estimators
- Maximum Likelihood Estimation (MLE)
- Laplaces law, Lidstones law and the
Jeffreys-Perks law - Held out estimation
- Cross-validation (deleted estimation)
- Good-Turing estimation
- Combining Estimators
- Simple linear interpolation
- Katzs backing-off
- General linear interpolation
- Conclusions
3Introduction
- Object of Statistical NLP
- Do statistical inference for the field of natural
language. - Statistical inference in general consists of
- Taking some data generated by unknown probability
distribution. - Making some inferences about this distribution.
- Divides the problem into three areas
- Dividing the training data into equivalence
class. - Finding a good statistical estimator for each
equivalence class. - Combining multiple estimators.
4Bins Forming Equivalence Classes1/2
- Reliability vs. Discrimination
- large green ___________
- tree? mountain? frog? car?
- swallowed the large green ________
- pill? broccoli?
- larger n more information about the context of
the specific instance (greater discrimination) - smaller n more instances in training data,
better statistical estimates (more reliability)
5Bins Forming Equivalence Classes2/2
- N-gram models
- n-gram sequence of n words
- predicting the next word
- Markov assumption
- Only the prior local con text - the last few
words affects the next word. - Selecting an n Vocabulary size 20,000 words
6Statistical Estimators1/3
- Given the observed training data.
- How do you develop a model (probability
distribution) to predict future events? - Probability estimate
- target feature
-
- Estimating the unknown probability distribution
of n-grams.
7Statistical Estimators2/3
- Notation for the statistical estimation chapter.
8Statistical Estimators3/3
- Example - Instances in the training corpus
inferior to ________
9Maximum Likelihood Estimation (MLE)1/2
- Definition
- Using the relative frequency as a probability
estimate. - Example
- In corpus, found 10 training instances of the
word comes across - 8 times they were followed by as P(as) 0.8
- Once by more and a P(more) 0.1 , P(a)
0.1 - Not among the above 3 word P(x) 0.0
- Formula
10Maximum Likelihood Estimation (MLE)2/2
11Laplaces law, Lidstones law and the
Jeffreys-Perks law1/2
- Laplaces law
- Add a little bit of probability space to unseen
events
12Laplaces law, Lidstones law and the
Jeffreys-Perks law2/2
- Lidstones law and the Jeffreys-Perks law
- Lidstones Law
- add some positive value
-
- Jeffreys-Perks Law
- 0.5
- Called ELE (Expected Likelihood Estimation)
13Held out estimation
- Validate by holding out part of the training
data. -
- C1 (w1n) Frequency of w1n in training data
- C2(w1n) Frequency of w1n in held out data
- T Number of token in held out data
-
14Cross-validation (deleted estimation)1/2
- Use data for both training and validation
- Divide test data into 2 parts
- Train on A, validate on B
- Train on B, validate on A
- Combine two models
15Cross-validation (deleted estimation)2/2
- Cross validation training data is used both as
- initial training data
- held out data
- On large training corpora, deleted estimation
works better than held-out estimation
16Good-Turing estimation
- Suitable for large number of observations from a
large vocabulary - Works well for n-grams
( r is an adjusted frequency )
( E denotes the expectation of random
variable )
17Combining Estimators1/3
- Basic Idea
- Consider how to combine multiple probability
estimate from various different models - How can you develop a model to utilize different
length n-grams as appropriate? - Simple linear interpolation
-
- Combination of trigram , bigram and unigram
18Combining Estimators2/3
- Katzs backing-off
- used to smooth or to combine information source
- n-gram appeared more than k time
- n-gram estimate
- k or less than k
- estimate from a shorter n-gram
19Combining Estimators3/3
- General linear interpolation
- weight function of history
- Very general way to combine models (commonly used)
20Conclusions
- problems of sparse data
- Good-Turing, linear interpolation or back-off
- Good-Turing smoothing is good
- Church Gale (1991)
- Active research
- combining probability models
- dealing with sparse data