Title: 6. Statistical Inference : ngram Models over Sparse Data
16. Statistical Inference n-gram Models over
Sparse Data
Foundations of Statistic Natural Language
Processing
2Outline (1)
- Bins Forming Equiv. Classes
- Reliability vs. discrimination
- n-gram models
- Building n-gram models
- Statistical Estimators
- Maximum Likelihood Estimation (MLE)
- Laplaces law, Lidstones law and the
Jeffreys-Perks law - Held out estimation
- Cross-validation
- Good-Turing estimation
3Outline (2)
- Combining Estimators
- Simple linear interpolation
- Katzs backing-off
- General linear interpolation
- Language models for Austen
- Conclusions
41. Bins Forming Equiv. Classes
Reliability vs. discrimination
- classification task
- classificatory feature
- target feature
- equivalence classing help to predict the value of
target feature - independence assumption
- compromise is needed
- Discrimination dividing data into bins
- Reliability number of training instances in bin
5n-gram models
- predicting the next word(probability function P)
- Markov Assumption
- (n-1)th order model (or n-gram model)
- last n-1 words are in the same equiv class
- parameters(V 20,000)
62. Statistical Estimators
- probability estimate
- target feature
- estimating the unknown probability of
distribution of n-grams
7Notation for the statistical estimation
8Maximum Likelihood Estimation (MLE)
- Probability estimates for the next word
- The MLE assigns a zero probability to unseen
events - These zero probability will propagate and give us
bad estimates for the probability
9Laplaces law
- add a little bit of probability space to unseen
events - but Laps law actually gives too much of the
probability space to unseen events - In case of B gt N Laplaces method is completely
unsatisfactory in such circumstances. - too much of the probability space gt unseen
bigrams - 46.5 (Church Gale)
10Lidstones law and the Jeffreys-Perks law
- Lidstones Law
- add some positive value
- Jeffreys-Perks Law
- 0.5
- or called ELE (Expected Likelihood Estimation)
11Held out estimation
C1(w1wn) frequency of w1wn in training
data C2(w1wn) frequency of w1wn in held out
data
where C(w1wn) r
- further text
- how often appear bigrams that appeared r times in
training text
12Cross-validation(deleted estimation)
- cross validation training data is used both as
- initial training data
- held out data
- On large training corpora, deleted estimation
works better than held-out estimation
13Good-Turing estimation
- suitable for large number of observations from a
large vocabulary - works well for n-grams
( r is an adjusted frequency )
( E denotes the expectation of random
variable )
143. Combining Estimators
- Consider how to combine multiple probability
estimate from various different models
Simple linear interpolation
- combination of trigram and bigram unigram
15Katzs backing-off
- used to smooth or to combine information source
- n-gram appeared more than k time
- n-gram estimate
- k or less than k
- estimate from a shorter n-gram
16General linear interpolation
- weight function of history
- Very general way to combine models(commonly used)
17IV. Conclusions
- problems of sparse data
- Good-Turing, linear interpolation or back-off
- Good-Turing smoothing is good
- Church Gale (1991)
- Active research
- combining probability models
- dealing with sparse data