6. Statistical Inference : ngram Models over Sparse Data

About This Presentation

Title:

6. Statistical Inference : ngram Models over Sparse Data

Description:

General linear interpolation. weight : function of history ... Good-Turing, linear interpolation or back-off. Good-Turing smoothing is good. Church & Gale (1991) ... – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 18

Provided by: klplReP

Category:

more less

Transcript and Presenter's Notes

Title: 6. Statistical Inference : ngram Models over Sparse Data

1
6. Statistical Inference n-gram Models over
Sparse Data
Foundations of Statistic Natural Language
Processing

2002. 1. 18.
???????
???

2
Outline (1)

Bins Forming Equiv. Classes
Reliability vs. discrimination
n-gram models
Building n-gram models
Statistical Estimators
Maximum Likelihood Estimation (MLE)
Laplaces law, Lidstones law and the
Jeffreys-Perks law
Held out estimation
Cross-validation
Good-Turing estimation

3
Outline (2)

Combining Estimators
Simple linear interpolation
Katzs backing-off
General linear interpolation
Language models for Austen
Conclusions

4
1. Bins Forming Equiv. Classes
Reliability vs. discrimination

classification task
classificatory feature
target feature
equivalence classing help to predict the value of
target feature
independence assumption
compromise is needed
Discrimination dividing data into bins
Reliability number of training instances in bin

5
n-gram models

predicting the next word(probability function P)
Markov Assumption
(n-1)th order model (or n-gram model)
last n-1 words are in the same equiv class
parameters(V 20,000)

6
2. Statistical Estimators

probability estimate
target feature
estimating the unknown probability of
distribution of n-grams

7
Notation for the statistical estimation
8
Maximum Likelihood Estimation (MLE)

Probability estimates for the next word
The MLE assigns a zero probability to unseen
events
These zero probability will propagate and give us
bad estimates for the probability

9
Laplaces law

add a little bit of probability space to unseen
events
but Laps law actually gives too much of the
probability space to unseen events
In case of B gt N Laplaces method is completely
unsatisfactory in such circumstances.
too much of the probability space gt unseen
bigrams
46.5 (Church Gale)

10
Lidstones law and the Jeffreys-Perks law

Lidstones Law
add some positive value
Jeffreys-Perks Law
0.5
or called ELE (Expected Likelihood Estimation)

11
Held out estimation
C1(w1wn) frequency of w1wn in training
data C2(w1wn) frequency of w1wn in held out
data
where C(w1wn) r

further text
how often appear bigrams that appeared r times in
training text

12
Cross-validation(deleted estimation)

cross validation training data is used both as
initial training data
held out data
On large training corpora, deleted estimation
works better than held-out estimation

13
Good-Turing estimation