Title: Statistical Inference: ngram Models over Sparse Data
1Statistical Inference n-gram Models over Sparse
Data
2Outline
- Purpose of Statistical NLP
- Forming Equivalence Classes
- Statistical Estimators
- Combining Estimators
- Conclusions
3Purpose of Statistical NLP
- Doing statistical inference for NLP
- Why statistical inference for NLP tasks?
- Some NLP data generated by some distribution
- Make inference about distribution
- How? training data --gt equivalent classes (EC)
- Finding good statistical estimated for EC
- Combining multiple estimators
4Example Language Modelling
- Purpose Predict next word given previous words
- Applications
- speech or optical character recognition
- spelling correction
- handwriting recognition
- machine translation
- Methods applicable to
- word sense disambiguation
- probabilistic parsing
5Outline
- Purpose of Statistical NLP
- Forming Equivalence Classes
- Statistical Estimators
- Combining Estimators
- Conclusions
6Reliability vs. Discrimination
- Prediction mapping from past to future
- From classificatory features to target feature
- Independence assumption data does not depend on
other features (or just minor dependence) - More features
- More bins, greater discrimination
- Less training data, lower statistical reliability
7n-gram models
- Predicting next word estimatingP(wm w1
wm-1) - Using history h w1 wm-1
- Markov assumption only the prior n-1 word
affects wmP(wm h) P(wm wm-n1 wm-1) - n 2 bigram, n 3 trigram,
- n 4 four-gram
8How large should n be?
- Large n to capture long distance dependency
- Example Sue swallow the large green ____
- Longer context predicts pill, frog
- Local context predicts tree, car, mountain
- But high order n-gram not realistic
- Bigram has 400 million bins
- Trigram has 8 trillion bins
9n-gram for Austens novel
- Data available from project Gutenberg
- 40 M of clean plain ASCII files
- Training data Emma, Mansfield Park, Pride and
Prejudice, Sense and Sensibility - Testing Persuasion
- Corpus N 617,091 words, Vocabulary V 14,585
word types - Leaving out all punctuation
- Keeping case distinction
10Outline
- Purpose of Statistical NLP
- Forming Equivalence Classes
- Statistical Estimators
- Combining Estimators
- Conclusions
11Statistical Estimators
- n-gram model
- P(wn w1 wn-1) P(w1 wn)/P(w1 wn-1)
- C(w1 w2 wn) frequency of w1n w1 w2 wn
- w1 wn-1 h history of preceding words
- N number of training instances
- Problem? What if r C(w1 w2 wn) is 0 or 1
- Smoothing using N0 , N1 , N2 , T1 , T2
- N r number of distinct n gram with r instances
- T r r N r total count of n grams with r
instances
12Maximum Likelihood Estimation (MLE)
- MLE estimates from relative frequencies
- P(as) C(as) / N
- P(as) 8/10, P(more) 1/10, P(a) 1/10, P(x)
0, all x ? as, more, a - PMLE(w1 wn) C(w1 wn) / N
- PMLE(wn w1 wn-1) C(w1wn) / C(w1wn-1)
- PMLE gives highest probability to sample
- Why? Not wasting any probability value on unseen
13MLEs problem and solution
- When using model to predict testing data
- Many (Majority) of word types are unseen
- Zero prob propagates and wipe out other prob
- Is more data a solution?
- Never a general solution
- Consider all numbers following the year
- Solution
- Decreasing prob of seen events
- No-zero prob for unseen events
14Using MLE for n-gram of Austen
- Sentence In person she was inferior to both
sisters - Unigram - not best, but still useful for
prediction - Bigram - generally increase prob.
- Trigram can work brilliantly, sometimes
- P(was person she) 0.5
- Four gram useless
- Intuition - use high n-gram when possible
- Zero prob still exists
15Laplace's law etc
- Laplaces law or Adding One
- PLAP(w1 wn) (C(w1 wn)1) /(NB)fLAP(w1
wn) (C(w1 wn)1) N /(NB) - For r gt 0, fLAP lt fMLE r
- For r 0, fLAP gt 0 (fMLEr)
16PLAP gives too much prob to N0
- Depends on size of vocabulary V wi
- B gtgt N
- AP corpus
- N22 M, V400 K,
- BV2160G
- N075G
- 46.5 prob goes to N0 bigrams (N0fLAP / N)
- Actually, 9.2 of word instances in AP (testing)
are unseen
17(No Transcript)
18Lidstone's law ,Jeffreys-Perks law (ELE)
- PLid(w1 wn) (C(w1 wn)?) /(NB?)
- Interpolation between MLE uniform PLid(w1
wn) ?C(w1 wn)/N(1-?)/B - ?N/(N ? B)
- ?0.5, Jeffreys-Perks law
- Better than PLAP but problems remain
- Which value for ??
- Still linear to MLE, dependent on B and N
19Applying MLE and ELE to Austen
- P(notwas) 0.065?0.036
- bigram PELE lt PMLE
- Still too much discount? Yes
- P(she was inferior to both sisters)
- Bigram ELE - PELE 6.89 ? 10-20 (?0.5)
- Worse than Unigram MLE
- Low prob than PMLE
20Held out estimation
- Tr ? w1 wn C(w1 wn)r Cho(w1 wn)
- Tr total count of n-gram in HO
- Tr Tr, ho where Ctrain(w1 wn) r
- Tr / Nr is average frequency in HO
- f ho(w1 wn) Tr, ho / Nr
- Use hold out frequency to estimate P
- Pho(w1 wn) f ho(w1 wn) / N
21Why held out data?
- Training data
- Study and develop model
- Training model
- Test data
- Should never look at test data
- Held out Simulation of test data
- Training Train Held out
- HO simulated testing (HO ? testing)
22Gold standard to evaluate f
- In training data
- Nr number of n-gram types with count ftrain r
- Tr total number of n-gram instances (ftrain
r) - In testing data (test)
- Tr total number of n-grams in test (count
ftrain r) - femp Tr / Nr ?(Ctest(w1 wn) ) / Nrfor
(w1 wn) which Ctrain(w1 wn) r - Example
- N0 10000, N0 3 with count 1, 1, 2
- femp (112) / 10000 0.0004
23Cross-validation (deleted estimation)
- Divide training data N into N0 N1
- Nra number of r count bigrams in Na
- Trab total occurance of Nra in Nb
- Pho(w1 wn) Tr01/Nr0N or Tr10/Nr1N
- Pdel(w1 wn) (Tr01Tr10)/(Nr0Nr1)N
- Effective, close to gold standard
- Overestimate P for r 0
- Underestimate when r 1
24Leave one out
- Two sets of training data Dividing Tr into (Tr0
, Tr1 ) (N-1, 1) - Pdel(w1 wn) (Tr01Tr10)/(Nr0Nr1)N where
C(w1 wn) r - Rotation this one for N times
- Closely related to Good-Turing method
25Good-Turing estimation
- Good (1953) attributes GT to Turing
- Based on binomial distribution
- Works for many situations including n-gram
- PGT r/N, r (r1)E(Nr1)/E(Nr)
- Redistribution of prob value
26Issues with Using Good-Turing
- Problems
- r0 for max r because E(Nr1)0
- Nr monotonic but not smooth
- Solution
- Adjust r only when r lt k 10
- Use smoothed value S(r) instead of Nr
- Renormalize so prob values sum to 1
27Simple Good-Turing
- Due to Gale and Sampson (1995)
- What S(r) ?
- For low r, use S(r) Nr directly
- For high r
- Nr -gt Sr a r b (b lt -1) log Nr a b log r
- Estimate a and b by linear regression for high r
- Need to have probs summing up to 1
28Briefly noted
- Absolute discounting
- Pabs(w1 wn) (r-?)/N if r gt 0
Pabs(w1 wn) (B-N0)?/ N0N if r 0 - ? ? 0.77 works best except for r 1
- Linear discounting
- Pld(w1 wn) (1-?) r / N if r gt 0
Pld(w1 wn) ? / N0 if r 0 - Cannot be justified
- Discounting too much for hight-count events
- High-count events more reliable statistically
29Outline
- Purpose of Statistical NLP
- Forming Equivalence Classes
- Statistical Estimators
- Combining Estimators
- Conclusions
30Combining Estimators
- Simple linear interpolation
- Pli (wnwn-2 wn-1) ?1P(wn) ?2P(wnwn-1)?3P(wnw
n-2 wn-1) - -training ? by EM algorithm
- -have good result
31Katzs backing-off
32Language models for Austen
- CMU-Cambridge Toolkit
- Katzs back off Good-Turing
- Trigram is better than bigram
- Four-gram is slightly worse
- Back-off model is ineffective at bad long
contexts
33Outline
- Purpose of Statistical NLP
- Forming Equivalence Classes
- Statistical Estimators
- Combining Estimators
- Conclusions
34Conclusions
- According to Chen and Goodman (1996,8,9)
- Kneser-Ney is the best
- According to Church and Gale (1991)
- Good-Turing is the best
- Bigram, 2 M-word text
35