Statistical Inference: ngram Models over Sparse Data

1 / 35

About This Presentation

Title:

Statistical Inference: ngram Models over Sparse Data

Description:

bigram PELE PMLE. Still too much discount? Yes. P(she was inferior to both sisters) Bigram ELE - PELE = 6.89 10-20 ( =0.5) Worse than Unigram MLE. Low prob than ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 36

Provided by: mitel

Tags: gutenberg | pele | project | flash | free | online_training | powerpoint | ppt | pptx | presentation | slide_show | slideshow

Transcript and Presenter's Notes

Title: Statistical Inference: ngram Models over Sparse Data

1
Statistical Inference n-gram Models over Sparse
Data

Huang Like
2007-04-27

2
Outline

Purpose of Statistical NLP
Forming Equivalence Classes
Statistical Estimators
Combining Estimators
Conclusions

3
Purpose of Statistical NLP

Doing statistical inference for NLP
Why statistical inference for NLP tasks?
Some NLP data generated by some distribution
Make inference about distribution
How? training data --gt equivalent classes (EC)
Finding good statistical estimated for EC
Combining multiple estimators

4
Example Language Modelling

Purpose Predict next word given previous words
Applications
speech or optical character recognition
spelling correction
handwriting recognition
machine translation
Methods applicable to
word sense disambiguation
probabilistic parsing

5
Outline

Purpose of Statistical NLP
Forming Equivalence Classes
Statistical Estimators
Combining Estimators
Conclusions

6
Reliability vs. Discrimination

Prediction mapping from past to future
From classificatory features to target feature
Independence assumption data does not depend on
other features (or just minor dependence)
More features
More bins, greater discrimination
Less training data, lower statistical reliability

7
n-gram models

Predicting next word estimatingP(wm w1
wm-1)
Using history h w1 wm-1
Markov assumption only the prior n-1 word
affects wmP(wm h) P(wm wm-n1 wm-1)
n 2 bigram, n 3 trigram,
n 4 four-gram

8
How large should n be?

Large n to capture long distance dependency
Example Sue swallow the large green ____
Longer context predicts pill, frog
Local context predicts tree, car, mountain
But high order n-gram not realistic
Bigram has 400 million bins
Trigram has 8 trillion bins

9
n-gram for Austens novel

Data available from project Gutenberg
40 M of clean plain ASCII files
Training data Emma, Mansfield Park, Pride and
Prejudice, Sense and Sensibility
Testing Persuasion
Corpus N 617,091 words, Vocabulary V 14,585
word types
Leaving out all punctuation
Keeping case distinction

10
Outline

Purpose of Statistical NLP
Forming Equivalence Classes
Statistical Estimators
Combining Estimators
Conclusions

11
Statistical Estimators

n-gram model
P(wn w1 wn-1) P(w1 wn)/P(w1 wn-1)
C(w1 w2 wn) frequency of w1n w1 w2 wn
w1 wn-1 h history of preceding words
N number of training instances
Problem? What if r C(w1 w2 wn) is 0 or 1
Smoothing using N0 , N1 , N2 , T1 , T2
N r number of distinct n gram with r instances
T r r N r total count of n grams with r
instances

12
Maximum Likelihood Estimation (MLE)

MLE estimates from relative frequencies
P(as) C(as) / N
P(as) 8/10, P(more) 1/10, P(a) 1/10, P(x)
0, all x ? as, more, a
PMLE(w1 wn) C(w1 wn) / N
PMLE(wn w1 wn-1) C(w1wn) / C(w1wn-1)
PMLE gives highest probability to sample
Why? Not wasting any probability value on unseen

13
MLEs problem and solution

When using model to predict testing data
Many (Majority) of word types are unseen
Zero prob propagates and wipe out other prob
Is more data a solution?
Never a general solution
Consider all numbers following the year
Solution
Decreasing prob of seen events
No-zero prob for unseen events

14
Using MLE for n-gram of Austen

Sentence In person she was inferior to both
sisters
Unigram - not best, but still useful for
prediction
Bigram - generally increase prob.
Trigram can work brilliantly, sometimes
P(was person she) 0.5
Four gram useless
Intuition - use high n-gram when possible
Zero prob still exists

15
Laplace's law etc

Laplaces law or Adding One
PLAP(w1 wn) (C(w1 wn)1) /(NB)fLAP(w1
wn) (C(w1 wn)1) N /(NB)
For r gt 0, fLAP lt fMLE r
For r 0, fLAP gt 0 (fMLEr)

16
PLAP gives too much prob to N0

Depends on size of vocabulary V wi
B gtgt N
AP corpus
N22 M, V400 K,
BV2160G
N075G
46.5 prob goes to N0 bigrams (N0fLAP / N)
Actually, 9.2 of word instances in AP (testing)
are unseen

17
(No Transcript)
18
Lidstone's law ,Jeffreys-Perks law (ELE)

PLid(w1 wn) (C(w1 wn)?) /(NB?)
Interpolation between MLE uniform PLid(w1
wn) ?C(w1 wn)/N(1-?)/B
?N/(N ? B)
?0.5, Jeffreys-Perks law
Better than PLAP but problems remain
Which value for ??
Still linear to MLE, dependent on B and N

19
Applying MLE and ELE to Austen

P(notwas) 0.065?0.036
bigram PELE lt PMLE
Still too much discount? Yes
P(she was inferior to both sisters)
Bigram ELE - PELE 6.89 ? 10-20 (?0.5)
Worse than Unigram MLE
Low prob than PMLE

20
Held out estimation

Tr ? w1 wn C(w1 wn)r Cho(w1 wn)
Tr total count of n-gram in HO
Tr Tr, ho where Ctrain(w1 wn) r
Tr / Nr is average frequency in HO
f ho(w1 wn) Tr, ho / Nr
Use hold out frequency to estimate P
Pho(w1 wn) f ho(w1 wn) / N

21
Why held out data?

Training data
Study and develop model
Training model
Test data
Should never look at test data
Held out Simulation of test data
Training Train Held out
HO simulated testing (HO ? testing)

22
Gold standard to evaluate f

In training data
Nr number of n-gram types with count ftrain r
Tr total number of n-gram instances (ftrain
r)
In testing data (test)
Tr total number of n-grams in test (count
ftrain r)
femp Tr / Nr ?(Ctest(w1 wn) ) / Nrfor
(w1 wn) which Ctrain(w1 wn) r
Example
N0 10000, N0 3 with count 1, 1, 2
femp (112) / 10000 0.0004

23
Cross-validation (deleted estimation)

Divide training data N into N0 N1
Nra number of r count bigrams in Na
Trab total occurance of Nra in Nb
Pho(w1 wn) Tr01/Nr0N or Tr10/Nr1N
Pdel(w1 wn) (Tr01Tr10)/(Nr0Nr1)N
Effective, close to gold standard
Overestimate P for r 0
Underestimate when r 1

24
Leave one out

Two sets of training data Dividing Tr into (Tr0
, Tr1 ) (N-1, 1)
Pdel(w1 wn) (Tr01Tr10)/(Nr0Nr1)N where
C(w1 wn) r
Rotation this one for N times
Closely related to Good-Turing method

25
Good-Turing estimation

Good (1953) attributes GT to Turing
Based on binomial distribution
Works for many situations including n-gram
PGT r/N, r (r1)E(Nr1)/E(Nr)
Redistribution of prob value

26
Issues with Using Good-Turing

Problems
r0 for max r because E(Nr1)0
Nr monotonic but not smooth
Solution
Adjust r only when r lt k 10
Use smoothed value S(r) instead of Nr
Renormalize so prob values sum to 1

27
Simple Good-Turing

Due to Gale and Sampson (1995)
What S(r) ?
For low r, use S(r) Nr directly
For high r
Nr -gt Sr a r b (b lt -1) log Nr a b log r
Estimate a and b by linear regression for high r
Need to have probs summing up to 1

28
Briefly noted

Absolute discounting
Pabs(w1 wn) (r-?)/N if r gt 0
Pabs(w1 wn) (B-N0)?/ N0N if r 0
? ? 0.77 works best except for r 1
Linear discounting
Pld(w1 wn) (1-?) r / N if r gt 0
Pld(w1 wn) ? / N0 if r 0
Cannot be justified
Discounting too much for hight-count events
High-count events more reliable statistically

29
Outline

Purpose of Statistical NLP
Forming Equivalence Classes
Statistical Estimators
Combining Estimators
Conclusions

30
Combining Estimators

Simple linear interpolation
Pli (wnwn-2 wn-1) ?1P(wn) ?2P(wnwn-1)?3P(wnw
n-2 wn-1)
-training ? by EM algorithm
-have good result

31
Katzs backing-off
32
Language models for Austen

CMU-Cambridge Toolkit
Katzs back off Good-Turing
Trigram is better than bigram
Four-gram is slightly worse
Back-off model is ineffective at bad long
contexts

33
Outline

Purpose of Statistical NLP
Forming Equivalence Classes
Statistical Estimators
Combining Estimators
Conclusions

34
Conclusions

According to Chen and Goodman (1996,8,9)
Kneser-Ney is the best
According to Church and Gale (1991)
Good-Turing is the best
Bigram, 2 M-word text

35

Thanks!

Write a Comment

User Comments (0)

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

6. Statistical Inference : ngram Models over Sparse Data PowerPoint PPT Presentation

6. Statistical Inference : ngram Models over Sparse Data - General linear interpolation. weight : function of history ... Good-Turing, linear interpolation or back-off. Good-Turing smoothing is good. Church & Gale (1991) ... | PowerPoint PPT presentation | free to view

Chapter6. Statistical Inference : ngram Model over Sparse Data PowerPoint PPT Presentation

Chapter6. Statistical Inference : ngram Model over Sparse Data - General linear interpolation. weight : function of history ... Good-Turing, linear interpolation or back-off. Good-Turing smoothing is good. Church & Gale (1991) ... | PowerPoint PPT presentation | free to view

Modeling diffusion in heterogeneous media: Data driven microstructure reconstruction models, stochastic collocation and the variational multiscale method* PowerPoint PPT Presentation

Modeling diffusion in heterogeneous media: Data driven microstructure reconstruction models, stochastic collocation and the variational multiscale method* - Modeling diffusion in heterogeneous media: Data driven microstructure reconstruction models, stochastic collocation and the variational multiscale method* | PowerPoint PPT presentation | free to view

Section 7.7 Introduction to Inference PowerPoint PPT Presentation

Section 7.7 Introduction to Inference - Statistical Inference. Statistical inferencerefers to methods for drawing conclusions about an entire population on the basis of data from a sample. | PowerPoint PPT presentation | free to view

Data%20integration:%20an%20overview%20on%20statistical%20methodologies%20and%20applications. PowerPoint PPT Presentation

Data%20integration:%20an%20overview%20on%20statistical%20methodologies%20and%20applications. - Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration and Territorial Statistics | PowerPoint PPT presentation | free to view

2. Fixed Effects Models PowerPoint PPT Presentation

2. Fixed Effects Models - 2.1 Basic fixed-effects model 2.2 Exploring panel data 2.3 Estimation and inference 2.4 Model specification and diagnostics 2.5 Model extensions | PowerPoint PPT presentation | free to view

Teaching Statistical Concepts with Activities, Data, and Technology PowerPoint PPT Presentation

Teaching Statistical Concepts with Activities, Data, and Technology - Chi-square statistic 2. ... conceptual understanding and analyzing data Use assessments to improve and evaluate student learning www.amstat.org/education/gaise ... | PowerPoint PPT presentation | free to view

Confidence intervals are one of the two most common types of statistical inference. Use a confidence interval when your goal is to estimate a population parameter. The second common type of inference, called tests of significance, has a different goal: PowerPoint PPT Presentation

Confidence intervals are one of the two most common types of statistical inference. Use a confidence interval when your goal is to estimate a population parameter. The second common type of inference, called tests of significance, has a different goal: - Statistical Inference Confidence intervals are one of the two most common types of statistical inference. Use a confidence interval when your goal is to estimate a ... | PowerPoint PPT presentation | free to view

Partially missing at random and ignorable inferences for parameter subsets with missing data PowerPoint PPT Presentation

Partially missing at random and ignorable inferences for parameter subsets with missing data - Title: Statistical Analysis of Repeated-Measures Data with Dropouts Author: Preferred Customer Last modified by: School of Public Health Created Date | PowerPoint PPT presentation | free to view

Probabilistic Graphical Models PowerPoint PPT Presentation

Probabilistic Graphical Models - Title: Statistical Analysis of Web-Generated Data Author: David Madigan Last modified by: David Madigan Created Date: 4/6/1997 3:24:04 PM Document presentation format | PowerPoint PPT presentation | free to view

Machine Learning & AI Foundations: A Guide to Predictive Modeling PowerPoint PPT Presentation

Machine Learning & AI Foundations: A Guide to Predictive Modeling - We will walk through the modeling process from start to finish and discuss how the amount of data fluctuates, often dramatically, at different stages of the project. We will review each stage—data selection, data preparation, modeling, scoring, and deployment—with scalability in mind, providing IT professionals, data scientists, and leadership with new insights, perspectives, and collaboration tools. | PowerPoint PPT presentation | free to view

On the causal interpretation of statistical models in social research PowerPoint PPT Presentation

On the causal interpretation of statistical models in social research - On the causal interpretation of statistical models in social research Alessio Moneta & Federica Russo The dawn of history of causal modelling Staunch causalists ... | PowerPoint PPT presentation | free to view

Cardinality-based Inference Control in OLAP Systems An Information Theoretic Approach PowerPoint PPT Presentation

Cardinality-based Inference Control in OLAP Systems An Information Theoretic Approach - Cardinality-based Inference Control in OLAP Systems An Information Theoretic Approach Nan Zhang Texas A&M University This is a joint work with Dr. Wei Zhao and Dr ... | PowerPoint PPT presentation | free to view

Structured Models for Multi-Agent Interactions PowerPoint PPT Presentation

Structured Models for Multi-Agent Interactions - Title: Discriminative Probabilistic Models for Relational Data Author: btaskar Last modified by: Daphne Koller Created Date: 2/2/2002 10:39:20 PM Document ... | PowerPoint PPT presentation | free to view

External validation with sparse, adaptive-design data for evaluating the predictive performance of a population pharmacokinetic model of tacrolimus PowerPoint PPT Presentation

External validation with sparse, adaptive-design data for evaluating the predictive performance of a population pharmacokinetic model of tacrolimus - External validation with sparse, adaptive-design data for evaluating the predictive performance of a population pharmacokinetic model of tacrolimus | PowerPoint PPT presentation | free to view

Keyword Search on Structured and Semi-Structured Data PowerPoint PPT Presentation

Keyword Search on Structured and Semi-Structured Data - XRANK: Ranked keyword search over XML documents. ... Tutorial * * Databases / XML data ... dataflow Result Definition on XML & Trees /1 In an XML tree, ... | PowerPoint PPT presentation | free to view

The Basics of Statistics for Data Science By Statisticians (1) PowerPoint PPT Presentation

The Basics of Statistics for Data Science By Statisticians (1) - Want to learn data science, but don't know how to start learn data science from scratch? Here in this presentation you will going to learn the basics of statistics for data science. Start learn these basic statistics to get the good command over data science. | PowerPoint PPT presentation | free to view

The Basics of Statistics for Data Science By Statisticians PowerPoint PPT Presentation

The Basics of Statistics for Data Science By Statisticians - Want to learn data science, but don't know how to start learn data science from scratch? Here in this presentation you will going to learn the basics of statistics for data science. Start learn these basic statistics to get the good command over data science. | PowerPoint PPT presentation | free to view

Assumptions in linear regression models PowerPoint PPT Presentation

Assumptions in linear regression models - Title: What are linear statistical models? Author: singertf Last modified by: emanuele.taufer Created Date: 9/26/2005 8:14:23 PM Document presentation format | PowerPoint PPT presentation | free to view

Lecture 10: Inference and Belief Networks PowerPoint PPT Presentation

Lecture 10: Inference and Belief Networks - Mini-TREC alone would not qualify, but some set of related ... Papers and (Mini-INEX Organization ?) Review. Probabilistic Models and Logistic Regression ... | PowerPoint PPT presentation | free to view

Basis of statistical Inference 1 PowerPoint PPT Presentation

Basis of statistical Inference 1 - Confidence Interval. 95% data points between 1.96 and 1.96. Z ... Confidence interval. 95% data points between z= -1.96 and z= 1.96. Confidence interval ... | PowerPoint PPT presentation | free to view

Chapter 5 Statistical Methods PowerPoint PPT Presentation

Chapter 5 Statistical Methods - Descriptive statistics V.S. Statistical inference. Population, Sample, Data set ... in the extent of support for abortion between the male and the female population? ... | PowerPoint PPT presentation | free to view

Utilizing SizeUSA Data with Principal Component Analysis to Create Lifelike Human Models from Few In PowerPoint PPT Presentation

Utilizing SizeUSA Data with Principal Component Analysis to Create Lifelike Human Models from Few In - Utilizing SizeUSA Data with Principal Component Analysis to Create Lifelike ... Only possible using hundreds of principal components and 1000 body scan avatar inputs. ... | PowerPoint PPT presentation | free to view

Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis PowerPoint PPT Presentation

Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis - Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis Stephen E. Fienberg Department of Statistics Center for Automated Learning & Discovery | PowerPoint PPT presentation | free to view

The Sensitivity of a Real-Time Four-Dimensional Data Assimilation Procedure to Weather Research and Forecast Model Simulations: A Case Study PowerPoint PPT Presentation

The Sensitivity of a Real-Time Four-Dimensional Data Assimilation Procedure to Weather Research and Forecast Model Simulations: A Case Study - The Sensitivity of a Real-Time Four-Dimensional Data Assimilation Procedure to Weather Research and Forecast Model Simulations: A Case Study Hsiao-ming Hsu and Yubao Liu | PowerPoint PPT presentation | free to view

A comparison of 3DVar and Ensemble Data Assimilation Methods Using the NCEP GFS PowerPoint PPT Presentation

A comparison of 3DVar and Ensemble Data Assimilation Methods Using the NCEP GFS - Loop over analysis times: ... End loop over observations. Add variance to ... (very preliminary, very optimistic since their algorithm faster) Conclusions ... | PowerPoint PPT presentation | free to view

Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sand PowerPoint PPT Presentation

Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sand - Theresa M. Sandifer. Chapter 11. Regression Analysis. 2. Doing Statistics ... Although the simple linear model may be significant, it might not be correct. 73 ... | PowerPoint PPT presentation | free to view