Title: 6. N-GRAMs
16. N-GRAMs
2Word prediction
- Id like to make a collect
- Call, telephone, or person-to-person
- Spelling error detection
- Augmentative communication
- Context-sensitive spelling error correction
3Language Model
- Language Model (LM)
- statistical model of word sequences
- n-gram Use the previous n -1 words to predict
the next word
4Applications
- context-sensitive spelling error detection and
correction - He is trying to fine out.
- The design an construction will take a year.
- machine translation
5Counting Words in Corpora
- Corpora (on-line text collections)
- Which words to count
- What we are going to count
- Where we are going to find the things to count
6Brown Corpus
- 1 million words
- 500 texts
- Varied genres (newspaper, novels, non-fiction,
academic, etc.) - Assembled at Brown University in 1963-64
- The first large on-line text collection used in
corpus-based NLP research
7Issues in Word Counting
- Punctuation symbols (. , ? !)
- Capitalization (He vs. he, Bush vs. bush)
- Inflected forms (cat vs. cats)
- Wordform cat, cats, eat, eats, ate, eating,
eaten - Lemma (Stem) cat, eat
8Types vs. Tokens
- Tokens (N) Total number of running words
- Types (B) Number of distinct words in a corpus
(size of the vocabulary) - Example
- They picnicked by the pool, then lay back on the
grass and looked at the stars. - 16 word tokens, 14 word types (not counting
punctuation) - ? Types will mean wordform types and not lemma
type, and punctuation marks will generally be
counted as word
9How Many Words in English?
- Shakespeares complete works
- 884,647 wordform tokens
- 29,066 wordform types
- Brown Corpus
- 1 million wordform tokens
- 61,805 wordform types
- 37,851 lemma types
10Simple (Unsmoothed) N-grams
- Task Estimating the probability of a word
- First attempt
- Suppose there is no corpus available
- Use uniform distribution
- Assume
- word types V (e.g., 100,000)
11Simple (Unsmoothed) N-grams
- Task Estimating the probability of a word
- Second attempt
- Suppose there is a corpus
- Assume
- word tokens N
- times w appears in corpus C(w)
12Simple (Unsmoothed) N-grams
- Task Estimating the probability of a word
- Third attempt
- Suppose there is a corpus
- Assume a word depends on its n 1 previous words
13Simple (Unsmoothed) N-grams
14Simple (Unsmoothed) N-grams
- n-gram approximation
- Wk only depends on its previous n1words
15Bigram Approximation
- Example
- P(I want to eat British food)
- P(Iltsgt) P(wantI) P(towant) P(eatto)
P(Britisheat) P(foodBritish) - ltsgt a special word meaning start of sentence
16Note on Practical Problem
- Multiplying many probabilities results in a very
small number and can cause numerical underflow - Use logprob instead in the actual computation
17Estimating N-gram Probability
- Maximum Likelihood Estimate (MLE)
18(No Transcript)
19Estimating Bigram Probability
- Example
- C(to eat) 860
- C(to) 3256
20(No Transcript)
21Two Important facts
- The increasing accuracy of N-gram models as we
increse the value of N - Very strong dependency on their training corpus
(in particular its genre and its size in words)
22Smoothing
- Any particular training corpus is finite
- Sparse data problem
- Deal with zero probability
23Smoothing
- Smoothing
- Reevaluating zero probability n-grams and
assigning them non-zero probability - Also called Discounting
- Lowering non-zero n-gram counts in order to
assign some probability mass to the zero n-grams
24Add-One Smoothing for Bigram
25(No Transcript)
26(No Transcript)
27Things Seen Once
- Use the count of things seen once to help
estimate the count of things never seen
28Witten-Bell Discounting
29Witten-Bell Discounting for Bigram
30Witten-Bell Discounting for Bigram
31 32(No Transcript)
33Good-Turing Discounting for Bigram
34(No Transcript)
35Backoff
36Backoff
37Entropy
- Measure of uncertainty
- Used to evaluate quality of n-gram models (how
well a language model matches a given language) - Entropy H(X) of a random variable X
- Measured in bits
- Number of bits to encode information in the
optimal coding scheme
38Example 1
39Example 2
40Perplexity
41Entropy of a Sequence
42Entropy of a Language
43Cross Entropy
- Used for comparing two language models
- p Actual probability distribution that generated
some data - m A model of p (approximation to p)
- Cross entropy of m on p
44Cross Entropy
- By Shannon-McMillan-Breimantheorem
- Property of cross entropy
- Difference between H(p,m) and H(p) is a measure
of how accurate model m is - The more accurate a model, the lower its
cross-entropy