6. N-GRAMs

About This Presentation

Title:

6. N-GRAMs

Description:

Inflected forms ('cat' vs. 'cats') Wordform: cat, cats, eat, eats, ate, eating, eaten. Lemma (Stem): cat, eat. 8. Types vs. ... P(I want to eat British food) ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 45

Provided by: klplReP

more less

Transcript and Presenter's Notes

Title: 6. N-GRAMs

1
6. N-GRAMs

????? ???????
???

2
Word prediction

Id like to make a collect
Call, telephone, or person-to-person
Spelling error detection
Augmentative communication
Context-sensitive spelling error correction

3
Language Model

Language Model (LM)
statistical model of word sequences
n-gram Use the previous n -1 words to predict
the next word

4
Applications

context-sensitive spelling error detection and
correction
He is trying to fine out.
The design an construction will take a year.
machine translation

5
Counting Words in Corpora

Corpora (on-line text collections)
Which words to count
What we are going to count
Where we are going to find the things to count

6
Brown Corpus

1 million words
500 texts
Varied genres (newspaper, novels, non-fiction,
academic, etc.)
Assembled at Brown University in 1963-64
The first large on-line text collection used in
corpus-based NLP research

7
Issues in Word Counting

Punctuation symbols (. , ? !)
Capitalization (He vs. he, Bush vs. bush)
Inflected forms (cat vs. cats)
Wordform cat, cats, eat, eats, ate, eating,
eaten
Lemma (Stem) cat, eat

8
Types vs. Tokens

Tokens (N) Total number of running words
Types (B) Number of distinct words in a corpus
(size of the vocabulary)
Example
They picnicked by the pool, then lay back on the
grass and looked at the stars.
16 word tokens, 14 word types (not counting
punctuation)
? Types will mean wordform types and not lemma
type, and punctuation marks will generally be
counted as word

9
How Many Words in English?

Shakespeares complete works
884,647 wordform tokens
29,066 wordform types
Brown Corpus
1 million wordform tokens
61,805 wordform types
37,851 lemma types

10
Simple (Unsmoothed) N-grams

Task Estimating the probability of a word
First attempt
Suppose there is no corpus available
Use uniform distribution
Assume
word types V (e.g., 100,000)

11
Simple (Unsmoothed) N-grams

Task Estimating the probability of a word
Second attempt
Suppose there is a corpus
Assume
word tokens N
times w appears in corpus C(w)

12
Simple (Unsmoothed) N-grams

Task Estimating the probability of a word
Third attempt
Suppose there is a corpus
Assume a word depends on its n 1 previous words

13
Simple (Unsmoothed) N-grams
14
Simple (Unsmoothed) N-grams

n-gram approximation
Wk only depends on its previous n1words

15
Bigram Approximation

Example
P(I want to eat British food)
P(Iltsgt) P(wantI) P(towant) P(eatto)
P(Britisheat) P(foodBritish)
ltsgt a special word meaning start of sentence

16
Note on Practical Problem

Multiplying many probabilities results in a very
small number and can cause numerical underflow
Use logprob instead in the actual computation

17
Estimating N-gram Probability

Maximum Likelihood Estimate (MLE)

18
(No Transcript)
19
Estimating Bigram Probability

Example
C(to eat) 860
C(to) 3256

20
(No Transcript)
21
Two Important facts

The increasing accuracy of N-gram models as we
increse the value of N
Very strong dependency on their training corpus
(in particular its genre and its size in words)

22
Smoothing

Any particular training corpus is finite
Sparse data problem
Deal with zero probability

23
Smoothing

Smoothing
Reevaluating zero probability n-grams and
assigning them non-zero probability
Also called Discounting
Lowering non-zero n-gram counts in order to
assign some probability mass to the zero n-grams

24
Add-One Smoothing for Bigram
25
(No Transcript)
26
(No Transcript)
27
Things Seen Once

Use the count of things seen once to help
estimate the count of things never seen

28
Witten-Bell Discounting
29
Witten-Bell Discounting for Bigram
30
Witten-Bell Discounting for Bigram
31

Seen count Unseen count

32
(No Transcript)
33
Good-Turing Discounting for Bigram
34
(No Transcript)
35
Backoff
36
Backoff
37
Entropy

Measure of uncertainty
Used to evaluate quality of n-gram models (how
well a language model matches a given language)
Entropy H(X) of a random variable X
Measured in bits
Number of bits to encode information in the
optimal coding scheme

38
Example 1
39
Example 2
40
Perplexity
41
Entropy of a Sequence
42
Entropy of a Language
43
Cross Entropy