6. N-GRAMs

1 / 44
About This Presentation
Title:

6. N-GRAMs

Description:

Inflected forms ('cat' vs. 'cats') Wordform: cat, cats, eat, eats, ate, eating, eaten. Lemma (Stem): cat, eat. 8. Types vs. ... P(I want to eat British food) ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 45
Provided by: klplReP

less

Transcript and Presenter's Notes

Title: 6. N-GRAMs


1
6. N-GRAMs
  • ????? ???????
  • ???

2
Word prediction
  • Id like to make a collect
  • Call, telephone, or person-to-person
  • Spelling error detection
  • Augmentative communication
  • Context-sensitive spelling error correction

3
Language Model
  • Language Model (LM)
  • statistical model of word sequences
  • n-gram Use the previous n -1 words to predict
    the next word

4
Applications
  • context-sensitive spelling error detection and
    correction
  • He is trying to fine out.
  • The design an construction will take a year.
  • machine translation

5
Counting Words in Corpora
  • Corpora (on-line text collections)
  • Which words to count
  • What we are going to count
  • Where we are going to find the things to count

6
Brown Corpus
  • 1 million words
  • 500 texts
  • Varied genres (newspaper, novels, non-fiction,
    academic, etc.)
  • Assembled at Brown University in 1963-64
  • The first large on-line text collection used in
    corpus-based NLP research

7
Issues in Word Counting
  • Punctuation symbols (. , ? !)
  • Capitalization (He vs. he, Bush vs. bush)
  • Inflected forms (cat vs. cats)
  • Wordform cat, cats, eat, eats, ate, eating,
    eaten
  • Lemma (Stem) cat, eat

8
Types vs. Tokens
  • Tokens (N) Total number of running words
  • Types (B) Number of distinct words in a corpus
    (size of the vocabulary)
  • Example
  • They picnicked by the pool, then lay back on the
    grass and looked at the stars.
  • 16 word tokens, 14 word types (not counting
    punctuation)
  • ? Types will mean wordform types and not lemma
    type, and punctuation marks will generally be
    counted as word

9
How Many Words in English?
  • Shakespeares complete works
  • 884,647 wordform tokens
  • 29,066 wordform types
  • Brown Corpus
  • 1 million wordform tokens
  • 61,805 wordform types
  • 37,851 lemma types

10
Simple (Unsmoothed) N-grams
  • Task Estimating the probability of a word
  • First attempt
  • Suppose there is no corpus available
  • Use uniform distribution
  • Assume
  • word types V (e.g., 100,000)

11
Simple (Unsmoothed) N-grams
  • Task Estimating the probability of a word
  • Second attempt
  • Suppose there is a corpus
  • Assume
  • word tokens N
  • times w appears in corpus C(w)

12
Simple (Unsmoothed) N-grams
  • Task Estimating the probability of a word
  • Third attempt
  • Suppose there is a corpus
  • Assume a word depends on its n 1 previous words

13
Simple (Unsmoothed) N-grams
14
Simple (Unsmoothed) N-grams
  • n-gram approximation
  • Wk only depends on its previous n1words

15
Bigram Approximation
  • Example
  • P(I want to eat British food)
  • P(Iltsgt) P(wantI) P(towant) P(eatto)
    P(Britisheat) P(foodBritish)
  • ltsgt a special word meaning start of sentence

16
Note on Practical Problem
  • Multiplying many probabilities results in a very
    small number and can cause numerical underflow
  • Use logprob instead in the actual computation

17
Estimating N-gram Probability
  • Maximum Likelihood Estimate (MLE)

18
(No Transcript)
19
Estimating Bigram Probability
  • Example
  • C(to eat) 860
  • C(to) 3256

20
(No Transcript)
21
Two Important facts
  • The increasing accuracy of N-gram models as we
    increse the value of N
  • Very strong dependency on their training corpus
    (in particular its genre and its size in words)

22
Smoothing
  • Any particular training corpus is finite
  • Sparse data problem
  • Deal with zero probability

23
Smoothing
  • Smoothing
  • Reevaluating zero probability n-grams and
    assigning them non-zero probability
  • Also called Discounting
  • Lowering non-zero n-gram counts in order to
    assign some probability mass to the zero n-grams

24
Add-One Smoothing for Bigram
25
(No Transcript)
26
(No Transcript)
27
Things Seen Once
  • Use the count of things seen once to help
    estimate the count of things never seen

28
Witten-Bell Discounting
29
Witten-Bell Discounting for Bigram
30
Witten-Bell Discounting for Bigram
31
  • Seen count Unseen count

32
(No Transcript)
33
Good-Turing Discounting for Bigram
34
(No Transcript)
35
Backoff
36
Backoff
37
Entropy
  • Measure of uncertainty
  • Used to evaluate quality of n-gram models (how
    well a language model matches a given language)
  • Entropy H(X) of a random variable X
  • Measured in bits
  • Number of bits to encode information in the
    optimal coding scheme

38
Example 1
39
Example 2
40
Perplexity
41
Entropy of a Sequence
42
Entropy of a Language
43
Cross Entropy
  • Used for comparing two language models
  • p Actual probability distribution that generated
    some data
  • m A model of p (approximation to p)
  • Cross entropy of m on p

44
Cross Entropy
  • By Shannon-McMillan-Breimantheorem
  • Property of cross entropy
  • Difference between H(p,m) and H(p) is a measure
    of how accurate model m is
  • The more accurate a model, the lower its
    cross-entropy
Write a Comment
User Comments (0)