Ngram models and the Sparsity problem

1 / 36

About This Presentation

Title:

Ngram models and the Sparsity problem

Description:

Find a probability distribution for the current word in a text (utterance, etc. ... Corpus: five Jane Austen novels. N = 617,091 words. V = 14,585 unique words ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 37

Provided by: johnagol

Learn more at: http://hum.uchicago.edu

more less

Transcript and Presenter's Notes

Title: Ngram models and the Sparsity problem

1
Ngram models and the Sparsity problem

John Goldsmith

2
The task

Find a probability distribution for the current
word in a text (utterance, etc.), given what the
last n words have been. (n 0,1,2,3)
Why this is reasonable
What the problems are

3
Why this is important

Probability is the only common currency that can
be used to relate information from several
sources
Language model (prior) plus right-now
information from sound or writing
Probability (joint event the current input its
analysis)
probability (analysis) prob(current
inputanalysis) Bayesian analysis

4
If you take logs

Log prob ( your joint analysis )
Sum of
Log prob ( linguistic analysis )
Log prob ( likelihood of that data, given this
linguistic analysis )
Find the analysis that maximizes this sum.

5
Why an n-gram model is reasonable

The last few words tells us a lot about the next
word
collocations
prediction of current category the is followed
by nouns or adjectives
semantic domain

6
Reminder about applications

Speech recognition
Handwriting recognition
POS tagging

7
Problem of sparsity

Words are very rare events (even if were not
aware of that), so
What feel like perfectly common sequences of
words may be too rare to actually have in our
training corpus

8
Whats the next word?

in a ____
with a ____
the last ____
shot a _____
open the ____
over my ____
President Bill ____
keep tabs ____

9
Example Corpus five Jane Austen novels N
617,091 words V 14,585 unique words Task
predict the next word of the trigram inferior to
________ from test data, Persuasion In
person, she was inferior to both sisters.
borrowed from Henke, based on Manning and Schütze
10
Instances in the Training Corpusinferior to
________
borrowed from Henke, based on Manning and Schütze
11
Maximum Likelihood Estimate
borrowed from Henke, based on Manning and Schütze
12
Maximum Likelihood Distribution DML

probability is assigned exactly based on the
n-gram count in the training corpus.
Anything not found in the training corpus gets
probability 0.

13
Actual Probability Distribution
borrowed from Henke, based on Manning and Schütze
14
Conundrum

Do we stick very tight to the Maximum
Likelihood model, assigning zero probability to
sequences not seen in the training corpus?
Answer we simply cannot the results are just
too bad.

15
Smoothing

We need, therefore, some smoothing procedure
which adds some of the probability mass to unseen
n-grams
and must therefore take away some of the
probability mass from observed n-grams

16
And linguistics?

The theory of syntax can be viewed as a
contribution to the back-off conundrum syntactic
categories are the first back-off route, and
linear distance may be less good than syntactic
closeness for the conditioning words.

17
Discounting, back-off, and deleted interpolation

These words all go with smoothing.
Smoothing describes the general problem we
face getting probability mass to the great
unseen.
Discounting describes who we take probability
mass away from, and how much.

Back-off and deleted interpolation (a special
case of linear interpolation) are the two
standard ways of redistributing the probability
mass taken away by discounting.

19
Back-off and deleted interpolationfor a given
context

What is probability of words wii in the
context following in the__ (e.g., pocket) ?
Words that were found in this context get a
probability a bit less thanand
with backoff, the held-back
probability mass is distributed over words in the
context the __. And how?

Probability mass is distributed over the WORD
pretty much in proportion to how often each word
appears in the context the___. But even there,
we hold some of the probability mass, and assign
it to all words independent of context.

21
Deleted Interpolation

Is linear for any word in context (e.g., pocket
after in the), we choose three ls and take its
probability to be the weighted average of the
trigram, bigram, and unigram modelsl1P(pocketin
the) l2P(pocketthe) l3P(pocket)
If we fixed the ls, we would only need to insist
that they sum to 1.0. But

We dont fix them we allow them to vary,
depending on the context (in the) we need to
do some fancier calculations then
(Expectation-Maximization).

23
General ideas about discounting

Three closely related ideas that are widely used.

24
Sum of counts method of creating a distribution

You can always get a distribution from a set of
counts by dividing each count by the total count
of the set.
bins name for the different preceding n-grams
that we keep track of. Each bin gets a
probability, and they must sum to 1.0

25
Zero knowledge

Suppose we give a count of 1 to every possible
bin in our model.
If our model is a bigram model, we give a count
of 1 to the V2 conceivable bigrams. (V if
unigram, V3 if trigram, etc.)
Admittedly, this model assumes zero knowledge of
the language.
We get a distribution for each bin by assigning
probability 1/V2 to each bin. Call this
distribution DN.

26
Too much knowledge

Give each bin exactly the number of counts that
it earns from the training corpus.
If we are making a bigram model, then there are
V2 bins, and those bigrams that do not appear in
the training corpus get a count of 0.
We get the Maximum Likelihood distribution by
dividing by the total count N.

27
Laplace (Adding one)

Add the bin counts from the Zero-knowledge case
(1 for each bin, V2 of them in bigram case) and
the bin counts from the Too-much knowledge (score
in training corpus)
Divide by total number of counts V2 N
Formula each bin gets probability (Count in
corpus 1) / (V2 N)

28
Lidstones Law

Choose a number l, between 0 and 1, for the
count in the NoKnowledge distribution.
Then the count in each bin is Count in corpus
l
And we assign probability to it (where the number
of bins is V2, because were considering a bigram
model

If l 1 this is Laplace If l 0.5, this is
Jeffrey-Perks Law If l 0, this is Maximum
Likelihood
29
Another way to say this

We can also think of Laplace as a weighted
average of two distributions, the No Knowledge
distribution and the MaximumLikelihood
distribution

30
2. Averaging distributions

Remember this
If you take weighted averages of distributions of
this form
l distribution D1 (1- l) distribution D2
the result is a distribution all the numbers sum
to 1.0
This means that you split the probability mass
between the two distributions (in proportion l/1-
l) then divide up those smaller portions exactly
according to D1 and D2.

31
Adding 1 (Laplace)

Is it clear that

this is a special case of
l DN (1- l )DML
where l V2/(V2N).
How big is this? if V 50,000, then
V2 2,500,000,000. This means that if our corpus
is 2 and a half billion words, we are still
reserving half of our probability mass for zero
knowledge thats too much.
l V2/(V2N) 2,500,000,000/5,000,000,000 0.5

33
Good-Turing discounting

The central problem is assigning probability mass
to unseen examples, especially unseen bigrams (or
trigrams), based on known vocabulary.
Good-Turing estimation says that a good estimate
for the total probability of unseen n-grams is
the total number of 1-grams seen N1/N.

So we take the probability mass assigned
empirically to n-grams seen once, and assign it
to all the unseen n-grams (we know how many there
are if the vocabulary is of size V, then there
are Vn n-grams
if we have seen T distinct n-grams, then each
unseen n-gram gets probability

So unseen n-grams got all of the probability mass
that had been earned by the n-grams seen once. So
the n-grams seen once will grab all of the
probability mass earned by n-grams seen twice,
then (uniformly) distributed

So n-grams seen twice will take all the
probability mass earned by n-grams seen three
timesand we stop this foolishness around the
time when observed frequencies are reliable,
around 10 times.

all unseen ngrams
Counts
seen 1x seen 2x 3x 4x 5x
pred 1x pred 2x 3x 4x 5x
MODEL assigns probabilities

Write a Comment

User Comments (0)