Smoothing

1 / 32

About This Presentation

Title:

Smoothing

Description:

... conditional entropy, mutual information ... much count mass did we harvest ... We redistribute the count mass of types observed r 1 times evenly among ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 33

Provided by: pichua

more less

Transcript and Presenter's Notes

Title: Smoothing

1
Smoothing

Anish Johnson and Nate Chambers
10 April 2009

(Thanks to Bill MacCartney, Jenny Finkel, and
Sushant Prakash for these materials)
2
Format and content of sections

Mix of theoretical and practical topics,
targeted at PAs
Emphasis on simple examples worked out in detail

3
Outline for today

Information theory intuitions and examples
entropy, joint entropy, conditional entropy,
mutual information
relative entropy (KL divergence), cross entropy,
perplexity
Smoothing examples, proofs, implementation
absolute discounting example
how to prove you have a proper probability
distribution
Good-Turing smoothing tricks
smoothing and conditional distributions
Java implementation
representing huge models efficiently
Some tips on the programming assignments

4
Example

Toy language with two words "A" and "B". Want to
build predictive model from
"A B, B A."
"A B, B A A A!"
"B A, A B!"
"A!"
"A, A, A."
"" (Can say nothing)

5
A Unigram Model

Let's omit punctuation and put a stop symbol
after each utterance
A B B A .
A B B A A A .
B A A B .
A .
A A A .
.
Let C(x) be the observed count of unigram x
Let o(x) be the observed frequency of unigram x

A multinomial probability distribution with 2
free parameters
Event space A, B, .
o(x) MLE estimates of the unknown true parameters

6
A Bigram Model (1/2)

This time we'll put a stop symbol
before and after each utterance
. A B B A .
. A B B A A A .
. B A A B .
. A .
. A A A .
. .
Let C(x, y) be the observed count of
bigram xy
Let o(x, y) be the observed
frequency of bigram xy

multinomial probability distribution with 8 free
parameters
Event space A, B, . A, B, .
MLE estimates of the unknown true parameters

7
A Bigram Model (2/2)

Marginal distributions o(x) and o(y)
Conditional distributions o(y x) and o(x y)
?
o(y x) o(x, y) / o(x)
o(x y) o(x, y) / o(y)

o(B A) o(B, A) / o(A) (1/8) /
(1/2) 1/4
8
Entropy (1/2)
Note that entropy is (the negative of) the
expected value of the log probability of x. Log
probabilities come up a lot in statistical
modeling. Since probabilities are always 1,
log probabilities are always 0. Therefore,
entropy is always 0.

If X is a random variable whose distribution is p
(which we write "X p"), then we can define the
entropy of X, H(X), as follows
Let's calculate the entropy of our observed
unigram distribution, o(x)

9
Entropy (2/2)

What does entropy mean, intuitively?
Helpful to think of it as a measure of the
evenness or uniformity of the distribution
Lowest entropy?
what parameters achieve that minimum?
Highest entropy?
what parameters achieve that maximum?
What if m(x) 0 for some x?
By definition 0 lg 0 0, so that events with
probability 0 do not affect the entropy
calculation.

10
Joint Entropy (1/2)

If random variables X and Y have joint
distribution p(x, y), then we can define the
joint entropy of X and Y, H(X, Y), as follows
Let's calculate the entropy of our observed
bigram distribution, o(x, y)

joint entropy H(X, Y)
11
Joint Entropy (2/2)

Try fiddling with the parameters of the following
joint distribution m(x, y), and observe what
happens to the joint entropy

12
Conditional Entropy (1/2)

If random variables X and Y have joint
distribution p(x, y), then we can define the
conditional entropy of Y given X, H(Y X), as
follows

13
Conditional Entropy (2/2)

From previous slide
H(Y X) ?x p(x) - ?y p(yx) lg p(y x)
An alternative of computing H(Y X)
H(Y X) H(X,Y) H(X)

14
Relative entropy (1/3)(KL divergence)

If we have two probability distributions, p(x)
and q(x), we can define the relative entropy (aka
KL divergence) between p and q, D(pq), as
follows
We define 0 log0 0, plog(p/0)?
Relative entropy
measures how much two probability distributions
differ.
asymmetric
Identical distributions have zero relative
entropy.
Non-identical distributions have positive
relative entropy.
never negative

15
Relative entropy (2/3)(KL divergence)

Suppose we want to compare our observed unigram
distribution o(x) to some arbitrary model
distribution m(x). What is the relative entropy
between them?
Try fiddling with the parameters of m(x) and see
what it does to the KL divergence.
What parameters minimize the divergence?
Maximize?

16
Relative entropy (3/3)(KL divergence)

What happens when o(x) 0? (lg 0 is normally
undefined!)
Events that cannot happen (according to o(x)) do
not contribute to KL divergence between o(x) and
any other distribution.
What happens when m(x) 0? (division by 0!)
If an event x was observed (o(x) gt 0) but your
model says it can't (m(x) 0), then your model
is infinitely surprised D(o m) 8.
Why is D(p q) ? 0? Can you prove it?

17
Smoothing Absolute Discounting

Idea reduce counts of observed event types by a
fixed amount ?, and reallocate the count mass to
unobserved event types.
Absolute discounting is simple and gives quite
good results.
Terminology
x - an event (a type) (e.g., bigram)
C - count of all observations (tokens) (e.g.,
training size)
C(x) - count of observations (tokens) of type x
(e.g. bigram counts)
V - of event types x (e.g. size of
vocabulary)
Nr - of event types observed r times x
C(x) r
? - a number between 0 and 1 (e.g. 0.75)

18
Absolute Discounting (Cont.)

For seen types, we deduct ? from the count mass
Pad(x) (C(x) - ?) / C if C(x) gt 0
How much count mass did we harvest by doing this?
We took ? from each of V - N0 types, so we have
?(V - N0) to redistribute among the N0 unseen
types. So each unseen type will get a count mass
of ?(V - N0) / N0
Pad(x) ?(V-N0) / N0C if C(x) 0
To see how this works, let's go back to our
original example and look at bigram counts. To
bring unseens into the picture, let's suppose we
have another word, C, giving rise to 7 new unseen
bigrams

19
Absolute Discounting (Cont.)

The probabilities add up to 1, but how do we
prove this in general?
Also, how do you choose a good value for ??
0.75 is often recommended, but you can also use
held-out data to tune this parameter.
Look down the column of Pad probabilities.
Anything troubling to you?

20
Proper Probability Distributions

To prove that a function p is a probability
distribution, you must show
1. ?x p(x) 0
2. ?x p(x) 1
The first is generally trivial the second can be
more challenging.
A proof for absolute discounting will illustrate
the general idea
?xPad(x) ?xC(x)gt0Pad(x) ?xC(x)0Pad(x)
?xC(x)gt0 (C(x)-?)/C ?xC(x)0(V-N0)?/N0
C
V-N0 terms N0 terms
?xC(x)gt0C(x)/C - (V-N0)?/C (V-N0)?/C
?xC(x)/C
C/C 1

21
Good-Turing Smoothing

We redistribute the count mass of types observed
r1 times evenly among types observed r times.
Then, estimate P(x) as r/C, where r is an
adjusted count for types observed r times. We
want r such that
r Nr (r1) Nr1
r (r1) Nr1 / Nr
gt PGT(x) ((r1) Nr1/Nr) / C

22
Good-Turing (Cont.)

To see how this works, let's go back to our
example and look at bigram counts. To make the
example more interesting, we'll assume we've also
seen the sentence, "C C C C C C C C C C C",
giving us another 12 bigram observations, as
summarized in the following table of
counts

23
Good-Turing (Cont.)

But now the probabilities do not add up to 1
One problem is that for high values of r (i.e.,
for high-frequency bigrams), Nr1 is quite likely
to be 0
One way to address this is to use the Good-Turing
estimates only for frequencies r lt k for some
constant cutoff k. Above this, the MLE estimates
are used.

24
Good-Turing (Cont.)

Thus a better way to define r is
r (r1) ENr1 / ENr
Where ENr means the expected number of event
types (bigrams) observed r times.
So we fit some function S to the observed values
(r, Nr)
Gale Sampson (1995) suggest using a power curve
Nr arb, with b lt -1
Fit using linear regression in logspace log Nr
log a b log r
The observed distribution of Nr, when transformed
into logspace, looks like this

25
Good-Turing (Cont.)

Here is a graph of the line fit in logspace and
the resulting power curve
The fit is a poor fit, but it gives us smoothed
values for Nr.
Now, For r gt 0, we use S(r) to generate the
adjusted counts r
r (r1) S(r1) / S(r)

26
Smoothing and Conditional Probabilities

Some people have the wrong idea about how to
combine smoothing with conditional probability
distributions. You know that a conditional
distribution can be computed as the ratio of a
joint distribution and a marginal distribution
P(x y) P(x, y) / P(y)
What if you want to use smoothing?
WRONG Smooth joint P(x, y) and marginal P(y)
independently, then combine P'''(x y) P'(x,
y) / P''(y).
Correct Smooth P(x y) independently.

27
Smoothing and Conditional Probabilities (Cont.)

The problem is that steps 1 and 2 do smoothing
separately, so it makes no sense to divide the
results. The right way to compute the smoothed
conditional probability distribution P(x y) is
1. From the joint P(x, y), compute a smoothed
joint P'(x, y).
2. From the smoothed joint P'(x, y), compute a
smoothed marginal P'(y).
3. Divide them let P'(x y) P'(x, y) / P'(y).

28
Smoothing and Conditional Probabilities (Cont.)

Suppose we're on safari, and we count the animals
we observe by species (x) and gender (y)
From these counts, we can easily compute
unsmoothed, empirical joint and marginal
distributions

29
Smoothing and Conditional Probabilities (Cont.)

Now suppose we want to use absolute discounting,
with ? 0.75. The adjusted counts are
Now from these counts, we can compute smoothed
joint and marginal distributions
Now, since both P'(x, y) and P'(y) result from
the same smoothing operation, it's OK to divide
them

30
Java

Any confusion about roulette wheel sampling in
generateWord()?
You want to train a trigram model on 10 million
words, but you keep running out of memory.
Don't use a V V V matrix!
CounterMap? Good, but needs to be elaborated.
Intern your Strings!
Virtual memory -mx2000m

31
Programming Assignment Tips