Title: Smoothing
1Smoothing
- Anish Johnson and Nate Chambers
- 10 April 2009
(Thanks to Bill MacCartney, Jenny Finkel, and
Sushant Prakash for these materials)
2Format and content of sections
- Mix of theoretical and practical topics,
targeted at PAs - Emphasis on simple examples worked out in detail
3Outline for today
- Information theory intuitions and examples
- entropy, joint entropy, conditional entropy,
mutual information - relative entropy (KL divergence), cross entropy,
perplexity - Smoothing examples, proofs, implementation
- absolute discounting example
- how to prove you have a proper probability
distribution - Good-Turing smoothing tricks
- smoothing and conditional distributions
- Java implementation
- representing huge models efficiently
- Some tips on the programming assignments
4Example
- Toy language with two words "A" and "B". Want to
build predictive model from - "A B, B A."
- "A B, B A A A!"
- "B A, A B!"
- "A!"
- "A, A, A."
- "" (Can say nothing)
5A Unigram Model
- Let's omit punctuation and put a stop symbol
after each utterance - A B B A .
- A B B A A A .
- B A A B .
- A .
- A A A .
- .
- Let C(x) be the observed count of unigram x
- Let o(x) be the observed frequency of unigram x
- A multinomial probability distribution with 2
free parameters - Event space A, B, .
- o(x) MLE estimates of the unknown true parameters
6A Bigram Model (1/2)
- This time we'll put a stop symbol
- before and after each utterance
- . A B B A .
- . A B B A A A .
- . B A A B .
- . A .
- . A A A .
- . .
- Let C(x, y) be the observed count of
- bigram xy
- Let o(x, y) be the observed
- frequency of bigram xy
- multinomial probability distribution with 8 free
parameters - Event space A, B, . A, B, .
- MLE estimates of the unknown true parameters
7A Bigram Model (2/2)
- Marginal distributions o(x) and o(y)
- Conditional distributions o(y x) and o(x y)
? - o(y x) o(x, y) / o(x)
- o(x y) o(x, y) / o(y)
o(B A) o(B, A) / o(A) (1/8) /
(1/2) 1/4
8Entropy (1/2)
Note that entropy is (the negative of) the
expected value of the log probability of x. Log
probabilities come up a lot in statistical
modeling. Since probabilities are always 1,
log probabilities are always 0. Therefore,
entropy is always 0.
- If X is a random variable whose distribution is p
(which we write "X p"), then we can define the
entropy of X, H(X), as follows - Let's calculate the entropy of our observed
unigram distribution, o(x)
9Entropy (2/2)
- What does entropy mean, intuitively?
- Helpful to think of it as a measure of the
evenness or uniformity of the distribution - Lowest entropy?
- what parameters achieve that minimum?
- Highest entropy?
- what parameters achieve that maximum?
- What if m(x) 0 for some x?
- By definition 0 lg 0 0, so that events with
probability 0 do not affect the entropy
calculation.
10Joint Entropy (1/2)
- If random variables X and Y have joint
distribution p(x, y), then we can define the
joint entropy of X and Y, H(X, Y), as follows - Let's calculate the entropy of our observed
bigram distribution, o(x, y)
joint entropy H(X, Y)
11Joint Entropy (2/2)
- Try fiddling with the parameters of the following
joint distribution m(x, y), and observe what
happens to the joint entropy
12Conditional Entropy (1/2)
- If random variables X and Y have joint
distribution p(x, y), then we can define the
conditional entropy of Y given X, H(Y X), as
follows
13Conditional Entropy (2/2)
- From previous slide
- H(Y X) ?x p(x) - ?y p(yx) lg p(y x)
- An alternative of computing H(Y X)
- H(Y X) H(X,Y) H(X)
14Relative entropy (1/3)(KL divergence)
- If we have two probability distributions, p(x)
and q(x), we can define the relative entropy (aka
KL divergence) between p and q, D(pq), as
follows - We define 0 log0 0, plog(p/0)?
- Relative entropy
- measures how much two probability distributions
differ. - asymmetric
- Identical distributions have zero relative
entropy. - Non-identical distributions have positive
relative entropy. - never negative
15Relative entropy (2/3)(KL divergence)
- Suppose we want to compare our observed unigram
distribution o(x) to some arbitrary model
distribution m(x). What is the relative entropy
between them? - Try fiddling with the parameters of m(x) and see
what it does to the KL divergence. - What parameters minimize the divergence?
- Maximize?
16Relative entropy (3/3)(KL divergence)
- What happens when o(x) 0? (lg 0 is normally
undefined!) - Events that cannot happen (according to o(x)) do
not contribute to KL divergence between o(x) and
any other distribution. - What happens when m(x) 0? (division by 0!)
- If an event x was observed (o(x) gt 0) but your
model says it can't (m(x) 0), then your model
is infinitely surprised D(o m) 8. - Why is D(p q) ? 0? Can you prove it?
17Smoothing Absolute Discounting
- Idea reduce counts of observed event types by a
fixed amount ?, and reallocate the count mass to
unobserved event types. - Absolute discounting is simple and gives quite
good results. - Terminology
- x - an event (a type) (e.g., bigram)
- C - count of all observations (tokens) (e.g.,
training size) - C(x) - count of observations (tokens) of type x
(e.g. bigram counts) - V - of event types x (e.g. size of
vocabulary) - Nr - of event types observed r times x
C(x) r - ? - a number between 0 and 1 (e.g. 0.75)
18Absolute Discounting (Cont.)
- For seen types, we deduct ? from the count mass
- Pad(x) (C(x) - ?) / C if C(x) gt 0
- How much count mass did we harvest by doing this?
We took ? from each of V - N0 types, so we have
?(V - N0) to redistribute among the N0 unseen
types. So each unseen type will get a count mass
of ?(V - N0) / N0 - Pad(x) ?(V-N0) / N0C if C(x) 0
- To see how this works, let's go back to our
original example and look at bigram counts. To
bring unseens into the picture, let's suppose we
have another word, C, giving rise to 7 new unseen
bigrams
19Absolute Discounting (Cont.)
- The probabilities add up to 1, but how do we
prove this in general? - Also, how do you choose a good value for ??
- 0.75 is often recommended, but you can also use
held-out data to tune this parameter. - Look down the column of Pad probabilities.
Anything troubling to you?
20Proper Probability Distributions
- To prove that a function p is a probability
distribution, you must show - 1. ?x p(x) 0
- 2. ?x p(x) 1
- The first is generally trivial the second can be
more challenging. - A proof for absolute discounting will illustrate
the general idea - ?xPad(x) ?xC(x)gt0Pad(x) ?xC(x)0Pad(x)
- ?xC(x)gt0 (C(x)-?)/C ?xC(x)0(V-N0)?/N0
C - V-N0 terms N0 terms
- ?xC(x)gt0C(x)/C - (V-N0)?/C (V-N0)?/C
- ?xC(x)/C
- C/C 1
21Good-Turing Smoothing
- We redistribute the count mass of types observed
r1 times evenly among types observed r times.
Then, estimate P(x) as r/C, where r is an
adjusted count for types observed r times. We
want r such that - r Nr (r1) Nr1
- r (r1) Nr1 / Nr
- gt PGT(x) ((r1) Nr1/Nr) / C
22Good-Turing (Cont.)
- To see how this works, let's go back to our
example and look at bigram counts. To make the
example more interesting, we'll assume we've also
seen the sentence, "C C C C C C C C C C C",
giving us another 12 bigram observations, as
summarized in the following table of
counts
23Good-Turing (Cont.)
- But now the probabilities do not add up to 1
- One problem is that for high values of r (i.e.,
for high-frequency bigrams), Nr1 is quite likely
to be 0 - One way to address this is to use the Good-Turing
estimates only for frequencies r lt k for some
constant cutoff k. Above this, the MLE estimates
are used.
24Good-Turing (Cont.)
- Thus a better way to define r is
- r (r1) ENr1 / ENr
- Where ENr means the expected number of event
types (bigrams) observed r times. - So we fit some function S to the observed values
(r, Nr) - Gale Sampson (1995) suggest using a power curve
Nr arb, with b lt -1 - Fit using linear regression in logspace log Nr
log a b log r - The observed distribution of Nr, when transformed
into logspace, looks like this
25Good-Turing (Cont.)
- Here is a graph of the line fit in logspace and
the resulting power curve - The fit is a poor fit, but it gives us smoothed
values for Nr. - Now, For r gt 0, we use S(r) to generate the
adjusted counts r - r (r1) S(r1) / S(r)
26Smoothing and Conditional Probabilities
- Some people have the wrong idea about how to
combine smoothing with conditional probability
distributions. You know that a conditional
distribution can be computed as the ratio of a
joint distribution and a marginal distribution - P(x y) P(x, y) / P(y)
- What if you want to use smoothing?
- WRONG Smooth joint P(x, y) and marginal P(y)
independently, then combine P'''(x y) P'(x,
y) / P''(y). - Correct Smooth P(x y) independently.
27Smoothing and Conditional Probabilities (Cont.)
- The problem is that steps 1 and 2 do smoothing
separately, so it makes no sense to divide the
results. The right way to compute the smoothed
conditional probability distribution P(x y) is - 1. From the joint P(x, y), compute a smoothed
joint P'(x, y). - 2. From the smoothed joint P'(x, y), compute a
smoothed marginal P'(y). - 3. Divide them let P'(x y) P'(x, y) / P'(y).
28Smoothing and Conditional Probabilities (Cont.)
- Suppose we're on safari, and we count the animals
we observe by species (x) and gender (y) - From these counts, we can easily compute
unsmoothed, empirical joint and marginal
distributions
29Smoothing and Conditional Probabilities (Cont.)
- Now suppose we want to use absolute discounting,
with ? 0.75. The adjusted counts are - Now from these counts, we can compute smoothed
joint and marginal distributions - Now, since both P'(x, y) and P'(y) result from
the same smoothing operation, it's OK to divide
them
30Java
- Any confusion about roulette wheel sampling in
generateWord()? - You want to train a trigram model on 10 million
words, but you keep running out of memory. - Don't use a V V V matrix!
- CounterMap? Good, but needs to be elaborated.
- Intern your Strings!
- Virtual memory -mx2000m
31Programming Assignment Tips
- Objectives
- The assignments are intentionally open-ended
- Think of the assignment as an investigation
- Think of your report as a mini research paper
- Implement the basics brigram and trigram models,
smoothing, interpolation - Choose one or more additions, e.g.
- Fancy smoothing Katz, Witten-Bell, Kneser-Ney,
gapped bigram - Compare smoothing joint vs. smoothing
conditionals - Crude spelling model for unknown words
- Trade-off between memory and performance
- Explain what you did, why you did it, and what
you found out
32Programming Assignment Tips
- Development
- Use a very small dataset during development (esp.
debugging). - Use validation data to tune hyperparameters.
- Investigate the learning curve (performance as a
function of training size). - Investigate the variance of your model results.
- Report
- Be concise! 6 pages is plenty.
- Prove that your actual, implemented distributions
are proper (concisely!). - Include a graph or two.
- Error analysis discuss examples that your model
gets wrong.