Smoothing

1 / 32
About This Presentation
Title:

Smoothing

Description:

... conditional entropy, mutual information ... much count mass did we harvest ... We redistribute the count mass of types observed r 1 times evenly among ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 33
Provided by: pichua

less

Transcript and Presenter's Notes

Title: Smoothing


1
Smoothing
  • Anish Johnson and Nate Chambers
  • 10 April 2009

(Thanks to Bill MacCartney, Jenny Finkel, and
Sushant Prakash for these materials)
2
Format and content of sections
  • Mix of theoretical and practical topics,
    targeted at PAs
  • Emphasis on simple examples worked out in detail

3
Outline for today
  • Information theory intuitions and examples
  • entropy, joint entropy, conditional entropy,
    mutual information
  • relative entropy (KL divergence), cross entropy,
    perplexity
  • Smoothing examples, proofs, implementation
  • absolute discounting example
  • how to prove you have a proper probability
    distribution
  • Good-Turing smoothing tricks
  • smoothing and conditional distributions
  • Java implementation
  • representing huge models efficiently
  • Some tips on the programming assignments

4
Example
  • Toy language with two words "A" and "B". Want to
    build predictive model from
  • "A B, B A."
  • "A B, B A A A!"
  • "B A, A B!"
  • "A!"
  • "A, A, A."
  • "" (Can say nothing)

5
A Unigram Model
  • Let's omit punctuation and put a stop symbol
    after each utterance
  • A B B A .
  • A B B A A A .
  • B A A B .
  • A .
  • A A A .
  • .
  • Let C(x) be the observed count of unigram x
  • Let o(x) be the observed frequency of unigram x
  • A multinomial probability distribution with 2
    free parameters
  • Event space A, B, .
  • o(x) MLE estimates of the unknown true parameters

6
A Bigram Model (1/2)
  • This time we'll put a stop symbol
  • before and after each utterance
  • . A B B A .
  • . A B B A A A .
  • . B A A B .
  • . A .
  • . A A A .
  • . .
  • Let C(x, y) be the observed count of
  • bigram xy
  • Let o(x, y) be the observed
  • frequency of bigram xy
  • multinomial probability distribution with 8 free
    parameters
  • Event space A, B, . A, B, .
  • MLE estimates of the unknown true parameters

7
A Bigram Model (2/2)
  • Marginal distributions o(x) and o(y)
  • Conditional distributions o(y x) and o(x y)
    ?
  • o(y x) o(x, y) / o(x)
  • o(x y) o(x, y) / o(y)

o(B A) o(B, A) / o(A) (1/8) /
(1/2) 1/4
8
Entropy (1/2)
Note that entropy is (the negative of) the
expected value of the log probability of x. Log
probabilities come up a lot in statistical
modeling. Since probabilities are always 1,
log probabilities are always 0. Therefore,
entropy is always 0.
  • If X is a random variable whose distribution is p
    (which we write "X p"), then we can define the
    entropy of X, H(X), as follows
  • Let's calculate the entropy of our observed
    unigram distribution, o(x)

9
Entropy (2/2)
  • What does entropy mean, intuitively?
  • Helpful to think of it as a measure of the
    evenness or uniformity of the distribution
  • Lowest entropy?
  • what parameters achieve that minimum?
  • Highest entropy?
  • what parameters achieve that maximum?
  • What if m(x) 0 for some x?
  • By definition 0 lg 0 0, so that events with
    probability 0 do not affect the entropy
    calculation.

10
Joint Entropy (1/2)
  • If random variables X and Y have joint
    distribution p(x, y), then we can define the
    joint entropy of X and Y, H(X, Y), as follows
  • Let's calculate the entropy of our observed
    bigram distribution, o(x, y)

joint entropy H(X, Y)
11
Joint Entropy (2/2)
  • Try fiddling with the parameters of the following
    joint distribution m(x, y), and observe what
    happens to the joint entropy

12
Conditional Entropy (1/2)
  • If random variables X and Y have joint
    distribution p(x, y), then we can define the
    conditional entropy of Y given X, H(Y X), as
    follows

13
Conditional Entropy (2/2)
  • From previous slide
  • H(Y X) ?x p(x) - ?y p(yx) lg p(y x)
  • An alternative of computing H(Y X)
  • H(Y X) H(X,Y) H(X)

14
Relative entropy (1/3)(KL divergence)
  • If we have two probability distributions, p(x)
    and q(x), we can define the relative entropy (aka
    KL divergence) between p and q, D(pq), as
    follows
  • We define 0 log0 0, plog(p/0)?
  • Relative entropy
  • measures how much two probability distributions
    differ.
  • asymmetric
  • Identical distributions have zero relative
    entropy.
  • Non-identical distributions have positive
    relative entropy.
  • never negative

15
Relative entropy (2/3)(KL divergence)
  • Suppose we want to compare our observed unigram
    distribution o(x) to some arbitrary model
    distribution m(x). What is the relative entropy
    between them?
  • Try fiddling with the parameters of m(x) and see
    what it does to the KL divergence.
  • What parameters minimize the divergence?
  • Maximize?

16
Relative entropy (3/3)(KL divergence)
  • What happens when o(x) 0? (lg 0 is normally
    undefined!)
  • Events that cannot happen (according to o(x)) do
    not contribute to KL divergence between o(x) and
    any other distribution.
  • What happens when m(x) 0? (division by 0!)
  • If an event x was observed (o(x) gt 0) but your
    model says it can't (m(x) 0), then your model
    is infinitely surprised D(o m) 8.
  • Why is D(p q) ? 0? Can you prove it?

17
Smoothing Absolute Discounting
  • Idea reduce counts of observed event types by a
    fixed amount ?, and reallocate the count mass to
    unobserved event types.
  • Absolute discounting is simple and gives quite
    good results.
  • Terminology
  • x - an event (a type) (e.g., bigram)
  • C - count of all observations (tokens) (e.g.,
    training size)
  • C(x) - count of observations (tokens) of type x
    (e.g. bigram counts)
  • V - of event types x (e.g. size of
    vocabulary)
  • Nr - of event types observed r times x
    C(x) r
  • ? - a number between 0 and 1 (e.g. 0.75)

18
Absolute Discounting (Cont.)
  • For seen types, we deduct ? from the count mass
  • Pad(x) (C(x) - ?) / C if C(x) gt 0
  • How much count mass did we harvest by doing this?
    We took ? from each of V - N0 types, so we have
    ?(V - N0) to redistribute among the N0 unseen
    types. So each unseen type will get a count mass
    of ?(V - N0) / N0
  • Pad(x) ?(V-N0) / N0C if C(x) 0
  • To see how this works, let's go back to our
    original example and look at bigram counts. To
    bring unseens into the picture, let's suppose we
    have another word, C, giving rise to 7 new unseen
    bigrams

19
Absolute Discounting (Cont.)
  • The probabilities add up to 1, but how do we
    prove this in general?
  • Also, how do you choose a good value for ??
  • 0.75 is often recommended, but you can also use
    held-out data to tune this parameter.
  • Look down the column of Pad probabilities.
    Anything troubling to you?

20
Proper Probability Distributions
  • To prove that a function p is a probability
    distribution, you must show
  • 1. ?x p(x) 0
  • 2. ?x p(x) 1
  • The first is generally trivial the second can be
    more challenging.
  • A proof for absolute discounting will illustrate
    the general idea
  • ?xPad(x) ?xC(x)gt0Pad(x) ?xC(x)0Pad(x)
  • ?xC(x)gt0 (C(x)-?)/C ?xC(x)0(V-N0)?/N0
    C
  • V-N0 terms N0 terms
  • ?xC(x)gt0C(x)/C - (V-N0)?/C (V-N0)?/C
  • ?xC(x)/C
  • C/C 1

21
Good-Turing Smoothing
  • We redistribute the count mass of types observed
    r1 times evenly among types observed r times.
    Then, estimate P(x) as r/C, where r is an
    adjusted count for types observed r times. We
    want r such that
  • r Nr (r1) Nr1
  • r (r1) Nr1 / Nr
  • gt PGT(x) ((r1) Nr1/Nr) / C

22
Good-Turing (Cont.)
  • To see how this works, let's go back to our
    example and look at bigram counts. To make the
    example more interesting, we'll assume we've also
    seen the sentence, "C C C C C C C C C C C",
    giving us another 12 bigram observations, as
    summarized in the following table of
    counts

23
Good-Turing (Cont.)
  • But now the probabilities do not add up to 1
  • One problem is that for high values of r (i.e.,
    for high-frequency bigrams), Nr1 is quite likely
    to be 0
  • One way to address this is to use the Good-Turing
    estimates only for frequencies r lt k for some
    constant cutoff k. Above this, the MLE estimates
    are used.

24
Good-Turing (Cont.)
  • Thus a better way to define r is
  • r (r1) ENr1 / ENr
  • Where ENr means the expected number of event
    types (bigrams) observed r times.
  • So we fit some function S to the observed values
    (r, Nr)
  • Gale Sampson (1995) suggest using a power curve
    Nr arb, with b lt -1
  • Fit using linear regression in logspace log Nr
    log a b log r
  • The observed distribution of Nr, when transformed
    into logspace, looks like this

25
Good-Turing (Cont.)
  • Here is a graph of the line fit in logspace and
    the resulting power curve
  • The fit is a poor fit, but it gives us smoothed
    values for Nr.
  • Now, For r gt 0, we use S(r) to generate the
    adjusted counts r
  • r (r1) S(r1) / S(r)

26
Smoothing and Conditional Probabilities
  • Some people have the wrong idea about how to
    combine smoothing with conditional probability
    distributions. You know that a conditional
    distribution can be computed as the ratio of a
    joint distribution and a marginal distribution
  • P(x y) P(x, y) / P(y)
  • What if you want to use smoothing?
  • WRONG Smooth joint P(x, y) and marginal P(y)
    independently, then combine P'''(x y) P'(x,
    y) / P''(y).
  • Correct Smooth P(x y) independently.

27
Smoothing and Conditional Probabilities (Cont.)
  • The problem is that steps 1 and 2 do smoothing
    separately, so it makes no sense to divide the
    results. The right way to compute the smoothed
    conditional probability distribution P(x y) is
  • 1. From the joint P(x, y), compute a smoothed
    joint P'(x, y).
  • 2. From the smoothed joint P'(x, y), compute a
    smoothed marginal P'(y).
  • 3. Divide them let P'(x y) P'(x, y) / P'(y).

28
Smoothing and Conditional Probabilities (Cont.)
  • Suppose we're on safari, and we count the animals
    we observe by species (x) and gender (y)
  • From these counts, we can easily compute
    unsmoothed, empirical joint and marginal
    distributions

29
Smoothing and Conditional Probabilities (Cont.)
  • Now suppose we want to use absolute discounting,
    with ? 0.75. The adjusted counts are
  • Now from these counts, we can compute smoothed
    joint and marginal distributions
  • Now, since both P'(x, y) and P'(y) result from
    the same smoothing operation, it's OK to divide
    them

30
Java
  • Any confusion about roulette wheel sampling in
    generateWord()?
  • You want to train a trigram model on 10 million
    words, but you keep running out of memory.
  • Don't use a V V V matrix!
  • CounterMap? Good, but needs to be elaborated.
  • Intern your Strings!
  • Virtual memory -mx2000m

31
Programming Assignment Tips
  • Objectives
  • The assignments are intentionally open-ended
  • Think of the assignment as an investigation
  • Think of your report as a mini research paper
  • Implement the basics brigram and trigram models,
    smoothing, interpolation
  • Choose one or more additions, e.g.
  • Fancy smoothing Katz, Witten-Bell, Kneser-Ney,
    gapped bigram
  • Compare smoothing joint vs. smoothing
    conditionals
  • Crude spelling model for unknown words
  • Trade-off between memory and performance
  • Explain what you did, why you did it, and what
    you found out

32
Programming Assignment Tips
  • Development
  • Use a very small dataset during development (esp.
    debugging).
  • Use validation data to tune hyperparameters.
  • Investigate the learning curve (performance as a
    function of training size).
  • Investigate the variance of your model results.
  • Report
  • Be concise! 6 pages is plenty.
  • Prove that your actual, implemented distributions
    are proper (concisely!).
  • Include a graph or two.
  • Error analysis discuss examples that your model
    gets wrong.
Write a Comment
User Comments (0)