CS 124LINGUIST 180: From Language to Information

1 / 68
About This Presentation
Title:

CS 124LINGUIST 180: From Language to Information

Description:

So in general, Laplace is a blunt instrument. But Laplace smoothing not used for ... Despite its flaws Laplace (add-k) is however still used to smooth other ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 69
Provided by: DanJur6

less

Transcript and Presenter's Notes

Title: CS 124LINGUIST 180: From Language to Information


1
CS 124/LINGUIST 180 From Language to Information
  • Dan Jurafsky
  • Lecture 3 Intro to Probability, Language
    Modeling

IP notice some slides for today from Jim
Martin, Sandiway Fong, Dan Klein
2
Outline
  • Probability
  • Basic probability
  • Conditional probability
  • Language Modeling (N-grams)
  • N-gram Intro
  • The Chain Rule
  • The Shannon Visualization Method
  • Evaluation
  • Perplexity
  • Smoothing
  • Laplace (Add-1)
  • Add-prior

3
1. Introduction to Probability
  • Experiment (trial)
  • Repeatable procedure with well-defined possible
    outcomes
  • Sample Space (S)
  • the set of all possible outcomes
  • finite or infinite
  • Example
  • coin toss experiment
  • possible outcomes S heads, tails
  • Example
  • die toss experiment
  • possible outcomes S 1,2,3,4,5,6

Slides from Sandiway Fong
4
Introduction to Probability
  • Definition of sample space depends on what we are
    asking
  • Sample Space (S) the set of all possible
    outcomes
  • Example
  • die toss experiment for whether the number is
    even or odd
  • possible outcomes even,odd
  • not 1,2,3,4,5,6

5
More definitions
  • Events
  • an event is any subset of outcomes from the
    sample space
  • Example
  • die toss experiment
  • let A represent the event such that the outcome
    of the die toss experiment is divisible by 3
  • A 3,6
  • A is a subset of the sample space S
    1,2,3,4,5,6
  • Example
  • Draw a card from a deck
  • suppose sample space S heart,spade,club,diamond
    (four suits)
  • let A represent the event of drawing a heart
  • let B represent the event of drawing a red card
  • A heart
  • B heart,diamond

6
Introduction to Probability
  • Some definitions
  • Counting
  • suppose operation oi can be performed in ni ways,
    then
  • a sequence of k operations o1o2...ok
  • can be performed in n1 ? n2 ? ... ? nk ways
  • Example
  • die toss experiment, 6 possible outcomes
  • two dice are thrown at the same time
  • number of sample points in sample space 6 ? 6
    36

7
Definition of Probability
  • The probability law assigns to an event a
    nonnegative number
  • Called P(A)
  • Also called the probability A
  • That encodes our knowledge or belief about the
    collective likelihood of all the elements of A
  • Probability law must satisfy certain properties

8
Probability Axioms
  • Nonnegativity
  • P(A) gt 0, for every event A
  • Additivity
  • If A and B are two disjoint events, then the
    probability of their union satisfies
  • P(A U B) P(A) P(B)
  • Normalization
  • The probability of the entire sample space S is
    equal to 1, I.e. P(S) 1.

9
An example
  • An experiment involving a single coin toss
  • There are two possible outcomes, H and T
  • Sample space S is H,T
  • If coin is fair, should assign equal
    probabilities to 2 outcomes
  • Since they have to sum to 1
  • P(H) 0.5
  • P(T) 0.5
  • P(H,T) P(H)P(T) 1.0

10
Another example
  • Experiment involving 3 coin tosses
  • Outcome is a 3-long string of H or T
  • S HHH,HHT,HTH,HTT,THH,THT,TTH,TTT
  • Assume each outcome is equiprobable
  • Uniform distribution
  • What is probability of the event that exactly 2
    heads occur?
  • A HHT,HTH,THH
  • P(A) P(HHT)P(HTH)P(THH)
  • 1/8 1/8 1/8
  • 3/8

11
Probability definitions
  • In summary
  • Probability of drawing a spade from 52
    well-shuffled playing cards

12
Probabilities of two events
  • If two events A and B are independent
  • Then
  • P(A and B) P(A) x P(B)
  • If flip a fair coin twice
  • What is the probability that they are both heads?
  • If draw a card from a deck, then put it back,
    draw a card from the deck again
  • What is the probability that both drawn cards are
    hearts?

13
How about non-uniform probabilities? An example
  • A biased coin,
  • twice as likely to come up tails as heads,
  • is tossed twice
  • What is the probability that at least one head
    occurs?
  • Sample space hh, ht, th, tt (h heads, t
    tails)
  • Sample points/probability for the event
  • ht 1/3 x 2/3 2/9 hh 1/3 x 1/3 1/9
  • th 2/3 x 1/3 2/9 tt 2/3 x 2/3 4/9
  • Answer 5/9 ?0.56 (sum of weights in red)

14
Moving toward language
  • Whats the probability of drawing a 2 from a deck
    of 52 cards with four 2s?
  • Whats the probability of a random word (from a
    random dictionary page) being a verb?

15
Probability and part of speech tags
  • Whats the probability of a random word (from a
    random dictionary page) being a verb?
  • How to compute each of these
  • All words just count all the words in the
    dictionary
  • of ways to get a verb number of words which
    are verbs!
  • If a dictionary has 50,000 entries, and 10,000
    are verbs. P(V) is 10000/50000 1/5 .20

16
Conditional Probability
  • A way to reason about the outcome of an
    experiment based on partial information
  • In a word guessing game the first letter for the
    word is a t. What is the likelihood that the
    second letter is an h?
  • How likely is it that a person has a disease
    given that a medical test was negative?
  • A spot shows up on a radar screen. How likely is
    it that it corresponds to an aircraft?

17
More precisely
  • Given an experiment, a corresponding sample space
    S, and a probability law
  • Suppose we know that the outcome is within some
    given event B
  • We want to quantify the likelihood that the
    outcome also belongs to some other given event A.
  • We need a new probability law that gives us the
    conditional probability of A given B
  • P(AB)

18
An intuition
  • A is its raining now.
  • P(A) in dry California is .01
  • B is it was raining ten minutes ago
  • P(AB) means what is the probability of it
    raining now if it was raining 10 minutes ago
  • P(AB) is probably way higher than P(A)
  • Perhaps P(AB) is .10
  • Intuition The knowledge about B should change
    our estimate of the probability of A.

19
Conditional probability
  • One of the following 30 items is chosen at random
  • What is P(X), the probability that it is an X?
  • What is P(Xred), the probability that it is an X
    given that it is red?

20
Conditional Probability
  • let A and B be events
  • p(BA) the probability of event B occurring
    given event A occurs
  • definition p(BA) p(A ? B) / p(A)

21
Conditional probability
  • P(AB) P(A ? B)/P(B)
  • Or

Note P(A,B)P(AB) P(B) Also P(A,B) P(B,A)
22
Independence
  • What is P(A,B) if A and B are independent?
  • P(A,B)P(A) P(B) iff A,B independent.
  • P(heads,tails) P(heads) P(tails) .5 .5
    .25
  • Note P(AB)P(A) iff A,B independent
  • Also P(BA)P(B) iff A,B independent

23
Summary
  • Probability
  • Conditional Probability
  • Independence

24
Language Modeling
  • We want to compute
  • P(w1,w2,w3,w4,w5wn) P(W)
  • the probability of a sequence
  • Alternatively we want to compute
  • P(w5w1,w2,w3,w4)
  • the probability of a word given some previous
    words
  • The model that computes
  • P(W) or
  • P(wnw1,w2wn-1)
  • is called the language model.
  • A better term for this would be The Grammar
  • But Language model or LM is standard

25
Computing P(W)
  • How to compute this joint probability
  • P(the,other,day,I,was,walking,along,
    and,saw,a,lizard)
  • Intuition lets rely on the Chain Rule of
    Probability

26
The Chain Rule
  • Recall the definition of conditional
    probabilities
  • Rewriting
  • More generally
  • P(A,B,C,D) P(A)P(BA)P(CA,B)P(DA,B,C)
  • In general
  • P(x1,x2,x3,xn) P(x1)P(x2x1)P(x3x1,x2)P(xnx1
    xn-1)

27
The Chain Rule applied to joint probability of
words in sentence
  • P(the big red dog was)
  • P(the) P(bigthe) P(redthe big) P(dogthe
    big red) P(wasthe big red dog)

28
Very easy estimate
  • How to estimate?
  • P(the its water is so transparent that)
  • P(the its water is so transparent that)
  • C(its water is so transparent that the)
  • __________________________________________________
    _________________________________
  • C(its water is so transparent that)

29
Unfortunately
  • There are a lot of possible sentences
  • Well never be able to get enough data to compute
    the statistics for those long prefixes
  • P(lizardthe,other,day,I,was,walking,along,and,saw
    ,a)
  • Or
  • P(theits water is so transparent that)

30
Markov Assumption
  • Make the simplifying assumption
  • P(lizardthe,other,day,I,was,walking,along,and,saw
    ,a) P(lizarda)
  • Or maybe
  • P(lizardthe,other,day,I,was,walking,along,and,saw
    ,a) P(lizardsaw,a)

31
Markov Assumption
  • So for each component in the product replace with
    the approximation (assuming a prefix of N)
  • Bigram version

32
Estimating bigram probabilities
  • The Maximum Likelihood Estimate

33
An example
  • ltsgt I am Sam lt/sgt
  • ltsgt Sam I am lt/sgt
  • ltsgt I do not like green eggs and ham lt/sgt
  • This is the Maximum Likelihood Estimate, because
    it is the one which maximizes P(Training
    setModel)

34
Maximum Likelihood Estimates
  • The maximum likelihood estimate of some parameter
    of a model M from a training set T
  • Is the estimate
  • that maximizes the likelihood of the training set
    T given the model M
  • Suppose the word Chinese occurs 400 times in a
    corpus of a million words (Brown corpus)
  • What is the probability that a random word from
    some other text will be Chinese
  • MLE estimate is 400/1000000 .004
  • This may be a bad estimate for some other corpus
  • But it is the estimate that makes it most likely
    that Chinese will occur 400 times in a million
    word corpus.

35
More examples Berkeley Restaurant Project
sentences
  • can you tell me about any good cantonese
    restaurants close by
  • mid priced thai food is what im looking for
  • tell me about chez panisse
  • can you give me a listing of the kinds of food
    that are available
  • im looking for a good place to eat breakfast
  • when is caffe venezia open during the day

36
Raw bigram counts
  • Out of 9222 sentences

37
Raw bigram probabilities
  • Normalize by unigrams
  • Result

38
Bigram estimates of sentence probabilities
  • P(ltsgt I want english food lt/sgt)
  • P(iltsgt) x
  • P(wantI) x
  • P(englishwant) x
  • P(foodenglish) x
  • P(lt/sgtfood)
  • .000031

39
What kinds of knowledge?
  • P(englishwant) .0011
  • P(chinesewant) .0065
  • P(towant) .66
  • P(eat to) .28
  • P(food to) 0
  • P(want spend) 0
  • P (i ltsgt) .25

40
The Shannon Visualization Method
  • Generate random sentences
  • Choose a random bigram ltsgt, w according to its
    probability
  • Now choose a random bigram (w, x) according to
    its probability
  • And so on until we choose lt/sgt
  • Then string the words together
  • ltsgt I
  • I want
  • want to
  • to eat
  • eat Chinese
  • Chinese food
  • food lt/sgt

41
Approximating Shakespeare

42
Shakespeare as corpus
  • N884,647 tokens, V29,066
  • Shakespeare produced 300,000 bigram types out of
    V2 844 million possible bigrams so, 99.96 of
    the possible bigrams were never seen (have zero
    entries in the table)
  • Quadrigrams worse What's coming out looks like
    Shakespeare because it is Shakespeare

43
The wall street journal is not shakespeare (no
offense)

44
Lesson 1 the perils of overfitting
  • N-grams only work well for word prediction if the
    test corpus looks like the training corpus
  • In real life, it often doesnt
  • We need to train robust models, adapt to test
    set, etc

45
Lesson 2 zeros or not?
  • Zipfs Law
  • A small number of events occur with high
    frequency
  • A large number of events occur with low frequency
  • You can quickly collect statistics on the high
    frequency events
  • You might have to wait an arbitrarily long time
    to get valid statistics on low frequency events
  • Result
  • Our estimates are sparse! no counts at all for
    the vast bulk of things we want to estimate!
  • Some of the zeroes in the table are really zeros
    But others are simply low frequency events you
    haven't seen yet. After all, ANYTHING CAN
    HAPPEN!
  • How to address?
  • Answer
  • Estimate the likelihood of unseen N-grams!

Slide adapted from Bonnie Dorr and Julia
Hirschberg
46
Smoothing is like Robin HoodSteal from the rich
and give to the poor (in probability mass)
Slide from Dan Klein
47
Laplace smoothing
  • Also called add-one smoothing
  • Just add one to all the counts!
  • Very simple
  • MLE estimate
  • Laplace estimate
  • Reconstructed counts

48
Laplace smoothed bigram counts
49
Laplace-smoothed bigrams
50
Reconstituted counts
51
Note big change to counts
  • C(count to) went from 608 to 238!
  • P(towant) from .66 to .26!
  • Discount d c/c
  • d for chinese food .10!!! A 10x reduction
  • So in general, Laplace is a blunt instrument
  • But Laplace smoothing not used for N-grams, as we
    have much better methods
  • Despite its flaws Laplace (add-k) is however
    still used to smooth other probabilistic models
    in NLP, especially
  • For pilot studies
  • in domains where the number of zeros isnt so
    huge.

52
Add-k
  • Add a small fraction instead of 1

53
Bayesian unigram prior smoothing for bigrams
  • Maximum Likelihood Estimation
  • Laplace Smoothing
  • Bayesian prior Smoothing

54
Practical Issues
  • We do everything in log space
  • Avoid underflow
  • (also adding is faster than multiplying)

55
Language Modeling Toolkits
  • SRILM
  • http//www.speech.sri.com/projects/srilm/
  • CMU-Cambridge LM Toolkit

56
Google N-Gram Release
57
Google N-Gram Release
  • serve as the incoming 92
  • serve as the incubator 99
  • serve as the independent 794
  • serve as the index 223
  • serve as the indication 72
  • serve as the indicator 120
  • serve as the indicators 45
  • serve as the indispensable 111
  • serve as the indispensible 40
  • serve as the individual 234

58
Advanced stuff Perplexity
  • We didnt get to this in lecture, but is good to
    know, and you can check out the section in the
    chapter

59
Evaluation
  • We train parameters of our model on a training
    set.
  • How do we evaluate how well our model works?
  • We look at the models performance on some new
    data
  • This is what happens in the real world we want
    to know how our model performs on data we havent
    seen
  • So a test set. A dataset which is different than
    our training set
  • Then we need an evaluation metric to tell us how
    well our model is doing on the test set.
  • One such metric is perplexity (to be introduced
    below)

60
Unknown words Open versus closed vocabulary tasks
  • If we know all the words in advanced
  • Vocabulary V is fixed
  • Closed vocabulary task
  • Often we dont know this
  • Out Of Vocabulary OOV words
  • Open vocabulary task
  • Instead create an unknown word token ltUNKgt
  • Training of ltUNKgt probabilities
  • Create a fixed lexicon L of size V
  • At text normalization phase, any training word
    not in L changed to ltUNKgt
  • Now we train its probabilities like a normal word
  • At decoding time
  • If text input Use UNK probabilities for any word
    not in training

61
Evaluating N-gram models
  • Best evaluation for an N-gram
  • Put model A in a task (language identification,
    speech recognizer, machine translation system)
  • Run the task, get an accuracy for A (how many lgs
    identified corrrectly, or Word Error Rate, or
    etc)
  • Put model B in task, get accuracy for B
  • Compare accuracy for A and B
  • Extrinsic evaluation

62
Difficulty of extrinsic (in-vivo) evaluation of
N-gram models
  • Extrinsic evaluation
  • This is really time-consuming
  • Can take days to run an experiment
  • So
  • As a temporary solution, in order to run
    experiments
  • To evaluate N-grams we often use an intrinsic
    evaluation, an approximation called perplexity
  • But perplexity is a poor approximation unless the
    test data looks just like the training data
  • So is generally only useful in pilot experiments
    (generally is not sufficient to publish)
  • But is helpful to think about.

63
Perplexity
  • Perplexity is the probability of the test set
    (assigned by the language model), normalized by
    the number of words
  • Chain rule
  • For bigrams
  • Minimizing perplexity is the same as maximizing
    probability
  • The best language model is one that best predicts
    an unseen test set

64
A totally different perplexity Intuition
  • How hard is the task of recognizing digits
    0,1,2,3,4,5,6,7,8,9,oh easy, perplexity 11 (or
    if we ignore oh, perplexity 10)
  • How hard is recognizing (30,000) names at
    Microsoft. Hard perplexity 30,000
  • If a system has to recognize
  • Operator (1 in 4)
  • Sales (1 in 4)
  • Technical Support (1 in 4)
  • 30,000 names (1 in 120,000 each)
  • Perplexity is 54
  • Perplexity is weighted equivalent branching
    factor

Slide from Josh Goodman
65
Perplexity as branching factor
66
Lower perplexity better model
  • Training 38 million words, test 1.5 million
    words, WSJ

67
Advanced LM stuff
  • Current best smoothing algorithm
  • Kneser-Ney smoothing
  • Other stuff
  • Interpolation
  • Backoff
  • Variable-length n-grams
  • Class-based n-grams
  • Clustering
  • Hand-built classes
  • Cache LMs
  • Topic-based LMs
  • Sentence mixture models
  • Skipping LMs
  • Parser-based LMs

68
Summary
  • Probability
  • Basic probability
  • Conditional probability
  • Language Modeling (N-grams)
  • N-gram Intro
  • The Chain Rule
  • The Shannon Visualization Method
  • Evaluation
  • Perplexity
  • Smoothing
  • Laplace (Add-1)
  • Add-k
  • Add-prior
Write a Comment
User Comments (0)