Title: CS 124LINGUIST 180: From Language to Information
1CS 124/LINGUIST 180 From Language to Information
- Dan Jurafsky
- Lecture 3 Intro to Probability, Language
Modeling
IP notice some slides for today from Jim
Martin, Sandiway Fong, Dan Klein
2Outline
- Probability
- Basic probability
- Conditional probability
- Language Modeling (N-grams)
- N-gram Intro
- The Chain Rule
- The Shannon Visualization Method
- Evaluation
- Perplexity
- Smoothing
- Laplace (Add-1)
- Add-prior
31. Introduction to Probability
- Experiment (trial)
- Repeatable procedure with well-defined possible
outcomes - Sample Space (S)
- the set of all possible outcomes
- finite or infinite
- Example
- coin toss experiment
- possible outcomes S heads, tails
- Example
- die toss experiment
- possible outcomes S 1,2,3,4,5,6
Slides from Sandiway Fong
4Introduction to Probability
- Definition of sample space depends on what we are
asking - Sample Space (S) the set of all possible
outcomes - Example
- die toss experiment for whether the number is
even or odd - possible outcomes even,odd
- not 1,2,3,4,5,6
5More definitions
- Events
- an event is any subset of outcomes from the
sample space - Example
- die toss experiment
- let A represent the event such that the outcome
of the die toss experiment is divisible by 3 - A 3,6
- A is a subset of the sample space S
1,2,3,4,5,6 - Example
- Draw a card from a deck
- suppose sample space S heart,spade,club,diamond
(four suits) - let A represent the event of drawing a heart
- let B represent the event of drawing a red card
- A heart
- B heart,diamond
6Introduction to Probability
- Some definitions
- Counting
- suppose operation oi can be performed in ni ways,
then - a sequence of k operations o1o2...ok
- can be performed in n1 ? n2 ? ... ? nk ways
- Example
- die toss experiment, 6 possible outcomes
- two dice are thrown at the same time
- number of sample points in sample space 6 ? 6
36
7Definition of Probability
- The probability law assigns to an event a
nonnegative number - Called P(A)
- Also called the probability A
- That encodes our knowledge or belief about the
collective likelihood of all the elements of A - Probability law must satisfy certain properties
8Probability Axioms
- Nonnegativity
- P(A) gt 0, for every event A
- Additivity
- If A and B are two disjoint events, then the
probability of their union satisfies - P(A U B) P(A) P(B)
- Normalization
- The probability of the entire sample space S is
equal to 1, I.e. P(S) 1.
9An example
- An experiment involving a single coin toss
- There are two possible outcomes, H and T
- Sample space S is H,T
- If coin is fair, should assign equal
probabilities to 2 outcomes - Since they have to sum to 1
- P(H) 0.5
- P(T) 0.5
- P(H,T) P(H)P(T) 1.0
10Another example
- Experiment involving 3 coin tosses
- Outcome is a 3-long string of H or T
- S HHH,HHT,HTH,HTT,THH,THT,TTH,TTT
- Assume each outcome is equiprobable
- Uniform distribution
- What is probability of the event that exactly 2
heads occur? - A HHT,HTH,THH
- P(A) P(HHT)P(HTH)P(THH)
- 1/8 1/8 1/8
- 3/8
11Probability definitions
- In summary
- Probability of drawing a spade from 52
well-shuffled playing cards
12Probabilities of two events
- If two events A and B are independent
- Then
- P(A and B) P(A) x P(B)
- If flip a fair coin twice
- What is the probability that they are both heads?
- If draw a card from a deck, then put it back,
draw a card from the deck again - What is the probability that both drawn cards are
hearts?
13How about non-uniform probabilities? An example
- A biased coin,
- twice as likely to come up tails as heads,
- is tossed twice
- What is the probability that at least one head
occurs? - Sample space hh, ht, th, tt (h heads, t
tails) - Sample points/probability for the event
- ht 1/3 x 2/3 2/9 hh 1/3 x 1/3 1/9
- th 2/3 x 1/3 2/9 tt 2/3 x 2/3 4/9
- Answer 5/9 ?0.56 (sum of weights in red)
14Moving toward language
- Whats the probability of drawing a 2 from a deck
of 52 cards with four 2s? - Whats the probability of a random word (from a
random dictionary page) being a verb?
15Probability and part of speech tags
- Whats the probability of a random word (from a
random dictionary page) being a verb? - How to compute each of these
- All words just count all the words in the
dictionary - of ways to get a verb number of words which
are verbs! - If a dictionary has 50,000 entries, and 10,000
are verbs. P(V) is 10000/50000 1/5 .20
16Conditional Probability
- A way to reason about the outcome of an
experiment based on partial information - In a word guessing game the first letter for the
word is a t. What is the likelihood that the
second letter is an h? - How likely is it that a person has a disease
given that a medical test was negative? - A spot shows up on a radar screen. How likely is
it that it corresponds to an aircraft?
17More precisely
- Given an experiment, a corresponding sample space
S, and a probability law - Suppose we know that the outcome is within some
given event B - We want to quantify the likelihood that the
outcome also belongs to some other given event A. - We need a new probability law that gives us the
conditional probability of A given B - P(AB)
18An intuition
- A is its raining now.
- P(A) in dry California is .01
- B is it was raining ten minutes ago
- P(AB) means what is the probability of it
raining now if it was raining 10 minutes ago - P(AB) is probably way higher than P(A)
- Perhaps P(AB) is .10
- Intuition The knowledge about B should change
our estimate of the probability of A.
19Conditional probability
- One of the following 30 items is chosen at random
- What is P(X), the probability that it is an X?
- What is P(Xred), the probability that it is an X
given that it is red?
20Conditional Probability
- let A and B be events
- p(BA) the probability of event B occurring
given event A occurs - definition p(BA) p(A ? B) / p(A)
21Conditional probability
Note P(A,B)P(AB) P(B) Also P(A,B) P(B,A)
22Independence
- What is P(A,B) if A and B are independent?
- P(A,B)P(A) P(B) iff A,B independent.
- P(heads,tails) P(heads) P(tails) .5 .5
.25 - Note P(AB)P(A) iff A,B independent
- Also P(BA)P(B) iff A,B independent
23Summary
- Probability
- Conditional Probability
- Independence
24Language Modeling
- We want to compute
- P(w1,w2,w3,w4,w5wn) P(W)
- the probability of a sequence
- Alternatively we want to compute
- P(w5w1,w2,w3,w4)
- the probability of a word given some previous
words - The model that computes
- P(W) or
- P(wnw1,w2wn-1)
- is called the language model.
- A better term for this would be The Grammar
- But Language model or LM is standard
25Computing P(W)
- How to compute this joint probability
- P(the,other,day,I,was,walking,along,
and,saw,a,lizard) - Intuition lets rely on the Chain Rule of
Probability
26The Chain Rule
- Recall the definition of conditional
probabilities - Rewriting
- More generally
- P(A,B,C,D) P(A)P(BA)P(CA,B)P(DA,B,C)
- In general
- P(x1,x2,x3,xn) P(x1)P(x2x1)P(x3x1,x2)P(xnx1
xn-1)
27The Chain Rule applied to joint probability of
words in sentence
- P(the big red dog was)
- P(the) P(bigthe) P(redthe big) P(dogthe
big red) P(wasthe big red dog)
28Very easy estimate
- How to estimate?
- P(the its water is so transparent that)
- P(the its water is so transparent that)
-
- C(its water is so transparent that the)
- __________________________________________________
_________________________________ - C(its water is so transparent that)
29Unfortunately
- There are a lot of possible sentences
- Well never be able to get enough data to compute
the statistics for those long prefixes - P(lizardthe,other,day,I,was,walking,along,and,saw
,a) - Or
- P(theits water is so transparent that)
30Markov Assumption
- Make the simplifying assumption
- P(lizardthe,other,day,I,was,walking,along,and,saw
,a) P(lizarda) - Or maybe
- P(lizardthe,other,day,I,was,walking,along,and,saw
,a) P(lizardsaw,a)
31Markov Assumption
- So for each component in the product replace with
the approximation (assuming a prefix of N) - Bigram version
32Estimating bigram probabilities
- The Maximum Likelihood Estimate
33An example
- ltsgt I am Sam lt/sgt
- ltsgt Sam I am lt/sgt
- ltsgt I do not like green eggs and ham lt/sgt
- This is the Maximum Likelihood Estimate, because
it is the one which maximizes P(Training
setModel)
34Maximum Likelihood Estimates
- The maximum likelihood estimate of some parameter
of a model M from a training set T - Is the estimate
- that maximizes the likelihood of the training set
T given the model M - Suppose the word Chinese occurs 400 times in a
corpus of a million words (Brown corpus) - What is the probability that a random word from
some other text will be Chinese - MLE estimate is 400/1000000 .004
- This may be a bad estimate for some other corpus
- But it is the estimate that makes it most likely
that Chinese will occur 400 times in a million
word corpus.
35More examples Berkeley Restaurant Project
sentences
- can you tell me about any good cantonese
restaurants close by - mid priced thai food is what im looking for
- tell me about chez panisse
- can you give me a listing of the kinds of food
that are available - im looking for a good place to eat breakfast
- when is caffe venezia open during the day
36Raw bigram counts
37Raw bigram probabilities
- Normalize by unigrams
- Result
38Bigram estimates of sentence probabilities
- P(ltsgt I want english food lt/sgt)
- P(iltsgt) x
- P(wantI) x
- P(englishwant) x
- P(foodenglish) x
- P(lt/sgtfood)
- .000031
39What kinds of knowledge?
- P(englishwant) .0011
- P(chinesewant) .0065
- P(towant) .66
- P(eat to) .28
- P(food to) 0
- P(want spend) 0
- P (i ltsgt) .25
40The Shannon Visualization Method
- Generate random sentences
- Choose a random bigram ltsgt, w according to its
probability - Now choose a random bigram (w, x) according to
its probability - And so on until we choose lt/sgt
- Then string the words together
- ltsgt I
- I want
- want to
- to eat
- eat Chinese
- Chinese food
- food lt/sgt
41Approximating Shakespeare
42Shakespeare as corpus
- N884,647 tokens, V29,066
- Shakespeare produced 300,000 bigram types out of
V2 844 million possible bigrams so, 99.96 of
the possible bigrams were never seen (have zero
entries in the table) - Quadrigrams worse What's coming out looks like
Shakespeare because it is Shakespeare
43The wall street journal is not shakespeare (no
offense)
44Lesson 1 the perils of overfitting
- N-grams only work well for word prediction if the
test corpus looks like the training corpus - In real life, it often doesnt
- We need to train robust models, adapt to test
set, etc
45Lesson 2 zeros or not?
- Zipfs Law
- A small number of events occur with high
frequency - A large number of events occur with low frequency
- You can quickly collect statistics on the high
frequency events - You might have to wait an arbitrarily long time
to get valid statistics on low frequency events - Result
- Our estimates are sparse! no counts at all for
the vast bulk of things we want to estimate! - Some of the zeroes in the table are really zeros
But others are simply low frequency events you
haven't seen yet. After all, ANYTHING CAN
HAPPEN! - How to address?
- Answer
- Estimate the likelihood of unseen N-grams!
Slide adapted from Bonnie Dorr and Julia
Hirschberg
46Smoothing is like Robin HoodSteal from the rich
and give to the poor (in probability mass)
Slide from Dan Klein
47Laplace smoothing
- Also called add-one smoothing
- Just add one to all the counts!
- Very simple
- MLE estimate
- Laplace estimate
- Reconstructed counts
48Laplace smoothed bigram counts
49Laplace-smoothed bigrams
50Reconstituted counts
51Note big change to counts
- C(count to) went from 608 to 238!
- P(towant) from .66 to .26!
- Discount d c/c
- d for chinese food .10!!! A 10x reduction
- So in general, Laplace is a blunt instrument
- But Laplace smoothing not used for N-grams, as we
have much better methods - Despite its flaws Laplace (add-k) is however
still used to smooth other probabilistic models
in NLP, especially - For pilot studies
- in domains where the number of zeros isnt so
huge.
52Add-k
- Add a small fraction instead of 1
53Bayesian unigram prior smoothing for bigrams
- Maximum Likelihood Estimation
- Laplace Smoothing
- Bayesian prior Smoothing
54Practical Issues
- We do everything in log space
- Avoid underflow
- (also adding is faster than multiplying)
55Language Modeling Toolkits
- SRILM
- http//www.speech.sri.com/projects/srilm/
- CMU-Cambridge LM Toolkit
56Google N-Gram Release
57Google N-Gram Release
- serve as the incoming 92
- serve as the incubator 99
- serve as the independent 794
- serve as the index 223
- serve as the indication 72
- serve as the indicator 120
- serve as the indicators 45
- serve as the indispensable 111
- serve as the indispensible 40
- serve as the individual 234
58Advanced stuff Perplexity
- We didnt get to this in lecture, but is good to
know, and you can check out the section in the
chapter
59Evaluation
- We train parameters of our model on a training
set. - How do we evaluate how well our model works?
- We look at the models performance on some new
data - This is what happens in the real world we want
to know how our model performs on data we havent
seen - So a test set. A dataset which is different than
our training set - Then we need an evaluation metric to tell us how
well our model is doing on the test set. - One such metric is perplexity (to be introduced
below)
60Unknown words Open versus closed vocabulary tasks
- If we know all the words in advanced
- Vocabulary V is fixed
- Closed vocabulary task
- Often we dont know this
- Out Of Vocabulary OOV words
- Open vocabulary task
- Instead create an unknown word token ltUNKgt
- Training of ltUNKgt probabilities
- Create a fixed lexicon L of size V
- At text normalization phase, any training word
not in L changed to ltUNKgt - Now we train its probabilities like a normal word
- At decoding time
- If text input Use UNK probabilities for any word
not in training
61Evaluating N-gram models
- Best evaluation for an N-gram
- Put model A in a task (language identification,
speech recognizer, machine translation system) - Run the task, get an accuracy for A (how many lgs
identified corrrectly, or Word Error Rate, or
etc) - Put model B in task, get accuracy for B
- Compare accuracy for A and B
- Extrinsic evaluation
62Difficulty of extrinsic (in-vivo) evaluation of
N-gram models
- Extrinsic evaluation
- This is really time-consuming
- Can take days to run an experiment
- So
- As a temporary solution, in order to run
experiments - To evaluate N-grams we often use an intrinsic
evaluation, an approximation called perplexity - But perplexity is a poor approximation unless the
test data looks just like the training data - So is generally only useful in pilot experiments
(generally is not sufficient to publish) - But is helpful to think about.
63Perplexity
- Perplexity is the probability of the test set
(assigned by the language model), normalized by
the number of words - Chain rule
- For bigrams
- Minimizing perplexity is the same as maximizing
probability - The best language model is one that best predicts
an unseen test set
64A totally different perplexity Intuition
- How hard is the task of recognizing digits
0,1,2,3,4,5,6,7,8,9,oh easy, perplexity 11 (or
if we ignore oh, perplexity 10) - How hard is recognizing (30,000) names at
Microsoft. Hard perplexity 30,000 - If a system has to recognize
- Operator (1 in 4)
- Sales (1 in 4)
- Technical Support (1 in 4)
- 30,000 names (1 in 120,000 each)
- Perplexity is 54
- Perplexity is weighted equivalent branching
factor
Slide from Josh Goodman
65Perplexity as branching factor
66Lower perplexity better model
- Training 38 million words, test 1.5 million
words, WSJ
67Advanced LM stuff
- Current best smoothing algorithm
- Kneser-Ney smoothing
- Other stuff
- Interpolation
- Backoff
- Variable-length n-grams
- Class-based n-grams
- Clustering
- Hand-built classes
- Cache LMs
- Topic-based LMs
- Sentence mixture models
- Skipping LMs
- Parser-based LMs
68Summary
- Probability
- Basic probability
- Conditional probability
- Language Modeling (N-grams)
- N-gram Intro
- The Chain Rule
- The Shannon Visualization Method
- Evaluation
- Perplexity
- Smoothing
- Laplace (Add-1)
- Add-k
- Add-prior