CS 124LINGUIST 180: From Language to Information

1 / 68

About This Presentation

Title:

CS 124LINGUIST 180: From Language to Information

Description:

So in general, Laplace is a blunt instrument. But Laplace smoothing not used for ... Despite its flaws Laplace (add-k) is however still used to smooth other ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 69

Provided by: DanJur6

more less

Transcript and Presenter's Notes

Title: CS 124LINGUIST 180: From Language to Information

1
CS 124/LINGUIST 180 From Language to Information

Dan Jurafsky
Lecture 3 Intro to Probability, Language
Modeling

IP notice some slides for today from Jim
Martin, Sandiway Fong, Dan Klein
2
Outline

Probability
Basic probability
Conditional probability
Language Modeling (N-grams)
N-gram Intro
The Chain Rule
The Shannon Visualization Method
Evaluation
Perplexity
Smoothing
Laplace (Add-1)
Add-prior

3
1. Introduction to Probability

Experiment (trial)
Repeatable procedure with well-defined possible
outcomes
Sample Space (S)
the set of all possible outcomes
finite or infinite
Example
coin toss experiment
possible outcomes S heads, tails
Example
die toss experiment
possible outcomes S 1,2,3,4,5,6

Slides from Sandiway Fong
4
Introduction to Probability

Definition of sample space depends on what we are
asking
Sample Space (S) the set of all possible
outcomes
Example
die toss experiment for whether the number is
even or odd
possible outcomes even,odd
not 1,2,3,4,5,6

5
More definitions

Events
an event is any subset of outcomes from the
sample space
Example
die toss experiment
let A represent the event such that the outcome
of the die toss experiment is divisible by 3
A 3,6
A is a subset of the sample space S
1,2,3,4,5,6
Example
Draw a card from a deck
suppose sample space S heart,spade,club,diamond
(four suits)
let A represent the event of drawing a heart
let B represent the event of drawing a red card
A heart
B heart,diamond

6
Introduction to Probability

Some definitions
Counting
suppose operation oi can be performed in ni ways,
then
a sequence of k operations o1o2...ok
can be performed in n1 ? n2 ? ... ? nk ways
Example
die toss experiment, 6 possible outcomes
two dice are thrown at the same time
number of sample points in sample space 6 ? 6
36

7
Definition of Probability

The probability law assigns to an event a
nonnegative number
Called P(A)
Also called the probability A
That encodes our knowledge or belief about the
collective likelihood of all the elements of A
Probability law must satisfy certain properties

8
Probability Axioms

Nonnegativity
P(A) gt 0, for every event A
Additivity
If A and B are two disjoint events, then the
probability of their union satisfies
P(A U B) P(A) P(B)
Normalization
The probability of the entire sample space S is
equal to 1, I.e. P(S) 1.

9
An example

An experiment involving a single coin toss
There are two possible outcomes, H and T
Sample space S is H,T
If coin is fair, should assign equal
probabilities to 2 outcomes
Since they have to sum to 1
P(H) 0.5
P(T) 0.5
P(H,T) P(H)P(T) 1.0

10
Another example

Experiment involving 3 coin tosses
Outcome is a 3-long string of H or T
S HHH,HHT,HTH,HTT,THH,THT,TTH,TTT
Assume each outcome is equiprobable
Uniform distribution
What is probability of the event that exactly 2
heads occur?
A HHT,HTH,THH
P(A) P(HHT)P(HTH)P(THH)
1/8 1/8 1/8
3/8

11
Probability definitions

In summary
Probability of drawing a spade from 52
well-shuffled playing cards

12
Probabilities of two events

If two events A and B are independent
Then
P(A and B) P(A) x P(B)
If flip a fair coin twice
What is the probability that they are both heads?
If draw a card from a deck, then put it back,
draw a card from the deck again
What is the probability that both drawn cards are
hearts?

13
How about non-uniform probabilities? An example

A biased coin,
twice as likely to come up tails as heads,
is tossed twice
What is the probability that at least one head
occurs?
Sample space hh, ht, th, tt (h heads, t
tails)
Sample points/probability for the event
ht 1/3 x 2/3 2/9 hh 1/3 x 1/3 1/9
th 2/3 x 1/3 2/9 tt 2/3 x 2/3 4/9
Answer 5/9 ?0.56 (sum of weights in red)

14
Moving toward language

Whats the probability of drawing a 2 from a deck
of 52 cards with four 2s?
Whats the probability of a random word (from a
random dictionary page) being a verb?

15
Probability and part of speech tags

Whats the probability of a random word (from a
random dictionary page) being a verb?
How to compute each of these
All words just count all the words in the
dictionary
of ways to get a verb number of words which
are verbs!
If a dictionary has 50,000 entries, and 10,000
are verbs. P(V) is 10000/50000 1/5 .20

16
Conditional Probability

A way to reason about the outcome of an
experiment based on partial information
In a word guessing game the first letter for the
word is a t. What is the likelihood that the
second letter is an h?
How likely is it that a person has a disease
given that a medical test was negative?
A spot shows up on a radar screen. How likely is
it that it corresponds to an aircraft?

17
More precisely

Given an experiment, a corresponding sample space
S, and a probability law
Suppose we know that the outcome is within some
given event B
We want to quantify the likelihood that the
outcome also belongs to some other given event A.
We need a new probability law that gives us the
conditional probability of A given B
P(AB)

18
An intuition

A is its raining now.
P(A) in dry California is .01
B is it was raining ten minutes ago
P(AB) means what is the probability of it
raining now if it was raining 10 minutes ago
P(AB) is probably way higher than P(A)
Perhaps P(AB) is .10
Intuition The knowledge about B should change
our estimate of the probability of A.

19
Conditional probability

One of the following 30 items is chosen at random
What is P(X), the probability that it is an X?
What is P(Xred), the probability that it is an X
given that it is red?

20
Conditional Probability

let A and B be events
p(BA) the probability of event B occurring
given event A occurs
definition p(BA) p(A ? B) / p(A)

21
Conditional probability

P(AB) P(A ? B)/P(B)
Or

Note P(A,B)P(AB) P(B) Also P(A,B) P(B,A)
22
Independence

What is P(A,B) if A and B are independent?
P(A,B)P(A) P(B) iff A,B independent.
P(heads,tails) P(heads) P(tails) .5 .5
.25
Note P(AB)P(A) iff A,B independent
Also P(BA)P(B) iff A,B independent

23
Summary

Probability
Conditional Probability
Independence

24
Language Modeling

We want to compute
P(w1,w2,w3,w4,w5wn) P(W)
the probability of a sequence
Alternatively we want to compute
P(w5w1,w2,w3,w4)
the probability of a word given some previous
words
The model that computes
P(W) or
P(wnw1,w2wn-1)
is called the language model.
A better term for this would be The Grammar
But Language model or LM is standard

25
Computing P(W)

How to compute this joint probability
P(the,other,day,I,was,walking,along,
and,saw,a,lizard)
Intuition lets rely on the Chain Rule of
Probability

26
The Chain Rule

Recall the definition of conditional
probabilities
Rewriting
More generally
P(A,B,C,D) P(A)P(BA)P(CA,B)P(DA,B,C)
In general
P(x1,x2,x3,xn) P(x1)P(x2x1)P(x3x1,x2)P(xnx1
xn-1)

27
The Chain Rule applied to joint probability of
words in sentence

P(the big red dog was)
P(the) P(bigthe) P(redthe big) P(dogthe
big red) P(wasthe big red dog)

28
Very easy estimate

How to estimate?
P(the its water is so transparent that)
P(the its water is so transparent that)
C(its water is so transparent that the)
__________________________________________________
_________________________________
C(its water is so transparent that)

29
Unfortunately

There are a lot of possible sentences
Well never be able to get enough data to compute
the statistics for those long prefixes
P(lizardthe,other,day,I,was,walking,along,and,saw
,a)
Or
P(theits water is so transparent that)

30
Markov Assumption

Make the simplifying assumption
P(lizardthe,other,day,I,was,walking,along,and,saw
,a) P(lizarda)
Or maybe
P(lizardthe,other,day,I,was,walking,along,and,saw
,a) P(lizardsaw,a)

31
Markov Assumption

So for each component in the product replace with
the approximation (assuming a prefix of N)
Bigram version

32
Estimating bigram probabilities

The Maximum Likelihood Estimate

33
An example

ltsgt I am Sam lt/sgt
ltsgt Sam I am lt/sgt
ltsgt I do not like green eggs and ham lt/sgt
This is the Maximum Likelihood Estimate, because
it is the one which maximizes P(Training
setModel)

34
Maximum Likelihood Estimates

The maximum likelihood estimate of some parameter
of a model M from a training set T
Is the estimate
that maximizes the likelihood of the training set
T given the model M
Suppose the word Chinese occurs 400 times in a
corpus of a million words (Brown corpus)
What is the probability that a random word from
some other text will be Chinese
MLE estimate is 400/1000000 .004
This may be a bad estimate for some other corpus
But it is the estimate that makes it most likely
that Chinese will occur 400 times in a million
word corpus.

35
More examples Berkeley Restaurant Project
sentences

can you tell me about any good cantonese
restaurants close by
mid priced thai food is what im looking for
tell me about chez panisse
can you give me a listing of the kinds of food
that are available
im looking for a good place to eat breakfast
when is caffe venezia open during the day

36
Raw bigram counts

Out of 9222 sentences

37
Raw bigram probabilities

Normalize by unigrams
Result

38
Bigram estimates of sentence probabilities

P(ltsgt I want english food lt/sgt)
P(iltsgt) x
P(wantI) x
P(englishwant) x
P(foodenglish) x
P(lt/sgtfood)
.000031

39
What kinds of knowledge?

P(englishwant) .0011
P(chinesewant) .0065
P(towant) .66
P(eat to) .28
P(food to) 0
P(want spend) 0
P (i ltsgt) .25

40
The Shannon Visualization Method

Generate random sentences
Choose a random bigram ltsgt, w according to its
probability
Now choose a random bigram (w, x) according to
its probability
And so on until we choose lt/sgt
Then string the words together
ltsgt I
I want
want to
to eat
eat Chinese
Chinese food
food lt/sgt

41
Approximating Shakespeare

42
Shakespeare as corpus

N884,647 tokens, V29,066
Shakespeare produced 300,000 bigram types out of
V2 844 million possible bigrams so, 99.96 of
the possible bigrams were never seen (have zero
entries in the table)
Quadrigrams worse What's coming out looks like
Shakespeare because it is Shakespeare

43
The wall street journal is not shakespeare (no
offense)

44
Lesson 1 the perils of overfitting

N-grams only work well for word prediction if the
test corpus looks like the training corpus
In real life, it often doesnt
We need to train robust models, adapt to test
set, etc

45
Lesson 2 zeros or not?

Zipfs Law
A small number of events occur with high
frequency
A large number of events occur with low frequency
You can quickly collect statistics on the high
frequency events
You might have to wait an arbitrarily long time
to get valid statistics on low frequency events
Result
Our estimates are sparse! no counts at all for
the vast bulk of things we want to estimate!
Some of the zeroes in the table are really zeros
But others are simply low frequency events you
haven't seen yet. After all, ANYTHING CAN
HAPPEN!
How to address?
Answer
Estimate the likelihood of unseen N-grams!

Slide adapted from Bonnie Dorr and Julia
Hirschberg
46
Smoothing is like Robin HoodSteal from the rich
and give to the poor (in probability mass)
Slide from Dan Klein
47
Laplace smoothing

Also called add-one smoothing
Just add one to all the counts!
Very simple
MLE estimate
Laplace estimate
Reconstructed counts

48
Laplace smoothed bigram counts
49
Laplace-smoothed bigrams
50
Reconstituted counts
51
Note big change to counts

C(count to) went from 608 to 238!
P(towant) from .66 to .26!
Discount d c/c
d for chinese food .10!!! A 10x reduction
So in general, Laplace is a blunt instrument
But Laplace smoothing not used for N-grams, as we
have much better methods
Despite its flaws Laplace (add-k) is however
still used to smooth other probabilistic models
in NLP, especially
For pilot studies
in domains where the number of zeros isnt so
huge.

52
Add-k

Add a small fraction instead of 1

53
Bayesian unigram prior smoothing for bigrams

Maximum Likelihood Estimation
Laplace Smoothing
Bayesian prior Smoothing

54
Practical Issues

We do everything in log space
Avoid underflow
(also adding is faster than multiplying)

55
Language Modeling Toolkits

SRILM
http//www.speech.sri.com/projects/srilm/
CMU-Cambridge LM Toolkit

56
Google N-Gram Release
57
Google N-Gram Release

serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234

58
Advanced stuff Perplexity

We didnt get to this in lecture, but is good to
know, and you can check out the section in the
chapter

59
Evaluation

We train parameters of our model on a training
set.
How do we evaluate how well our model works?
We look at the models performance on some new
data
This is what happens in the real world we want
to know how our model performs on data we havent
seen
So a test set. A dataset which is different than
our training set
Then we need an evaluation metric to tell us how
well our model is doing on the test set.
One such metric is perplexity (to be introduced
below)

60
Unknown words Open versus closed vocabulary tasks

If we know all the words in advanced
Vocabulary V is fixed
Closed vocabulary task
Often we dont know this
Out Of Vocabulary OOV words
Open vocabulary task
Instead create an unknown word token ltUNKgt
Training of ltUNKgt probabilities
Create a fixed lexicon L of size V
At text normalization phase, any training word
not in L changed to ltUNKgt
Now we train its probabilities like a normal word
At decoding time
If text input Use UNK probabilities for any word
not in training

61
Evaluating N-gram models

Best evaluation for an N-gram
Put model A in a task (language identification,
speech recognizer, machine translation system)
Run the task, get an accuracy for A (how many lgs
identified corrrectly, or Word Error Rate, or
etc)
Put model B in task, get accuracy for B
Compare accuracy for A and B
Extrinsic evaluation

62
Difficulty of extrinsic (in-vivo) evaluation of
N-gram models

Extrinsic evaluation
This is really time-consuming
Can take days to run an experiment
So
As a temporary solution, in order to run
experiments
To evaluate N-grams we often use an intrinsic
evaluation, an approximation called perplexity
But perplexity is a poor approximation unless the
test data looks just like the training data
So is generally only useful in pilot experiments
(generally is not sufficient to publish)
But is helpful to think about.

63
Perplexity

Perplexity is the probability of the test set
(assigned by the language model), normalized by
the number of words
Chain rule
For bigrams

Minimizing perplexity is the same as maximizing
probability
The best language model is one that best predicts
an unseen test set

64
A totally different perplexity Intuition

How hard is the task of recognizing digits
0,1,2,3,4,5,6,7,8,9,oh easy, perplexity 11 (or
if we ignore oh, perplexity 10)
How hard is recognizing (30,000) names at
Microsoft. Hard perplexity 30,000
If a system has to recognize
Operator (1 in 4)
Sales (1 in 4)
Technical Support (1 in 4)
30,000 names (1 in 120,000 each)
Perplexity is 54
Perplexity is weighted equivalent branching
factor