CPSC 503 Computational Linguistics - PowerPoint PPT Presentation

About This Presentation

Title:

CPSC 503 Computational Linguistics

Description:

Intro probability and information Theory. Lecture 5. Giuseppe Carenini. 9/8/09 ... theory? ... Basic Probability/Info Theory. An overview (not complete! ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 28

Provided by: gcare

Category:

more less

Transcript and Presenter's Notes

Title: CPSC 503 Computational Linguistics

1
CPSC 503Computational Linguistics

Intro probability and information Theory
Lecture 5
Giuseppe Carenini

2
Today 28/1

Why do we need probabilities and information
theory?
Basic Probability Theory
Basic Information Theory

3
Why do we need probabilities?

For Spelling errors what is the most probable
correct word?
For real-word spelling errors, speech and hand
writing recognition - What is the most probable
next word?
Part-of-speech tagging, word-sense
disambiguation, probabilistic parsing Basic
question What is the probability of sequence of
words? (e.g. of a sentence)

4
Disambiguation Tasks
Example
I made her duck
Part-of-speech tagging

duck V / N
make create / cook

Word Sense Disambiguation

her possessive adjective /
dative pronoun

Syntactic Disambiguation
(I (made (her duck))) vs. (I (made (her) (duck))

make transitive (single direct obj.) /
ditransitive (two objs) / cause (direct obj.
verb)

5
Why do we need information theory?

How much information is contained in a particular
probabilistic model (PM)?
How predictive a PM is?
Given two PMs, which one better matches a corpus?

Entropy, Mutual Information, Relative Entropy,
Cross-Entropy, Perplexity
6
Basic Probability/Info Theory

An overview (not complete! sometimes imprecise!)
Clarify basic concepts you may encounter in NLP
Try to address common misunderstandings

7
Experiments and Sample Spaces

Uncertain Situation Experiment, Process, Test.
Set of possible basic outcomes sample space O
Coin toss (Ohead,tail, die (O1..6),
Opinion poll (Oyes,no),
Quality test (Obad,good)
Lottery (O ? 105 107)
of traffic accidents in Canada in 2005 (ON)
missing word (O ? vocabulary size)

8
Events

Event A is a set of basic outcomes
A ? O and all A? 2O (the event space)
O is the certain event, Ø is the impossible event
Examples
Experiment three times coin toss
O HHH, HHT, HTH, THH, TTH, HTT, THT,TTT
Cases with exactly two tails
ATTH, HTT, THT
All heads
AHHH

9
Probability Function/Distribution

Intuition measure of how likely an event is
Formally
P 2O ? 0,1, P(O) 1
If A and B are disjoint events P(A?B)P(A)P(B)
Immediate consequences
P(Ø)0, P(?A)1- P(A), A?B ? P(A) lt P(B)
?a? OP(a) 1
How to estimate P(A)
Repeat the experiment n times
c times outcome ? A
P(A) ? c/n

10
Missing Word from Book
11
Joint and Conditional Probability

P(A,B) P(A?B)
P(AB) P(A,B)/P(B)

Bayes Rule
P(A,B) P(B,A) (since P(A?B) P(A?B)) ? P(AB)
P(B) P(BA) P(A) ? P(AB) P(BA) P(A) / P(B)
12
Missing Word Independence
13
Independence

How does P(AB) relates P(B)?

If knowing that B is the case does not change the
probability of A (i.e., P(AB)P(A)) A and B are
independent Immediate consequence P(A,B)P(A)P(B
)
14
Chain Rule

The rule

The proof

P(A,B,C,D,..) P(A) P(A,B)/P(A)
P(A,B,C)/P(A,B) P(A,B,C,D)/P(A,B,C)
P(..,A,B,C,D)/P(A,B,C,D)
P(A,B,C,D) P(A) P(BA) P(CA,B) P(DA,B,C)
P(..A,B,C,D)
15
Random Variables and pmf

Random variables (RV) X allow us to talk about
the probabilities of numerical values that are
related to the event space

Examplesdie natural numbering 1,6, English
word length 1,?

Probability mass function

16
Example English Word length
p(x)
1
10
5
15
25
Sampling?
How to do it?
17
Expectation and Variance

The Expectation is the (expected) mean or average
of a RV

Examplerolling one die (3.5)

The variance of a RV is a measure of whether the
values of the RV tend to be consistent over
samples or to vary a lot

s is the standard deviation

18
Joint, Marginal and Conditional RV/Distributions
Joint
Marginal
Conditional
Bayes and Chain Rule also apply !
19
Joint Distributions(word length word class)
Y
N
V
Adj
Adv
X
1
2
3
4

Note fictional numbers
20
Conditional and Independence(word length word
class)
21
Standard Distributions

Discrete
Binomial
Multinomial
Continuous
Normal

Go back to your Stats textbook
22
Today 28/1

Why do we need probabilities and information
theory?
Basic Probability Theory
Basic Information Theory

23
Entropy

Def1. Measure of uncertainty
Def2. Measure of the information that we need to
resolve an uncertain situation
Def3. Measure of the information that we obtain
form an experiment that resolves an uncertain
situation

Let p(x)P(Xx) where x ? X.
H(p) H(X) - ?x?X p(x)log2p(x)
It is normally measured in bits.

24
Entropy (extra-slides)

Using the formula Example
Example binary outcome
The Limits
(why exactly that formula?)
Entropy and Expectation
Coding interpretation
Joint and Conditional Entropy
Summary of key Properties

25
Mutual Information

Chain Rule for EntropyH(X,Y)H(X)H(YX)
By the chain rule for entropy, we have H(X,Y)
H(X) H(YX) H(Y)H(XY)
Therefore, H(X)-H(XY)H(Y)-H(YX)
This difference is called the mutual information
between X and Y, I(X,Y).
reduction in uncertainty of one random variable
due to knowing about another
the amount of information one random variable
contains about another

26
Relative Entropy or Kullback-Leibler Divergence

Def. The relative entropy is a measure of how
different two probability distributions (over the
same event space) are.
D(pq) ?x?X p(x)log(p(x)/q(x))
average number of bits wasted by encoding events
from a distribution p with distribution q.
I(X,Y) D(p(x,y)p(x)p(y))

27
Next Time