Title: CPSC 503 Computational Linguistics
1CPSC 503Computational Linguistics
- Intro probability and information Theory
- Lecture 5
- Giuseppe Carenini
2Today 28/1
- Why do we need probabilities and information
theory? - Basic Probability Theory
- Basic Information Theory
3Why do we need probabilities?
- For Spelling errors what is the most probable
correct word? - For real-word spelling errors, speech and hand
writing recognition - What is the most probable
next word? - Part-of-speech tagging, word-sense
disambiguation, probabilistic parsing Basic
question What is the probability of sequence of
words? (e.g. of a sentence)
4Disambiguation Tasks
Example
I made her duck
Part-of-speech tagging
- duck V / N
- make create / cook
Word Sense Disambiguation
- her possessive adjective /
- dative pronoun
Syntactic Disambiguation
(I (made (her duck))) vs. (I (made (her) (duck))
- make transitive (single direct obj.) /
ditransitive (two objs) / cause (direct obj.
verb)
5Why do we need information theory?
- How much information is contained in a particular
probabilistic model (PM)? - How predictive a PM is?
- Given two PMs, which one better matches a corpus?
Entropy, Mutual Information, Relative Entropy,
Cross-Entropy, Perplexity
6Basic Probability/Info Theory
- An overview (not complete! sometimes imprecise!)
- Clarify basic concepts you may encounter in NLP
- Try to address common misunderstandings
7Experiments and Sample Spaces
- Uncertain Situation Experiment, Process, Test.
- Set of possible basic outcomes sample space O
- Coin toss (Ohead,tail, die (O1..6),
- Opinion poll (Oyes,no),
- Quality test (Obad,good)
- Lottery (O ? 105 107)
- of traffic accidents in Canada in 2005 (ON)
- missing word (O ? vocabulary size)
8Events
- Event A is a set of basic outcomes
- A ? O and all A? 2O (the event space)
- O is the certain event, Ø is the impossible event
- Examples
- Experiment three times coin toss
- O HHH, HHT, HTH, THH, TTH, HTT, THT,TTT
- Cases with exactly two tails
- ATTH, HTT, THT
- All heads
- AHHH
9Probability Function/Distribution
- Intuition measure of how likely an event is
- Formally
- P 2O ? 0,1, P(O) 1
- If A and B are disjoint events P(A?B)P(A)P(B)
- Immediate consequences
- P(Ø)0, P(?A)1- P(A), A?B ? P(A) lt P(B)
- ?a? OP(a) 1
- How to estimate P(A)
- Repeat the experiment n times
- c times outcome ? A
- P(A) ? c/n
10Missing Word from Book
11Joint and Conditional Probability
- P(A,B) P(A?B)
- P(AB) P(A,B)/P(B)
Bayes Rule
P(A,B) P(B,A) (since P(A?B) P(A?B)) ? P(AB)
P(B) P(BA) P(A) ? P(AB) P(BA) P(A) / P(B)
12Missing Word Independence
13Independence
- How does P(AB) relates P(B)?
If knowing that B is the case does not change the
probability of A (i.e., P(AB)P(A)) A and B are
independent Immediate consequence P(A,B)P(A)P(B
)
14Chain Rule
P(A,B,C,D,..) P(A) P(A,B)/P(A)
P(A,B,C)/P(A,B) P(A,B,C,D)/P(A,B,C)
P(..,A,B,C,D)/P(A,B,C,D)
P(A,B,C,D) P(A) P(BA) P(CA,B) P(DA,B,C)
P(..A,B,C,D)
15Random Variables and pmf
- Random variables (RV) X allow us to talk about
the probabilities of numerical values that are
related to the event space
- Examplesdie natural numbering 1,6, English
word length 1,?
- Probability mass function
16Example English Word length
p(x)
1
10
5
15
25
Sampling?
How to do it?
17Expectation and Variance
- The Expectation is the (expected) mean or average
of a RV
- Examplerolling one die (3.5)
- The variance of a RV is a measure of whether the
values of the RV tend to be consistent over
samples or to vary a lot
- s is the standard deviation
18Joint, Marginal and Conditional RV/Distributions
Joint
Marginal
Conditional
Bayes and Chain Rule also apply !
19Joint Distributions(word length word class)
Y
N
V
Adj
Adv
X
1
2
3
4
Note fictional numbers
20Conditional and Independence(word length word
class)
21Standard Distributions
- Discrete
- Binomial
- Multinomial
- Continuous
- Normal
Go back to your Stats textbook
22Today 28/1
- Why do we need probabilities and information
theory? - Basic Probability Theory
- Basic Information Theory
23Entropy
- Def1. Measure of uncertainty
- Def2. Measure of the information that we need to
resolve an uncertain situation - Def3. Measure of the information that we obtain
form an experiment that resolves an uncertain
situation
- Let p(x)P(Xx) where x ? X.
- H(p) H(X) - ?x?X p(x)log2p(x)
- It is normally measured in bits.
24Entropy (extra-slides)
- Using the formula Example
- Example binary outcome
- The Limits
- (why exactly that formula?)
- Entropy and Expectation
- Coding interpretation
- Joint and Conditional Entropy
- Summary of key Properties
25Mutual Information
- Chain Rule for EntropyH(X,Y)H(X)H(YX)
- By the chain rule for entropy, we have H(X,Y)
H(X) H(YX) H(Y)H(XY) - Therefore, H(X)-H(XY)H(Y)-H(YX)
- This difference is called the mutual information
between X and Y, I(X,Y). - reduction in uncertainty of one random variable
due to knowing about another - the amount of information one random variable
contains about another
26Relative Entropy or Kullback-Leibler Divergence
- Def. The relative entropy is a measure of how
different two probability distributions (over the
same event space) are. - D(pq) ?x?X p(x)log(p(x)/q(x))
- average number of bits wasted by encoding events
from a distribution p with distribution q. - I(X,Y) D(p(x,y)p(x)p(y))
27Next Time
- Probabilistic models applied to spelling
- Read Chp. 5 up to pag.156