Title: 2' Mathematical Foundations
12. Mathematical Foundations
Foundations of Statistic Natural Language
Processing
2Contents Part 1
- 1. Elementary Probability Theory
- Conditional probability
- Bayes theorem
- Random variable
- Joint and conditional distributions
- Standard distribution
3Conditional probability (1/2)
- P(A) the probability of the event A
- Ex1gt A coin is tossed 3 times.
- W HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
- A HHT, HTH, THH 2 heads, P(A)3/8
- B HHH, HHT, HTH, HTT first head, P(B)1/2
-
-
- conditional probability
4Conditional probability (2/2)
- Multiplication rule
-
- Chain rule
- Two events A, B are independent
-
-
If
5Bayes theorem (1/2)
-
- Generally, if and the Bi
are disjoint -
- Bayes
- theorem
6Bayes theorem (2/2)
- Ex2gt G the event of the sentence having a
parasitic gap - T the event of the test being positive
- This poor result comes about because the prior
probability of a sentence containing a parasitic
gap is so low.
7Random variable
- Ex3gt Random variable X for the sum of two dice.
Expectation
Variance
S2,,12
probability mass function(pmf) p(x) p(Xx), X
p(x) If XW ? 0,1, then X is called an
indicator RV or a Bernoulli trial
8Joint and conditional distributions
- The joint pmf for two discrete random variables
X, Y -
- Marginal pmfs, which total up the probability
mass for the values of each variable separately. -
- Conditional pmf
-
for y such that
9Standard distributions (1/3)
- Discrete distributions The binomial distribution
- When one has a series of trials with only two
outcomes, each trial being independent from all
the others. - The number r of successes out of n trials given
that the probability of success in any trial is
p. - Expectation np, variance np(1-p)
where
10Standard distributions (2/3)
- Discrete distributions The binomial distribution
11Standard distributions (3/3)
- Continuous distributions The normal distribution
- For the Mean m and the standard deviation s
Probability density function (pdf)
12Contents Part 2
- 2. Essential Information Theory
- Entropy
- Joint entropy and conditional entropy
- Mutual information
- The noisy channel model
- Relative entropy or Kullback-Leibler divergence
13Shannons Information Theory
- Maximizing the amount of information that one can
transmit over an imperfect communication channel
such as a noisy phone line. - Theoretical maxima for data compression
- Entropy H
- Theoretical maxima for the transmission rate
- Channel Capacity
14Entropy (1/4)
- The entropy H (or self-information) is the
average uncertainty of a single random variable
X. - Entropy is a measure of uncertainty.
- The more we know about something, the lower the
entropy will be. - We can use entropy as a measure of the quality of
our models. - Entropy measures the amount of information in a
random variable (measured in bits).
where, p(x) is pmf of X
15Entropy (2/4)
- The entropy of a weighted coin. The horizontal
axis shows the probability of a weighted coin to
come up heads. The vertical axis shows the
entropy of tossing the corresponding coin once.
back 23 page
p
16Entropy (3/4)
- Ex7gt The result of rolling an 8-sided die.
(uniform distribution) -
- Entropy The average length of the message
needed to transmit an outcome of that variable. - For expectation E
17Entropy (4/4)
- Ex8gt Simplified Polynesian
-
- We can design a code that on average takes
bits to transmit a letter - Entropy can be interpreted as a measure of the
size of the search space consisting of the
possible values of a random variable.
bits
18Joint entropy and conditional entropy (1/3)
- The joint entropy of a pair of discrete random
variable X,Y p(x,y) -
- The conditional entropy
-
- The chain rule for entropy
-
-
19Joint entropy and conditional entropy (2/3)
- Ex9gt Simplified Polynesian revisited
- All words of consist of sequence of
CV(consonant-vowel) syllables
Marginal probabilities (per-syllable basis)
Per-letter basis probabilities
double
back 8 page
20Joint entropy and conditional entropy (3/3)
21Mutual information (1/2)
- By the chain rule for entropy
-
- mutual information
- Mutual information between X and Y
- The amount of information one random variable
contains about another. (symmetric, non-negative) - It is 0 only when two variables are independent.
- It grows not only with the degree of dependence,
but also according to the entropy of the
variables. - It is actually better to think of it as a measure
of independence.
22Mutual information (2/2)
-
- Since
- (entropy is called self-information)
- Conditional MI and a chain rule
I(x,y)
Pointwise MI
23Noisy channel model
- Channel capacity the rate at which one can
transmit information through the channel
(optimal) -
- Binary symmetric channel
-
- since entropy is non-negative,
go 15 page
24Relative entropy or Kullback-Leibler divergence
- Relative entropy for two pmfs, p(x), q(x)
- A measure of how close two pmfs are.
- Non-negative, and D(pq)0 if pq
-
-
- Conditional relative entropy and chain rule