Title: Announcement
1Announcement
- Homework 1 is due on Tuesday, 01/27/2003, not
01/26/2003
2Statistical Inference
3Statistical Inference
Assume that the height of people follows Gaussian
distribution. Question Based on the measurement
of 100 people, what is the Gaussian distribution
that fits the data?
4Statistical Inference
- Problem
- Likelihood function
- Approach Maximum likelihood estimation (MLE), or
maximize log-likelihood
5Example I Flip Coins
6Example I Flip Coins (contd)
7Example II Normal Distribution
8Information Theory
9Outline
- Information
- Entropy
- Mutual information
- Noisy channel model
10Information
- Information ? knowledge
- Information reduction in uncertainty
- Example
- flip a coin
- roll a die
- 2 is more uncertain than 1
- Therefore, more information is provided by the
outcome of 2 than 1
11Definition of Information
- Let E be some event that occurs with probability
P(E). If we are told that E has occurred, then we
say we have received I(E)log2(1/P(E)) bits of
information - Example
- Result of a fair coin flip (log221 bit)
- Result of a fair die roll (log262.585 bits)
12Information is Additive
- I(k fair coin tosses) log2k k bits
- Example infomation conveyed by words
- Random word from a 100,000 word vocabulary
- I(word) log(100,000) 16.6 bits
- A 1000 word document from the same source
- I(document) 16,600 bits
- A 480x640 pixel, 16-greyscale video picture
- I(picture) 307,200 log16 1,228,800 bits
- ? A picture is worth a 1000 words!
13Outline
- Information
- Entropy
- Mutual Information
- Cross Entropy and Learning
14Entropy
- A zero-memory information source S is a source
that emits symbols from an alphabet s1, s2,,
sk with probability p1, p2,,pk, respectively,
where the symbols emitted are statistically
independent. - What is the average amount of information in
observing the output of the source S? - Call this entropy
15Explanation of Entropy
- Average amount of information provided per symbol
- Average amount of surprise when observed a symbol
- Uncertainty that an observer has before seeing
the symbol - Average of bits needed to communicate each
symbol
16Properties of Entropy
- Non-negative H(P) ?0
- For any other probability distribution q1,,qk,
- H(P) ? logk, with equality iff pi1/k for all i
- The further P is from uniform, the lower the
entropy.
17Entropy k 2
- Notice
- zero information at edges
- maximum information at 0.5 (1 bit)
- drop off more quickly close edges than in the
middle
18The Entropy of English
- 27 characters (A-Z, space)
- 100,000 words (average 6.5 char each)
- Assuming independence between successive
characters - Uniform character distribution log27 4.75
bits/char - True character distribution 4.03 bits/character
- Assuming independence between successive words
- Uniform word distribution log100,1000/6.5 2.55
bits/char - True word distribution 9.45/6.5 1.45
bits/character - True entropy of English is much lower!
19Entropy of Two Sources
Temperature T P(T hot) 0.3 P(T mild)
0.5 P(T cold) 0.2 ? H(T) H(0.3, 0.5, 0.2)
1.485
Humidity M P(M low) 0.6 P(M high) 0.4 ?
H(M) H(0.6, 0.4) 0.971
- Random variable T, M are not independent
- P(Tt, Mm)?P(Tt)P(Mm)
20Joint Entropy
- H(T) 1.485
- H(M) 0.971
- H(T) H(M) 2.456
- Joint Entropy
- H(T, M) H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1, 0.1)
2.321 - H(T, M) ?H(T) H(M)
Joint Probability P(T, M)
21Conditional Entropy
- Conditional Entropy
- H(TM low) 1.252
- H(TM high) 1.5
- Average conditional entropy
- How much is M telling us on average about T?
- H(T) H(TM) 1.485 1.351 0.134 bits
Conditinal Probability P(T M)
22Mutual Information
- Properties
- Indicate the amount of information one random
variable can provide to another one - Symmetric I(XY) I(YX)
- Non-negative
- Zero iff X, Y are independent
23Review
H(X, Y)
H(X)
H(Y)
H(XY)
H(YX)
I(XY)
24A Distance Measure Between Distributions
- Kullback-Leibler distance
- Properties of Kullback-Leibler distance
- Non-negative K(PD, PM)0 iff PD PM
- Minimizing KL distance ? PM get close to PD
- Non-symmetric K(PD, PM) ?K(PM, PD)
25The Noisy Channel
- Prototypical case
- Input
Output (noisy) - The channel
- 0,1,1,1,0,1,0,1,... (adds noise)
0,1,1,0,0,1,1,0,... - Model probability of error (noise)
- Example p(01) .3 p(11) .7 p(10) .4
p(00) .6 - The Task
- known the noisy output want to know the
input (decoding) - Source coding theorem
- Channel coding theorem
26Noisy Channel Applications
- OCR
- straightforward text print (adds noise), scan
image - Handwriting recognition
- text neurons, muscles (noise), scan/digitize
image - Speech recognition (dictation, commands, etc.)
- text conversion to acoustic signal (noise)
acoustic waves - Machine Translation
- text in target language translation (noise)
source language