Announcement - PowerPoint PPT Presentation

About This Presentation
Title:

Announcement

Description:

Question: Based on the measurement of 100 people, what is the Gaussian ... Machine Translation. text in target language translation ('noise') source language ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 27
Provided by: rong7
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Announcement


1
Announcement
  • Homework 1 is due on Tuesday, 01/27/2003, not
    01/26/2003

2
Statistical Inference
  • Rong Jin

3
Statistical Inference
Assume that the height of people follows Gaussian
distribution. Question Based on the measurement
of 100 people, what is the Gaussian distribution
that fits the data?
4
Statistical Inference
  • Problem
  • Likelihood function
  • Approach Maximum likelihood estimation (MLE), or
    maximize log-likelihood

5
Example I Flip Coins
6
Example I Flip Coins (contd)
7
Example II Normal Distribution
8
Information Theory
  • Rong Jin

9
Outline
  • Information
  • Entropy
  • Mutual information
  • Noisy channel model

10
Information
  • Information ? knowledge
  • Information reduction in uncertainty
  • Example
  • flip a coin
  • roll a die
  • 2 is more uncertain than 1
  • Therefore, more information is provided by the
    outcome of 2 than 1

11
Definition of Information
  • Let E be some event that occurs with probability
    P(E). If we are told that E has occurred, then we
    say we have received I(E)log2(1/P(E)) bits of
    information
  • Example
  • Result of a fair coin flip (log221 bit)
  • Result of a fair die roll (log262.585 bits)

12
Information is Additive
  • I(k fair coin tosses) log2k k bits
  • Example infomation conveyed by words
  • Random word from a 100,000 word vocabulary
  • I(word) log(100,000) 16.6 bits
  • A 1000 word document from the same source
  • I(document) 16,600 bits
  • A 480x640 pixel, 16-greyscale video picture
  • I(picture) 307,200 log16 1,228,800 bits
  • ? A picture is worth a 1000 words!

13
Outline
  • Information
  • Entropy
  • Mutual Information
  • Cross Entropy and Learning

14
Entropy
  • A zero-memory information source S is a source
    that emits symbols from an alphabet s1, s2,,
    sk with probability p1, p2,,pk, respectively,
    where the symbols emitted are statistically
    independent.
  • What is the average amount of information in
    observing the output of the source S?
  • Call this entropy

15
Explanation of Entropy
  • Average amount of information provided per symbol
  • Average amount of surprise when observed a symbol
  • Uncertainty that an observer has before seeing
    the symbol
  • Average of bits needed to communicate each
    symbol

16
Properties of Entropy
  • Non-negative H(P) ?0
  • For any other probability distribution q1,,qk,
  • H(P) ? logk, with equality iff pi1/k for all i
  • The further P is from uniform, the lower the
    entropy.

17
Entropy k 2
  • Notice
  • zero information at edges
  • maximum information at 0.5 (1 bit)
  • drop off more quickly close edges than in the
    middle

18
The Entropy of English
  • 27 characters (A-Z, space)
  • 100,000 words (average 6.5 char each)
  • Assuming independence between successive
    characters
  • Uniform character distribution log27 4.75
    bits/char
  • True character distribution 4.03 bits/character
  • Assuming independence between successive words
  • Uniform word distribution log100,1000/6.5 2.55
    bits/char
  • True word distribution 9.45/6.5 1.45
    bits/character
  • True entropy of English is much lower!

19
Entropy of Two Sources
Temperature T P(T hot) 0.3 P(T mild)
0.5 P(T cold) 0.2 ? H(T) H(0.3, 0.5, 0.2)
1.485
Humidity M P(M low) 0.6 P(M high) 0.4 ?
H(M) H(0.6, 0.4) 0.971
  • Random variable T, M are not independent
  • P(Tt, Mm)?P(Tt)P(Mm)

20
Joint Entropy
  • H(T) 1.485
  • H(M) 0.971
  • H(T) H(M) 2.456
  • Joint Entropy
  • H(T, M) H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1, 0.1)
    2.321
  • H(T, M) ?H(T) H(M)

Joint Probability P(T, M)
21
Conditional Entropy
  • Conditional Entropy
  • H(TM low) 1.252
  • H(TM high) 1.5
  • Average conditional entropy
  • How much is M telling us on average about T?
  • H(T) H(TM) 1.485 1.351 0.134 bits

Conditinal Probability P(T M)
22
Mutual Information
  • Properties
  • Indicate the amount of information one random
    variable can provide to another one
  • Symmetric I(XY) I(YX)
  • Non-negative
  • Zero iff X, Y are independent

23
Review
H(X, Y)
H(X)
H(Y)
H(XY)
H(YX)
I(XY)
24
A Distance Measure Between Distributions
  • Kullback-Leibler distance
  • Properties of Kullback-Leibler distance
  • Non-negative K(PD, PM)0 iff PD PM
  • Minimizing KL distance ? PM get close to PD
  • Non-symmetric K(PD, PM) ?K(PM, PD)

25
The Noisy Channel
  • Prototypical case
  • Input
    Output (noisy)
  • The channel
  • 0,1,1,1,0,1,0,1,... (adds noise)
    0,1,1,0,0,1,1,0,...
  • Model probability of error (noise)
  • Example p(01) .3 p(11) .7 p(10) .4
    p(00) .6
  • The Task
  • known the noisy output want to know the
    input (decoding)
  • Source coding theorem
  • Channel coding theorem

26
Noisy Channel Applications
  • OCR
  • straightforward text print (adds noise), scan
    image
  • Handwriting recognition
  • text neurons, muscles (noise), scan/digitize
    image
  • Speech recognition (dictation, commands, etc.)
  • text conversion to acoustic signal (noise)
    acoustic waves
  • Machine Translation
  • text in target language translation (noise)
    source language
Write a Comment
User Comments (0)
About PowerShow.com