Entropy and Compression - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Entropy and Compression

Description:

In general, with n symbols, codes need to be of length lg n, rounded up ... Self-Information = H(S) = lg(1/p) Greater frequency == Less information ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 25
Provided by: Harry112
Category:

less

Transcript and Presenter's Notes

Title: Entropy and Compression


1
Entropy and Compression
  • How much information is in English?

2
Shannons game
  • Capacity to guess the next letter of an English
    text is a measure of the redundancy of English
  • More guessable more redundant lower entropy
  • Perfectly guessable Zero information zero
    entropy
  • How many bits per character are needed to
    represent English?

3
Fixed Length Codes
  • Example 4 symbols, A, B, C, D
  • A00, B01, C10, D11
  • In general, with n symbols, codes need to be of
    length lg n, rounded up
  • For English text, 26 letters space 27
    symbols, length 5 since 24 lt 27 lt 25
  • (replace all punctuation marks by space)
  • AKA block codes

4
Modeling the Message Source
Source
Destination
  • Characteristics of the stream of messages coming
    from the source affect the choice of the coding
    method
  • We need a model for a source of English text that
    can be described and analyzed mathematically

5
No compression if no assumptions
  • If nothing is assumed except that there are 27
    possible symbols, the best that can be done to
    construct a code is to use the shortest possible
    fixed-length code
  • A string in which every letter is equally likely
  • xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd
    qpaamkbzaacibzlhjqd

6
First-Order model of English
  • Assume symbols are produced in proportion to
    their actual frequencies in English texts
  • Use shorter codes for more frequent symbols
  • Morse Code does something like this
  • Example with 4 possible messages

7
Prefix Codes
  • Only one way to decode left to right

8
First-Order model shortens code
  • Average bits per symbol

.71.13.13.13 1.6 bits/symbol (down from
2)
Another prefix code that is better .71.12.1
3.13 1.5
9
First-Order model of English
  • From a First-Order Source for English
  • ocroh hli rgwr nmielwis eu ll nbnesebya th eei
    alhenhttpa oobttva nah brl

10
Measuring Information
  • If a symbol S has frequency p, its
    self-information is H(S) lg(1/p) -lg p.

11
Logarithm of a number its length
  • Decimal log(274562328) 8.4
  • Binary lg(101011) 5.4
  • For calculations
  • lg(x) log(x)/log(2)
  • Google calculator understands lg(x)

12
Self-Information H(S) lg(1/p)
  • Greater frequency ltgt Less information
  • Extreme case p 1, H(S) lg(1) 0
  • Why is this the right formula?
  • 1/p is the average length of the gaps between
    recurrences of S
  • ..S..SS.SS..
  • a b c
    d
  • Average of a, b, c, d 1/p
  • Number of bits to specify a gap is about lg(1/p)

13
Self-information example
  • A 18/25.72, B 5/25.20, C 2/25.08
  • lg(1/.72).47, lg(1/.2)2.32, lg(1/.08)3.64
  • Gaps for A 1,2,1,2,1,1,3,1,2,1,1,1,1,2,1,1,2,1,
    ave 1.39 gt 1 bits to write down
  • Gaps for B 2, 7, 1, 3, 10, ave 4.6 gt 2 bits
  • Gaps for C 5, 14, ave 9.5 gt 3 bits

14
First-Order Entropy of Source Weighted Average
Self-Information
15
Entropy, Compressibility, Redundancy
  • Lower entropy ltgt More redundant ltgt More
    compressible
  • Higher entropy ltgt Less redundant ltgt Less
    compressible
  • A source of yeas and nays takes 24 bits per
    symbol but contains at most one bit per symbol of
    information
  • 010110010100010101000001 yea
  • 010011100100000110101001 nay

16
Entropy and Compression
  • No code taking only frequencies into account can
    be better than first-order entropy
  • Average length for this code .71.12.13.13
    1.5
  • First-order Entropy of this source
    .7lg(1/.7).1lg(1/.1) .1lg(1/.1).1lg(1/.1)
    1.353
  • First-order Entropy of English is about 4
    bits/character based on typical English texts

17
Second-Order model of English
  • Source generates all 729 digrams in the right
    proportions
  • Digram two-letter sequence AA ZZ, also
    Altspgt, ltspgtZ, etc.
  • A string from a second-order source of English
  • On ie antsoutinys are t inctore st be s deamy
    achin d ilonasive tucoowe at teasonare fuso tizin
    andy tobe seace ctisbe

18
Second-Order Entropy
  • Second-Order Entropy of a source is calculated by
    treating digrams as single symbols according to
    their frequencies
  • Occurrences of q and u are not independent so it
    is helpful to treat qu as one
  • Second-order entropy of English is about 3.3
    bits/character

19
Third-Order Entropy
  • Have trigrams in proper frequencies
  • IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID
    PONDENOME OF DEMONSTURES OF THE REPTAGIN IS
    REGOACTIONA OF CRE

20
Word Approximations to English
  • Use English words in their real frequencies
  • First-order word approximation REPRESENTING AND
    SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT
    NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT
    GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE
    THESE

21
Second-Order Word Approximation
  • THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
    WRITER THAT THE CHARACTER OF THIS POINT IS
    THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE
    TIME OF WHO EVER TOLD THE PROBLEM FOR AN
    UNEXPECTED

22
What is entropy of English?
  • Entropy is the limit of the information per
    symbol using single symbols, digrams, trigrams,
  • Not really calculable because English is a finite
    language!
  • Nonetheless it can be determined experimentally
    using Shannons game
  • Answer a little more than 1 bit/character

23
Efficiency of a Code
  • Efficiency of code for a source
  • (entropy of source)/(average code length)
  • Average code length 1.5
  • Assume that source generates symbols in these
    frequencies but otherwise randomly (first-order
    model)
  • Entropy 1.353
  • Efficiency 1.353/1.5 0.902

24
Shannons Source Coding Theorem
  • No code can achieve efficiency greater than 1,
    but
  • For any source, there are codes with efficiency
    as close to 1 as desired.
  • This is a most remarkable result because the
    proof does not give a method to find the best
    codes. It just sets a limit on how good they can
    be.
Write a Comment
User Comments (0)
About PowerShow.com