Title: Entropy and Compression
1Entropy and Compression
- How much information is in English?
2Shannons game
- Capacity to guess the next letter of an English
text is a measure of the redundancy of English - More guessable more redundant lower entropy
- Perfectly guessable Zero information zero
entropy - How many bits per character are needed to
represent English?
3Fixed Length Codes
- Example 4 symbols, A, B, C, D
- A00, B01, C10, D11
- In general, with n symbols, codes need to be of
length lg n, rounded up - For English text, 26 letters space 27
symbols, length 5 since 24 lt 27 lt 25 - (replace all punctuation marks by space)
- AKA block codes
4Modeling the Message Source
Source
Destination
- Characteristics of the stream of messages coming
from the source affect the choice of the coding
method - We need a model for a source of English text that
can be described and analyzed mathematically
5No compression if no assumptions
- If nothing is assumed except that there are 27
possible symbols, the best that can be done to
construct a code is to use the shortest possible
fixed-length code - A string in which every letter is equally likely
- xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd
qpaamkbzaacibzlhjqd
6First-Order model of English
- Assume symbols are produced in proportion to
their actual frequencies in English texts - Use shorter codes for more frequent symbols
- Morse Code does something like this
- Example with 4 possible messages
7Prefix Codes
- Only one way to decode left to right
8First-Order model shortens code
.71.13.13.13 1.6 bits/symbol (down from
2)
Another prefix code that is better .71.12.1
3.13 1.5
9First-Order model of English
- From a First-Order Source for English
- ocroh hli rgwr nmielwis eu ll nbnesebya th eei
alhenhttpa oobttva nah brl
10Measuring Information
- If a symbol S has frequency p, its
self-information is H(S) lg(1/p) -lg p.
11Logarithm of a number its length
- Decimal log(274562328) 8.4
- Binary lg(101011) 5.4
- For calculations
- lg(x) log(x)/log(2)
- Google calculator understands lg(x)
12Self-Information H(S) lg(1/p)
- Greater frequency ltgt Less information
- Extreme case p 1, H(S) lg(1) 0
- Why is this the right formula?
- 1/p is the average length of the gaps between
recurrences of S - ..S..SS.SS..
- a b c
d - Average of a, b, c, d 1/p
- Number of bits to specify a gap is about lg(1/p)
13Self-information example
- A 18/25.72, B 5/25.20, C 2/25.08
- lg(1/.72).47, lg(1/.2)2.32, lg(1/.08)3.64
- Gaps for A 1,2,1,2,1,1,3,1,2,1,1,1,1,2,1,1,2,1,
ave 1.39 gt 1 bits to write down - Gaps for B 2, 7, 1, 3, 10, ave 4.6 gt 2 bits
- Gaps for C 5, 14, ave 9.5 gt 3 bits
14First-Order Entropy of Source Weighted Average
Self-Information
15Entropy, Compressibility, Redundancy
- Lower entropy ltgt More redundant ltgt More
compressible - Higher entropy ltgt Less redundant ltgt Less
compressible - A source of yeas and nays takes 24 bits per
symbol but contains at most one bit per symbol of
information - 010110010100010101000001 yea
- 010011100100000110101001 nay
16Entropy and Compression
- No code taking only frequencies into account can
be better than first-order entropy - Average length for this code .71.12.13.13
1.5 - First-order Entropy of this source
.7lg(1/.7).1lg(1/.1) .1lg(1/.1).1lg(1/.1)
1.353 - First-order Entropy of English is about 4
bits/character based on typical English texts
17Second-Order model of English
- Source generates all 729 digrams in the right
proportions - Digram two-letter sequence AA ZZ, also
Altspgt, ltspgtZ, etc. - A string from a second-order source of English
- On ie antsoutinys are t inctore st be s deamy
achin d ilonasive tucoowe at teasonare fuso tizin
andy tobe seace ctisbe
18Second-Order Entropy
- Second-Order Entropy of a source is calculated by
treating digrams as single symbols according to
their frequencies - Occurrences of q and u are not independent so it
is helpful to treat qu as one - Second-order entropy of English is about 3.3
bits/character
19Third-Order Entropy
- Have trigrams in proper frequencies
- IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID
PONDENOME OF DEMONSTURES OF THE REPTAGIN IS
REGOACTIONA OF CRE
20Word Approximations to English
- Use English words in their real frequencies
- First-order word approximation REPRESENTING AND
SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT
NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT
GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE
THESE
21Second-Order Word Approximation
- THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
WRITER THAT THE CHARACTER OF THIS POINT IS
THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE
TIME OF WHO EVER TOLD THE PROBLEM FOR AN
UNEXPECTED
22What is entropy of English?
- Entropy is the limit of the information per
symbol using single symbols, digrams, trigrams, - Not really calculable because English is a finite
language! - Nonetheless it can be determined experimentally
using Shannons game - Answer a little more than 1 bit/character
23Efficiency of a Code
- Efficiency of code for a source
- (entropy of source)/(average code length)
- Average code length 1.5
- Assume that source generates symbols in these
frequencies but otherwise randomly (first-order
model) - Entropy 1.353
- Efficiency 1.353/1.5 0.902
24Shannons Source Coding Theorem
- No code can achieve efficiency greater than 1,
but - For any source, there are codes with efficiency
as close to 1 as desired. - This is a most remarkable result because the
proof does not give a method to find the best
codes. It just sets a limit on how good they can
be.