Entropy and Compression - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Entropy and Compression

Description:

In general, with n symbols, codes need to be of length lg n, rounded up ... Self-Information = H(S) = lg(1/p) Greater frequency == Less information ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 25

Provided by: Harry112

Category:

Tags: compression | entropy | lg

more less

Transcript and Presenter's Notes

Title: Entropy and Compression

1
Entropy and Compression

How much information is in English?

2
Shannons game

Capacity to guess the next letter of an English
text is a measure of the redundancy of English
More guessable more redundant lower entropy
Perfectly guessable Zero information zero
entropy
How many bits per character are needed to
represent English?

3
Fixed Length Codes

Example 4 symbols, A, B, C, D
A00, B01, C10, D11
In general, with n symbols, codes need to be of
length lg n, rounded up
For English text, 26 letters space 27
symbols, length 5 since 24 lt 27 lt 25
(replace all punctuation marks by space)
AKA block codes

4
Modeling the Message Source
Source
Destination

Characteristics of the stream of messages coming
from the source affect the choice of the coding
method
We need a model for a source of English text that
can be described and analyzed mathematically

5
No compression if no assumptions

If nothing is assumed except that there are 27
possible symbols, the best that can be done to
construct a code is to use the shortest possible
fixed-length code
A string in which every letter is equally likely
xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd
qpaamkbzaacibzlhjqd

6
First-Order model of English

Assume symbols are produced in proportion to
their actual frequencies in English texts
Use shorter codes for more frequent symbols
Morse Code does something like this
Example with 4 possible messages

7
Prefix Codes

Only one way to decode left to right

8
First-Order model shortens code

Average bits per symbol

.71.13.13.13 1.6 bits/symbol (down from
2)
Another prefix code that is better .71.12.1
3.13 1.5
9
First-Order model of English

From a First-Order Source for English
ocroh hli rgwr nmielwis eu ll nbnesebya th eei
alhenhttpa oobttva nah brl

10
Measuring Information

If a symbol S has frequency p, its
self-information is H(S) lg(1/p) -lg p.

11
Logarithm of a number its length

Decimal log(274562328) 8.4
Binary lg(101011) 5.4
For calculations
lg(x) log(x)/log(2)
Google calculator understands lg(x)

12
Self-Information H(S) lg(1/p)

Greater frequency ltgt Less information
Extreme case p 1, H(S) lg(1) 0
Why is this the right formula?
1/p is the average length of the gaps between
recurrences of S
..S..SS.SS..
a b c
d
Average of a, b, c, d 1/p
Number of bits to specify a gap is about lg(1/p)

13
Self-information example

A 18/25.72, B 5/25.20, C 2/25.08
lg(1/.72).47, lg(1/.2)2.32, lg(1/.08)3.64
Gaps for A 1,2,1,2,1,1,3,1,2,1,1,1,1,2,1,1,2,1,
ave 1.39 gt 1 bits to write down
Gaps for B 2, 7, 1, 3, 10, ave 4.6 gt 2 bits
Gaps for C 5, 14, ave 9.5 gt 3 bits

14
First-Order Entropy of Source Weighted Average
Self-Information
15
Entropy, Compressibility, Redundancy

Lower entropy ltgt More redundant ltgt More
compressible
Higher entropy ltgt Less redundant ltgt Less
compressible
A source of yeas and nays takes 24 bits per
symbol but contains at most one bit per symbol of
information
010110010100010101000001 yea
010011100100000110101001 nay

16
Entropy and Compression

No code taking only frequencies into account can
be better than first-order entropy
Average length for this code .71.12.13.13
1.5
First-order Entropy of this source
.7lg(1/.7).1lg(1/.1) .1lg(1/.1).1lg(1/.1)
1.353
First-order Entropy of English is about 4
bits/character based on typical English texts

17
Second-Order model of English

Source generates all 729 digrams in the right
proportions
Digram two-letter sequence AA ZZ, also
Altspgt, ltspgtZ, etc.
A string from a second-order source of English
On ie antsoutinys are t inctore st be s deamy
achin d ilonasive tucoowe at teasonare fuso tizin
andy tobe seace ctisbe

18
Second-Order Entropy

Second-Order Entropy of a source is calculated by
treating digrams as single symbols according to
their frequencies
Occurrences of q and u are not independent so it
is helpful to treat qu as one
Second-order entropy of English is about 3.3
bits/character

19
Third-Order Entropy

Have trigrams in proper frequencies
IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID
PONDENOME OF DEMONSTURES OF THE REPTAGIN IS
REGOACTIONA OF CRE

20
Word Approximations to English

Use English words in their real frequencies
First-order word approximation REPRESENTING AND
SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT
NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT
GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE
THESE

21
Second-Order Word Approximation

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
WRITER THAT THE CHARACTER OF THIS POINT IS
THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE
TIME OF WHO EVER TOLD THE PROBLEM FOR AN
UNEXPECTED

22
What is entropy of English?

Entropy is the limit of the information per
symbol using single symbols, digrams, trigrams,
Not really calculable because English is a finite
language!
Nonetheless it can be determined experimentally
using Shannons game
Answer a little more than 1 bit/character

23
Efficiency of a Code

Efficiency of code for a source
(entropy of source)/(average code length)
Average code length 1.5
Assume that source generates symbols in these
frequencies but otherwise randomly (first-order
model)
Entropy 1.353
Efficiency 1.353/1.5 0.902

24
Shannons Source Coding Theorem

No code can achieve efficiency greater than 1,
but
For any source, there are codes with efficiency
as close to 1 as desired.
This is a most remarkable result because the
proof does not give a method to find the best
codes. It just sets a limit on how good they can
be.

Write a Comment

User Comments (0)