Title: Arithmetic Coding for Data Compression
1Arithmetic Coding for Data Compression
Combinatorics, Complexity and Algorithms (CoCoA)
Group at LUMS
2On Information Codes
- We use codes to represent information for storage
and transmission. - We shall consider our messages (chunks of
information) consisting of symbols form a finite
alphabet e.g. English Alphabet, Numerals or a
combination of the two - The coding process would typically assign to each
symbol a unique codeword and the decoding process
would reverse the process. - Sometimes we would not use symbol coding but
would convert entire message into a number and
the decoding process will reproduce the message
from the number.
3Terminology
- Consider an alphabet A a1, a2, , am we
would represent a code by C c1, c2, , cm
where c1 is the code word for a1, c2 for a2 and
so on - For this talk, we take our code words to be
consisting of a binary alphabet 0, 1 - The length of each codeword ci is given by L(ci).
- We assume that we know the probabilities of
occurrence of the symbols of alphabet and we use
pi to represent the probability of ai. - Our measure of compression is the number of bits
per symbol on the average ? pi L(ci)
4Fixed and Variable Length Codes
- A Fixed Length Code employs code words of the
same length to represent all symbols of the
alphabet. - ASCII Code 8-bits to represent all symbols
(English Lang. characters special characters ) - A Variable Length Code gives code words of
varying lengths to different symbols. - Famous Morse Code uses different length
sequences of dots and dashes to transmit symbols
5A possibility of Data Compression
- Intuitively if you are buying more doughnuts than
either of pizzas and brownies, bargaining on the
unit price of doughnuts would save you more. - Naturally, assigning shorter codes to more
frequent symbols and longer symbols to less
probable ones has a strong possibility of
achieving compression. - If the alphabet is uniformly distributed, we can
NOT achieve compression by employing above scheme
alone.
6Example of Fixed-Length Coding
- Alphabet A a,b,c,d Probabilities pa
0.1, pb 0.2, pc 0.3, pd 0.4 . - We employ a fixed length code a 00, b 01,
c 10, d 11 - A message addbcd would then be coded to
001111011011. Takes 12 bits. - Decoding is straight-forward by breaking the
coded message into chunks of 2 and looking up
what symbol each chunk represents. - Average number of bits per symbol is 2.
7Example of Variable-Length Coding
- Alphabet A a,b,c,d Probabilities pa
0.1, pb 0.2, pc 0.3, pd 0.4 . - We employ a variable length code a 011, b
010, c 00, d 1 - A message addbcd would then be coded to
01111010001. Takes 11 bits. - Decoding Read the message till you are sure you
have recognized a symbol, then replace it and
continue with the rest of the message. - Average number of bits per symbol is 0.13
0.23 0.32 0.41 1.9.
8Uniquely Decodable Codes Prefix Codes
- Uniquely Decodable Code No two different
messages generate the same coded sequence. - Prefix Code No codeword for any symbol is a
prefix of a codeword for another symbol.
- Any prefix code is also uniquely decodable but
the converse is not true. - Prefix codes are desirable because they allow
ready identification of the symbol at hand.
9Huffman Coding
10Huffman doesnt want to take the Final Exam
- 1951 David Huffman in a term paper (as a
substitute for the final exam) discovers the
optimum prefix code (Minimum Redundancy Code). - Huffman, in his 1952 paper, observes that for an
optimum code - If pi pk then L(ci) L(ck)
- The code words for the two least frequent symbols
should only differ in their last digit.
11Huffman Coding Algorithm
- Start with a forest of trees where each tree is a
single node corresponding to a symbol of the
alphabet. The probability of a tree is defined
as the probability of its root. - Until you have a single tree, do the following
- Choose the two trees whose roots have the minimum
frequency (probability) - Construct a new node and make the two selected
trees its children. The probability of root of
the new tree would be the sum of the two trees we
have merged. - Your code tree is ready, assign 0 to each left
edge and 1 to each right edge.
12Huffman Coding Algorithm
- All the symbols are at the leaves of the code
tree and hence the prefix property holds. - Code word for a symbol can be formed by traversal
from root to the corresponding leave while
recording 1s and 0s on your way. - Decoding is done by reading the coded message and
traversing the tree. Once you reach a leaf, you
replace the read- part with the symbol. Repeat
the procedure with remaining message till you are
done.
13Huffman Code Example
a,b,c,d,e
0
1.0
1
d,e
0
1
0.6
a,b,c
e
d
0
1
0.4
0.3
0.3
c
a,b
1
0.2
0.2
0
Average Code Length ? P(i) L(i) 0.1 3
0.1 3 0.2 2 0.3 2 0.3 2 2.2 bits
per symbol
a
b
0.1
0.1
14How good can we get at it?
15How much can we compress?
- Assuming all possible messages are valid, for
each message an algorithm compresses, some other
must expand. - We take advantage of the fact that some messages
are more likely than others. Thus on the average,
we achieve compression. - There is a connection between the probability of
messages and the best average compression we can
achieve.
161948 Shannons Limit on Data Compression
- Self-information is the number of bits required
to specify an event so that on average the total
number of bits is minimum. - Shannon showed in 1948 that the self-information
is given by S (i) - log (pi) log (1/pi) - Thus knowing only the probabilities of symbols in
our alphabet, we would achieve the best average
compression when each symbol is encoded using
exactly the same number of bits as its
self-information. - We can not do any better.
17Why Huffman Coding cannot reach Shannons limit?
- Huffman coding reaches Shannons limit when the
probabilities of symbols are negative powers of
2. - For example, if two symbol have probabilities
1/16 and 1/2, the Huffman coding algorithm would
generate a log (16) 4 and log(2) 1 bit code
words for them respectively. - Now consider a symbol with probability 0.9 , then
the ideal codeword length should be log(1/0.9)
0.152 bits. But Huffman code would generate a 1
bit codeword and thus can not approach Shannons
limit.
18Arithmetic Coding
19The idea
- Arithmetic Coding is different from Huffman and
most other coding techniques in that it does not
replace symbols with codes. - Instead we produce a single fractional number
corresponding to a message. Each possible message
encodes to a unique number and can be recovered
uniquely. - The longer the message, the longer is the
precision required to represent the coded number.
20Encoding Procedure
- We use an interval called Current Range
initialized to 0,1. - While the message has not been completely read,
- Read next symbol from the message as the current
symbol. - Divide the range into m sub-intervals, each
corresponding to one of the m possible symbols
in the alphabet. The length of each sub-interval
is proportional to the probability of the
corresponding symbol. - The portion of the current symbol in the message
is designated as the new current range. - Any number between the final interval is the
output and it uniquely describes the message.
21Example
0.37419
o
c
o
a
c
Output 0.374 .01011111112 (10 bits)
22Encoding Algorithm
- Low 0
- High 1
- While there are symbols to encode
- Current Symbol read next from message
- Current Range High Low
- Low Low Current Range Low End(Current
Symbol) - High Low Current Range High End(Current
Symbol) - Output any number in Low,High)
23Encoding Algorithm at Work
Output 0.374125 .0101111111000110101012 (21
bits)
24Decoding Algorithm
- We apply inverse operations while decoding
- Initialize num to encoded number
- While decoding not complete
- Find symbol within whose range num lies, output
symbol - range High End (symbol) Low End (symbol)
- num num Low End (symbol)
- num num / range
25Decoding at work..
You need to explicitly tell the decoder when to
stop. This can be done by either sending an EOF
character or by sending file length in the
header.
26Analysis
- Compression is achieved because
- A high probability symbol does not decrease the
interval sharply but a low probability symbol
decreases the interval more significantly
requiring large number of digits. - A larger interval needs less bits to be
specified. - The number of digits required is log(size of
interval) - The size of the final interval is the product of
the probabilities of the symbols encoded. - Thus each symbol i with probability pi
contributes log (pi) to the output, which is
the self-information of the symbol. - Thus Arithmetic coding thus achieves optimal
compression theoretically.
27Some Comments
- The main advantage of Arithmetic coding is its
optimality. Another advantage is that it can be
used to employ adaptive coding without much
increase in complexity. - The main disadvantage is its slow speed. We
require a large number of multiplications and
divisions during encoding and decoding
respectively. - Todays computers have limited precision and thus
practically we can not represent entire files as
fractions. The problem is resolved by scaling the
interval so that we can use integer operations
instead.
28Questions!