Title: Huffman Codes
1Huffman Codes
2Introduction
- Huffman codes are a very effective technique for
compressing data savings of 20 to 90 are
typical, depending on the characteristics of the
file being compressed. Huffmans greedy algorithm
uses a table of frequencies of occurrence of each
character in the file to build up an optimal way
of representing each character as a binary
string. - Suppose we have a 100,000-character data file
that we wish to store compactly. Further suppose
the characters in the file occur with the
following frequencies
3Introduction
- That is, there are only 6 different characters in
the file, and, for example, the character a
appears 45,000 times. - There are many ways to represent such a file of
information. We consider the problem of designing
a binary character code (or code for short),
wherein each character is represented by a unique
binary string. If we use a fixed-length code, we
need 3 bits to represent six characters, and
300,000 bits for the entire file.
4Introduction
- A variable-length code can do considerably
better, by giving frequent characters short
codewords, and infrequent characters long
codewords. In our example - If we use the given variable-length code, we only
need 224,000 bits (1453133123164945224
). - We saved approximately 25. In fact, this is an
optimal character code for this file.
5Prefix codes
- We consider here only codes in which no codeword
is also a prefix of some other codeword. Such
codes are are called prefix(-free) codes. It is
possible to show that optimal data compression
achievable by a character code can always be
achieved with a prefix code, so there is no loss
of generality in restricting attention to prefix
codes.
6Prefix codes
- Prefix codes are desirable because they simplify
encoding (compression) and decoding. Encoding is
always easy for any binary character code we
just concatenate the codewords representing each
character of the file. In the example, we code
abc by 0101100 (if we use the variable-length
prefix code).
7Prefix codes
- Decoding is also quite simple with a prefix code.
Since no codeword is a prefix of any other, the
codeword that begins an encoded file is
unambiguous. We can simply identify the initial
codeword, translate it back to the original
character, remove it from the encoded file, and
repeat the decoding process on the remainder of
the encoded file. In our example, the string
001011101 parses uniquely as 0 0 101 1101, which
decodes to aabe.
8Prefix codes
- The decoding process needs a convenient
representation for the prefix code so that the
initial codeword can be easily picked off. A
binary tree whose leaves are the given characters
provides one such representation. We interpret
the binary codeword for a character as the path
form the root to the character, where 0 means go
to the left child and 1 means go to the right
child. - The following figure shows the trees for the two
codes of our example.
9Prefix Codes
10Prefix codes
- An optimal code for a file is always represented
by a full binary tree, in which every nonleaf
node has two children (why?). The fixed-length
code in our example is not optimal since its tree
is not a full binary tree there are codewords
beginning 10, but none beginning 11. Since we
can now restrict our attention to full binary
trees, we can say that if C is the alphabet from
which the characters are drawn, then the tree for
an optimal prefix code has exactly C leaves,
one for each letter of the alphabet, and exactly
C -1 internal nodes.
11Prefix codes
- Given a tree T corresponding to a prefix code, it
is a simple matter to compute the number of bits
required to encode a file. For each character c
in the alphabet C, let f(c) denote the frequency
of c in the file and let dT(c) denote the depth
of cs leaf in the tree. Note that dT(c) is also
the length of the codeword for the character c.
The number of bits required to encode a file is
thus - B(T) ?c?C f(c) dT(c)
- Which is defined as the cost of the tree T.
12Constructing a Huffman code
- Huffman invented a greedy algorithm that
constructs an optimal prefix code called a
Huffman code. - In the pseudocode that follows, C is a set of n
characters and each c?C has a defined frequency
fc. The algorithm builds the tree T
corresponding to the optimal code in a bottom-up
manner. It begins with a set of C leaves and
performs a sequence of C -1 merging
operations to create the final tree. A priority
queue Q, keyed on f, is used to identify the two
least-frequent objects to merge together. The
result of the merger of two objects is the a new
object whose frequency is the sum of the
frequencies of the two objects that were merged.
13Constructing a Huffman code
- HUFFMAN(C)
- n ? C
- Q ? C
- for i ? 1 to n-1
- do allocate a new node z
- leftz ? x ? EXTRACT-MIN(Q)
- rightz ? y ? EXTRACT-MIN(Q)
- fz ? fx fy
- INSERT (Q,z)
- return EXTRACT-MIN(Q)
14Constructing a Huffman code
- For our example, the following figures show how
the algorithm works. - There are 6 letters, and so the size of the
initial queue is n 6. There are 5 merge steps.
The final tree represents the optimal prefix
code. The codeword for a letter is a sequence of
the edge labels on the path from the root to the
letter.
15Constructing a Huffman code
16Constructing a Huffman code
- The analysis of the code is quite simple we
first define the queue, then we have n-1 merge
steps we pick the two most infrequent characters
and merge them to a new one, that finds its
proper place in the queue. - If we implement the queue via a heap, the running
time is easily found to be O(nlog n).
17Correctness of Huffmans algorithm
- We present several lemmas that will lead to the
desired conclusion. - Lemma 16.2 Let C be an alphabet in which each
character c?C has frequency fc. Let x and y be
two characters in C having the lowest
frequencies. Then there exists an optimal prefix
code for C in which the codewords for x and y
have the same length and differ only in the last
bit.
18Correctness of Huffmans algorithm
- Proof
- The idea is to take the tree T representing an
arbitrary optimal prefix code and modify it to
make a tree representing another optimal prefix
code such that the characters x and y appear as
sibling leaves of maximum depth in the new tree.
If we succeed, then their codewords will have the
same length and will only differ in the last bit.
19Correctness of Huffmans algorithm
- Let a and b be two characters that are sibling
leaves of maximum depth in T. Without loss of
generality, we assume that fa ? fb and fx
? fy. Since fx and fy are the two lowest
leaf frequencies, in order, and fa and fb are
two arbitrary frequencies, in order, we have fx
? fa, and fy ? fb. We now exchange the
positions in T of a and x, and get the tree T,
and then exchange the positions of b and y, to
produce the tree T. - We should now calculate the difference in cost
between T and T.
20Correctness of Huffmans algorithm
21Correctness of Huffmans algorithm
- We start with
- B(T) - B(T) ?c?C fc dT(c) - ?c?C fc
dT(c) - fx dT(x) fa dT(a) -
fx dT(x) - fa dT(a) - fx dT(x) fa dT(a) -
fx dT(a) - fa dT(x) - ( fa - fx ) ( dT(a) -
dT(x) ) - ? 0,
- because both fa - fx and dT(a) - dT(x) are
nonnegative (Why?). - Similarly, when we move from T to T, we do not
increase the cost. Therefore, B(T) ? B(T), but
since T was optimal, B(T) ? B(T), which
implies B(T) B(T). - Thus, T is an optimal tree in which x and y
appear as sibling leaves of maximum depth, and
the lemma follows. ?
22Correctness of Huffmans algorithm
- The lemma implies that the process of building up
an optimal tree by mergers can, without loss of
generality, begin with the greedy choice of
merging together two characters of lowest
frequency. - The next lemma shows that the problem of
constructing optimal prefix codes has (what we
call) the optimal substructure property
23Correctness of Huffmans algorithm
- Lemma 16.3 Let T be a full binary tree
representing an optimal prefix code over an
alphabet C, where frequency fc is defined for
each character c?C. Consider any two characters x
and y that appear as sibling leaves in T, and let
z be their parent. Then, considering z as a
character with frequency fz fx fy, the
tree - T T x, y
- represents an optimal prefix code for the
alphabet - C C x, y ? z .
24Correctness of Huffmans algorithm
- Proof We first show that the cost B(T) of the
tree T can be expressed in terms of the cost
B(T) of the tree T by considering the different
summands in the definition of B( ). For each c? C
x, y , we have dT(c) dT(c), resulting in
- fc dT(c) fc dT(c).
- Since dT(x) dT(y) dT(z) 1, we have
- fx dT(x) fy dT(y) ( fx fy ( (
dT(z) 1 ) - fz dT(z) ( fx fy (,
- leading to
- B(T) B(T) fx fy.
25Correctness of Huffmans algorithm
- If T represents a non-optimal prefix code for
the alphabet C, then there exists a tree T
whose leave are characters in C such that B(T)
lt B(T). Since z is treated as a character in C,
it appears as a leaf in T. If we add x and y as
the children of z in T, we obtain a prefix code
for C with cost - B(T) fx fy lt B(T),
- contradicting the optimality of T. Thus, T must
be optimal for the alphabet C. ?
26Correctness of Huffmans algorithm
- Theorem Procedure HUFFMAN produces an optimal
prefix code. - Proof Immediate from the two lemmas.
- Last updated 2/08/2010