Title: Huffman codes
1Huffman codes
- Binary character code each character is
represented by a unique binary string. - A data file can be coded in two ways
a b c d e f
frequency() 45 13 12 16 9 5
fixed-length code 000 001 010 011 100 101
variable-length code 0 101 100 111 1101 1100
The first way needs 100?3300 bits. The second
way needs 45 ?113 ?312 ?316 ?39 ?45 ?4232
bits.
2Variable-length code
- Need some care to read the code.
- 001011101 (codeword a0, b00, c01, d11.)
- Where to cut? 00 can be explained as either aa
or b. - Prefix of 0011 0, 00, 001, and 0011.
- Prefix codes no codeword is a prefix of some
other codeword. (prefix free) - Prefix codes are simple to encode and decode.
3Using codeword in Table to encode and decode
- Encode abc 0.101.100 0101100
- (just concatenate the codewords.)
- Decode 001011101 0.0.101.1101 aabe
a b c d e f
frequency() 45 13 12 16 9 5
fixed-length code 000 001 010 011 100 101
variable-length code 0 101 100 111 1101 1100
4- Encode abc 0.101.100 0101100
- (just concatenate the codewords.)
- Decode 001011101 0.0.101.1101 aabe
- (use the (right)binary tree below)
Tree for the fixed length codeword
Tree for variable-length codeword
5Binary tree
- Every nonleaf node has two children.
- The fixed-length code in our example is not
optimal. - The total number of bits required to encode a
file is - f ( c ) the frequency (number of occurrences)
of c in the file - dT(c) denote the depth of cs leaf in the tree
6Constructing an optimal code
- Formal definition of the problem
- Input a set of characters Cc1, c2, , cn,
each c?C has frequency fc. - Output a binary tree representing codewords so
that the total number of bits required for the
file is minimized. - Huffman proposed a greedy algorithm to solve the
problem.
7a45
d16
e9
f5
b13
c12
(a)
(b)
8(c)
(d)
9(f)
(e)
10HUFFMAN(C) 1 nC 2 QC 3 for i1 to n-1
do 4 zALLOCATE_NODE() 5 xleftzEXTRACT_MI
N(Q) 6 yrightzEXTRACT_MIN(Q) 7 fzfx
fy 8 INSERT(Q,z) 9 return EXTRACT_MIN(Q)
11The Huffman Algorithm
- This algorithm builds the tree T corresponding to
the optimal code in a bottom-up manner. - C is a set of n characters, and each character c
in C is a character with a defined frequency
fc. - Q is a priority queue, keyed on f, used to
identify the two least-frequent characters to
merge together. - The result of the merger is a new object
(internal node) whose frequency is the sum of
the two objects.
12Time complexity
- Lines 4-8 are executed n-1 times.
- Each heap operation in Lines 4-8 takes O(lg n)
time. - Total time required is O(n lg n).
- Note The details of heap operation will not be
tested. Time complexity O(n lg n) should be
remembered.
13Another example
e4
a6
c6
b9
d11
14d11
15(No Transcript)
16Correctness of Huffmans Greedy Algorithm
(Fun Part, not required)
- Again, we use our general strategy.
- Let x and y are the two characters in C having
the lowest frequencies. (the first two characters
selected in the greedy algorithm.) - We will show the two properties
- There exists an optimal solution Topt (binary
tree representing codewords) such that x and y
are siblings in Topt. - Let z be a new character with frequency
fzfxfy and CC-x, y?z. Let
T be an optimal tree for C. Then we can get
Topt from T by replacing z with
z
x
y
17Proof of Property 1
Topt
Tnew
- Look at the lowest siblings in Topt, say, b and
c. - Exchange x with b and y with c.
- B(Topt)-B(Tnew)?0 since fx and fy are the
smallest. - 1 is proved.
18 - Let z be a new character with frequency
fzfxfy and CC-x, y?z. Let T be an
optimal tree for C. Then we can get Topt from T
by -
replacing z with - Proof Let T be the tree obtained from T by
- replacing z with the three nodes.
- B(T)B(T)fxfy. (1)
- (the length of the codes for x and y are 1 bit
more than that of z.) - Now prove T Topt by contradiction.
- If T?Topt, then B(T)gtB(Topt). (2)
- From 1, x and y are siblings in Topt .
- Thus, we can delete x and y from Topt and get
another tree T for C. - B(T)B(Topt) fx-fyltB(T)-fx-fyB(T).
- using (2)
using (1) - Thus, T(T)ltB(T). Contradiction to the
assumption T is optimum for C. -
z
y
x