Title: Representation of Strings
1Huffman Encodings Section 9.4
2Data Compression Array Representation
- S denotes an alphabet used for all strings
- Each element in S is called a character
- Typical representation contiguous memory
- The bit sequence representing characters is
called the encoding
- number of bit sequences of length n?
2n
- Number of bits to represent S?
?log2 S)?
Data Compression problem Given an string w over
S, store it using as few bits as possible in such
a way that it can be recovered at will
3Motivation for the Solution
For representing strings we want to take
advantage of the fact that not all characters
occurs with the same frequency
Example FitalyStamp Do you have what it takes
to type 50 words per minute in your palm
organizer?
If only a subset S of S is actually used in w, we
could represent the strings in log2(S)
- Problems
- We need to know S in advance
- It doesnt account for ranking of occurrences
- Improvement only if ?log2 S)? lt ?log2 S)?
4Encoding Trees
Idea use different lengths to encode members of S
Potential problem E 101 T
110 Q 101110
Solution No encoding of a character can be the
prefix to the encoding of other character
Suppose that I 0000, V 0001, M 0010, U 0011,
D 010, H0110, N 0111, A 10, ? 110, F
111 Question how do we represent these codes in
a binary tree?
5Encoding Trees
A
D
F
H
N
V
M
U
I
Encoding trees can always be assumed to be full!
6Decoding with Encoding Trees
AIDA FAN 10000001101011011111001111
Procedure TreeDecode(pointer T, bitstream b)
P ? T while not Empty(b) do
if NextBit(b) 0 then
P ? LC(P) else
P ? RC(P)
if isLeaf(P) then
print(value(P)) P ? T
How to generate encoding trees?
7Constructing Encoding Trees
Example f(A) 0.35, B 0.1, C 0.2, D 0.2,
E 0.15
Many possible trees (combinatorial number). We
like the one that has minimum cost
Notation L(T) is the set of all leaves in T
c(n) is the cost or weight of node n
Idea 0 use exhaustive search to find the tree
with minimum cost
8Idea 1 Huffman Encoding Tree
For each character c we now the frequency fc with
which c occurs in w
- Construction method
- Create one node for each character c in S with
weight fc (each of these nodes will be a leaf in
the tree) - Repeat the following steps
- Pick two nodes n1 and n2 with smallest weight and
without parent - Create a new parent node for n1 and n2 with
weight weight(n1) weight(n2)
Eventually only two nodes remain, the parent node
is created and the loop ends
9Properties of Huffman Encoding Trees
Characters with higher frequency are placed
nearer the root, thus
They have shorter encoding!
Theorem. Let N be a set of nodes and C(n) the
weight of each node n in N. Let T be a Huffman
tree encoding for N. If X is any other tree
encoding for N, then WPL(T) WPL(X)
Is the Huffman method for generating the encoding
trees greedy?
Yes!
10Compression Ratio
Compression ratio (CR) ?log2 S)? is to 100
as (?log2 S)? - WPL(T) ) is to the CR
Huffman compression ratio falls between 20 and
80