Huffman Codes - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Huffman Codes

Description:

Huffman Codes Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the characteristics of ... – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 27
Provided by: UNI138
Category:
Tags: codes | huffman

less

Transcript and Presenter's Notes

Title: Huffman Codes


1
Huffman Codes
2
Introduction
  • Huffman codes are a very effective technique for
    compressing data savings of 20 to 90 are
    typical, depending on the characteristics of the
    file being compressed. Huffmans greedy algorithm
    uses a table of frequencies of occurrence of each
    character in the file to build up an optimal way
    of representing each character as a binary
    string.
  • Suppose we have a 100,000-character data file
    that we wish to store compactly. Further suppose
    the characters in the file occur with the
    following frequencies

3
Introduction
  • That is, there are only 6 different characters in
    the file, and, for example, the character a
    appears 45,000 times.
  • There are many ways to represent such a file of
    information. We consider the problem of designing
    a binary character code (or code for short),
    wherein each character is represented by a unique
    binary string. If we use a fixed-length code, we
    need 3 bits to represent six characters, and
    300,000 bits for the entire file.

4
Introduction
  • A variable-length code can do considerably
    better, by giving frequent characters short
    codewords, and infrequent characters long
    codewords. In our example
  • If we use the given variable-length code, we only
    need 224,000 bits (1453133123164945224
    ).
  • We saved approximately 25. In fact, this is an
    optimal character code for this file.

5
Prefix codes
  • We consider here only codes in which no codeword
    is also a prefix of some other codeword. Such
    codes are are called prefix(-free) codes. It is
    possible to show that optimal data compression
    achievable by a character code can always be
    achieved with a prefix code, so there is no loss
    of generality in restricting attention to prefix
    codes.

6
Prefix codes
  • Prefix codes are desirable because they simplify
    encoding (compression) and decoding. Encoding is
    always easy for any binary character code we
    just concatenate the codewords representing each
    character of the file. In the example, we code
    abc by 0101100 (if we use the variable-length
    prefix code).

7
Prefix codes
  • Decoding is also quite simple with a prefix code.
    Since no codeword is a prefix of any other, the
    codeword that begins an encoded file is
    unambiguous. We can simply identify the initial
    codeword, translate it back to the original
    character, remove it from the encoded file, and
    repeat the decoding process on the remainder of
    the encoded file. In our example, the string
    001011101 parses uniquely as 0 0 101 1101, which
    decodes to aabe.

8
Prefix codes
  • The decoding process needs a convenient
    representation for the prefix code so that the
    initial codeword can be easily picked off. A
    binary tree whose leaves are the given characters
    provides one such representation. We interpret
    the binary codeword for a character as the path
    form the root to the character, where 0 means go
    to the left child and 1 means go to the right
    child.
  • The following figure shows the trees for the two
    codes of our example.

9
Prefix Codes
10
Prefix codes
  • An optimal code for a file is always represented
    by a full binary tree, in which every nonleaf
    node has two children (why?). The fixed-length
    code in our example is not optimal since its tree
    is not a full binary tree there are codewords
    beginning 10, but none beginning 11. Since we
    can now restrict our attention to full binary
    trees, we can say that if C is the alphabet from
    which the characters are drawn, then the tree for
    an optimal prefix code has exactly C leaves,
    one for each letter of the alphabet, and exactly
    C -1 internal nodes.

11
Prefix codes
  • Given a tree T corresponding to a prefix code, it
    is a simple matter to compute the number of bits
    required to encode a file. For each character c
    in the alphabet C, let f(c) denote the frequency
    of c in the file and let dT(c) denote the depth
    of cs leaf in the tree. Note that dT(c) is also
    the length of the codeword for the character c.
    The number of bits required to encode a file is
    thus
  • B(T) ?c?C f(c) dT(c)
  • Which is defined as the cost of the tree T.

12
Constructing a Huffman code
  • Huffman invented a greedy algorithm that
    constructs an optimal prefix code called a
    Huffman code.
  • In the pseudocode that follows, C is a set of n
    characters and each c?C has a defined frequency
    fc. The algorithm builds the tree T
    corresponding to the optimal code in a bottom-up
    manner. It begins with a set of C leaves and
    performs a sequence of C -1 merging
    operations to create the final tree. A priority
    queue Q, keyed on f, is used to identify the two
    least-frequent objects to merge together. The
    result of the merger of two objects is the a new
    object whose frequency is the sum of the
    frequencies of the two objects that were merged.

13
Constructing a Huffman code
  • HUFFMAN(C)
  • n ? C
  • Q ? C
  • for i ? 1 to n-1
  • do allocate a new node z
  • leftz ? x ? EXTRACT-MIN(Q)
  • rightz ? y ? EXTRACT-MIN(Q)
  • fz ? fx fy
  • INSERT (Q,z)
  • return EXTRACT-MIN(Q)

14
Constructing a Huffman code
  • For our example, the following figures show how
    the algorithm works.
  • There are 6 letters, and so the size of the
    initial queue is n 6. There are 5 merge steps.
    The final tree represents the optimal prefix
    code. The codeword for a letter is a sequence of
    the edge labels on the path from the root to the
    letter.

15
Constructing a Huffman code
16
Constructing a Huffman code
  • The analysis of the code is quite simple we
    first define the queue, then we have n-1 merge
    steps we pick the two most infrequent characters
    and merge them to a new one, that finds its
    proper place in the queue.
  • If we implement the queue via a heap, the running
    time is easily found to be O(nlog n).

17
Correctness of Huffmans algorithm
  • We present several lemmas that will lead to the
    desired conclusion.
  • Lemma 16.2 Let C be an alphabet in which each
    character c?C has frequency fc. Let x and y be
    two characters in C having the lowest
    frequencies. Then there exists an optimal prefix
    code for C in which the codewords for x and y
    have the same length and differ only in the last
    bit.

18
Correctness of Huffmans algorithm
  • Proof
  • The idea is to take the tree T representing an
    arbitrary optimal prefix code and modify it to
    make a tree representing another optimal prefix
    code such that the characters x and y appear as
    sibling leaves of maximum depth in the new tree.
    If we succeed, then their codewords will have the
    same length and will only differ in the last bit.

19
Correctness of Huffmans algorithm
  • Let a and b be two characters that are sibling
    leaves of maximum depth in T. Without loss of
    generality, we assume that fa ? fb and fx
    ? fy. Since fx and fy are the two lowest
    leaf frequencies, in order, and fa and fb are
    two arbitrary frequencies, in order, we have fx
    ? fa, and fy ? fb. We now exchange the
    positions in T of a and x, and get the tree T,
    and then exchange the positions of b and y, to
    produce the tree T.
  • We should now calculate the difference in cost
    between T and T.

20
Correctness of Huffmans algorithm
21
Correctness of Huffmans algorithm
  • We start with
  • B(T) - B(T) ?c?C fc dT(c) - ?c?C fc
    dT(c)
  • fx dT(x) fa dT(a) -
    fx dT(x) - fa dT(a)
  • fx dT(x) fa dT(a) -
    fx dT(a) - fa dT(x)
  • ( fa - fx ) ( dT(a) -
    dT(x) )
  • ? 0,
  • because both fa - fx and dT(a) - dT(x) are
    nonnegative (Why?).
  • Similarly, when we move from T to T, we do not
    increase the cost. Therefore, B(T) ? B(T), but
    since T was optimal, B(T) ? B(T), which
    implies B(T) B(T).
  • Thus, T is an optimal tree in which x and y
    appear as sibling leaves of maximum depth, and
    the lemma follows. ?

22
Correctness of Huffmans algorithm
  • The lemma implies that the process of building up
    an optimal tree by mergers can, without loss of
    generality, begin with the greedy choice of
    merging together two characters of lowest
    frequency.
  • The next lemma shows that the problem of
    constructing optimal prefix codes has (what we
    call) the optimal substructure property

23
Correctness of Huffmans algorithm
  • Lemma 16.3 Let T be a full binary tree
    representing an optimal prefix code over an
    alphabet C, where frequency fc is defined for
    each character c?C. Consider any two characters x
    and y that appear as sibling leaves in T, and let
    z be their parent. Then, considering z as a
    character with frequency fz fx fy, the
    tree
  • T T x, y
  • represents an optimal prefix code for the
    alphabet
  • C C x, y ? z .

24
Correctness of Huffmans algorithm
  • Proof We first show that the cost B(T) of the
    tree T can be expressed in terms of the cost
    B(T) of the tree T by considering the different
    summands in the definition of B( ). For each c? C
    x, y , we have dT(c) dT(c), resulting in
  • fc dT(c) fc dT(c).
  • Since dT(x) dT(y) dT(z) 1, we have
  • fx dT(x) fy dT(y) ( fx fy ( (
    dT(z) 1 )
  • fz dT(z) ( fx fy (,
  • leading to
  • B(T) B(T) fx fy.

25
Correctness of Huffmans algorithm
  • If T represents a non-optimal prefix code for
    the alphabet C, then there exists a tree T
    whose leave are characters in C such that B(T)
    lt B(T). Since z is treated as a character in C,
    it appears as a leaf in T. If we add x and y as
    the children of z in T, we obtain a prefix code
    for C with cost
  • B(T) fx fy lt B(T),
  • contradicting the optimality of T. Thus, T must
    be optimal for the alphabet C. ?

26
Correctness of Huffmans algorithm
  • Theorem Procedure HUFFMAN produces an optimal
    prefix code.
  • Proof Immediate from the two lemmas.
  • Last updated 2/08/2010
Write a Comment
User Comments (0)
About PowerShow.com