Chapter 3: Compression (Part 1) - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 3: Compression (Part 1)

Description:

Some symbols are used more frequently than others. ... With fixed-length encoding, to represent n symbols, we need log2n bits for each code. ... – PowerPoint PPT presentation

Number of Views:246
Avg rating:3.0/5.0
Slides: 75
Provided by: drbe
Category:

less

Transcript and Presenter's Notes

Title: Chapter 3: Compression (Part 1)


1
Chapter 3 Compression(Part 1)
2
Storage size for media types(uncompressed)
3
A typical encyclopedia
  • 50,000 pages of text (2KB/page)
    0.1 GB
  • 3,000 color pictures(on average 640?480?24 bits
    1MB/picture) 3 GB
  • 500 maps/drawings/illustrations(on average
    640?480?16 bits 0.6 MB/map) 0.3 GB
  • 60 minutes of stereo sound (176 KB/s)
    0.6 GB
  • 30 animations, on average 2 minutes in
    duration(640?320?16 pixels/frame 6.5 MB/s)
    23.4 GB
  • 50 digitized movies, on average 1 min. in
    duration 87 GB
    TOTAL 114 GB (or 180 CDs, or about 2
    feet high of CDs)

4
Compressing an encyclopedia
5
Compression and encoding
  • We can view compression as an encoding method
    which transforms a sequence of bits/bytes into an
    artificial format such that the output (in
    general) takes less space than the input.

Compressor Encoder
Fat data
Slim data
Fat (redundancy)
6
Redundancy
  • Compression is made possible by exploiting
    redundancy in source data
  • For example, redundancy found in
  • Character distribution
  • Character repetition
  • High-usage pattern
  • Positional redundancy(these categories overlap
    to some extent)

7
Symbols and Encoding
  • We assume a piece of source data that is subject
    to compression (such as text, image, etc.) is
    represented by a sequence of symbols. Each symbol
    is encoded in a computer by a code (or codeword,
    value), which is a bit string.
  • Example
  • English text abc (symbols), ASCII (coding)
  • Chinese text ??? (symbols), BIG5 (coding)
  • Image color (symbols), RGB (coding)

8
Character distribution
  • Some symbols are used more frequently than
    others.
  • In English text, e and space occur most often.
  • Fixed-length encoding use the same number of
    bits to represent each symbol.
  • With fixed-length encoding, to represent n
    symbols, we need ?log2n? bits for each code.

9
Character distribution
  • Fixed-length encoding may not be the most
    space-efficient method to digitally represent an
    object.
  • Example abacabacabacabacabacabacabac
  • With fixed-length encoding, we need 2 bits for
    each of the 3 symbols.
  • Alternatively, we could use 1 to represent a
    and 00 and 01 to represent b and c,
    respectively.
  • How much saving did we achieve?

10
Character distribution
  • Technique whenever symbols occur with different
    frequencies, we use variable-length encoding.
  • For your interest
  • In 1838, Samuel Morse and Alfred Vail designed
    the Morse Code for telegraph.
  • Each letter of the alphabet is represented by
    dots (electric current of short duration) and
    dashes (3 dots).
  • e.g., E ? . Z ? --..
  • Morse Vail determined the code by counting the
    letters in each bin of a printers type box.

11
Morse code table
SOS ? ? ? ? ? ?
12
Character distribution
  • Principle use fewer bits to represent frequently
    occurring symbols, use more bits to represent
    those rarely occurring ones.

13
Character repetition
  • when a string of repetitions of a single symbol
    occurs, e.g., aaaaaaaaaaaaaabc
  • Example in graphics, it is common to have a line
    of pixels using the same color.
  • Technique run-length encoding
  • Instead of specifying the exact value of each
    repeating symbol, we specify the value once and
    specify the number of times the value repeats.

14
High usage patterns
  • Certain patterns (i.e., sequences of symbols)
    occur very often.
  • Example qu, th, , in English text.
  • Example begin, end, , in Pascal source
    code.
  • Technique Use a single special symbol to
    represent a common sequence of symbols.
  • e.g., ? th
  • i.e., instead of using 2 codes for t and h,
    use only 1 code for th

15
Positional redundancy
  • Sometimes, given the context, symbols can be
    predicted.
  • Technique differential encoding
  • Predict a value and code only the prediction
    error.
  • example differential pulse code modulation (DPCM)

198, 199, 200, 200, 201, 200, 200, 198
198, 1, 1, 0, 1, -1, 0, -2
Small numbers require fewer bits to represent
16
Categories of compression techniques
  • Entropy encoding
  • regard data stream as a simple digital sequence,
    its semantics is ignored
  • lossless compression
  • Source encoding
  • take into account the semantics of the data and
    the characteristics of the specific medium to
    achieve a higher compression ratio
  • lossy compression

17
Categories of compression techniques
  • Hybrid encoding
  • A combination of the above two.

18
A categorization
Entropy Encoding Huffman Code, Shannon-Fano Code Huffman Code, Shannon-Fano Code
Entropy Encoding Run-Length Encoding Run-Length Encoding
Entropy Encoding LZW LZW
Source Coding Prediction DPCM
Source Coding Prediction DM
Source Coding Transformation DCT
Source Coding Vector Quantization Vector Quantization
Hybrid Coding JPEG JPEG
Hybrid Coding MPEG MPEG
Hybrid Coding H.261 H.261
19
Information theory
  • Entropy is a quantitative measure of uncertainty
    or randomness. Entropy ? ? order ?
  • According to Shannon, the entropy of an
    information source S is defined as
  • where pi is the probability that symbol Si in S
    occurs.
  • For notational convenience, we use a set of
    probabilities to denote S. E.g., S 0.2, 0.3,
    0.5 for a set S of 3 symbols.

20
Entropy
  • For any set S of n symbols
  • 0 H(1, 0, , 0) ? H(S) ? H(1/n, , 1/n)
    log2n.
  • That is, the entropy of S is bounded below and
    above by 0 and log2n, respectively.
  • The lower bound is obtained if one symbol occurs
    with a probability of 1, while those of others
    are 0 The upper bound is obtained if all symbols
    occur with equal probability.

21
Self-information
  • I(Si) log2(1/pi) is called the self-information
    associated with the symbol Si.
  • It indicates the amount of information (surprise
    or unexpectedness) contained in Si or the number
    of bits needed to encode it.
  • Note that Entropy can be interpreted as the
    average self-information of a symbol of a source.

22
Entropy and Compression
  • In relation to compression and coding, one can
    consider an object (such as an image, a piece of
    text, etc.) as an information source, and the
    elements in the object (such as pixels,
    alpha-numeric characters, etc.) as the symbols.
  • You may also think of the occurrence frequency of
    an element (e.g., a pixel) as the elements
    occurrence probability.

23
Example
Correspondingstatistics
Symbol Freq. Prob.
A 1 1/12
b 2 2/12
r 2 2/12
a 4 4/12
c 1 1/12
d 1 1/12
! 1 1/12
Abracadabra!
A piece of text
24
Example
Correspondingstatistics
Symbol Freq. Prob.
27 0.21
44 0.34
28 0.22
25 0.20
4 0.03
An image
25
Information theory
  • The entropy is the lower bound for lossless
    compression if the probability of each symbol is
    fixed, then on average, a symbol requires H(S)
    bits to encode.
  • A guideline the code lengths for different
    symbols should vary according to the information
    carried by the symbol. Coding techniques based on
    that are called entropy encoding.

26
Information theory
  • For example, consider an image in which half of
    the pixels are white, another half are black
  • p(white) 0.5, p(black) 0.5,
  • Information theory says that each pixel requires
    1 bit to encode on average.

27
Information theory
  • Q How about an image such that white occurs ¼ of
    the time, black ¼ of the time, and gray ½ of the
    time?
  • Ans Entropy 1.5.
  • Information theory says that it takes at least
    1.5 bits on average to encode a pixel
  • How white (00), black (01), gray (1)
  • Entropy encoding is about wisely using just
    enough number of bits to represent the occurrence
    of certain patterns.

28
Variable-length encoding
  • A problem with variable-length encoding is
    ambiguity in the decoding process.
  • e.g., 4 symbols a (0), b (1), c (01), d
    (10)
  • the bit stream 0110 can be interpreted as
  • abba, or
  • abd, or
  • cba, or
  • cd

29
Prefix code
  • We wish to encode each symbol into a sequence of
    0s and 1s so that no code for a symbol is the
    prefix of the code of any other symbol the
    prefix property.
  • With the prefix property, we are able to decode a
    sequence of 0s and 1s into a sequence of
    symbols unambiguously.

30
Average code length
  • Given
  • a symbol set S S1, , Sn
  • with probability of Si pi
  • a coding scheme C such that the codeword of Si is
    ci
  • and the length (in bits) of ci being li
  • The average code length (ACL) is given by

31
Average code length
  • Example
  • S a, b, c, d, e, f, g, h
  • with probabilities 0.1, 0.1, 0.3, 0.35, 0.05,
    0.03, 0.02, 0.05
  • Coding scheme

a b c d e f g h
100 101 00 01 110 11100 11101 1111
32
Optimal code
  • An optimal code is one that has the smallest ACL
    for a given probability distribution of the
    symbols.

33
Theorem
  • For any prefix binary coding scheme C for a
    symbol set S, we have
  • Entropy is thus a lower bound for lossless
    compression.

34
Shannon-Fano algorithm
  • Example 5 characters are used in a text,
    statistics
  • Algorithm (top-down approach)
  • 1. sort symbols according to their probabilities.
  • 2. divide symbols into 2 half-sets, each with
    roughly equal counts.
  • 3. all symbols in the first group are given 0
    as the first bit of their codes 1 for the
    second group.
  • 4. recursively perform steps 2-3 for each group
    with the additional bits appended.

A
B
35
Shannon-Fano algorithm
  • Example
  • G1 A,B G2 C,D,E
  • subdivide G1
  • G11A G12B
  • subdivide G2
  • G21C G22D,E
  • subdivide G22
  • G221D G222E

36
Shannon-Fano algorithm
  • Result

39
89
ACL 89/39 2.28 bits per symbol H(S) 2.19
37
Shannon-Fano algorithm
  • The coding scheme in a tree representation
  • Total of bits used for encoding is 89.
  • A fixed-length coding scheme will need 3 bits for
    each symbol ? a total of 117 bits.

38
Shannon-Fano algorithm
  • It can be shown that the ACL of a code generated
    by the Shannon-Fano algorithm is between H(S) and
    H(S) 1. Hence it is fairly close to the
    theoretical optimal value.

39
Huffman code
  • used in many compression programs, e.g., zip
  • Algorithm (bottom-up approach)
  • Given n symbols
  • We start with a forest of n trees. Each tree has
    only 1 node corresponding to a symbol.
  • Lets define the weight of a tree sum of the
    probabilities of all the symbols in the tree.
  • Repeat the following steps until the forest has
    one tree left
  • pick 2 trees of the lightest weights, lets say,
    T1, T2.
  • replace T1 and T2 by a new tree T, such that T1
    and T2 are Ts children.

40
Huffman code
a(15/39)
b(7/39)
e(5/39)
c(6/39)
d(6/39)
11/39
a(15/39)
b(7/39)
c(6/39)
e(5/39)
d(6/39)
11/39
13/39
a(15/39)
e(5/39)
d(6/39)
c(6/39)
b(7/39)
41
Huffman code
24/39
11/39
13/39
a(15/39)
e(5/39)
d(6/39)
c(6/39)
b(7/39)
42
Huffman code
  • result from the previous example

43
Huffman code
  • The code of a symbol is obtained by traversing
    the tree from the root to the symbol, taking a
    0 when a left branch is followed and a 1
    when a right branch is followed.

44
Huffman code
symbol code bits/symbol occ. bits used
A 0 1 15 15
B 100 3 7 21
C 101 3 6 18
D 110 3 6 18
E 111 3 5 15
total 87 bits
ABAAAE ? 0100000111
45
Huffman code
symbol code bits/symbol occ. bits used
A 0 1 15 15
B 100 3 7 21
C 101 3 6 18
D 110 3 6 18
E 111 3 5 15
Codingtable
total 87 bits
ABAAAE ? 0100000111
46
Huffman code (advantages)
  • Decoding is trivial as long as the coding table
    is stored before the (coded) data. Overhead
    (i.e., the coding table) is relatively negligible
    if the data file is big.
  • Prefix code.

47
Huffman code (advantages)
  • Optimal code size (among any coding scheme that
    uses a fixed code for each symbol.)
  • Entropy is
  • Average bits/symbol needed for Huffman coding is

48
Huffman code (disadvantages)
  • Need to know the symbol frequency distribution.
  • Might be ok for English text, but not data files
  • Might need two passes over the data.
  • Frequency distribution may change over the
    content of the source.

49
Run-length encoding (RLE)
  • Image, video and audio data streams often contain
    sequences of the same bytes or symbols, e.g.,
    background in images, and silence in sound files.
  • Substantial compression can be achieved by
    replacing repeated symbol sequences with the
    number of occurrences of the symbol.

50
Run-length encoding
  • For example, if a symbol occurs at least 4
    consecutive times, the sequence can be encoded
    with the symbol followed by a special flag (i.e.,
    a special symbol) and the number of its
    occurrences, e.g.,
  • Uncompressed ABCCCCCCCCDEFGGGHI
  • Compressed ABC!8DEFGGGHI
  • If 4Cs (CCCC) are encoded as C!0, and 5Cs as C!1,
    and so on, then it allows compression of 4 to 259
    symbols into 2 symbols a count.

51
Run-length encoding
  • Problems
  • Ineffective with text
  • except, perhaps, Ahhhhhhhhhhhhhhhhh!,
    Noooooooooo!.
  • Need an escape byte (!)
  • ok with text
  • not ok with data
  • solution?
  • used in ITU-T Group 3 fax compression
  • a combination of RLE and Huffman
  • compression ratio 101 or better

52
(No Transcript)
53
(No Transcript)
54
Lempel-Ziv-Welch algorithm
  • Method due to Ziv, Lempel in 1977, 1978.
    Improvement by Welch in 1984. Hence, named LZW
    compression.
  • used in UNIX compress, GIF, and V.42bis
    compression modems.

55
Lempel-Ziv-Welch algorithm
  • Motivation
  • Suppose we want to encode a piece of English
    text. We first encode the Websters English
    dictionary which contains about 159,000 entries.
    Here, we surely can use 18 bits for a word. Then
    why dont we just represent each word in the text
    as an 18 bit number, and so obtain a compressed
    version of the text?

56
LZW
Using ASCII 16 symbols, 7 bits each ? 112 bits
A_simple_example
A 34 Simple 278 Example 1022 Dictionary
A 34 Simple 278 Example 1022 Dictionary
34 278 1022
Using the dictionary approach 3 symbols, 18 bits
each ?54 bits
57
LZW
  • Problems
  • Need to preload a copy of the dictionary before
    encoding and decoding.
  • Dictionary is huge, 159,000 entries.
  • Most of the dictionary entries are not used.
  • Many words cannot be found in the dictionary.
  • Solution
  • Build the dictionary on-line while encoding and
    decoding.
  • Build the dictionary adaptively. Words not used
    in the text are not incorporated in the
    dictionary.

58
LZW
  • Example encode WEDWEWEEWEBWET using LZW.
  • Assume each symbol (character) is encoded by an
    8-bit code (i.e., the codes for the characters
    run from 0 to 255). We assume that the dictionary
    is loaded with these 256 codes initially (the
    base dictionary).
  • While scanning the string, sub-string patterns
    are encoded into the dictionary. Patterns that
    appear after the first time would then be
    substituted by the index into the dictionary.

59
Example 1 (compression)
16
W 23
D 77
E 40
B 254
T 95
Base dictionary(partial)
buffer
60
LZW (encoding rules)
  • Rule 1 whenever the pattern in the buffer is in
    the dictionary, we try to extend it.
  • Rule 2 whenever a new pattern is formed in the
    buffer, we add the pattern to the dictionary.
  • Rule 3 whenever a pattern is added to the
    dictionary, we output the codeword of the pattern
    before the last extension.

61
Example 1
  • Original 19 symbols, 8-bit code each ? 19 ? 8
    152 bits
  • LZWed 12 codes, 9 bits each ? 12 ? 9 108 bits
  • Some compression achieved. Compression
    effectiveness is more substantial for long
    messages.
  • The dictionary is not explicitly stored and sent
    to the decoder. In a sense, the compressor is
    encoding the dictionary in the output stream
    implicitly.

62
LZW decompression
16
W 23
D 77
E 40
B 254
T 95
  • Input 16234077256402602612572
    5426095.
  • Assume the decoder also knows about the first 256
    entries of the dictionary (the base dictionary).
  • While scanning the code stream for decoding, the
    dictionary is rebuilt. When an index is read, the
    corresponding dictionary entry is retrieved to
    splice in position.
  • The decoder deduces what the buffer of the
    encoder has gone through and rebuild the
    dictionary.

63
Example 1(decompression)
16
W 23
D 77
E 40
B 254
T 95
64
LZW (decoding rule)
  • Whenever there are 2 codes C1C2 in the buffer,
    add a new entry to the dictionary
  • symbols of C1 first symbol of C2

65
  • Compression code
  • Decompression code

66
Example 2
  • ababcbababaaaaaaa

67
Example 2 (compression)
68
LZW Pros and Cons
  • LZW (advantages)
  • effective at exploiting
  • character repetition redundancy
  • high-usage pattern redundancy
  • an adaptive algorithm (no need to learn about the
    statistical properties of the symbols beforehand)
  • one-pass procedure

69
LZW Pros and Cons
  • LZW (disadvantage)
  • Messages must be long for effective compression
  • It takes time for the dictionary to build up.

70
LZW (Implementation)
  • decompression is generally faster than
    compression. Why?
  • usually uses 12-bit codes ? 4096 table entries.
  • need a hash table for compression.

71
LZW and GIF
11 colors create a color map of 11
entries color code r0,g0,b0
0 r1,g1,b1 1 r2,g2,b2
2 r10,g10,b10 10 convert the
image into a sequence of pixel colors, apply LZW
to the code stream.
320 x 200
72
GIF, 5280 bytes (0.66 bpp)
bit per pixel
JPEG, 16327 bytes (2.04 bpp)
73
GIF image
74
JPEG image
Write a Comment
User Comments (0)
About PowerShow.com