Compression Techniques - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Compression Techniques

Description:

In other words, data compression seeks to reduce the number of bits used to ... Simplicity is their downfall: NOT best compression ratios. ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 66
Provided by: lab2202
Category:

less

Transcript and Presenter's Notes

Title: Compression Techniques


1
Compression Techniques
2
Introduction
  • What is Compression?
  • Data compression requires the identification and
    extraction of source redundancy.
  • In other words, data compression seeks to reduce
    the number of bits used to store or transmit
    information.
  • There are a wide range of compression methods
    which can be so unlike one another that they have
    little in common except that they compress data.

3
Introduction
  • Compression can be categorized in two broad ways
  • Lossless compression
  • recover the exact original data after
    compression.
  • mainly use for compressing database records,
    spreadsheets or word processing files, where
    exact replication of the original is essential.
  • Lossy compression
  • will result in a certain loss of accuracy in
    exchange for a substantial increase in
    compression.
  • more effective when used to compress graphic
    images and digitised voice where losses outside
    visual or aural perception can be tolerated.
  • Most lossy compression techniques can be adjusted
    to different quality levels, gaining higher
    accuracy in exchange for less effective
    compression.

4
The Need For Compression
  • In terms of storage, the capacity of a storage
    device can be effectively increased with methods
    that compresses a body of data on its way to a
    storage device and decompresses it when it is
    retrieved.
  • In terms of communications, the bandwidth of a
    digital communication link can be effectively
    increased by compressing data at the sending end
    and decompressing data at the receiving end.

5
A Brief History of Data Compression
  • The late 40's were the early years of Information
    Theory, the idea of developing efficient new
    coding methods was just starting to be fleshed
    out. Ideas of entropy, information content and
    redundancy were explored.
  • One popular notion held that if the probability
    of symbols in a message were known, there ought
    to be a way to code the symbols so that the
    message will take up less space.

6
  • The first well-known method for compressing
    digital signals is now known as Shannon- Fano
    coding. Shannon and Fano 1948 simultaneously
    developed this algorithm which assigns binary
    codewords to unique symbols that appear within a
    given data file.
  • While Shannon-Fano coding was a great leap
    forward, it had the unfortunate luck to be
    quickly superseded by an even more efficient
    coding system Huffman Coding.

7
  • Huffman coding 1952 shares most characteristics
    of Shannon-Fano coding.
  • Huffman coding could perform effective data
    compression by reducing the amount of redundancy
    in the coding of symbols.
  • It has been proven to be the most efficient
    fixed-length coding method available.

8
  • In the last fifteen years, Huffman coding has
    been replaced by arithmetic coding.
  • Arithmetic coding bypasses the idea of replacing
    an input symbol with a specific code.
  • It replaces a stream of input symbols with a
    single floating-point output number.
  • More bits are needed in the output number for
    longer, complex messages.

9
  • Dictionary-based compression algorithms use a
    completely different method to compress data.
  • They encode variable-length strings of symbols as
    single tokens.
  • The token forms an index to a phrase dictionary.
  • If the tokens are smaller than the phrases, they
    replace the phrases and compression occurs.

10
  • Two dictionary-based compression techniques
    called LZ77 and LZ78 have been developed.
  • LZ77 is a "sliding window" technique in which the
    dictionary consists of a set of fixed-length
    phrases found in a "window" into the previously
    seen text.
  • LZ78 takes a completely different approach to
    building a dictionary.
  • Instead of using fixedlength phrases from a
    window into the text, LZ78 builds phrases up one
    symbol at a time, adding a new symbol to an
    existing phrase when a match occurs.

11
(No Transcript)
12
Terminology
  • CompressorSoftware (or hardware) device that
    compresses data
  • DecompressorSoftware (or hardware) device that
    decompresses data
  • CodecSoftware (or hardware) device that
    compresses and decompresses data
  • AlgorithmThe logic that governs the
    compression/decompression process

13
Lossless Compression Algorithms
  • Repetitive Sequence Suppression
  • Run-length Encoding
  • Pattern Substitution
  • Entropy Encoding
  • The Shannon-Fano Algorithm
  • Huffman Coding
  • Arithmetic Coding

14
Repetitive Sequence Suppression
  • Fairly straight forward to understand and
    implement.
  • Simplicity is their downfall NOT best
    compression ratios.
  • Some methods have their applications, e.g.
    Component of JPEG, Silence Suppression.

15
Repetitive Sequence Suppression
  • If a sequence a series on n successive tokens
    appears
  • Replace series with a token and a count number of
    occurrences.
  • Usually need to have a special flag to denote
    when the repeated token appears
  • Example
  • 89400000000000000000000000000000000
  • we can replace with 894f32, where f is the flag
    for zero.

16
Repetitive Sequence Suppression
  • How Much Compression?
  • Compression savings depend on the content of the
    data.
  • Applications of this simple compression technique
    include
  • Suppression of zeros in a file (Zero Length
    Suppression)
  • Silence in audio data, Pauses in conversation
    etc.
  • Bitmaps
  • Blanks in text or program source files
  • Backgrounds in images
  • Other regular image or data tokens

17
Run-length Encoding
  • This encoding method is frequently applied to
    images (or pixels in a scan line).
  • It is a small compression component used in JPEG
    compression.
  • In this instance
  • Sequences of image elements X1,X2, . . . ,Xn (Row
    by Row)
  • Mapped to pairs (c1, l1), (c2, l2), . . . , (cn,
    ln) where ci represent image intensity or colour
    and li the length of the ith run of pixels

18
Run-length Encoding
  • Example
  • Original Sequence
  • 111122233333311112222
  • can be encoded as
  • (1,4),(2,3),(3,6),(1,4),(2,4)
  • How Much Compression?
  • The savings are dependent on the data.
  • In the worst case (Random Noise) encoding is more
    heavy than original file
  • 2integer rather 1 integer if data is
    represented as integers.

19
Run-Length Encoding (RLE) Method
  • Example

20
Run-Length Encoding (RLE) Method
  • Example

blue x 6, magenta x 7, red x 3, yellow x 3 and
green x 4
21
Run-Length Encoding (RLE) Method
  • Example

This would give
which is twice the size!
22
Uncompress Blue White White White White White
White Blue White Blue White White White White
White Blue etc. Compress 1XBlue 6XWhite
1XBlue 1XWhite 1XBlue 4Xwhite 1XBlue 1XWhite etc.
23
Run-Length Encoding (RLE) Method
  • One advantage of this method is that it is
    sequential  once a particular series has been
    counted it could be transmitted.
  • Consequently the principles of this method are
    also employed by the CCITT codec for fax
    communication in conjunction with the Huffman
    method

24
Pattern Substitution
  • Here we substitute a frequently repeating
    pattern(s) with a code.
  • For example replace all occurrences of The with
    the predefined code .
  • So
  • The code is The Key
  • Becomes
  • code is Key

25
Entropy Encoding
  • Lossless Compression frequently involves some
    form of entropy encoding
  • Based on information theoretic techniques.
  • According to Shannon, the entropy of an
    information source S is defined as

where Pi is the probability that symbol Si in S
will occur.
26
The Shannon-Fano Algorithm
  • Example
  • Data ABBAAAACDEAAABBBDDEEAAA........
  • Count symbols in stream

27
The Shannon-Fano Algorithm
  • A top-down approach
  • Sort symbols (Tree Sort) according to their
    frequencies/probabilities, e.g., ABCDE.
  • Recursively divide into two parts, each with
    approx. same number of counts.

E
C
D
B
A
28
The Shannon-Fano Algorithm
  • A top-down approach
  • Sort symbols (Tree Sort) according to their
    frequencies/probabilities, e.g., ABCDE.
  • Recursively divide into two parts, each with
    approx. same number of counts.

0
1
1
0
A(15)
C(6)
E(5)
D(6)
B(7)
0
1
1
0
29
The Shannon-Fano Algorithm
  • A top-down approach
  • Sort symbols (Tree Sort) according to their
    frequencies/probabilities, e.g., ABCDE.
  • Recursively divide into two parts, each with
    approx. same number of counts.

30
The Shannon-Fano Algorithm
  • Assemble code by depth first traversal of tree to
    symbol node
  • Raw token stream 8 bits per (39 chars) token
    312 bits
  • Coded data stream 89 bits

31
Huffman Coding
  • Based on the frequency of occurrence of a data
    item (pixels or small blocks of pixels in
    images).
  • Use a lower number of bits to encode more
    frequent data
  • Codes are stored in a Code Bookas for Shannon
    (previous slides)
  • Code book constructed for each image or a set of
    images.
  • Code book plus encoded data must be transmitted
    to enable decoding.

32
Huffman Coding
  • A bottom-up approach
  • Put all nodes in an OPEN list, keep it sorted at
    all times (e.g., ABCDE).
  • Repeat until the OPEN list has only one node
    left
  • From OPEN pick two nodes having the lowest
    frequencies/ probabilities, create a parent node
    of them.
  • Assign the sum of the childrens frequencies/
    probabilities to the parent node and insert it
    into OPEN
  • Assign code 0, 1 to the two branches of the tree,
    and delete the children from OPEN.

33
Huffman Coding
  • Example
  • Data ABBAAAACDEAAABBBDDEEAAA........
  • Count symbols in stream

34
Huffman Coding
1
0
A(15)
C(6)
E(5)
D(6)
B(7)
35
Huffman Coding
log2(1/p) p probability of symbol
12.22
13
36
(No Transcript)
37
The Huffman Method
  • Example
  • It then encodes each symbol with a
    variable-length code - the more frequent the
    symbol the shorter the code

38
The Huffman Method
39
The Huffman Method
40
The Huffman Method
  • Although the number of bits required for
    less-frequently increases rapidly, but there is a
    significant reduction in the number of bits
    required for the overall data due to the savings
    gained from the more-frequently-occurring
    symbols.
  • Clearly this method requires at least one pass
    through the data to determine the symbol
    frequencies.

41
The Huffman Method
  • For many applications this is inefficient.
  • This method also requires that the look-up
    table of symbol representations accompanies the
    data whose size must be considered as an
    additional overhead as it is later combined with
    the actual data.

42
The Huffman Method
  • Further Reading
  • Data Compression (Lelewer and Hirshberg) an
    informative paper on data compression algorithms
    and techniques.
  • Various links to compression papers and
    sourcecode.
  • Data Compression Reference Centre Huffman a
    comprehensive site with descriptions and basic
    examples of various compression (for example,
    Shannon-Fano) methods accessible from its Home
    Page. NOTE This site is very slow to access.
  • Huffman Coding Example. Part of a larger document
    on electrical engineering.
  • The section on Huffman coding. Part of the larger
    document Information Engineering Across the
    Professions by David Cyganski, John A. Orr,
    Richard F. Vaz.

43
Arithmetic Coding
  • A widely used entropy coder
  • Also used in JPEG more soon
  • Good compression ratio (better than Huffman
    coding), entropy around the Shannon Ideal value.
  • Only problem is its speed due possibly complex
    computations due to large symbol tables

44
Arithmetic Coding
  • Why better than Huffman?
  • Huffman coding etc. use an integer number (k) of
    bits for each symbol, hence k is never less
    than 1.
  • Sometimes, e.g., when sending a 1-bit image,
    compression becomes impossible.

45
Arithmetic Coding
  • Basic Idea
  • The idea behind arithmetic coding is
  • To have a probability line, 01, and
  • Assign to every symbol a range in this line based
    on its probability,
  • The higher the probability, the higher range
    which assigns to it.
  • Once we have defined the ranges and the
    probability line,
  • Start to encode symbols,
  • Every symbol defines where the output floating
    point number lands within the range.

46
Arithmetic Coding
  • Example
  • Raw data BACA
  • Therefore
  • A occurs with probability 0.5,
  • B and C with probabilities 0.25

2/40.5
1/40.25
47
Arithmetic Coding
  • Start by assigning each symbol to the probability
    range 01.

The first symbol in our example stream is B
48
Arithmetic Coding
1
0.75
C
C
0.75
0.6875
B
B
0.5
0.625
A
A
0
0.5
49
Arithmetic Coding
0.75
0.625
C
C
0.6875
0.59375
B
B
0.625
0.5625
A
A
0.5
0.5
50
Arithmetic Coding
0.625
0.625
C
C
0.59375
0.6171875
B
B
0.5625
0.609375
A
A
0.5
0.59375
51
  • So the (Unique) output code for BACA is any
    number in the range
  • 0.59375, 0.60937.

52
Example
  • Table 1 shown a symbol distribution of raw data.
    CAEE is part of data to be transmit. How to
    compress that data using Arithmetic Coding before
    it can be transmitted?

Table 1 Symbol distribution
53
C
A
E
E
54

CAEE
55
CAEE
56
  • Generating codeword for encoder

BEGIN code0 k1 while(value(code) lt low
) assign 1 to the k-th binary fraction
bit if (value(code) gt high) replace the
k-th bit by 0 k k 1 END
57
How to translate range to bit
  • Example
  • BACA
  • low 0.59375, high 0.60937.
  • CAEE
  • low 0.33184, high 0.3322.

58
Decimal
  • 0.12345

x 10-5
x 10-4
x 10-3
x 10-2
x 10-1
59
Binary
  • 0.01010

x 2-5
x 2-4
x 2-3
x 2-2
x 2-1
60
Binary to decimal
  • What is a value of
  • 0.010101012 in decimal?

0.033203125
61
Generating codeword for encoder
0.33184,0.33220
BEGIN code0 k1 while( value(code) lt low
) assign 1 to the k-th binary fraction
bit if ( value(code) gt high) replace the
k-th bit by 0 k k 1 END
62
Example1Range (0.33184,0.33220)
Binary
BEGIN code0 k1 while( value(code) lt
0.33184 ) assign 1 to the k-th binary
fraction bit if ( value(code) gt 0.33220
) replace the k-th bit by 0 k k
1 END
Decimal
63
  • Assign 1 to the first fraction (codeword0.12)
    and compare with low (0.3318410)
  • value(0.120.510)gt 0.3318410 -gt out of range
  • Hence, we assign 0 for the first bit.
  • value(0.02)lt 0.3318410 -gt while loop continue
  • Assign 1 to the second fraction (0.012) 0.2510
    which is less then high (0.33220)

64
  • Assign 1 to the third fraction (0.0112) 0.2510
    0.12510 0.37510 which is bigger then high
    (0.33220), so replace the kbit by 0. Now the
    codeword 0.0102
  • Assign 1 to the fourth fraction (0.01012)
    0.2510 0.062510 0.312510 which is less then
    high (0.33220). Now the codeword 0.01012
  • Continue

65
  • Eventually, the binary codeword generate is
    0.01010101 which 0.033203125
  • 8 bit binary represent CASEE
Write a Comment
User Comments (0)
About PowerShow.com