Compression Techniques - PowerPoint PPT Presentation

1 / 65

About This Presentation

Title:

Compression Techniques

Description:

In other words, data compression seeks to reduce the number of bits used to ... Simplicity is their downfall: NOT best compression ratios. ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 66

Provided by: lab2202

Category:

more less

Transcript and Presenter's Notes

Title: Compression Techniques

1
Compression Techniques
2
Introduction

What is Compression?
Data compression requires the identification and
extraction of source redundancy.
In other words, data compression seeks to reduce
the number of bits used to store or transmit
information.
There are a wide range of compression methods
which can be so unlike one another that they have
little in common except that they compress data.

3
Introduction

Compression can be categorized in two broad ways
Lossless compression
recover the exact original data after
compression.
mainly use for compressing database records,
spreadsheets or word processing files, where
exact replication of the original is essential.
Lossy compression
will result in a certain loss of accuracy in
exchange for a substantial increase in
compression.
more effective when used to compress graphic
images and digitised voice where losses outside
visual or aural perception can be tolerated.
Most lossy compression techniques can be adjusted
to different quality levels, gaining higher
accuracy in exchange for less effective
compression.

4
The Need For Compression

In terms of storage, the capacity of a storage
device can be effectively increased with methods
that compresses a body of data on its way to a
storage device and decompresses it when it is
retrieved.
In terms of communications, the bandwidth of a
digital communication link can be effectively
increased by compressing data at the sending end
and decompressing data at the receiving end.

5
A Brief History of Data Compression

The late 40's were the early years of Information
Theory, the idea of developing efficient new
coding methods was just starting to be fleshed
out. Ideas of entropy, information content and
redundancy were explored.
One popular notion held that if the probability
of symbols in a message were known, there ought
to be a way to code the symbols so that the
message will take up less space.

The first well-known method for compressing
digital signals is now known as Shannon- Fano
coding. Shannon and Fano 1948 simultaneously
developed this algorithm which assigns binary
codewords to unique symbols that appear within a
given data file.
While Shannon-Fano coding was a great leap
forward, it had the unfortunate luck to be
quickly superseded by an even more efficient
coding system Huffman Coding.

Huffman coding 1952 shares most characteristics
of Shannon-Fano coding.
Huffman coding could perform effective data
compression by reducing the amount of redundancy
in the coding of symbols.
It has been proven to be the most efficient
fixed-length coding method available.

In the last fifteen years, Huffman coding has
been replaced by arithmetic coding.
Arithmetic coding bypasses the idea of replacing
an input symbol with a specific code.
It replaces a stream of input symbols with a
single floating-point output number.
More bits are needed in the output number for
longer, complex messages.

Dictionary-based compression algorithms use a
completely different method to compress data.
They encode variable-length strings of symbols as
single tokens.
The token forms an index to a phrase dictionary.
If the tokens are smaller than the phrases, they
replace the phrases and compression occurs.

Two dictionary-based compression techniques
called LZ77 and LZ78 have been developed.
LZ77 is a "sliding window" technique in which the
dictionary consists of a set of fixed-length
phrases found in a "window" into the previously
seen text.
LZ78 takes a completely different approach to
building a dictionary.
Instead of using fixedlength phrases from a
window into the text, LZ78 builds phrases up one
symbol at a time, adding a new symbol to an
existing phrase when a match occurs.

11
(No Transcript)
12
Terminology

CompressorSoftware (or hardware) device that
compresses data
DecompressorSoftware (or hardware) device that
decompresses data
CodecSoftware (or hardware) device that
compresses and decompresses data
AlgorithmThe logic that governs the
compression/decompression process

13
Lossless Compression Algorithms

Repetitive Sequence Suppression
Run-length Encoding
Pattern Substitution
Entropy Encoding
The Shannon-Fano Algorithm
Huffman Coding
Arithmetic Coding

14
Repetitive Sequence Suppression

Fairly straight forward to understand and
implement.
Simplicity is their downfall NOT best
compression ratios.
Some methods have their applications, e.g.
Component of JPEG, Silence Suppression.

15
Repetitive Sequence Suppression

If a sequence a series on n successive tokens
appears
Replace series with a token and a count number of
occurrences.
Usually need to have a special flag to denote
when the repeated token appears
Example
89400000000000000000000000000000000
we can replace with 894f32, where f is the flag
for zero.

16
Repetitive Sequence Suppression

How Much Compression?
Compression savings depend on the content of the
data.
Applications of this simple compression technique
include
Suppression of zeros in a file (Zero Length
Suppression)
Silence in audio data, Pauses in conversation
etc.
Bitmaps
Blanks in text or program source files
Backgrounds in images
Other regular image or data tokens

17
Run-length Encoding

This encoding method is frequently applied to
images (or pixels in a scan line).
It is a small compression component used in JPEG
compression.
In this instance
Sequences of image elements X1,X2, . . . ,Xn (Row
by Row)
Mapped to pairs (c1, l1), (c2, l2), . . . , (cn,
ln) where ci represent image intensity or colour
and li the length of the ith run of pixels

18
Run-length Encoding

Example
Original Sequence
111122233333311112222
can be encoded as
(1,4),(2,3),(3,6),(1,4),(2,4)
How Much Compression?
The savings are dependent on the data.
In the worst case (Random Noise) encoding is more
heavy than original file
2integer rather 1 integer if data is
represented as integers.

19
Run-Length Encoding (RLE) Method

Example

20
Run-Length Encoding (RLE) Method

Example

blue x 6, magenta x 7, red x 3, yellow x 3 and
green x 4
21
Run-Length Encoding (RLE) Method

Example

This would give
which is twice the size!
22
Uncompress Blue White White White White White
White Blue White Blue White White White White
White Blue etc. Compress 1XBlue 6XWhite
1XBlue 1XWhite 1XBlue 4Xwhite 1XBlue 1XWhite etc.
23
Run-Length Encoding (RLE) Method

One advantage of this method is that it is
sequential once a particular series has been
counted it could be transmitted.
Consequently the principles of this method are
also employed by the CCITT codec for fax
communication in conjunction with the Huffman
method

24
Pattern Substitution

Here we substitute a frequently repeating
pattern(s) with a code.
For example replace all occurrences of The with
the predefined code .
So
The code is The Key
Becomes
code is Key

25
Entropy Encoding

Lossless Compression frequently involves some
form of entropy encoding
Based on information theoretic techniques.
According to Shannon, the entropy of an
information source S is defined as

where Pi is the probability that symbol Si in S
will occur.
26
The Shannon-Fano Algorithm

Example
Data ABBAAAACDEAAABBBDDEEAAA........
Count symbols in stream

27
The Shannon-Fano Algorithm

A top-down approach
Sort symbols (Tree Sort) according to their
frequencies/probabilities, e.g., ABCDE.
Recursively divide into two parts, each with
approx. same number of counts.

E
C
D
B
A
28
The Shannon-Fano Algorithm

A top-down approach
Sort symbols (Tree Sort) according to their
frequencies/probabilities, e.g., ABCDE.
Recursively divide into two parts, each with
approx. same number of counts.

0
1
1
0
A(15)
C(6)
E(5)
D(6)
B(7)
0
1
1
0
29
The Shannon-Fano Algorithm

A top-down approach
Sort symbols (Tree Sort) according to their
frequencies/probabilities, e.g., ABCDE.
Recursively divide into two parts, each with
approx. same number of counts.

30
The Shannon-Fano Algorithm

Assemble code by depth first traversal of tree to
symbol node

Raw token stream 8 bits per (39 chars) token
312 bits
Coded data stream 89 bits

31
Huffman Coding

Based on the frequency of occurrence of a data
item (pixels or small blocks of pixels in
images).
Use a lower number of bits to encode more
frequent data
Codes are stored in a Code Bookas for Shannon
(previous slides)
Code book constructed for each image or a set of
images.
Code book plus encoded data must be transmitted
to enable decoding.

32
Huffman Coding

A bottom-up approach
Put all nodes in an OPEN list, keep it sorted at
all times (e.g., ABCDE).
Repeat until the OPEN list has only one node
left
From OPEN pick two nodes having the lowest
frequencies/ probabilities, create a parent node
of them.
Assign the sum of the childrens frequencies/
probabilities to the parent node and insert it
into OPEN
Assign code 0, 1 to the two branches of the tree,
and delete the children from OPEN.

33
Huffman Coding

Example
Data ABBAAAACDEAAABBBDDEEAAA........
Count symbols in stream

34
Huffman Coding
1
0
A(15)
C(6)
E(5)
D(6)
B(7)
35
Huffman Coding
log2(1/p) p probability of symbol
12.22
13
36
(No Transcript)
37
The Huffman Method

Example
It then encodes each symbol with a
variable-length code - the more frequent the
symbol the shorter the code

38
The Huffman Method
39
The Huffman Method
40
The Huffman Method

Although the number of bits required for
less-frequently increases rapidly, but there is a
significant reduction in the number of bits
required for the overall data due to the savings
gained from the more-frequently-occurring
symbols.
Clearly this method requires at least one pass
through the data to determine the symbol
frequencies.

41
The Huffman Method

For many applications this is inefficient.
This method also requires that the look-up
table of symbol representations accompanies the
data whose size must be considered as an
additional overhead as it is later combined with
the actual data.

42
The Huffman Method

Further Reading
Data Compression (Lelewer and Hirshberg) an
informative paper on data compression algorithms
and techniques.
Various links to compression papers and
sourcecode.
Data Compression Reference Centre Huffman a
comprehensive site with descriptions and basic
examples of various compression (for example,
Shannon-Fano) methods accessible from its Home
Page. NOTE This site is very slow to access.
Huffman Coding Example. Part of a larger document
on electrical engineering.
The section on Huffman coding. Part of the larger
document Information Engineering Across the
Professions by David Cyganski, John A. Orr,
Richard F. Vaz.

43
Arithmetic Coding

A widely used entropy coder
Also used in JPEG more soon
Good compression ratio (better than Huffman
coding), entropy around the Shannon Ideal value.
Only problem is its speed due possibly complex
computations due to large symbol tables

44
Arithmetic Coding

Why better than Huffman?
Huffman coding etc. use an integer number (k) of
bits for each symbol, hence k is never less
than 1.
Sometimes, e.g., when sending a 1-bit image,
compression becomes impossible.

45
Arithmetic Coding

Basic Idea
The idea behind arithmetic coding is
To have a probability line, 01, and
Assign to every symbol a range in this line based
on its probability,
The higher the probability, the higher range
which assigns to it.
Once we have defined the ranges and the
probability line,
Start to encode symbols,
Every symbol defines where the output floating
point number lands within the range.

46
Arithmetic Coding

Example
Raw data BACA
Therefore
A occurs with probability 0.5,
B and C with probabilities 0.25

2/40.5
1/40.25
47
Arithmetic Coding

Start by assigning each symbol to the probability
range 01.

The first symbol in our example stream is B
48
Arithmetic Coding
1
0.75
C
C
0.75
0.6875
B
B
0.5
0.625
A
A
0
0.5
49
Arithmetic Coding
0.75
0.625
C
C
0.6875
0.59375
B
B
0.625
0.5625
A
A
0.5
0.5
50
Arithmetic Coding
0.625
0.625
C
C
0.59375
0.6171875
B
B
0.5625
0.609375
A
A
0.5
0.59375
51

So the (Unique) output code for BACA is any
number in the range
0.59375, 0.60937.

52
Example

Table 1 shown a symbol distribution of raw data.
CAEE is part of data to be transmit. How to
compress that data using Arithmetic Coding before
it can be transmitted?

Table 1 Symbol distribution
53
C
A
E
E
54

CAEE
55
CAEE
56

Generating codeword for encoder

BEGIN code0 k1 while(value(code) lt low
) assign 1 to the k-th binary fraction
bit if (value(code) gt high) replace the
k-th bit by 0 k k 1 END
57
How to translate range to bit

Example
BACA
low 0.59375, high 0.60937.
CAEE
low 0.33184, high 0.3322.

58
Decimal

0.12345

x 10-5
x 10-4
x 10-3
x 10-2
x 10-1
59
Binary

0.01010

x 2-5
x 2-4
x 2-3
x 2-2
x 2-1
60
Binary to decimal

What is a value of
0.010101012 in decimal?

0.033203125
61
Generating codeword for encoder
0.33184,0.33220
BEGIN code0 k1 while( value(code) lt low
) assign 1 to the k-th binary fraction
bit if ( value(code) gt high) replace the
k-th bit by 0 k k 1 END
62
Example1Range (0.33184,0.33220)
Binary
BEGIN code0 k1 while( value(code) lt
0.33184 ) assign 1 to the k-th binary
fraction bit if ( value(code) gt 0.33220
) replace the k-th bit by 0 k k
1 END
Decimal
63

Assign 1 to the first fraction (codeword0.12)
and compare with low (0.3318410)
value(0.120.510)gt 0.3318410 -gt out of range
Hence, we assign 0 for the first bit.
value(0.02)lt 0.3318410 -gt while loop continue
Assign 1 to the second fraction (0.012) 0.2510
which is less then high (0.33220)

Assign 1 to the third fraction (0.0112) 0.2510
0.12510 0.37510 which is bigger then high
(0.33220), so replace the kbit by 0. Now the
codeword 0.0102
Assign 1 to the fourth fraction (0.01012)
0.2510 0.062510 0.312510 which is less then
high (0.33220). Now the codeword 0.01012
Continue