Title: Compression Techniques
1Compression Techniques
2Introduction
- What is Compression?
- Data compression requires the identification and
extraction of source redundancy. - In other words, data compression seeks to reduce
the number of bits used to store or transmit
information. - There are a wide range of compression methods
which can be so unlike one another that they have
little in common except that they compress data.
3Introduction
- Compression can be categorized in two broad ways
- Lossless compression
- recover the exact original data after
compression. - mainly use for compressing database records,
spreadsheets or word processing files, where
exact replication of the original is essential. - Lossy compression
- will result in a certain loss of accuracy in
exchange for a substantial increase in
compression. - more effective when used to compress graphic
images and digitised voice where losses outside
visual or aural perception can be tolerated. - Most lossy compression techniques can be adjusted
to different quality levels, gaining higher
accuracy in exchange for less effective
compression.
4The Need For Compression
- In terms of storage, the capacity of a storage
device can be effectively increased with methods
that compresses a body of data on its way to a
storage device and decompresses it when it is
retrieved. - In terms of communications, the bandwidth of a
digital communication link can be effectively
increased by compressing data at the sending end
and decompressing data at the receiving end.
5A Brief History of Data Compression
- The late 40's were the early years of Information
Theory, the idea of developing efficient new
coding methods was just starting to be fleshed
out. Ideas of entropy, information content and
redundancy were explored. - One popular notion held that if the probability
of symbols in a message were known, there ought
to be a way to code the symbols so that the
message will take up less space.
6- The first well-known method for compressing
digital signals is now known as Shannon- Fano
coding. Shannon and Fano 1948 simultaneously
developed this algorithm which assigns binary
codewords to unique symbols that appear within a
given data file. - While Shannon-Fano coding was a great leap
forward, it had the unfortunate luck to be
quickly superseded by an even more efficient
coding system Huffman Coding.
7- Huffman coding 1952 shares most characteristics
of Shannon-Fano coding. - Huffman coding could perform effective data
compression by reducing the amount of redundancy
in the coding of symbols. - It has been proven to be the most efficient
fixed-length coding method available.
8- In the last fifteen years, Huffman coding has
been replaced by arithmetic coding. - Arithmetic coding bypasses the idea of replacing
an input symbol with a specific code. - It replaces a stream of input symbols with a
single floating-point output number. - More bits are needed in the output number for
longer, complex messages.
9- Dictionary-based compression algorithms use a
completely different method to compress data. - They encode variable-length strings of symbols as
single tokens. - The token forms an index to a phrase dictionary.
- If the tokens are smaller than the phrases, they
replace the phrases and compression occurs.
10- Two dictionary-based compression techniques
called LZ77 and LZ78 have been developed. - LZ77 is a "sliding window" technique in which the
dictionary consists of a set of fixed-length
phrases found in a "window" into the previously
seen text. - LZ78 takes a completely different approach to
building a dictionary. - Instead of using fixedlength phrases from a
window into the text, LZ78 builds phrases up one
symbol at a time, adding a new symbol to an
existing phrase when a match occurs.
11(No Transcript)
12Terminology
- CompressorSoftware (or hardware) device that
compresses data - DecompressorSoftware (or hardware) device that
decompresses data - CodecSoftware (or hardware) device that
compresses and decompresses data - AlgorithmThe logic that governs the
compression/decompression process
13Lossless Compression Algorithms
- Repetitive Sequence Suppression
- Run-length Encoding
- Pattern Substitution
- Entropy Encoding
- The Shannon-Fano Algorithm
- Huffman Coding
- Arithmetic Coding
14Repetitive Sequence Suppression
- Fairly straight forward to understand and
implement. - Simplicity is their downfall NOT best
compression ratios. - Some methods have their applications, e.g.
Component of JPEG, Silence Suppression.
15Repetitive Sequence Suppression
- If a sequence a series on n successive tokens
appears - Replace series with a token and a count number of
occurrences. - Usually need to have a special flag to denote
when the repeated token appears - Example
- 89400000000000000000000000000000000
- we can replace with 894f32, where f is the flag
for zero.
16Repetitive Sequence Suppression
- How Much Compression?
- Compression savings depend on the content of the
data. - Applications of this simple compression technique
include - Suppression of zeros in a file (Zero Length
Suppression) - Silence in audio data, Pauses in conversation
etc. - Bitmaps
- Blanks in text or program source files
- Backgrounds in images
- Other regular image or data tokens
17Run-length Encoding
- This encoding method is frequently applied to
images (or pixels in a scan line). - It is a small compression component used in JPEG
compression. - In this instance
- Sequences of image elements X1,X2, . . . ,Xn (Row
by Row) - Mapped to pairs (c1, l1), (c2, l2), . . . , (cn,
ln) where ci represent image intensity or colour
and li the length of the ith run of pixels
18Run-length Encoding
- Example
- Original Sequence
- 111122233333311112222
- can be encoded as
- (1,4),(2,3),(3,6),(1,4),(2,4)
- How Much Compression?
- The savings are dependent on the data.
- In the worst case (Random Noise) encoding is more
heavy than original file - 2integer rather 1 integer if data is
represented as integers.
19Run-Length Encoding (RLE) Method
20Run-Length Encoding (RLE) Method
blue x 6, magenta x 7, red x 3, yellow x 3 and
green x 4
21Run-Length Encoding (RLE) Method
This would give
which is twice the size!
22Uncompress Blue White White White White White
White Blue White Blue White White White White
White Blue etc. Compress 1XBlue 6XWhite
1XBlue 1XWhite 1XBlue 4Xwhite 1XBlue 1XWhite etc.
23Run-Length Encoding (RLE) Method
- One advantage of this method is that it is
sequential once a particular series has been
counted it could be transmitted. - Consequently the principles of this method are
also employed by the CCITT codec for fax
communication in conjunction with the Huffman
method
24Pattern Substitution
- Here we substitute a frequently repeating
pattern(s) with a code. - For example replace all occurrences of The with
the predefined code . - So
- The code is The Key
- Becomes
- code is Key
25Entropy Encoding
- Lossless Compression frequently involves some
form of entropy encoding - Based on information theoretic techniques.
- According to Shannon, the entropy of an
information source S is defined as
where Pi is the probability that symbol Si in S
will occur.
26The Shannon-Fano Algorithm
- Example
- Data ABBAAAACDEAAABBBDDEEAAA........
- Count symbols in stream
27The Shannon-Fano Algorithm
- A top-down approach
- Sort symbols (Tree Sort) according to their
frequencies/probabilities, e.g., ABCDE. - Recursively divide into two parts, each with
approx. same number of counts.
E
C
D
B
A
28The Shannon-Fano Algorithm
- A top-down approach
- Sort symbols (Tree Sort) according to their
frequencies/probabilities, e.g., ABCDE. - Recursively divide into two parts, each with
approx. same number of counts.
0
1
1
0
A(15)
C(6)
E(5)
D(6)
B(7)
0
1
1
0
29The Shannon-Fano Algorithm
- A top-down approach
- Sort symbols (Tree Sort) according to their
frequencies/probabilities, e.g., ABCDE. - Recursively divide into two parts, each with
approx. same number of counts.
30The Shannon-Fano Algorithm
- Assemble code by depth first traversal of tree to
symbol node
- Raw token stream 8 bits per (39 chars) token
312 bits - Coded data stream 89 bits
31Huffman Coding
- Based on the frequency of occurrence of a data
item (pixels or small blocks of pixels in
images). - Use a lower number of bits to encode more
frequent data - Codes are stored in a Code Bookas for Shannon
(previous slides) - Code book constructed for each image or a set of
images. - Code book plus encoded data must be transmitted
to enable decoding.
32Huffman Coding
- A bottom-up approach
- Put all nodes in an OPEN list, keep it sorted at
all times (e.g., ABCDE). - Repeat until the OPEN list has only one node
left - From OPEN pick two nodes having the lowest
frequencies/ probabilities, create a parent node
of them. - Assign the sum of the childrens frequencies/
probabilities to the parent node and insert it
into OPEN - Assign code 0, 1 to the two branches of the tree,
and delete the children from OPEN.
33Huffman Coding
- Example
- Data ABBAAAACDEAAABBBDDEEAAA........
- Count symbols in stream
34Huffman Coding
1
0
A(15)
C(6)
E(5)
D(6)
B(7)
35Huffman Coding
log2(1/p) p probability of symbol
12.22
13
36(No Transcript)
37The Huffman Method
- Example
- It then encodes each symbol with a
variable-length code - the more frequent the
symbol the shorter the code
38The Huffman Method
39The Huffman Method
40The Huffman Method
- Although the number of bits required for
less-frequently increases rapidly, but there is a
significant reduction in the number of bits
required for the overall data due to the savings
gained from the more-frequently-occurring
symbols. - Clearly this method requires at least one pass
through the data to determine the symbol
frequencies.
41The Huffman Method
- For many applications this is inefficient.
- This method also requires that the look-up
table of symbol representations accompanies the
data whose size must be considered as an
additional overhead as it is later combined with
the actual data.
42The Huffman Method
- Further Reading
- Data Compression (Lelewer and Hirshberg) an
informative paper on data compression algorithms
and techniques. - Various links to compression papers and
sourcecode. - Data Compression Reference Centre Huffman a
comprehensive site with descriptions and basic
examples of various compression (for example,
Shannon-Fano) methods accessible from its Home
Page. NOTE This site is very slow to access. - Huffman Coding Example. Part of a larger document
on electrical engineering. - The section on Huffman coding. Part of the larger
document Information Engineering Across the
Professions by David Cyganski, John A. Orr,
Richard F. Vaz.
43Arithmetic Coding
- A widely used entropy coder
- Also used in JPEG more soon
- Good compression ratio (better than Huffman
coding), entropy around the Shannon Ideal value. - Only problem is its speed due possibly complex
computations due to large symbol tables
44Arithmetic Coding
- Why better than Huffman?
- Huffman coding etc. use an integer number (k) of
bits for each symbol, hence k is never less
than 1. - Sometimes, e.g., when sending a 1-bit image,
compression becomes impossible.
45Arithmetic Coding
- Basic Idea
- The idea behind arithmetic coding is
- To have a probability line, 01, and
- Assign to every symbol a range in this line based
on its probability, - The higher the probability, the higher range
which assigns to it. - Once we have defined the ranges and the
probability line, - Start to encode symbols,
- Every symbol defines where the output floating
point number lands within the range.
46Arithmetic Coding
- Example
- Raw data BACA
- Therefore
- A occurs with probability 0.5,
- B and C with probabilities 0.25
2/40.5
1/40.25
47Arithmetic Coding
- Start by assigning each symbol to the probability
range 01.
The first symbol in our example stream is B
48Arithmetic Coding
1
0.75
C
C
0.75
0.6875
B
B
0.5
0.625
A
A
0
0.5
49Arithmetic Coding
0.75
0.625
C
C
0.6875
0.59375
B
B
0.625
0.5625
A
A
0.5
0.5
50Arithmetic Coding
0.625
0.625
C
C
0.59375
0.6171875
B
B
0.5625
0.609375
A
A
0.5
0.59375
51- So the (Unique) output code for BACA is any
number in the range - 0.59375, 0.60937.
52Example
- Table 1 shown a symbol distribution of raw data.
CAEE is part of data to be transmit. How to
compress that data using Arithmetic Coding before
it can be transmitted?
Table 1 Symbol distribution
53C
A
E
E
54CAEE
55CAEE
56- Generating codeword for encoder
BEGIN code0 k1 while(value(code) lt low
) assign 1 to the k-th binary fraction
bit if (value(code) gt high) replace the
k-th bit by 0 k k 1 END
57How to translate range to bit
- Example
- BACA
- low 0.59375, high 0.60937.
- CAEE
- low 0.33184, high 0.3322.
58Decimal
x 10-5
x 10-4
x 10-3
x 10-2
x 10-1
59Binary
x 2-5
x 2-4
x 2-3
x 2-2
x 2-1
60Binary to decimal
- What is a value of
- 0.010101012 in decimal?
0.033203125
61Generating codeword for encoder
0.33184,0.33220
BEGIN code0 k1 while( value(code) lt low
) assign 1 to the k-th binary fraction
bit if ( value(code) gt high) replace the
k-th bit by 0 k k 1 END
62Example1Range (0.33184,0.33220)
Binary
BEGIN code0 k1 while( value(code) lt
0.33184 ) assign 1 to the k-th binary
fraction bit if ( value(code) gt 0.33220
) replace the k-th bit by 0 k k
1 END
Decimal
63- Assign 1 to the first fraction (codeword0.12)
and compare with low (0.3318410) - value(0.120.510)gt 0.3318410 -gt out of range
- Hence, we assign 0 for the first bit.
- value(0.02)lt 0.3318410 -gt while loop continue
- Assign 1 to the second fraction (0.012) 0.2510
which is less then high (0.33220)
64- Assign 1 to the third fraction (0.0112) 0.2510
0.12510 0.37510 which is bigger then high
(0.33220), so replace the kbit by 0. Now the
codeword 0.0102 - Assign 1 to the fourth fraction (0.01012)
0.2510 0.062510 0.312510 which is less then
high (0.33220). Now the codeword 0.01012 - Continue
65- Eventually, the binary codeword generate is
0.01010101 which 0.033203125 - 8 bit binary represent CASEE