Title: Data Compression
1Data Compression
2Data Compression
- Data transmission and storage cost money
- It is advantageous to represent data files in the
most compact form during either transmission or
storage
Storage
Player decompessor
Native Format
Compressor
Encoded Files (video, audio)
MPEG, MP3
End User
3Compression
- A compression program is used to convert data
from a native format to one optimized for
compactness. - An uncompression program returns the compressed
format to a usable form which may or may not be a
replication of the original depending on
whether the compression is lossy or not
4Compression
- A lossless compression algorithm means that the
restored data is identical to the original - Necessary for many applications executable code,
ascii files etc. - A lossy compression algorithm, allows for some
degradation - Ok for many types of signal applications audio,
video, graphics
5Lossy versus Lossless
6Input and Output Group Size
- Most compression algorithms operate by taking a
group of data from the original (native) file,
and writing a compressed group to the output
file. - Some algorithms have a fixed-input fixed output
scheme. That is, a fixed number of bits are read
from the input file and a smaller fixed number of
bits are written to the output file. - CSQ reads 24 bits of an audio file, and outputs
8 bits, resulting in a 31 compression ratio. - Others allow for a variable number of bits to be
read or written -
7Input and Output Group Size
8Run-Length Encoding
- Digitized signals can have runs of the same value
over and over again. - An image of the night sky could contain long runs
of the same signal representing the black
background. - Digitized albums on CD might have a long run of
zeroes between tracks.
9Run length Encoding
10PackBits
- Created for Macintosh users
- Each byte (8 bits) from the input file is
replaced by 9 bits in the compressed file. - The added bit is interpreted as the sign of the
number - Each character read from the input file is
between 0 to 255 (ASCII). - Each character added to the encoded file is
between -255 and 255.
11Packbits
- Consider the input file 1,2,3,4,2,2,2,2,4 and
the compressed file generated by PackBits
1,2,3,4,2,-3,4 - The compressed file replaced 2,2,2,2 with 2,-3.
- The 2 represents the character run. The -3
indicates the number of characters, found by
taking the absolute value and incrementing one. - Essentially, the negative indicates that this is
to be interpreted as the number of character runs
of the previous character. - Ex. 4,-2 means 4,4,4 21,-4 means
21,21,21,21,21
Why is the secondary value, one less then the
character run? Why not just use the exact number?
12Huffman Encoding
- Let us examine the histogram of the byte values
from a sample ASCII file.
13Huffman Encoding
- 96 of the sample text consists of only 31
characters the lower case letters, the space,
comma, period, and the return.
The observation can be used to make an
appropriate compression scheme for the file.
14Huffman Encoding
- Assign each of the 31 common characters a five
bit binary code 00000 a, 00001 b, 00010
c, etc. - This allows 96 of the file to be reduced in size
by 5/8 - Let 11111 be a flag indicating that the character
being transmitted is not one of the 31 common
characters and add 8 bits according to the
ASCII code - This results in 4 of the characters to be
represented by 5813 bits
15Huffman Encoding
- The principle is to assign the vast majority of
characters fewer bits and rare characters more
bits - Huffman encoding takes the principle to the
extreme - Characters that occur most often, such as the
space and period, may be assigned one or two
bits. - Characters that occur most often, such as the
space and period, may require a dozen or more
bits. - The optimal solution is reached when the number
of bits used for each character is proportional
to the logarithm of the characters probability
of occurrence - Huffman encoding is a fixed input, variable
output scheme, however the output is regrouped as
fixed output bytes for transmission and storage
16Huffman Encoding
17Huffman Encoding
- Figure 27-3 shows a simplified Huffman encoding
scheme the characters A through G occur in the
original data stream with the probabilities
shown. - Since the character A is the most common, it will
be represented with a single bit, the code 1. - The next most common character, B, receives two
bits, the code 01. - The least frequent character, G, is assigned six
bits, 000011
18Huffman Encoding
- The variable length codes are resorted into eight
bit groups or bytes - The uncompression program looks at the stream of
ones and zeros until a valid code is formed,and
then looks for the next character. - The way that codes are formed insures that no
ambiguity exists in the separation. - Note that there is no need for delimiters between
the codes, if the code is chosen correctly
19Delta Encoding
- Delta encoding refers to several techniques that
store data as the difference between successive
samples (or characters)
- The first value in the delta encoded file is the
same as the first value in the original value - All the following values in the encoded file are
equal to the delta (difference) between the
corresponding value in the input file, and the
previous value in the input file.
20Delta Encoding
- Delta encoding is effective when the values in
the original data is smooth that is, there is
very little change between adjacent values - not the case for ASCII text or executable code,
but very true for signals -
21Delta Encoding
- The essential feature is that the delta encoded
signal has a much lower amplitude than the
original signal. - If the original signal is not changing, or is
changing in a straight line, delta encoding will
result in runs of samples having the same value - Delta encoding followed by Huffman and/or
run-length encoding is a common strategy for
compressing signals
22JPEG
- Belongs to a family of lossy compression
techniques called transform compression - Transform compression is based on the following
premise when the signal is passed through the
Fourier (or other) transform, the resulting data
values will be unequal in their information
carrying capacity - The low frequency components of a signal are more
important than the high frequency components - For example, removing 50 of the bits from the
high frequency components might remove 5 of the
of the encoded information
23(No Transcript)
24JPEG
- JPEG compression starts by breaking the image
into 88 pixel groups. - Each pixel is a single byte, a greyscale value
between 0 and 255. - Each group is represented by 64 bytes.
- After transforming and removing data, each group
is represented by 2 to 20 bytes. - During uncompression, an inverse transform
creates an approximation of the original 88
group.
25JPEG
- The transform used for JPEG compression, is
called the Discrete Cosine Transform (DCT) - Like the discrete fourier transform (DFT), the
DCT provides information about the signal in the
frequency domain. - Unlike the DFT, the DCT of a real signal is real
valued.
26JPEG
- The transform is linear, so it can be expressed
in a matrix-vector form
where xN is the N-vector describing the
signal, XCN is the N-vector describing the
result of the transform, and CN is a square
nonsingular NxN matrix describing the transform
itself. The matrix CN is real valued.
27JPEG
- The DCT is a transform that maps a block of pixel
color values in the spatial domain to values in
the frequency domain. - The DCT can operate mathematically in any
dimension, however an image is a two-dimensional
(2-D) surface so the 2-D DCT transform is used.
28JPEG
29JPEG
- The 2-D DCT is applied to each block so that an
8x8 spectrum (8 x 8 matrix of DCT coefficients)
is produced for each 8 x 8 block. - 64 individual pixels are transformed into 64 real
numbers representing frequency components of the
entire block.
DC
The further an an AC component from the DC
component, the higher the frequency represented
- The top left component of the DCT matrix is
called the discrete cosine (DC) coefficient and
can be interpreted as the component responsible
for the average background color of the block - The remaining 63 components of the DCT matrix are
the AC components, representing the frequency
components of the image
30JPEG
- The 64 frequency components are described by 64
basis functions, and the value in the DCT matrix
is the amplitude of its corresponding 2D basis
function
31JPEG
The one-half cycle basis function
(0,1) represents brightness gradually
changing over a region
32JPEG
- The original image block is recovered from the
DCT coefficients by applying the inverse DCT
(IDCT)
33JPEG
- Most of the signal amplitude concentrates in the
low frequency components (upper left portion of
the DCT matrix) - This means the highest frequency components can
be eliminated, while only degrading the signal a
small amount
34JPEG
Reconstruction of the eye image with reduced
spectrum coefficients
35Quantization Tables
- Each value in the DCT spectrum is divided by the
corresponding value in the - quantization table, and the result rounded to
the nearest integer - For example, the lower right-hand value of a) is
16, meaning that the original - range of -127 to 127 is reduced to -7 to 7,
correspondingly the value has been - reduced in precision from 8 bits to 4 bits.
36Zig-Zag Pattern
- A zig-zag pattern places all the high frequency
components together - at the end of the linear sequence.
- This groups the zeros from the eliminated
components into long runs - Run-length encoding compresses the zeros
- Resulting sequence is encoded by Huffman or
arithmetic encoding
37Compression Ratios