Data Compression - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Data Compression

Description:

A compression program is used to convert data from a native ... Digitized albums on CD might have a long run of zeroes between tracks. Run length Encoding ... – PowerPoint PPT presentation

Number of Views:240

Avg rating:3.0/5.0

Slides: 38

Provided by: rejea

Category:

more less

Transcript and Presenter's Notes

Title: Data Compression

1
Data Compression

Rejean Lau

2
Data Compression

Data transmission and storage cost money
It is advantageous to represent data files in the
most compact form during either transmission or
storage

Storage
Player decompessor
Native Format
Compressor
Encoded Files (video, audio)
MPEG, MP3
End User
3
Compression

A compression program is used to convert data
from a native format to one optimized for
compactness.
An uncompression program returns the compressed
format to a usable form which may or may not be a
replication of the original depending on
whether the compression is lossy or not

4
Compression

A lossless compression algorithm means that the
restored data is identical to the original
Necessary for many applications executable code,
ascii files etc.
A lossy compression algorithm, allows for some
degradation
Ok for many types of signal applications audio,
video, graphics

5
Lossy versus Lossless
6
Input and Output Group Size

Most compression algorithms operate by taking a
group of data from the original (native) file,
and writing a compressed group to the output
file.
Some algorithms have a fixed-input fixed output
scheme. That is, a fixed number of bits are read
from the input file and a smaller fixed number of
bits are written to the output file.
CSQ reads 24 bits of an audio file, and outputs
8 bits, resulting in a 31 compression ratio.
Others allow for a variable number of bits to be
read or written

7
Input and Output Group Size
8
Run-Length Encoding

Digitized signals can have runs of the same value
over and over again.
An image of the night sky could contain long runs
of the same signal representing the black
background.
Digitized albums on CD might have a long run of
zeroes between tracks.

9
Run length Encoding
10
PackBits

Created for Macintosh users
Each byte (8 bits) from the input file is
replaced by 9 bits in the compressed file.
The added bit is interpreted as the sign of the
number
Each character read from the input file is
between 0 to 255 (ASCII).
Each character added to the encoded file is
between -255 and 255.

11
Packbits

Consider the input file 1,2,3,4,2,2,2,2,4 and
the compressed file generated by PackBits
1,2,3,4,2,-3,4
The compressed file replaced 2,2,2,2 with 2,-3.
The 2 represents the character run. The -3
indicates the number of characters, found by
taking the absolute value and incrementing one.
Essentially, the negative indicates that this is
to be interpreted as the number of character runs
of the previous character.
Ex. 4,-2 means 4,4,4 21,-4 means
21,21,21,21,21

Why is the secondary value, one less then the
character run? Why not just use the exact number?
12
Huffman Encoding

Let us examine the histogram of the byte values
from a sample ASCII file.

13
Huffman Encoding

96 of the sample text consists of only 31
characters the lower case letters, the space,
comma, period, and the return.

The observation can be used to make an
appropriate compression scheme for the file.
14
Huffman Encoding

Assign each of the 31 common characters a five
bit binary code 00000 a, 00001 b, 00010
c, etc.
This allows 96 of the file to be reduced in size
by 5/8
Let 11111 be a flag indicating that the character
being transmitted is not one of the 31 common
characters and add 8 bits according to the
ASCII code
This results in 4 of the characters to be
represented by 5813 bits

15
Huffman Encoding

The principle is to assign the vast majority of
characters fewer bits and rare characters more
bits
Huffman encoding takes the principle to the
extreme
Characters that occur most often, such as the
space and period, may be assigned one or two
bits.
Characters that occur most often, such as the
space and period, may require a dozen or more
bits.
The optimal solution is reached when the number
of bits used for each character is proportional
to the logarithm of the characters probability
of occurrence
Huffman encoding is a fixed input, variable
output scheme, however the output is regrouped as
fixed output bytes for transmission and storage

16
Huffman Encoding
17
Huffman Encoding

Figure 27-3 shows a simplified Huffman encoding
scheme the characters A through G occur in the
original data stream with the probabilities
shown.
Since the character A is the most common, it will
be represented with a single bit, the code 1.
The next most common character, B, receives two
bits, the code 01.
The least frequent character, G, is assigned six
bits, 000011

18
Huffman Encoding

The variable length codes are resorted into eight
bit groups or bytes
The uncompression program looks at the stream of
ones and zeros until a valid code is formed,and
then looks for the next character.
The way that codes are formed insures that no
ambiguity exists in the separation.
Note that there is no need for delimiters between
the codes, if the code is chosen correctly

19
Delta Encoding

Delta encoding refers to several techniques that
store data as the difference between successive
samples (or characters)

The first value in the delta encoded file is the
same as the first value in the original value
All the following values in the encoded file are
equal to the delta (difference) between the
corresponding value in the input file, and the
previous value in the input file.

20
Delta Encoding

Delta encoding is effective when the values in
the original data is smooth that is, there is
very little change between adjacent values
not the case for ASCII text or executable code,
but very true for signals

21
Delta Encoding

The essential feature is that the delta encoded
signal has a much lower amplitude than the
original signal.
If the original signal is not changing, or is
changing in a straight line, delta encoding will
result in runs of samples having the same value
Delta encoding followed by Huffman and/or
run-length encoding is a common strategy for
compressing signals

22
JPEG

Belongs to a family of lossy compression
techniques called transform compression
Transform compression is based on the following
premise when the signal is passed through the
Fourier (or other) transform, the resulting data
values will be unequal in their information
carrying capacity
The low frequency components of a signal are more
important than the high frequency components
For example, removing 50 of the bits from the
high frequency components might remove 5 of the
of the encoded information

23
(No Transcript)
24
JPEG

JPEG compression starts by breaking the image
into 88 pixel groups.
Each pixel is a single byte, a greyscale value
between 0 and 255.
Each group is represented by 64 bytes.
After transforming and removing data, each group
is represented by 2 to 20 bytes.
During uncompression, an inverse transform
creates an approximation of the original 88
group.

25
JPEG

The transform used for JPEG compression, is
called the Discrete Cosine Transform (DCT)
Like the discrete fourier transform (DFT), the
DCT provides information about the signal in the
frequency domain.
Unlike the DFT, the DCT of a real signal is real
valued.

26
JPEG

The transform is linear, so it can be expressed
in a matrix-vector form

where xN is the N-vector describing the
signal, XCN is the N-vector describing the
result of the transform, and CN is a square
nonsingular NxN matrix describing the transform
itself. The matrix CN is real valued.
27
JPEG

The DCT is a transform that maps a block of pixel
color values in the spatial domain to values in
the frequency domain.
The DCT can operate mathematically in any
dimension, however an image is a two-dimensional
(2-D) surface so the 2-D DCT transform is used.

28
JPEG

The 2-D DCT is given by

29
JPEG

The 2-D DCT is applied to each block so that an
8x8 spectrum (8 x 8 matrix of DCT coefficients)
is produced for each 8 x 8 block.
64 individual pixels are transformed into 64 real
numbers representing frequency components of the
entire block.

DC
The further an an AC component from the DC
component, the higher the frequency represented

The top left component of the DCT matrix is
called the discrete cosine (DC) coefficient and
can be interpreted as the component responsible
for the average background color of the block
The remaining 63 components of the DCT matrix are
the AC components, representing the frequency
components of the image

30
JPEG

The 64 frequency components are described by 64
basis functions, and the value in the DCT matrix
is the amplitude of its corresponding 2D basis
function

31
JPEG
The one-half cycle basis function
(0,1) represents brightness gradually
changing over a region
32
JPEG

The original image block is recovered from the
DCT coefficients by applying the inverse DCT
(IDCT)

33
JPEG

Most of the signal amplitude concentrates in the
low frequency components (upper left portion of
the DCT matrix)
This means the highest frequency components can
be eliminated, while only degrading the signal a
small amount

34
JPEG
Reconstruction of the eye image with reduced
spectrum coefficients
35
Quantization Tables