Data Compression - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Data Compression

Description:

Combine Pn and Pn-1 to form a new set of probabilities ... technique for constructing a source ... We can construct a prefix-free set by using bits to round ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 25

Provided by: Liu104

Category:

more less

Transcript and Presenter's Notes

Title: Data Compression

1
Data Compression

Ajmal Muhammad

2
Lecture outline

Introduction
Basic definations
Kraft Inequality
Optimal codes
Huffman codes
Shannon-Fano coding
Shannon-Fano-Elias coding
Arithmetic coding
Competitive optimality
Generation of random variables

3
Introduction

What is Compression?
It is a process of deriving more compact (i.e.,
smaller) representations of data
Goal of Compression
Significant reduction in the data size to reduce
the storage/bandwidth requirements
Constraints on Compression
Perfect or near-perfect reconstruction
(lossless/lossy)
Strategies for Compression
Reducing redundancies
Exploiting the characteristics of human vision

4
Introduction, cont

Strategies for Compression Reducing
redundancies
Symbol-Level Representation Redundancy
Different symbols occur with different
frequencies
Variable-length codes vs. fixed-length codes
Frequent symbols are better coded with short
codes
Infrequent symbols are coded with long codes

5
Introduction, cont

Block-Level Representation Redundancy
Different blocks of data occur with varying
frequencies
Better then to code blocks than individual
symbols
The block size can be fixed or variable
The block-code size can be fixed or variable
Frequent blocks are better coded with short codes

6
Basic definations

Source coding
Has the function to convert the source
symbols from its original form into channel
symbols typically 0,1 for a binary channel
Discrete Memory-less source
A data generator where the source symbols are
finite and the symbols generated are
independent of one another.
Source with Memory
Presence of inter-symbol correlation

7
Basic definations, cont

Expected Code length
Non-singular Code
Every element of the range of X maps to a
different string in i.e.
Extension of a code
Mapping from finite string of X to finite length
strings of D, i.e.

8
Basic definations, cont

Uniquely decodable code
When the extension of the code is non-singular
Prefix-free /Instantaneous code
No codeword is a prefix of any other codeword

9
Kraft Inequality

It is a condition determining whether it is
possible to construct prefix-free code for a
given discrete source alpahet X
a1,a2,......,aM with a given set of codeword
lengths li 1ltiltM.
Conversely, given a set of codeword lengths that
satisfy this inequality, there exist an
prefix-free/instantaneous code with these word
lengths
Where D denotes the size of alphabet use for the
codeword e.g. for binary channel D 2 i.e. 0,1

10
Optimal codes

The Kraft Inequality determines which sets of
codeword lengths are possible for prefix-free
codes.
Given a source, we want to determine what
set of codeword lengths can be used to minimize
the expected length of a prefix-free code for
that source, i.e. we want to minimize expected
length subject to Kraft Inequality.
Standard optimization problem
Minimize
Subject to

11
Optimal codes, cont

The optimal codelengths will be
Expected codeword length
must be interger, so it will not alway be
possible to set codeword lengths equal to
The expected length L will be greater than or
equal to entropy i.e. and with equality iff
Bounds on the optimal codelength

12
Getting closer to H(X)

Reduce the overhead per symbol by spreading it
out over many symbols
Consider a sequence of n source symbols from X
The symbols are assume to be drawn i.i.d, then
The expected codeword length Ln per input
symbol will be
The bounds will be
Dividing by n

13
Huffman codes

Special prefix-free codes that can be shown to
be optimal i.e. shortest expected length
Algorithm
Arrange source symbols in decreasing order of
probability
Assign 1 to the last digit of Xn and 0 to
the last digit of Xn-1
Combine Pn and Pn-1 to form a new set of
probabilities
If left with just one symbol then done,
otherwise repeat the above procedural steps

14
Huffman codes, cont

There are many optimal codes but these optimal
codes will have some properties
If , then
The two longest codewords have the same length
The two longest codewords differ only in the
last bit and correspond to the two least likely
symbols

15
Shannon-Fano coding

Suboptimal but simple technique for constructing
a source code
The source symbols and their probabilities of
occuring are listed in decreasing order
The list is then divided in such a way to form
two groups of as nearly equal probabilities as
possible
Each symbol in the first group receives 0 as the
first digit of its codeword, while the symbols in
the second half have codewords beginning with 1
Each of these groups is then divided according
to the same criterion and addational code digits
are appended.
The process is continued until each subsets
contains only one symbol

16
Shannon-Fano-Elias coding

Uses the cumulative distribution function to
allot codewords, i.e. code a (truncated) binary
representation of the cumulative distribution
function
Consider the random variable X taking as values
m letters of alphabets, a1,a2,.......am and for
the letter ai the probability mass function is
p(Xai)p(ai)gt0
The cumulative distribution function is
we assumed the lexicographic ordering relation
i.e. ailtaj if iltj

17
Shannon-Fano-Elias coding, cont

y F(x) is a function having its plot as stairs,
with jump at xak
If all , an arbitrary value
uniquely determine a symbol ak, as that
symbol obeys
To avoid dealing with interval boundaries,
define
The values of are the midways of the
steps in the distribution plot.
If or an approximation of is
given, we can find ai
needs to be represented in about bits

18
Arithmetic codes

Motivation
Huffman codes are optimal codes. However, their
average length is longer than entropy, within 1
bit distance
To reach average codelength closer to the
entropy, Huffman is applied to blocks of symbols.
The size of the Huffman table needed to store the
code increases exponentially with the length of
the block
If during encoding we improve our knowledge of
the symbol probabilities, we have to redesign the
Huffman table again
To encode long blocks of symbols, or to change
the code to make optimal for the new
distribution, the solution is arithmetic coding

19
Arithmetic codes, cont

Principle
Similar to shannon-Fano-Elias coding i.e.
handling the cumulative distribution to find
codes
Efficiently calculate the probability mass
function and the cumulative
distribution function for the source
sequence
Use any number in the interval
as the code for
Express with an accuracy of
will give a code for the source and so
the codewords for different sequences will be
different

20
Arithmetic codes, cont

We can construct a prefix-free set by using
bits to round
The most used mechanisms for computing the
probabilities are i.i.d. sources and Markov
sources
For i.i.d. sources
For Markov sources

21
Competitive optimality

Let X be a discrete random variable drawn
according to a probability mass function p(x),
suppose p(x) is dyadic i.e., log(1/p(x)) is an
integer for each x
Then the binary code length assignment
dominates any other uniquely decodable
assignment in expected length in the
sense that ,
indicating optimality in long run performance
Also competitively dominates , in the
sense that
, which
indicate is also optimal in the short run
If p(x) is not dyadic, then
dominates in expected length and
competitvely dominates

22
Generation of random variables

When a random source is compressed into a
sequence of bits so that the average length is
minimized, the encoded sequence is essentially
incompressible, and therefore has an entropy rate
close to 1 bit per symbol
The bits of the encoded sequence are essentially
fair coin flips
Opposite direction How many fair coin flips
does it take to generate a random variable X
drawn according to some specified probability
mass function p

23
Generation of random variables, cont

We map strings of bits Z1Z2........ to possible
outcomes X by a binary tree, where the leaves are
marked by output symbols X and the path to the
leaves is given by the sequence of bits produced
by the fair coin

24
www.liu.se

Write a Comment

User Comments (0)