Data Compression - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Data Compression

Description:

Combine Pn and Pn-1 to form a new set of probabilities ... technique for constructing a source ... We can construct a prefix-free set by using bits to round ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 25
Provided by: Liu104
Category:

less

Transcript and Presenter's Notes

Title: Data Compression


1
Data Compression
  • Ajmal Muhammad

2
Lecture outline
  • Introduction
  • Basic definations
  • Kraft Inequality
  • Optimal codes
  • Huffman codes
  • Shannon-Fano coding
  • Shannon-Fano-Elias coding
  • Arithmetic coding
  • Competitive optimality
  • Generation of random variables

3
Introduction
  • What is Compression?
  • It is a process of deriving more compact (i.e.,
    smaller) representations of data
  • Goal of Compression
  • Significant reduction in the data size to reduce
    the storage/bandwidth requirements
  • Constraints on Compression
  • Perfect or near-perfect reconstruction
    (lossless/lossy)
  • Strategies for Compression
  • Reducing redundancies
  • Exploiting the characteristics of human vision

4
Introduction, cont
  • Strategies for Compression Reducing
    redundancies
  • Symbol-Level Representation Redundancy
  • Different symbols occur with different
    frequencies
  • Variable-length codes vs. fixed-length codes
  • Frequent symbols are better coded with short
    codes
  • Infrequent symbols are coded with long codes

5
Introduction, cont
  • Block-Level Representation Redundancy
  • Different blocks of data occur with varying
    frequencies
  • Better then to code blocks than individual
    symbols
  • The block size can be fixed or variable
  • The block-code size can be fixed or variable
  • Frequent blocks are better coded with short codes

6
Basic definations
  • Source coding
  • Has the function to convert the source
    symbols from its original form into channel
    symbols typically 0,1 for a binary channel
  • Discrete Memory-less source
  • A data generator where the source symbols are
    finite and the symbols generated are
    independent of one another.
  • Source with Memory
  • Presence of inter-symbol correlation

7
Basic definations, cont
  • Expected Code length
  • Non-singular Code
  • Every element of the range of X maps to a
    different string in i.e.
  • Extension of a code
  • Mapping from finite string of X to finite length
    strings of D, i.e.

8
Basic definations, cont
  • Uniquely decodable code
  • When the extension of the code is non-singular
  • Prefix-free /Instantaneous code
  • No codeword is a prefix of any other codeword

9
Kraft Inequality
  • It is a condition determining whether it is
    possible to construct prefix-free code for a
    given discrete source alpahet X
    a1,a2,......,aM with a given set of codeword
    lengths li 1ltiltM.
  • Conversely, given a set of codeword lengths that
    satisfy this inequality, there exist an
    prefix-free/instantaneous code with these word
    lengths
  • Where D denotes the size of alphabet use for the
    codeword e.g. for binary channel D 2 i.e. 0,1

10
Optimal codes
  • The Kraft Inequality determines which sets of
    codeword lengths are possible for prefix-free
    codes.
  • Given a source, we want to determine what
    set of codeword lengths can be used to minimize
    the expected length of a prefix-free code for
    that source, i.e. we want to minimize expected
    length subject to Kraft Inequality.
  • Standard optimization problem
  • Minimize
  • Subject to

11
Optimal codes, cont
  • The optimal codelengths will be
  • Expected codeword length
  • must be interger, so it will not alway be
    possible to set codeword lengths equal to
  • The expected length L will be greater than or
    equal to entropy i.e. and with equality iff
  • Bounds on the optimal codelength

12
Getting closer to H(X)
  • Reduce the overhead per symbol by spreading it
    out over many symbols
  • Consider a sequence of n source symbols from X
  • The symbols are assume to be drawn i.i.d, then
  • The expected codeword length Ln per input
    symbol will be
  • The bounds will be
  • Dividing by n

13
Huffman codes
  • Special prefix-free codes that can be shown to
    be optimal i.e. shortest expected length
  • Algorithm
  • Arrange source symbols in decreasing order of
    probability
  • Assign 1 to the last digit of Xn and 0 to
    the last digit of Xn-1
  • Combine Pn and Pn-1 to form a new set of
    probabilities
  • If left with just one symbol then done,
    otherwise repeat the above procedural steps

14
Huffman codes, cont
  • There are many optimal codes but these optimal
    codes will have some properties
  • If , then
  • The two longest codewords have the same length
  • The two longest codewords differ only in the
    last bit and correspond to the two least likely
    symbols

15
Shannon-Fano coding
  • Suboptimal but simple technique for constructing
    a source code
  • The source symbols and their probabilities of
    occuring are listed in decreasing order
  • The list is then divided in such a way to form
    two groups of as nearly equal probabilities as
    possible
  • Each symbol in the first group receives 0 as the
    first digit of its codeword, while the symbols in
    the second half have codewords beginning with 1
  • Each of these groups is then divided according
    to the same criterion and addational code digits
    are appended.
  • The process is continued until each subsets
    contains only one symbol

16
Shannon-Fano-Elias coding
  • Uses the cumulative distribution function to
    allot codewords, i.e. code a (truncated) binary
    representation of the cumulative distribution
    function
  • Consider the random variable X taking as values
    m letters of alphabets, a1,a2,.......am and for
    the letter ai the probability mass function is
    p(Xai)p(ai)gt0
  • The cumulative distribution function is
  • we assumed the lexicographic ordering relation
    i.e. ailtaj if iltj

17
Shannon-Fano-Elias coding, cont
  • y F(x) is a function having its plot as stairs,
    with jump at xak
  • If all , an arbitrary value
    uniquely determine a symbol ak, as that
    symbol obeys
  • To avoid dealing with interval boundaries,
    define
  • The values of are the midways of the
    steps in the distribution plot.
  • If or an approximation of is
    given, we can find ai
  • needs to be represented in about bits

18
Arithmetic codes
  • Motivation
  • Huffman codes are optimal codes. However, their
    average length is longer than entropy, within 1
    bit distance
  • To reach average codelength closer to the
    entropy, Huffman is applied to blocks of symbols.
    The size of the Huffman table needed to store the
    code increases exponentially with the length of
    the block
  • If during encoding we improve our knowledge of
    the symbol probabilities, we have to redesign the
    Huffman table again
  • To encode long blocks of symbols, or to change
    the code to make optimal for the new
    distribution, the solution is arithmetic coding

19
Arithmetic codes, cont
  • Principle
  • Similar to shannon-Fano-Elias coding i.e.
    handling the cumulative distribution to find
    codes
  • Efficiently calculate the probability mass
    function and the cumulative
    distribution function for the source
    sequence
  • Use any number in the interval
    as the code for
  • Express with an accuracy of
    will give a code for the source and so
    the codewords for different sequences will be
    different

20
Arithmetic codes, cont
  • We can construct a prefix-free set by using
    bits to round
  • The most used mechanisms for computing the
    probabilities are i.i.d. sources and Markov
    sources
  • For i.i.d. sources
  • For Markov sources

21
Competitive optimality
  • Let X be a discrete random variable drawn
    according to a probability mass function p(x),
    suppose p(x) is dyadic i.e., log(1/p(x)) is an
    integer for each x
  • Then the binary code length assignment
    dominates any other uniquely decodable
    assignment in expected length in the
    sense that ,
    indicating optimality in long run performance
  • Also competitively dominates , in the
    sense that
    , which
    indicate is also optimal in the short run
  • If p(x) is not dyadic, then
    dominates in expected length and
    competitvely dominates

22
Generation of random variables
  • When a random source is compressed into a
    sequence of bits so that the average length is
    minimized, the encoded sequence is essentially
    incompressible, and therefore has an entropy rate
    close to 1 bit per symbol
  • The bits of the encoded sequence are essentially
    fair coin flips
  • Opposite direction How many fair coin flips
    does it take to generate a random variable X
    drawn according to some specified probability
    mass function p

23
Generation of random variables, cont
  • We map strings of bits Z1Z2........ to possible
    outcomes X by a binary tree, where the leaves are
    marked by output symbols X and the path to the
    leaves is given by the sequence of bits produced
    by the fair coin

24
www.liu.se
Write a Comment
User Comments (0)
About PowerShow.com