Data Compression - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Data Compression

Description:

Data Compression. Gabriel Laden. CS146 Dr. Sin-Min Lee. Spring 2004. What is Data Compression? ... Bin Packing problem. Huffman Code. Problem with Greedy ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 26
Provided by: gabrie1
Category:
Tags: bin | compression | data | laden

less

Transcript and Presenter's Notes

Title: Data Compression


1
Data Compression
  • Gabriel Laden
  • CS146 Dr. Sin-Min Lee
  • Spring 2004

2
What is Data Compression?
  • There is lossless and lossy compression, either
    way, file size is reduced
  • This saves both time and space (premium)
  • Data Compression Algorithms are more successful
    if they are based on statistical analysis of the
    frequency of the data and the accuracy needed to
    represent the data.

3
Examples in computers
  • jpeg is a compressed image file
  • mp3 is a compressed audio file
  • zip is a compressed archive of files
  • there are lots of encoding algorithms, we will
    look at Huffmans Algorithm
  • (see our textbook pp.357-362)

4
What is Greedy Algorithm
  • Solve a problem in stages
  • Make a locally optimum decision
  • Algorithm is good if local optimum is equal is to
    the global optimum

5
Examples of Greedy
  • Dijkstra, Prim, Kruskal
  • Bin Packing problem
  • Huffman Code

6
Problem with Greedy
  • Greedy Algorithm does not always work with the
    set of data, there can be some conflicts
  • What if all characters are equally distributed?
  • What if characters are very unequally
    distributed?
  • A problem from our text book
  • If we had such a thing as a 12cent coin, and
    we are asked to make 15cents change, Greedy
    Algorithm would produce
  • 1(12cent) 3(penny) 15 incorrect answer
  • 1(dime) 1(nickel) 15 correct answer

7
David Huffman
  • Paper published in 1952
  • A Method for the Construction of Minimum
    Redundancy Codes
  • What we call Data Compression is what he termed
    Minimum Redundancy

8
ASCII Code
  • 128 characters includes punctuation
  • log 128 7 bits
  • 1 byte 8 bits
  • All characters are 8 bits long
  • Fixed-Length Encoding
  • Etaoin Shrdlu most common letters!!!

9
Intro to Huffman Algorithm
  • Method of construction for an encoding tree
  • Full Binary Tree Representation
  • Each edge of the tree has a value,
  • (0 is the left child, 1 is the right child)
  • Data is at the leaves, not internal nodes
  • Result encoding tree
  • Variable-Length Encoding

10
Huffman Algorithm (English)
  • 1. Maintain a forest of trees
  • 2. Weight of tree sum frequency of leaves
  • 3. For 0 to N-1
  • Select two smallest weight trees
  • Form a new tree

11
Huffman Algorithm (Technical)
  • n ? C
  • Q ? C
  • For i ?1 to n 1
  • Do z ? AllocateNode()
  • x ? leftz ? ExtractMin(Q)
  • y ? rightz ? ExtractMin(Q)
  • fz ? fx fy
  • Insert(Q, z)
  • Return Extract-Min(Q)

12
Ambiguity in using code?
  • What if you have an encoded string
  • 000010101101011000110001110
  • How do you know where to break it up?
  • Prefix Coding Rule
  • No code is a prefix of another
  • The way the tree is built disallows this
  • If there is a 00 code, there cannot be a 0

13
Step0 10 20 5 15 25 1 16 (q) (w) (e) (r) (t) (y) (u) Step1 ( )6 10 20 15 25 16 / \ (q) (w) (r) (t) (u) (y) (e)
Step2 ( )16 / \ ( ) (q) 20 15 25 16 / \ (w) (r) (t) (u) (y) (e) Step3 ( )16 / \ ( )31 ( ) (q) 20 25 / \ / \ (w) (t) (r) (u) (y) (e)
14
Step3 ( )16 / \ ( )31 ( ) (q) 20 25 / \ / \ (w) (t) (r) (u) (y) (e) Step4 ( )36 / \ ( ) (w) / \ ( )31 ( ) (q) 25 / \ / \ (t) (r) (u) (y) (e)
Step5 ( )36 / \ ( )56 ( ) (w) / \ / \ (t) ( ) ( ) (q) / \ / \ (r) (u) (y) (e) Step6 ( )92 / \ ( ) ( ) / \ / \ ( ) (w) (t) ( ) / \ / \ ( ) (q) (r) (u) / \ (y) (e)
15
Step6 ( )92 / \ ( ) ( ) / \ / \ ( ) (w) (t) ( ) / \ / \ ( ) (q) (r) (u) / \ (y) (e) When tree is used to encode a file it is written as a header above the body of the encoded bits of text. 0 is left, 1 is right edge use a stack to do this
Table 01 w 0000 y 10 t 0001 e 110 r 001 q 111 u Header 0000y0001e001q01w10t110r111u
16
Proof part 1
  • Lemma
  • Let C be an alphabet in which each character c
    in C has frequency fc
  • Let x and y be two characters in C having lowest
    frequencies
  • There exists an optimal prefix code in C in which
    the codes for x and y have the same length and
    differ only in last bit

17
Proof part 2
  • Lemma
  • Let T be a full binary tree representing an
    optimal prefix code over an alphabet C
  • Let z be the parent of two leaves x and y
  • Then T T x,y represents an optimal prefix
    code for C C x,yUz

18
Lengths of Encoding Set
root / \ / \ / \ / \ / \ / \ / \ 1 2 3 4 5 6 7 8 Length of set is (8 nodes) (3 edges) 24bits This is what you would get if the nodes are mostly random and equal in probability
19
Lengths of Encoding Set
root / \ / \ 8 / \ 7 / \ 6 / \ 5 / \ 4 / \ 3 1 2 Length of set is 77654321 35bits This is what you would get if the nodes vary the most in probability.
20
Expected Value / character
  • In example 1
  • 8 (1/23) 3) 3 bits
  • In example 2
  • 2 (1/27 7) (1/26 6) (1/25 5)
    (1/24 4) (1/23 3) (1/22 2) (1/21
    1) 1.98 bits

21
Main Point
  • Statistical methods work better when the symbols
    in the data set have varying probabilities.
  • Otherwise you need to use a different method for
    compression. (Example jpeg)

22
Image Compression
  • Lossy meaning details are lost
  • An approximation of original image is made where
    large areas of similar color are combined into a
    single block
  • This introduces a certain amount of error, which
    is a tradeoff

23
Steps to Image Compression
  • Specify requested output file size
  • Divide image into several areas
  • Divide file size by the of areas
  • Quantize each area (information lost here)
  • Encode each area separately, write to file

24
Image Decomposition
25
References
  • Data Structures Algorithm Analysis - Mark Allen
    Weiss
  • Introduction to Algorithms Thomas H. Cormen
Write a Comment
User Comments (0)
About PowerShow.com