Data Compression - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Data Compression

Description:

Data Compression. Gabriel Laden. CS146 Dr. Sin-Min Lee. Spring 2004. What is Data Compression? ... Bin Packing problem. Huffman Code. Problem with Greedy ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 26

Provided by: gabrie1

Category:

more less

Transcript and Presenter's Notes

Title: Data Compression

1
Data Compression

Gabriel Laden
CS146 Dr. Sin-Min Lee
Spring 2004

2
What is Data Compression?

There is lossless and lossy compression, either
way, file size is reduced
This saves both time and space (premium)
Data Compression Algorithms are more successful
if they are based on statistical analysis of the
frequency of the data and the accuracy needed to
represent the data.

3
Examples in computers

jpeg is a compressed image file
mp3 is a compressed audio file
zip is a compressed archive of files
there are lots of encoding algorithms, we will
look at Huffmans Algorithm
(see our textbook pp.357-362)

4
What is Greedy Algorithm

Solve a problem in stages
Make a locally optimum decision
Algorithm is good if local optimum is equal is to
the global optimum

5
Examples of Greedy

Dijkstra, Prim, Kruskal
Bin Packing problem
Huffman Code

6
Problem with Greedy

Greedy Algorithm does not always work with the
set of data, there can be some conflicts
What if all characters are equally distributed?
What if characters are very unequally
distributed?
A problem from our text book
If we had such a thing as a 12cent coin, and
we are asked to make 15cents change, Greedy
Algorithm would produce
1(12cent) 3(penny) 15 incorrect answer
1(dime) 1(nickel) 15 correct answer

7
David Huffman

Paper published in 1952
A Method for the Construction of Minimum
Redundancy Codes
What we call Data Compression is what he termed
Minimum Redundancy

8
ASCII Code

128 characters includes punctuation
log 128 7 bits
1 byte 8 bits
All characters are 8 bits long
Fixed-Length Encoding
Etaoin Shrdlu most common letters!!!

9
Intro to Huffman Algorithm

Method of construction for an encoding tree
Full Binary Tree Representation
Each edge of the tree has a value,
(0 is the left child, 1 is the right child)
Data is at the leaves, not internal nodes
Result encoding tree
Variable-Length Encoding

10
Huffman Algorithm (English)

1. Maintain a forest of trees
2. Weight of tree sum frequency of leaves
3. For 0 to N-1
Select two smallest weight trees
Form a new tree

11
Huffman Algorithm (Technical)

n ? C
Q ? C
For i ?1 to n 1
Do z ? AllocateNode()
x ? leftz ? ExtractMin(Q)
y ? rightz ? ExtractMin(Q)
fz ? fx fy
Insert(Q, z)
Return Extract-Min(Q)

12
Ambiguity in using code?

What if you have an encoded string
000010101101011000110001110
How do you know where to break it up?
Prefix Coding Rule
No code is a prefix of another
The way the tree is built disallows this
If there is a 00 code, there cannot be a 0

13
Step0 10 20 5 15 25 1 16 (q) (w) (e) (r) (t) (y) (u) Step1 ( )6 10 20 15 25 16 / \ (q) (w) (r) (t) (u) (y) (e)
Step2 ( )16 / \ ( ) (q) 20 15 25 16 / \ (w) (r) (t) (u) (y) (e) Step3 ( )16 / \ ( )31 ( ) (q) 20 25 / \ / \ (w) (t) (r) (u) (y) (e)
14
Step3 ( )16 / \ ( )31 ( ) (q) 20 25 / \ / \ (w) (t) (r) (u) (y) (e) Step4 ( )36 / \ ( ) (w) / \ ( )31 ( ) (q) 25 / \ / \ (t) (r) (u) (y) (e)
Step5 ( )36 / \ ( )56 ( ) (w) / \ / \ (t) ( ) ( ) (q) / \ / \ (r) (u) (y) (e) Step6 ( )92 / \ ( ) ( ) / \ / \ ( ) (w) (t) ( ) / \ / \ ( ) (q) (r) (u) / \ (y) (e)
15
Step6 ( )92 / \ ( ) ( ) / \ / \ ( ) (w) (t) ( ) / \ / \ ( ) (q) (r) (u) / \ (y) (e) When tree is used to encode a file it is written as a header above the body of the encoded bits of text. 0 is left, 1 is right edge use a stack to do this
Table 01 w 0000 y 10 t 0001 e 110 r 001 q 111 u Header 0000y0001e001q01w10t110r111u
16
Proof part 1

Lemma
Let C be an alphabet in which each character c
in C has frequency fc
Let x and y be two characters in C having lowest
frequencies
There exists an optimal prefix code in C in which
the codes for x and y have the same length and
differ only in last bit

17
Proof part 2

Lemma
Let T be a full binary tree representing an
optimal prefix code over an alphabet C
Let z be the parent of two leaves x and y
Then T T x,y represents an optimal prefix
code for C C x,yUz

18
Lengths of Encoding Set
root / \ / \ / \ / \ / \ / \ / \ 1 2 3 4 5 6 7 8 Length of set is (8 nodes) (3 edges) 24bits This is what you would get if the nodes are mostly random and equal in probability
19
Lengths of Encoding Set
root / \ / \ 8 / \ 7 / \ 6 / \ 5 / \ 4 / \ 3 1 2 Length of set is 77654321 35bits This is what you would get if the nodes vary the most in probability.
20
Expected Value / character

In example 1
8 (1/23) 3) 3 bits
In example 2
2 (1/27 7) (1/26 6) (1/25 5)
(1/24 4) (1/23 3) (1/22 2) (1/21
1) 1.98 bits

21
Main Point

Statistical methods work better when the symbols
in the data set have varying probabilities.
Otherwise you need to use a different method for
compression. (Example jpeg)

22
Image Compression

Lossy meaning details are lost
An approximation of original image is made where
large areas of similar color are combined into a
single block
This introduces a certain amount of error, which
is a tradeoff

23
Steps to Image Compression