Title: Data Compression
1Data Compression
- Gabriel Laden
- CS146 Dr. Sin-Min Lee
- Spring 2004
2What is Data Compression?
- There is lossless and lossy compression, either
way, file size is reduced - This saves both time and space (premium)
- Data Compression Algorithms are more successful
if they are based on statistical analysis of the
frequency of the data and the accuracy needed to
represent the data.
3Examples in computers
- jpeg is a compressed image file
- mp3 is a compressed audio file
- zip is a compressed archive of files
- there are lots of encoding algorithms, we will
look at Huffmans Algorithm - (see our textbook pp.357-362)
4What is Greedy Algorithm
- Solve a problem in stages
- Make a locally optimum decision
- Algorithm is good if local optimum is equal is to
the global optimum
5Examples of Greedy
- Dijkstra, Prim, Kruskal
- Bin Packing problem
- Huffman Code
6Problem with Greedy
- Greedy Algorithm does not always work with the
set of data, there can be some conflicts - What if all characters are equally distributed?
- What if characters are very unequally
distributed? - A problem from our text book
- If we had such a thing as a 12cent coin, and
we are asked to make 15cents change, Greedy
Algorithm would produce - 1(12cent) 3(penny) 15 incorrect answer
- 1(dime) 1(nickel) 15 correct answer
7David Huffman
- Paper published in 1952
- A Method for the Construction of Minimum
Redundancy Codes - What we call Data Compression is what he termed
Minimum Redundancy
8ASCII Code
- 128 characters includes punctuation
- log 128 7 bits
- 1 byte 8 bits
- All characters are 8 bits long
- Fixed-Length Encoding
- Etaoin Shrdlu most common letters!!!
9Intro to Huffman Algorithm
- Method of construction for an encoding tree
- Full Binary Tree Representation
- Each edge of the tree has a value,
- (0 is the left child, 1 is the right child)
- Data is at the leaves, not internal nodes
- Result encoding tree
- Variable-Length Encoding
10Huffman Algorithm (English)
- 1. Maintain a forest of trees
- 2. Weight of tree sum frequency of leaves
- 3. For 0 to N-1
- Select two smallest weight trees
- Form a new tree
11Huffman Algorithm (Technical)
- n ? C
- Q ? C
- For i ?1 to n 1
- Do z ? AllocateNode()
- x ? leftz ? ExtractMin(Q)
- y ? rightz ? ExtractMin(Q)
- fz ? fx fy
- Insert(Q, z)
- Return Extract-Min(Q)
12Ambiguity in using code?
- What if you have an encoded string
- 000010101101011000110001110
- How do you know where to break it up?
- Prefix Coding Rule
- No code is a prefix of another
- The way the tree is built disallows this
- If there is a 00 code, there cannot be a 0
13Step0 10 20 5 15 25 1 16 (q) (w) (e) (r) (t) (y) (u) Step1 ( )6 10 20 15 25 16 / \ (q) (w) (r) (t) (u) (y) (e)
Step2 ( )16 / \ ( ) (q) 20 15 25 16 / \ (w) (r) (t) (u) (y) (e) Step3 ( )16 / \ ( )31 ( ) (q) 20 25 / \ / \ (w) (t) (r) (u) (y) (e)
14Step3 ( )16 / \ ( )31 ( ) (q) 20 25 / \ / \ (w) (t) (r) (u) (y) (e) Step4 ( )36 / \ ( ) (w) / \ ( )31 ( ) (q) 25 / \ / \ (t) (r) (u) (y) (e)
Step5 ( )36 / \ ( )56 ( ) (w) / \ / \ (t) ( ) ( ) (q) / \ / \ (r) (u) (y) (e) Step6 ( )92 / \ ( ) ( ) / \ / \ ( ) (w) (t) ( ) / \ / \ ( ) (q) (r) (u) / \ (y) (e)
15Step6 ( )92 / \ ( ) ( ) / \ / \ ( ) (w) (t) ( ) / \ / \ ( ) (q) (r) (u) / \ (y) (e) When tree is used to encode a file it is written as a header above the body of the encoded bits of text. 0 is left, 1 is right edge use a stack to do this
Table 01 w 0000 y 10 t 0001 e 110 r 001 q 111 u Header 0000y0001e001q01w10t110r111u
16Proof part 1
- Lemma
- Let C be an alphabet in which each character c
in C has frequency fc - Let x and y be two characters in C having lowest
frequencies - There exists an optimal prefix code in C in which
the codes for x and y have the same length and
differ only in last bit
17Proof part 2
- Lemma
- Let T be a full binary tree representing an
optimal prefix code over an alphabet C - Let z be the parent of two leaves x and y
- Then T T x,y represents an optimal prefix
code for C C x,yUz
18Lengths of Encoding Set
root / \ / \ / \ / \ / \ / \ / \ 1 2 3 4 5 6 7 8 Length of set is (8 nodes) (3 edges) 24bits This is what you would get if the nodes are mostly random and equal in probability
19Lengths of Encoding Set
root / \ / \ 8 / \ 7 / \ 6 / \ 5 / \ 4 / \ 3 1 2 Length of set is 77654321 35bits This is what you would get if the nodes vary the most in probability.
20Expected Value / character
- In example 1
- 8 (1/23) 3) 3 bits
- In example 2
- 2 (1/27 7) (1/26 6) (1/25 5)
(1/24 4) (1/23 3) (1/22 2) (1/21
1) 1.98 bits
21Main Point
- Statistical methods work better when the symbols
in the data set have varying probabilities. - Otherwise you need to use a different method for
compression. (Example jpeg)
22Image Compression
- Lossy meaning details are lost
- An approximation of original image is made where
large areas of similar color are combined into a
single block - This introduces a certain amount of error, which
is a tradeoff
23Steps to Image Compression
- Specify requested output file size
- Divide image into several areas
- Divide file size by the of areas
- Quantize each area (information lost here)
- Encode each area separately, write to file
24Image Decomposition
25References
- Data Structures Algorithm Analysis - Mark Allen
Weiss - Introduction to Algorithms Thomas H. Cormen