Lempel-Ziv methods - PowerPoint PPT Presentation

About This Presentation

Title:

Lempel-Ziv methods

Description:

Title: Slide 1 Author - Last modified by - Created Date: 5/24/2005 9:08:06 AM Document presentation format: On-screen Show Company - Other titles – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 39

Provided by: 6383156

Category:

more less

Transcript and Presenter's Notes

Title: Lempel-Ziv methods

1
Lempel-Ziv methods
2
Dictionary models - I

Dictionary-based compression methods use the
principle of replacing substrings in a message
with a codeword that identifies each substring in
a dictionary, or codebook
The dictionary contains a list of substrings and
their associated codewords
Unlike symbolwise methods, dictionary methods
often use fixed codewords rather than explicit
probability distribution

3
Dictionary models - II

For example, we can insert into the dictionary
the full set of 8-bit ASCII characters How many?
and the 256 most common pairs of characters
If we use fixed length codeword, how many bits
does we need to index dictionary entries?
SOL. 9 bits
What about the performances in bits/character in
the best and in the worst case?
SOL. best4.5b/char worst9b/char!!

4
Dictionary models - III

Another possibility is to use longer words in the
dictionary, perhaps common words like the or and
or common components of words like tion. This
strings are the phrases of the dictionary
A dictionary with a predefined set of phrases
does not achieve good compression
Performances are better if we tune the dictionary
on input source, i.e. if we loose input
indipendence

5
Dictionary models - IV

For istance common phrases for an italian sport
newspaper are very rare in a business management
book
To avoid the problem of dictionary being
unsuitable for the text at hand we can build a
new dictionary for each message to be
compressed....
.... but there is a significant overhead for
transmitting and storing it
Deciding the size of the dictionary in order to
maximize compression is a very difficult problem

6
The Lempel-Ziv methods

The only efficient solution to the problem is to
use an adaptive dictionary scheme
Pratically all adaptive dictionary compression
methods are based on one of the two methods
developed by two israely researchers, Abraham
Lempel and Jacob Ziv in 1977 e 1978, and called
LZ77 and LZ78
"A Universal Algorithm for Sequential Data
Compression" in the IEEE Transactions on
Information Theory, May 1977

7
The key idea - I

The key insight of the method is that it is
possible to automatically build a dictionary of
previously seen strings in the text being
compressed
The prior text makes a very good dictionary,
since it has usually the same style and language
of the upcoming text

8
The key idea - II

The dictionary does not have to be transmitted
with the compressed text, since the decompressor
can build it the same way the compressor does
The many variants of Lempel-Ziv methods differ in
how pointers are represented and in the
limitations on what the pointers are able to
refer to
The presence of so many variants is also caused
by same patents, and by the disputes over
patenting

9
The LZ77 family

Quite easy to implement
Fast decoding with little use of memory
The output of the encoding consists of a series
of triples
the first component indicates how far back to
look in the previously decoded text
the second component is the length of the phrase
the third is next character for the input

10
An example - encoding

alphabet a,b

a
a
a
a
b
b
b
aabb
lt0,0,agt
lt0,0,bgt
lt2,1,agt
lt3,2,bgt
lt5,3,bgt
11
An example - decoding

lt0,0,xgtlt0,0,ygtlt2,1,zgtlt2,1,xgtlt5,3,zgt lt6,3,zgtlt5,2,zgt

SOL. x y xz xx yxzz xxyz zxz
12
A recursive example

lt0,0,agtlt0,0,cgtlt2,1,agtlt4,2,bgtlt1,10,agt
Despite the recursive references, each character
is available when needed

a
c
aa
acb
??
bbbbbbbbbba
13
Further details on LZ77

LZ77 algorithm places limitations on how far back
a pointer can refer (i.e. on the length of the
first component of the triple) and on the maximum
size of the string referred to (i.e. on the
length of the second component)
For example, in English text there is no gain in
using a sliding windows of more than a few
thousand characters
We can use a windows of 8.192 characters, i.e. 13
bits

14
Further details on LZ77

At the same time, the length of the match is
rarely over 16 characters, so the extra cost to
allow longer match usually is not justified
Exercise encode the sequence
01002001100111
with a sliding window of 7 symbols and a maximal
match length of 3. Calculate the compression
ratio
SOL. lt0,0,0gtlt0,0,1gtlt2,1,0gtlt0,0,2gtlt2,1,gtlt7,2,1gtlt5
,2,gt,lt6,3,1gt
0000000 0000001 0100100 0000010 0100111 1111001
1011011 1101101
C(172)/780.607 ltlt 1!!!

15
LZ77 - encoding

Encode the text S1..N using LZ77, with a
sliding window of W characters
p1
WHILE pltN
search for the longest match for Sp... in Sp-W
... p-1.
Suppose the match occurs at position p-m, with
length l
output the triple lt p-m, l, Spl gt
pp l 1

16
LZ77 - decoding

Decode the text S1..N using LZ77, with a
sliding window of W characters
p1
FOREACH triple lt f, l, cgt
Sp ... p l - 1 S p - f ... p f l - 1
Suppose the match occurs at position p-m, with
length l
Spl c
pp l 1

17
LZ77 - improvements

The LZ77 has been gradually refined
first component of the triple it is useful to
use variable length, assigning shorter codewords
to recent matches (that are more common)
second component of the triple variable length
codes that uses less bits to represent smaller
numbers
third component of the triple in some variants
it is added only when needed (when?), with a
1-bit flag to indicate the presence or the
absence of this third component

18
gzip algorithm - I

gzip is one of the more effective variants of
LZ77
It is distributed by the Gnu Free Software
Foundation
home page of gzip project www.gzip.org

19
gzip algorithm - II

gzip uses a hash tables the next 3 characters to
be coded are hashed, and the return value is used
as index to lookup a table entry
This entry is the head of a list that contains
the places of occurrence of the 3 characters in
the window
The list is searched for the longest match
If there is no match the string is coded as raw
characters

20
gzip algorithm - III

If the match exists, we have a length and a
distance, otherwise we have a zero length and a
raw character
the sliding window has dimension W32KB, lengths
are limited to 258 bytes
List lengths are limited to avoid time consuming
researches
tradeoff accuracy/time ? users choice

21
gzip algorithm - IV

Lengths, distances and raw characters are coded
with two Huffman trees, one for distances and the
other for lengths and raw characters
Huffman codes are generated processing blocks of
up to 64KB (with canonical Huffman)
so gzip it is not really one-pass. From a
pratical point of view it is one-pass because
blocks are small, so they are read only one time
and kept in main memory

22
gzip - example

abacbcaab
ltlength, distance/charactergt
lt0,agtlt0,bgtlt1,2gtlt0,cgtlt1,3gtlt1,2gtlt1,4gtlt2,7gt

23
gzip best compression

abacbcaab
ltlength, distance/charactergt
lt0,agtlt0,bgtlt1,2gtlt0,cgtlt1,3gtlt1,2gtlt1,4gtlt2,7gt
two solutions
ab caab
a bcaab
The first is greedy, it uses longest possible
match. But sometimes the second is better. If
best compression is selected, gzip takes a longer
time but chooses the best of the two, eventually
coding raw characters even if matches are
possible, if it gives better compression in the
long run

...... abcaab
24
LZ78 family

it has restrictions on which substring can be
referenced (but this avoids some inefficiency)
decoding is slower than LZ77 and require more
memory
does not have a window to limit how far back
substring can be referenced
one of its variant, LZW, is widely used in many
popular compression systems

25
Referentiable strings

The text prior to the current coding position is
parsed in substrings, and only parsed phrases can
be referenced
Previous phrases are numbered in sequence, and
the output is a list of pairs
ltprevious phrase, next charactergt
This unseen combination is stored as a new phrase

26
An example

a
a
a
a
a
b
b
b
Phrases
Output
0 ltnullgt
lt0,agt
lt0,bgt
lt1,agt
lt2,agt
lt4,agt
1 a
2 b
3 aa
4 ba
5 baa

Only this phrases can be referenced
This avoids the inefficiency of having more than
one coded representation for the same string, as
usual in LZ77

27
How to store the phrases

It is crucial for algorithm efficiency, to store
the phrases in a clever way
This can be obtained using a trie

0
a
b
Phrases
0 ltnullgt
1
2
1 a
a
a
2 b
3
4
3 aa
a
b
4 ba
5 baa
5
6
28
How to store the phrases

The character of each phrase specify a path from
the root to a leaf
The character to be encoded are used to traverse
the trie until the path is blocked
The last node contains the phrase number to
output
A new node is added with next input character, to
form a new phrase

0
a
b
1
2
a
a
3
4
a
b
5
6
b
7
baab
lt5,bgt
29
A problem

The trie data structure continues to grow during
coding, and eventually growth must be stopped to
avoid an eccessive use of memory
There are various strategies
the trie can be reinitialized from scratch
the trie can be used as is, without further
updates
the trie can be partially rebuild using last part
of the text (this avoids the penalties of
starting form scratch)

30
LZ78 vs. LZ77

LZ78 encoding can be faster
LZ78 decoding is slower because the decoder must
also store the parsed phrases

31
LZ78 - exercise

Code the sequence with LZ78 e show the trie that
store the phrases
0100101110101001011
SOL. lt0,0gtlt0,1gtlt1,0gtlt2,0gtlt2,1gtlt4,1gtlt1,1gtlt3,1gtlt7,1gt

0
0
1
PHRASES 0 ltnullgt 1 0 2 1 3 00 4 10 5 11 6 1
01 7 01
8 001 9 011
1
2
0
0
1
1
3
4
7
5
1
1
1
6
8
9
32
LZW variant - I

One of the most popular variants of Lempel-Ziv
coding (Welch 1984)
It forms the basis for Unix utility compress and
many other popular compressors
The main difference between LZW and LZ78 is that
LZW encodes only phrase numbers without any
ending characters
This scheme works fine because we initialize the
dictionary with a phrase for each character of
the source alphabet (e.g. the 256 characters of
the 8-bit ASCII)

33
LZW variant - II

A new phrase is constructed from a coded one by
appending the first character of the next phrase
Suppose we use 7-bit ASCII dictionary is
initialized with 128 phrases (0-127)

34
LZW - encoding

b
a
ab
ab
ba
aba
abaa
input
output
97
98
97
128
128
129
131
134
new phrases added
128 ab
129 ba
130 aa
131 aba
132 abb
133 baa
134 abaa
35
LZW - decoding
input
97
98
97
128
128
129
131
134
a
b
a
ab
ab
ba
aba
aba?
output
abaa
it is not ready!!
?
new phrases added
128
129
130
131
132
133
134
a?
ab
b?
a?
ab?
ab?
ba?
aba?
aba
abb
baa
abaa
ba
aa
abaa?

The delay in phrase creation is not a problem
unless the encoder uses the phrase immediately
after its creation. In this case, if decoder
inserts the phrases only when they are completed,
cannot decode, because it doesnt have phrase 134

36
LZW - exercise

Code with LZW the sequence
0102001011102
using 8-bit ASCII codes
Hint. 0?48, 1?49, 2?50, ?36
SOL. 48 49 48 50 36 48 48 36 257 49 265 260 259

PHRASES 256 01 257 10 258 02 259 2 260 0

261 00
262 0
263 1
264 101
265 11
11

267 02 268 2?
37
Lempel-Ziv methods summary

LZ77
ltpointer,length,charactergt

gzip ltlength, distance/charactergt
LZ78 ltphrase,charactergt
LZW ltphrasegt
38
Lempel-Ziv methods exercise

Code the message using all studied methods
abbb010cac0bb0abb10111b1a
using a sliding window of 7 bits and a max match
length of 7. No limit is given with respect to
the dictionary dimension
You can use the following 8-bit ASCII codes
a?97 b?98 c?99 0?48 1?49

Write a Comment

User Comments (0)