Lempel-Ziv methods - PowerPoint PPT Presentation

About This Presentation
Title:

Lempel-Ziv methods

Description:

Title: Slide 1 Author - Last modified by - Created Date: 5/24/2005 9:08:06 AM Document presentation format: On-screen Show Company - Other titles – PowerPoint PPT presentation

Number of Views:151
Avg rating:3.0/5.0
Slides: 39
Provided by: 6383156
Category:
Tags: alphabet | lempel | methods | ziv

less

Transcript and Presenter's Notes

Title: Lempel-Ziv methods


1
Lempel-Ziv methods
2
Dictionary models - I
  • Dictionary-based compression methods use the
    principle of replacing substrings in a message
    with a codeword that identifies each substring in
    a dictionary, or codebook
  • The dictionary contains a list of substrings and
    their associated codewords
  • Unlike symbolwise methods, dictionary methods
    often use fixed codewords rather than explicit
    probability distribution

3
Dictionary models - II
  • For example, we can insert into the dictionary
    the full set of 8-bit ASCII characters How many?
    and the 256 most common pairs of characters
  • If we use fixed length codeword, how many bits
    does we need to index dictionary entries?
  • SOL. 9 bits
  • What about the performances in bits/character in
    the best and in the worst case?
  • SOL. best4.5b/char worst9b/char!!

4
Dictionary models - III
  • Another possibility is to use longer words in the
    dictionary, perhaps common words like the or and
    or common components of words like tion. This
    strings are the phrases of the dictionary
  • A dictionary with a predefined set of phrases
    does not achieve good compression
  • Performances are better if we tune the dictionary
    on input source, i.e. if we loose input
    indipendence

5
Dictionary models - IV
  • For istance common phrases for an italian sport
    newspaper are very rare in a business management
    book
  • To avoid the problem of dictionary being
    unsuitable for the text at hand we can build a
    new dictionary for each message to be
    compressed....
  • .... but there is a significant overhead for
    transmitting and storing it
  • Deciding the size of the dictionary in order to
    maximize compression is a very difficult problem

6
The Lempel-Ziv methods
  • The only efficient solution to the problem is to
    use an adaptive dictionary scheme
  • Pratically all adaptive dictionary compression
    methods are based on one of the two methods
    developed by two israely researchers, Abraham
    Lempel and Jacob Ziv in 1977 e 1978, and called
    LZ77 and LZ78
  • "A Universal Algorithm for Sequential Data
    Compression" in the IEEE Transactions on
    Information Theory, May 1977

7
The key idea - I
  • The key insight of the method is that it is
    possible to automatically build a dictionary of
    previously seen strings in the text being
    compressed
  • The prior text makes a very good dictionary,
    since it has usually the same style and language
    of the upcoming text

8
The key idea - II
  • The dictionary does not have to be transmitted
    with the compressed text, since the decompressor
    can build it the same way the compressor does
  • The many variants of Lempel-Ziv methods differ in
    how pointers are represented and in the
    limitations on what the pointers are able to
    refer to
  • The presence of so many variants is also caused
    by same patents, and by the disputes over
    patenting

9
The LZ77 family
  • Quite easy to implement
  • Fast decoding with little use of memory
  • The output of the encoding consists of a series
    of triples
  • the first component indicates how far back to
    look in the previously decoded text
  • the second component is the length of the phrase
  • the third is next character for the input

10
An example - encoding
  • alphabet a,b

a
a
a
a
b
b
b
aabb
lt0,0,agt
lt0,0,bgt
lt2,1,agt
lt3,2,bgt
lt5,3,bgt
11
An example - decoding
  • lt0,0,xgtlt0,0,ygtlt2,1,zgtlt2,1,xgtlt5,3,zgt lt6,3,zgtlt5,2,zgt

SOL. x y xz xx yxzz xxyz zxz
12
A recursive example
  • lt0,0,agtlt0,0,cgtlt2,1,agtlt4,2,bgtlt1,10,agt
  • Despite the recursive references, each character
    is available when needed

a
c
aa
acb
??
bbbbbbbbbba
13
Further details on LZ77
  • LZ77 algorithm places limitations on how far back
    a pointer can refer (i.e. on the length of the
    first component of the triple) and on the maximum
    size of the string referred to (i.e. on the
    length of the second component)
  • For example, in English text there is no gain in
    using a sliding windows of more than a few
    thousand characters
  • We can use a windows of 8.192 characters, i.e. 13
    bits

14
Further details on LZ77
  • At the same time, the length of the match is
    rarely over 16 characters, so the extra cost to
    allow longer match usually is not justified
  • Exercise encode the sequence
  • 01002001100111
  • with a sliding window of 7 symbols and a maximal
    match length of 3. Calculate the compression
    ratio
  • SOL. lt0,0,0gtlt0,0,1gtlt2,1,0gtlt0,0,2gtlt2,1,gtlt7,2,1gtlt5
    ,2,gt,lt6,3,1gt
  • 0000000 0000001 0100100 0000010 0100111 1111001
    1011011 1101101
  • C(172)/780.607 ltlt 1!!!

15
LZ77 - encoding
  • Encode the text S1..N using LZ77, with a
    sliding window of W characters
  • p1
  • WHILE pltN
  • search for the longest match for Sp... in Sp-W
    ... p-1.
  • Suppose the match occurs at position p-m, with
    length l
  • output the triple lt p-m, l, Spl gt
  • pp l 1

16
LZ77 - decoding
  • Decode the text S1..N using LZ77, with a
    sliding window of W characters
  • p1
  • FOREACH triple lt f, l, cgt
  • Sp ... p l - 1 S p - f ... p f l - 1
  • Suppose the match occurs at position p-m, with
    length l
  • Spl c
  • pp l 1

17
LZ77 - improvements
  • The LZ77 has been gradually refined
  • first component of the triple it is useful to
    use variable length, assigning shorter codewords
    to recent matches (that are more common)
  • second component of the triple variable length
    codes that uses less bits to represent smaller
    numbers
  • third component of the triple in some variants
    it is added only when needed (when?), with a
    1-bit flag to indicate the presence or the
    absence of this third component

18
gzip algorithm - I
  • gzip is one of the more effective variants of
    LZ77
  • It is distributed by the Gnu Free Software
    Foundation
  • home page of gzip project www.gzip.org

19
gzip algorithm - II
  • gzip uses a hash tables the next 3 characters to
    be coded are hashed, and the return value is used
    as index to lookup a table entry
  • This entry is the head of a list that contains
    the places of occurrence of the 3 characters in
    the window
  • The list is searched for the longest match
  • If there is no match the string is coded as raw
    characters

20
gzip algorithm - III
  • If the match exists, we have a length and a
    distance, otherwise we have a zero length and a
    raw character
  • the sliding window has dimension W32KB, lengths
    are limited to 258 bytes
  • List lengths are limited to avoid time consuming
    researches
  • tradeoff accuracy/time ? users choice

21
gzip algorithm - IV
  • Lengths, distances and raw characters are coded
    with two Huffman trees, one for distances and the
    other for lengths and raw characters
  • Huffman codes are generated processing blocks of
    up to 64KB (with canonical Huffman)
  • so gzip it is not really one-pass. From a
    pratical point of view it is one-pass because
    blocks are small, so they are read only one time
    and kept in main memory

22
gzip - example
  • abacbcaab
  • ltlength, distance/charactergt
  • lt0,agtlt0,bgtlt1,2gtlt0,cgtlt1,3gtlt1,2gtlt1,4gtlt2,7gt

23
gzip best compression
  • abacbcaab
  • ltlength, distance/charactergt
  • lt0,agtlt0,bgtlt1,2gtlt0,cgtlt1,3gtlt1,2gtlt1,4gtlt2,7gt
  • two solutions
  • ab caab
  • a bcaab
  • The first is greedy, it uses longest possible
    match. But sometimes the second is better. If
    best compression is selected, gzip takes a longer
    time but chooses the best of the two, eventually
    coding raw characters even if matches are
    possible, if it gives better compression in the
    long run

...... abcaab
24
LZ78 family
  • it has restrictions on which substring can be
    referenced (but this avoids some inefficiency)
  • decoding is slower than LZ77 and require more
    memory
  • does not have a window to limit how far back
    substring can be referenced
  • one of its variant, LZW, is widely used in many
    popular compression systems

25
Referentiable strings
  • The text prior to the current coding position is
    parsed in substrings, and only parsed phrases can
    be referenced
  • Previous phrases are numbered in sequence, and
    the output is a list of pairs
  • ltprevious phrase, next charactergt
  • This unseen combination is stored as a new phrase

26
An example
  • a

a
a
a
a
a
b
b
b
Phrases
Output
0 ltnullgt
lt0,agt
lt0,bgt
lt1,agt
lt2,agt
lt4,agt
1 a
2 b
3 aa
4 ba
5 baa
  • Only this phrases can be referenced
  • This avoids the inefficiency of having more than
    one coded representation for the same string, as
    usual in LZ77

27
How to store the phrases
  • It is crucial for algorithm efficiency, to store
    the phrases in a clever way
  • This can be obtained using a trie

0
a
b
Phrases
0 ltnullgt
1
2
1 a
a
a
2 b
3
4
3 aa
a
b
4 ba
5 baa
5
6
28
How to store the phrases
  • The character of each phrase specify a path from
    the root to a leaf
  • The character to be encoded are used to traverse
    the trie until the path is blocked
  • The last node contains the phrase number to
    output
  • A new node is added with next input character, to
    form a new phrase

0
a
b
1
2
a
a
3
4
a
b
5
6
b
7
baab
lt5,bgt
29
A problem
  • The trie data structure continues to grow during
    coding, and eventually growth must be stopped to
    avoid an eccessive use of memory
  • There are various strategies
  • the trie can be reinitialized from scratch
  • the trie can be used as is, without further
    updates
  • the trie can be partially rebuild using last part
    of the text (this avoids the penalties of
    starting form scratch)

30
LZ78 vs. LZ77
  • LZ78 encoding can be faster
  • LZ78 decoding is slower because the decoder must
    also store the parsed phrases

31
LZ78 - exercise
  • Code the sequence with LZ78 e show the trie that
    store the phrases
  • 0100101110101001011
  • SOL. lt0,0gtlt0,1gtlt1,0gtlt2,0gtlt2,1gtlt4,1gtlt1,1gtlt3,1gtlt7,1gt

0
0
1
PHRASES 0 ltnullgt 1 0 2 1 3 00 4 10 5 11 6 1
01 7 01
8 001 9 011
1
2
0
0
1
1
3
4
7
5
1
1
1
6
8
9
32
LZW variant - I
  • One of the most popular variants of Lempel-Ziv
    coding (Welch 1984)
  • It forms the basis for Unix utility compress and
    many other popular compressors
  • The main difference between LZW and LZ78 is that
    LZW encodes only phrase numbers without any
    ending characters
  • This scheme works fine because we initialize the
    dictionary with a phrase for each character of
    the source alphabet (e.g. the 256 characters of
    the 8-bit ASCII)

33
LZW variant - II
  • A new phrase is constructed from a coded one by
    appending the first character of the next phrase
  • Suppose we use 7-bit ASCII dictionary is
    initialized with 128 phrases (0-127)

34
LZW - encoding
  • a

b
a
ab
ab
ba
aba
abaa
input
output
97
98
97
128
128
129
131
134
new phrases added
128 ab
129 ba
130 aa
131 aba
132 abb
133 baa
134 abaa
35
LZW - decoding
input
97
98
97
128
128
129
131
134
a
b
a
ab
ab
ba
aba
aba?
output
abaa
it is not ready!!
?
new phrases added
128
129
130
131
132
133
134
a?
ab
b?
a?
ab?
ab?
ba?
aba?
aba
abb
baa
abaa
ba
aa
abaa?
  • The delay in phrase creation is not a problem
    unless the encoder uses the phrase immediately
    after its creation. In this case, if decoder
    inserts the phrases only when they are completed,
    cannot decode, because it doesnt have phrase 134

36
LZW - exercise
  • Code with LZW the sequence
  • 0102001011102
  • using 8-bit ASCII codes
  • Hint. 0?48, 1?49, 2?50, ?36
  • SOL. 48 49 48 50 36 48 48 36 257 49 265 260 259

PHRASES 256 01 257 10 258 02 259 2 260 0
  • 261 00
  • 262 0
  • 263 1
  • 264 101
  • 265 11
  • 11

267 02 268 2?
37
Lempel-Ziv methods summary
  • LZ77
  • ltpointer,length,charactergt

gzip ltlength, distance/charactergt
LZ78 ltphrase,charactergt
LZW ltphrasegt
38
Lempel-Ziv methods exercise
  • Code the message using all studied methods
  • abbb010cac0bb0abb10111b1a
  • using a sliding window of 7 bits and a max match
    length of 7. No limit is given with respect to
    the dictionary dimension
  • You can use the following 8-bit ASCII codes
  • a?97 b?98 c?99 0?48 1?49
Write a Comment
User Comments (0)
About PowerShow.com