Title: Chapter 7 DictionaryBased Compression
1Chapter 7Dictionary-Based Compression
The Data Compression Book
27.1 An Example
A good example of how dictionary based
compression ? 1/1 822/3 674/4 1343/60 928/75
550/32 173/46 421/2 (The first number gives
the page of the dictionary, and the second number
tells the number of the word on that page)
dictionary has 2200 pages, and with less than 256
entries on each page, ? 12bits for page number,
and 8bit for entry number, the total number of
bit need for the above message is
8x20bits160bits20byteslt43bytes(the ascii
encoding)
37.2 Static vs. Adaptive
- Static dictionary, is built up before compression
occurs, and it does not change while the data is
being compressed, can be tuned to fit the data
it is compressing. - Adaptive compression schemes, do not have to deal
with the problem of how to pass the dictionary
from the encoder to the decoder.
47.2.1 Adaptive Methods
A section of code that compressed text using an
algorithm that looked something like this for (
) word read_word( input_file )
dictionary_index look_up( word, dictionary )
if ( dictionary_index lt 0 ) output( word,
output_file ) add_to_dictionary( word,
dictionary ) else output(
dictionary_index, output_file )
57.2.1 Adaptive Methods (cont.)
- The basic components of an adaptive dictionary
compression algorithm - To parse the input text stream into fragments
tested against the dictionary. - To test the input fragments against the
dictionary it may or may not be desirable to
report on partial matches. - To add new phrases to the dictionary.
- To encode dictionary indices and plain text so
that they are distinguishable. - The corresponding decompression
- To decode the input stream into either dictionary
indices or plain text - To add new phrases to the dictionary
- To convert dictionary indices into phrases
- To output phrases as plain text.
67.2.2 A Representative Example
QIC-122. (QIC refers to the Quarter Inch
Cartridge industry group, a trade association of
tape-drive manufacturers.) QIC-122 provides a
good example of how a sliding-window,
dictionary-based compression algorithm actually
works. It is based on the LZ77 sliding-window
concept. As symbols are read in by the encoder,
they are added to the end of a 2K window that
forms the phrase dictionary. To encode a symbol,
the encoder checks to see if it is part of a
phrase already in the dictionary. If it is, it
creates a token that defines the location of the
phrase and its length. If it is not, the symbol
is passed through unencoded. Each token or symbol
is prefixed by a single bit flag that indicates
whether the following data is a dictionary
reference or a plain symbol. The definitions for
these two sequences are (1) plaintext
lt1gtlteight-bit-symbolgt (2) dictionary reference
lt0gtltwindow-offsetgtltphrase-lengthgt.
77.2.2 A Representative Example (cont.)
The glossed-over version of the C code for this
algorithm while ( !out_of_symbols ) length
find_longest_match(offset) if ( length gt 1 )
output_bit( 0 ) length
find_longest_match( offset ) output_bits(
offset ) output_bits( length )
shift_input_buffer( length ) else
output_bit( 1 ) output_byte( buffer 0 )
shift_input_buffer( 1 )
87.2.2 A Representative Example (cont.)
An example of what this sliding window looks like
when used to encode some C code, in this case the
phrase output_byte. The previously encoded
text, which ends with the phrase output_bit( 1
)\r, is at the end of the window. The
find_longest_match routine will return a value of
8, since the first eight characters of
output_byte match the first eight characters of
output_bit. It encodes 8 bytes of data (
output_b) with only 2 bytes.
97.3 Israeli Roots
Dig beneath the surface of virtually any
dictionary-based compression program, and you
will find the work of Jacob Ziv and Abraham
Lempel. In 1977 with the publication of Jacob
Zivs and Abraham Lempels A Universal Algorithm
for Sequential Data Compression in IEEE
Transactions on Information Theory. This paper,
with its 1978 sequel Compression of Individual
Sequences via Variable-Rate Coding, triggered a
flood of dictionary-based compression research,
algorithms, and programs. LZ77 and LZ78 LZ77 is
a sliding window technique in which the
dictionary consists of a set of fixed-length
phrases found in a window into the previously
processed text. LZ78 takes a completely different
approach to building a dictionary. Instead of
using fixed-length phrases from a window into the
text, LZ78 builds phrases up one symbol at a
time, adding a new symbol to an existing phrase
when a match occurs.
107.3.1 History
The June 1984 issue of IEEE Computer had an
article entitled A Technique for
High-Performance Data Compression by Terry
Welch. His paper was a practical description of
this implementation of the LZ78 algorithm, which
he called LZW. Unix compress program. Desktop
software struggling along with order-0 Huffman
coding program known as SQ.
117.4 ARC The Father of MS-DOS Dictionary
Compression
In 1985, System Enhancement Associates released a
general-purpose compression and cataloging
program called ARC. PKWares PKARC, LHarc by
Haruyasu Yoshizaki and ARJ by Robert Jung. A
patent dispute between Unisys, which owns the
patent for LZ78-derived algorithms (Terry Welchs
work), versus the rest of the computer industry,
has resulted in a definite shift over to
LZ77-derived algorithms. For example, the
recently designed PNG format (discussed later in
this book) is being promulgated as a replacement
to Compuserves GIF format, in order to sidestep
Unisys patent claims.
127.4.1 Dictionary Compression Where It Shows Up
General-purpose programs Hardware-specific code.
137.5 Danger AheadPatents
One of the first data-compression patents was
granted to Sperry Corp. (now Unisys) for the
improvements to LZ78 developed by Terry Welch at
the Sperry Research Center LZW. CCITT V.42bis
and GIF.
147.6 Conclusion
Dictionary-based compression techniques are
presently the most popular forms of compression
in the lossless arena. Almost without exception,
these techniques can trace their origins back to
the original work published by Ziv and Lempel in
1977 and 1978. Refinements on these algorithms
yield better performance at lower cost, but both
types of improvements are evolutionary, not
revolutionary.