Chapter 6 Statistical Modeling - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Chapter 6 Statistical Modeling

Description:

The previous three chapters have shown several coding ... to the expander the flush code, which tells the decoder to flush statistics out of the model. ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 20
Provided by: yr53
Category:

less

Transcript and Presenter's Notes

Title: Chapter 6 Statistical Modeling


1
Chapter 6 Statistical Modeling
The Data Compression Book
2
The previous three chapters have shown several
coding techniques used to compress data. The two
coding methods discussed, Huffman and arithmetic
coding, can be implemented using either the fixed
or adaptive approaches, but in all cases a
statistical model needs to drive them. The
chapters which discuss these coding techniques
all used a simple order-0 model, which provides
fairly good compression. This chapter discusses
how to combine more sophisticated modeling
techniques with standard coding methods to
achieve better compression.
3
6.1 Higher-Order Modeling
To compress data using arithmetic or Huffman
coding, we need a model of the data stream. The
model needs to do two things to achieve
compression (1) it needs to accurately predict
the frequency/probability of symbols in the input
data stream, and (2) the symbol probabilities
generated by the model need to deviate from a
uniform distribution.
4
6.2 Finite Context Modeling
The modeling discussed in this chapter is called
finite-context modeling. It is based on a
simple idea calculate the probabilities for each
incoming symbol based on the context in which the
symbol appears. In all of the examples shown
here, the context consists of nothing more than
symbols previously encountered. The order of
the model refers to the number of previous
symbols that make up the context.
5
6.3 Adaptive Modeling
It seems logical that as the order of the model
increases, compression ratios ought to improve as
well. The probability of the letter u appearing
in the text of this book may only be 5 percent,
for example, but if the previous context
character is q, the probability goes up to 95
percent. Predicting characters with high
probability lowers the number of bits needed, and
larger contexts ought to let us make better
predictions.
6
6.3.1 A Simple Example
The sample program in Chapter 4 used Huffman
coding to demonstrate adaptive compression. In
this chapter, the sample program will use
adaptive arithmetic coding. When performing
finite-context modeling, we need a data structure
to describe each context used while compressing
the data. If we move up from an order to an
order-1, for example, we will use the previous
symbol as a context for encoding the current
symbol.
7
6.3.2 Using the Escape Code as a Fallback
The simple order-1 program does in fact do a
creditable job of compression, but it has a
couple of problems to address. First, the model
for this program makes it a slow starter. Every
context starts off with 257 symbols initialized
to a single count, meaning every symbol starts
off being encoded in roughly eight bits. As new
symbols are added to the table, they will
gradually begin to be encoded in fewer bits. This
process, however, will not happen very quickly.
8
6.3.3 Improvements
Some problems with the method of encoding in
ARITH-1.C are the high-cost operations associated
with the model. Each time we update the counts
for symbol c, every count in totalscontext
from c up to 256 has to be incremented. An
average of 128 increment operations have to be
performed for every character encoded or decoded.
For a simple demonstration program like the one
shown here, this may not be a major problem, but
a production program should be modified to be
more efficient.
9
6.4 Highest-Order Modeling
The previous sample program used order-1
statistics to compress data. It seems logical
that if we move to higher orders, we should
achieve better compression. The importance of the
escape code becomes even more clear here. When
using an order-3 model, we potentially have 16
million context tables to work with (actually
256256256, or 16,777,216). We would have to
read in an incredible amount of text before those
16 million tables would have enough statistics to
start compressing data, and many of those 16
million tables will never be usedwhich means
they take up space in our computers memory for
no good reason. When compressing English text,
for example, it does no good to allocate space
for the table QQW. It will never appear.
10
6.4.1 Updating the Model-1
ARITH1E.C does a good job of compressing data.
But quite a few improvements can still be made to
this simple statistical method without changing
the fundamental nature of its algorithm. The rest
of this chapter is devoted to discussing those
improvements, along with a look at a sample
compression module, ARITH-N.C, that implements
most of them.
11
6.4.2 Escape Probabilities
  • When the program first starts encoding a text
    stream, it will emit quite a few escape codes.
    The number of bits used to encode escape
    characters will probably have a large effect on
    the compression ratio, particularly in small
    files. In our first attempts to use escape codes,
    we set the escape count to 1 and left it there,
    regardless of the state of the rest of the
    context table. Bell, Cleary, and Witten call this
    Method A. Method B sets the count for the
    escape character at the number of symbols
    presently defined for the context table. If
    eleven different characters have been seen so
    far, for example, the escape symbol count will be
    set at eleven, regardless of what the counts are.

12
6.4.3 Scoreboarding-1
When using highest-order modeling techniques, an
additional enhancement, scoreboarding, can
improve compression efficiency. When we first try
to compress a symbol, we can generate either the
code for the symbol or an escape code. If we
generate an escape code, the symbol had not
previously occurred in that context, so we had a
count of 0. But we do gain some information about
the symbol just by generating an escape. We can
now generate a list of symbols that did not match
the symbol to be encoded. These symbols can
temporarily have their counts set to 0 when we
calculate the probabilities for lower-order
models. The counts will be reset back to their
permanent values after the encoding for the
particular character is complete. This process is
called scoreboarding.
13
6.4.4 Data Structures
All improvements to the basic statistical
modeling assume that higher-order modeling can
actually be accomplished on the target computer.
The problem with increasing the order is one of
memory. The cumulative totals table in the
order-0 model in Chapter 5 occupied 516 bytes of
memory. If we used the same data structures for
an order-1 model, the memory used would shoot up
to 133K, which is still probably acceptable. But
going to order-2 will increase the RAM
requirements for the statistics unit to
thirty-four megabytes! Since we would like to try
orders even higher than 2, we need to redesign
the data structures that hold the counts.
14
6.4.5 The Finishing Touches Tables 1 and 2
The final touch to the context tree in ARITH-N.C
is the addition of two special tables. The
order(-1) table has been discussed previously.
This is a table with a fixed probability for
every symbol. If a symbol is not found in any of
the higher-order models, it will show up in the
order(-1) model. This is the table of last
resort. Since it guarantees that it will always
provide a code for every symbol in the alphabet,
we dont update this table, which means it uses a
fixed probability for every symbol.
15
6.4.6 Model Flushing
The creation of the order(-2) model allows us to
pass a second control code from the encoder to
the expanderthe flush code, which tells the
decoder to flush statistics out of the model. The
compressor does this when the performance of the
model starts to slip. The ratio is adjustable and
is set in this implementation to 10 percent. When
compression falls belows this ratio, the model is
flushed by dividing all counts by two. This
gives more weight to newer statistics, which
should improve the compression.
16
6.4.7 Implementation
Even with the Byzantine data structures used
here, the compression and expansion programs
built around ARITH-N.C have prodigious memory
requirements. When running on DOS machines
limited to 640K, these programs have to be
limited to order-1, or perhaps order-2 for text
that has a higher redundancy ratio. To examine
compression ratios for higher orders on binary
files, there are a couple of choices for these
programs. First, they can be built using a DOS
Extender, such as Rational Systems/16M. Or they
can be built on a machine that has either a
larger address space or support for virtual
memory, such as Windows 95, VMS, or UNIX. The
code distributed here was written in an attempt
to be portable across all these options.
17
6.5 Conclusions
Compression-ratio test show that statistical
modeling can perform at least as well as
dictionary-based methods. But these programs are
at present somewhat impractical because of their
high resource requirements. ARITH-N is fairly
slow, compressing data with speeds in the range
of 1K per second and needing huge amounts of
memory to use higher-order modeling. As memory
becomes cheaper and processors become more
powerful, however, schemes such as the ones shown
here may become practical. They could be applied
today to circumstances in which either storage or
transmission costs are extremely high.
18
6.5.1 Enhancement
The performance of these algorithms could be
improved significantly beyond the implementation
discussed here. The first area for improvement
would be in memory management. Right now, when
the programs run out of memory, they abort. A
more sensible approach would be to have the
programs start with fixed amounts of memory
available for statistics. When the statistics
fill the space, the program should then stop
updating the tables and just use what it had.
This would mean implementing internal
memory-management routines instead of using the C
run-time library routines.
19
6.6 ARITH-N Listing
The code ARITH-N.C
arith-n.c
Write a Comment
User Comments (0)
About PowerShow.com