Processing of large document collections - PowerPoint PPT Presentation

About This Presentation

Title:

Processing of large document collections

Description:

general idea: the coder should output short codewords for ... depend heavily on a good coder to achieve compression. 32. Huffman coding ... coders (LZ77 ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 49

Provided by: helenaah

Category:

more less

Transcript and Presenter's Notes

Title: Processing of large document collections

1
Processing of large document collections

Fall 2002, Part 3

2
Text compression

Despite a continuous increase in storage and
transmission capacities, more and more effort has
been put into using compression to increase the
amount of data that can be handled
no matter how much storage space or transmission
bandwidth is available, someone always finds ways
to fill it with

3
Text compression

Efficient storage and representation of
information is an old problem (before the
computer era)
Morse code uses shorter representations for
common characters
Braille code for the blind includes
contractions, which represent common words with 2
or 3 characters

4
Text compression

On a computer changing the representation of a
file so that it takes less space to store or less
time to transmit
original file can be reconstructed exactly from
the compressed representation
different than data compression in general
text compression has to be lossless
compare with sound and images small changes and
noise is tolerated

5
Text compression methods

Huffman coding (in the 50s)
compressing English 5 bits/character
Ziv-Lempel compression (in the 70s)
4 bits/character
arithmetic coding
2 bits/char (more processing power needed)
prediction by partial matching (80s)

6
Text compression methods

Since 80s compression rate has been about the
same
improvements are made in processor and memory
utilization during compression
also amount of compression may increase when
more memory (for compression and uncompression)
is available

7
Text compression methods

Most text compression methods can be placed in
one of two classes
symbolwise methods
dictionary methods

8
Symbolwise methods

Work by estimating the probabilities of symbols
(often characters)
coding one symbol at a time
using shorter codewords for the most likely
symbols (in the same way as Morse code does)

9
Symbolwise methods

variations differ mainly in how they estimate
probabilities for symbols
the more accurate these estimates are, the
greater the compression that can be achieved
to obtain good compression, the probability
estimate is usually based on the context in which
a symbol occurs

10
Dictionary methods

compress by replacing words and other fragments
of text with an index to an entry in a
dictionary
compression is achieved if the index is stored in
fewer bits than the string it replaces

11
Symbolwise methods

Modeling
estimating probabilities
there does not appear to be any single best
method
Coding
converting the probabilities into a bitstream for
transmission
well understood, can be performed effectively

12
Models

Compression methods obtain high compression by
forming good models of the data that is to be
coded
the function of a model is to predict symbols
e.g. during the encoding of a text , the
prediction for the next symbol might include a
probability of 2 for the letter u, based on
its relative frequency in a sample of text

13
Models

The set of all possible symbols is called the
alphabet
the probability distribution provides an
estimated probability for each symbol in the
alphabet

14
Encoding, decoding

the model provides the probability distribution
to the encoder, which uses it to encode the
symbol that actually occurs
the decoder uses an identical model together with
the output of the encoder to find out what the
encoded symbol was

15
Information content of a symbol

The number of bits in which a symbol s should be
coded is called the information content I(s) of
the symbol
the information content I(s) is directly related
to the symbols predicted probability P(s), by
the function
I(s) -log P(s) bits

16
Information content of a symbol

The average amount of information per symbol over
the whole alphabet is known as the entropy of the
probability distribution, denoted by H

17
Information content of a symbol

Provided that the symbols appear independently
and with the assumed probabilities, H is a lower
bound on compression, measured in bits per
symbol, that can be achieved by any coding method

18
Information content of a symbol

If the probability of symbol u is estimated to
be 2, the corresponding information content is
5.6 bits
if u happens to be the next symbol that is to
be coded, it should be transmitted in 5.6 bits

19
Information content of a symbol

predictions can usually be improved by taking
account of the previous symbol
if a q has just occurred, the probability of
u may jump to 95 , based on how often q is
followed by u in a sample of text
information content of u in this case is 0.074
bits

20
Information content of a symbol

Models that take a few immediately preceding
symbols into account to make a prediction are
called finite-context models of order m
m is the number of previous symbols used to make
a prediction

21
Static models

There are many ways to estimate the probabilities
in a model
we could use static modelling
always use the same probabilities for symbols,
regardless of what text is being coded
compressing system may not perform well, if
different text is received
e.g. a model for English with a file of numbers

22
Semi-static models

One solution is to generate a model specifically
for each file that is to be compressed
an initial pass is made through the file to
estimate symbol probabilities, and these are
transmitted to the decode before transmitting the
encoded symbols
this is called semi-static modelling

23
Semi-static models

Semi-static modelling has the advantage that the
model is invariably better suited to the input
than a static one, but the penalty paid is
having to transmit the model first,
as well as the preliminary pass over the data to
accumulate symbol probabilities

24
Adaptive models

Adaptive model begins with a bland probability
distribution and gradually alters it as more
symbols are encountered
as an example, assume a zero-order model, i.e.,
no context is used to predict the next symbol

25
Adaptive models

Assume that a encoder has already encoded a long
text and come to a sentence It migh
now the probability that the next character is
t is estimated to be 49,983/768,078 6.5 ,
since in the previous text, 49,983 characters of
the total of 768,078 characters were t

26
Adaptive models

Using the same system, e has the probability
9.4 and x has probability 0.11
the model provides this estimated probability
distribution to an encoder
the decoder is able to generate the same model
since it has the same probability estimates as
the encoder

27
Adaptive models

For a higher-order model, such as a first-order
model, the probability is estimated by how often
that character has occurred in the current
context
in a zero-order model earlier, a symbol t
occurred in a context It migh , but the model
made no use of the characters of the phrase

28
Adaptive models

A first-order model would use the final h as a
context with which to condition the probability
estimates
the letter h has occurred 37,525 times in the
prior text, and 1,133 of these times it was
followed by a t
the probability of t occurring after an h can
be estimated to be 1,133/37,5253.02

29
Adaptive models

For t, a prediction of 3.2 is actually worse
than in the zero-order model because t is rare
in this context (e follows h much more often)
second-order model would use the relative
frequency that the context gh is followed by
t, which is the case in 64,6

30
Adaptive models

Good robust, reliable, flexible
Bad not suitable for random access to compressed
files
a text can be decoded only from the beginning
the model used for coding a particular part of
the text is determined from all the preceding
text
-gt not suitable for full-text retrieval

31
Coding

Coding is the task of determining the output
representation of a symbol, based on a
probability distribution supplied by a model
general idea the coder should output short
codewords for likely symbols and long codewords
for rare ones
symbolwise methods depend heavily on a good coder
to achieve compression

32
Huffman coding

A phrase is coded by replacing each of its
symbols with the codeword given by a table
Huffman coding generates codewords for a set of
symbols, given some probability distribution for
the symbols
the type of code is called prefix-free code
no codeword is the prefix of another symbols
codeword

33
Huffman coding

The codewords can be stored in a tree (a decoding
tree)
Huffmans algorithm works by constructing the
decoding tree from the bottom up

34
Huffman coding

Algorithm
create for each symbol a leaf node containing the
symbol and its probability
two nodes with the smallest probabilities become
siblings under a new parent node, which is given
a probability equal to the sum of its two
childrens probabilities
the combining operation is repeated until there
is only one node without a parent
the two branches from every nonleaf node are then
labeled 0 and 1

35
Huffman coding

Huffman coding is generally fast for both
encoding and decoding, provided that the
probability distribution is static
adaptive Huffman coding is possible, but needs
either a lot of memory or is slow
coupled with a word-based model (rather than
character-based model), gives a good compression

36
Dictionary models

Dictionary-based compression methods use the
principle of replacing substrings in a text with
a codeword that identifies that substring in a
dictionary
dictionary contains a list of substrings and a
codeword for each substring
often fixed codewords used
reasonable compression is obtained even if coding
is simple

37
Dictionary models

The simplest dictionary compression methods use
small dictionaries
for instance, digram coding
selected pairs of letters are replaced with
codewords
a dictionary for the ASCII character set might
contain the 128 ASCII characters, as well as 128
common letter pairs

38
Dictionary models

Digram coding
the output codewords are eight bits each
the presence of the full ASCII character set
ensures that any (ASCII) input can be represented
at best, every pair of characters is replaced
with a codeword, reducing the input from 7
bits/character to 4 bits/characters
at worst, each 7 bit character will be expanded
to 8 bits

39
Dictionary models

Natural extension
put even larger entries in the dictionary, e.g.
common words like and, the, or common
components of words like pre, tion
a predefined set of dictionary phrases make the
compression domain-dependent
or very short phrases have to be used -gt good
compression is not achieved

40
Dictionary models

One way to avoid the problem of the dictionary
being unsuitable for the text at hand is to use a
semi-static dictionary scheme
constuct a new dictionary for every text that is
to be compressed
overhead of transmitting or storing the
dictionary is significant
decision of which phrases should be included is a
difficult problem

41
Dictionary models

Solution use an adaptive dictionary scheme
Ziv-Lempel coders (LZ77 and LZ78)
a substring of text is replaced with a pointer to
where it has occurred previously
dictionary all the text prior to the current
position
codewords pointers

42
Dictionary models

Ziv-Lempel
the prior text makes a very good dictionary since
it is usually in the same style and language as
upcoming text
the dictionary is transmitted implicitly at no
extra cost, because the decoder has access to all
previously encoded text

43
LZ77

Key benefits
relatively easy to implement
decoding can be performed extremely quickly using
only a small amount of memory
suitable when the resources required for decoding
must be minimized, like when data is distributed
or broadcast from a central source to a number of
small computers

44
LZ77

The output of an encoder consists of a sequence
of triples, e.g. lt3,2,bgt
the first component of a triple indicates how far
back to look in the previous (decoded) text to
find the next phrase
the second component records how long the phrase
is
the third component gives the next character from
the input

45
LZ77

The components 1 and 2 constitute a pointer back
into the text
the component 3 is actually necessary only when
the character to be coded does not occur anywhere
in the previous input

46
LZ77

Encoding
for the text from the current point ahead
search for the longest match in the previous text
output a triple that records the position and
length of the match
the search for a match may return a length of
zero, in which case the position of the match is
not relevant
search can be accelerated by indexing the prior
text with a suitable data structure

47
LZ77

limitations on how far back a pointer can refer
and the maximum size of the string referred to
e.g. for English text, a window of a few thousand
characters
the length of the phrase e.g. maximum of 16
characters
otherwise too much space wasted without benefit

48
LZ77

The decoding program is very simple, so it can be
included with the data at very little cost
in fact, the compressed data is stored as part of
the decoder program, which makes the data
self-expanding
common way to distribute files

Write a Comment

User Comments (0)