Processing of large document collections - PowerPoint PPT Presentation

About This Presentation
Title:

Processing of large document collections

Description:

general idea: the coder should output short codewords for ... depend heavily on a good coder to achieve compression. 32. Huffman coding ... coders (LZ77 ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 49
Provided by: helenaah
Category:

less

Transcript and Presenter's Notes

Title: Processing of large document collections


1
Processing of large document collections
  • Fall 2002, Part 3

2
Text compression
  • Despite a continuous increase in storage and
    transmission capacities, more and more effort has
    been put into using compression to increase the
    amount of data that can be handled
  • no matter how much storage space or transmission
    bandwidth is available, someone always finds ways
    to fill it with

3
Text compression
  • Efficient storage and representation of
    information is an old problem (before the
    computer era)
  • Morse code uses shorter representations for
    common characters
  • Braille code for the blind includes
    contractions, which represent common words with 2
    or 3 characters

4
Text compression
  • On a computer changing the representation of a
    file so that it takes less space to store or less
    time to transmit
  • original file can be reconstructed exactly from
    the compressed representation
  • different than data compression in general
  • text compression has to be lossless
  • compare with sound and images small changes and
    noise is tolerated

5
Text compression methods
  • Huffman coding (in the 50s)
  • compressing English 5 bits/character
  • Ziv-Lempel compression (in the 70s)
  • 4 bits/character
  • arithmetic coding
  • 2 bits/char (more processing power needed)
  • prediction by partial matching (80s)

6
Text compression methods
  • Since 80s compression rate has been about the
    same
  • improvements are made in processor and memory
    utilization during compression
  • also amount of compression may increase when
    more memory (for compression and uncompression)
    is available

7
Text compression methods
  • Most text compression methods can be placed in
    one of two classes
  • symbolwise methods
  • dictionary methods

8
Symbolwise methods
  • Work by estimating the probabilities of symbols
    (often characters)
  • coding one symbol at a time
  • using shorter codewords for the most likely
    symbols (in the same way as Morse code does)

9
Symbolwise methods
  • variations differ mainly in how they estimate
    probabilities for symbols
  • the more accurate these estimates are, the
    greater the compression that can be achieved
  • to obtain good compression, the probability
    estimate is usually based on the context in which
    a symbol occurs

10
Dictionary methods
  • compress by replacing words and other fragments
    of text with an index to an entry in a
    dictionary
  • compression is achieved if the index is stored in
    fewer bits than the string it replaces

11
Symbolwise methods
  • Modeling
  • estimating probabilities
  • there does not appear to be any single best
    method
  • Coding
  • converting the probabilities into a bitstream for
    transmission
  • well understood, can be performed effectively

12
Models
  • Compression methods obtain high compression by
    forming good models of the data that is to be
    coded
  • the function of a model is to predict symbols
  • e.g. during the encoding of a text , the
    prediction for the next symbol might include a
    probability of 2 for the letter u, based on
    its relative frequency in a sample of text

13
Models
  • The set of all possible symbols is called the
    alphabet
  • the probability distribution provides an
    estimated probability for each symbol in the
    alphabet

14
Encoding, decoding
  • the model provides the probability distribution
    to the encoder, which uses it to encode the
    symbol that actually occurs
  • the decoder uses an identical model together with
    the output of the encoder to find out what the
    encoded symbol was

15
Information content of a symbol
  • The number of bits in which a symbol s should be
    coded is called the information content I(s) of
    the symbol
  • the information content I(s) is directly related
    to the symbols predicted probability P(s), by
    the function
  • I(s) -log P(s) bits

16
Information content of a symbol
  • The average amount of information per symbol over
    the whole alphabet is known as the entropy of the
    probability distribution, denoted by H

17
Information content of a symbol
  • Provided that the symbols appear independently
    and with the assumed probabilities, H is a lower
    bound on compression, measured in bits per
    symbol, that can be achieved by any coding method

18
Information content of a symbol
  • If the probability of symbol u is estimated to
    be 2, the corresponding information content is
    5.6 bits
  • if u happens to be the next symbol that is to
    be coded, it should be transmitted in 5.6 bits

19
Information content of a symbol
  • predictions can usually be improved by taking
    account of the previous symbol
  • if a q has just occurred, the probability of
    u may jump to 95 , based on how often q is
    followed by u in a sample of text
  • information content of u in this case is 0.074
    bits

20
Information content of a symbol
  • Models that take a few immediately preceding
    symbols into account to make a prediction are
    called finite-context models of order m
  • m is the number of previous symbols used to make
    a prediction

21
Static models
  • There are many ways to estimate the probabilities
    in a model
  • we could use static modelling
  • always use the same probabilities for symbols,
    regardless of what text is being coded
  • compressing system may not perform well, if
    different text is received
  • e.g. a model for English with a file of numbers

22
Semi-static models
  • One solution is to generate a model specifically
    for each file that is to be compressed
  • an initial pass is made through the file to
    estimate symbol probabilities, and these are
    transmitted to the decode before transmitting the
    encoded symbols
  • this is called semi-static modelling

23
Semi-static models
  • Semi-static modelling has the advantage that the
    model is invariably better suited to the input
    than a static one, but the penalty paid is
  • having to transmit the model first,
  • as well as the preliminary pass over the data to
    accumulate symbol probabilities

24
Adaptive models
  • Adaptive model begins with a bland probability
    distribution and gradually alters it as more
    symbols are encountered
  • as an example, assume a zero-order model, i.e.,
    no context is used to predict the next symbol

25
Adaptive models
  • Assume that a encoder has already encoded a long
    text and come to a sentence It migh
  • now the probability that the next character is
    t is estimated to be 49,983/768,078 6.5 ,
    since in the previous text, 49,983 characters of
    the total of 768,078 characters were t

26
Adaptive models
  • Using the same system, e has the probability
    9.4 and x has probability 0.11
  • the model provides this estimated probability
    distribution to an encoder
  • the decoder is able to generate the same model
    since it has the same probability estimates as
    the encoder

27
Adaptive models
  • For a higher-order model, such as a first-order
    model, the probability is estimated by how often
    that character has occurred in the current
    context
  • in a zero-order model earlier, a symbol t
    occurred in a context It migh , but the model
    made no use of the characters of the phrase

28
Adaptive models
  • A first-order model would use the final h as a
    context with which to condition the probability
    estimates
  • the letter h has occurred 37,525 times in the
    prior text, and 1,133 of these times it was
    followed by a t
  • the probability of t occurring after an h can
    be estimated to be 1,133/37,5253.02

29
Adaptive models
  • For t, a prediction of 3.2 is actually worse
    than in the zero-order model because t is rare
    in this context (e follows h much more often)
  • second-order model would use the relative
    frequency that the context gh is followed by
    t, which is the case in 64,6

30
Adaptive models
  • Good robust, reliable, flexible
  • Bad not suitable for random access to compressed
    files
  • a text can be decoded only from the beginning
    the model used for coding a particular part of
    the text is determined from all the preceding
    text
  • -gt not suitable for full-text retrieval

31
Coding
  • Coding is the task of determining the output
    representation of a symbol, based on a
    probability distribution supplied by a model
  • general idea the coder should output short
    codewords for likely symbols and long codewords
    for rare ones
  • symbolwise methods depend heavily on a good coder
    to achieve compression

32
Huffman coding
  • A phrase is coded by replacing each of its
    symbols with the codeword given by a table
  • Huffman coding generates codewords for a set of
    symbols, given some probability distribution for
    the symbols
  • the type of code is called prefix-free code
  • no codeword is the prefix of another symbols
    codeword

33
Huffman coding
  • The codewords can be stored in a tree (a decoding
    tree)
  • Huffmans algorithm works by constructing the
    decoding tree from the bottom up

34
Huffman coding
  • Algorithm
  • create for each symbol a leaf node containing the
    symbol and its probability
  • two nodes with the smallest probabilities become
    siblings under a new parent node, which is given
    a probability equal to the sum of its two
    childrens probabilities
  • the combining operation is repeated until there
    is only one node without a parent
  • the two branches from every nonleaf node are then
    labeled 0 and 1

35
Huffman coding
  • Huffman coding is generally fast for both
    encoding and decoding, provided that the
    probability distribution is static
  • adaptive Huffman coding is possible, but needs
    either a lot of memory or is slow
  • coupled with a word-based model (rather than
    character-based model), gives a good compression

36
Dictionary models
  • Dictionary-based compression methods use the
    principle of replacing substrings in a text with
    a codeword that identifies that substring in a
    dictionary
  • dictionary contains a list of substrings and a
    codeword for each substring
  • often fixed codewords used
  • reasonable compression is obtained even if coding
    is simple

37
Dictionary models
  • The simplest dictionary compression methods use
    small dictionaries
  • for instance, digram coding
  • selected pairs of letters are replaced with
    codewords
  • a dictionary for the ASCII character set might
    contain the 128 ASCII characters, as well as 128
    common letter pairs

38
Dictionary models
  • Digram coding
  • the output codewords are eight bits each
  • the presence of the full ASCII character set
    ensures that any (ASCII) input can be represented
  • at best, every pair of characters is replaced
    with a codeword, reducing the input from 7
    bits/character to 4 bits/characters
  • at worst, each 7 bit character will be expanded
    to 8 bits

39
Dictionary models
  • Natural extension
  • put even larger entries in the dictionary, e.g.
    common words like and, the, or common
    components of words like pre, tion
  • a predefined set of dictionary phrases make the
    compression domain-dependent
  • or very short phrases have to be used -gt good
    compression is not achieved

40
Dictionary models
  • One way to avoid the problem of the dictionary
    being unsuitable for the text at hand is to use a
    semi-static dictionary scheme
  • constuct a new dictionary for every text that is
    to be compressed
  • overhead of transmitting or storing the
    dictionary is significant
  • decision of which phrases should be included is a
    difficult problem

41
Dictionary models
  • Solution use an adaptive dictionary scheme
  • Ziv-Lempel coders (LZ77 and LZ78)
  • a substring of text is replaced with a pointer to
    where it has occurred previously
  • dictionary all the text prior to the current
    position
  • codewords pointers

42
Dictionary models
  • Ziv-Lempel
  • the prior text makes a very good dictionary since
    it is usually in the same style and language as
    upcoming text
  • the dictionary is transmitted implicitly at no
    extra cost, because the decoder has access to all
    previously encoded text

43
LZ77
  • Key benefits
  • relatively easy to implement
  • decoding can be performed extremely quickly using
    only a small amount of memory
  • suitable when the resources required for decoding
    must be minimized, like when data is distributed
    or broadcast from a central source to a number of
    small computers

44
LZ77
  • The output of an encoder consists of a sequence
    of triples, e.g. lt3,2,bgt
  • the first component of a triple indicates how far
    back to look in the previous (decoded) text to
    find the next phrase
  • the second component records how long the phrase
    is
  • the third component gives the next character from
    the input

45
LZ77
  • The components 1 and 2 constitute a pointer back
    into the text
  • the component 3 is actually necessary only when
    the character to be coded does not occur anywhere
    in the previous input

46
LZ77
  • Encoding
  • for the text from the current point ahead
  • search for the longest match in the previous text
  • output a triple that records the position and
    length of the match
  • the search for a match may return a length of
    zero, in which case the position of the match is
    not relevant
  • search can be accelerated by indexing the prior
    text with a suitable data structure

47
LZ77
  • limitations on how far back a pointer can refer
    and the maximum size of the string referred to
  • e.g. for English text, a window of a few thousand
    characters
  • the length of the phrase e.g. maximum of 16
    characters
  • otherwise too much space wasted without benefit

48
LZ77
  • The decoding program is very simple, so it can be
    included with the data at very little cost
  • in fact, the compressed data is stored as part of
    the decoder program, which makes the data
    self-expanding
  • common way to distribute files
Write a Comment
User Comments (0)
About PowerShow.com