Information Theory and Patternbased Data Compression - PowerPoint PPT Presentation

About This Presentation
Title:

Information Theory and Patternbased Data Compression

Description:

Is a 'thing' that produces infinite sequences of symbols in some finite alphabet ... CONSTANTE ADORO A QUIEN MI AMOR MALTRATA; MALTRATO A QUIEN MI AMOR BUSCA CONSTANTE ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 43
Provided by: familiagal
Category:

less

Transcript and Presenter's Notes

Title: Information Theory and Patternbased Data Compression


1
Information Theory and Pattern-based Data
Compression
  • José Galaviz Casas
  • Facultad de Ciencias
  • UNAM

2
Contents
  • Introduction, fundamental concepts.
  • Huffman codes and extensions of a source.
  • Pattern-based Data Compression (PbDC).
  • The problems for PbDC.
  • Trying to solve. Heuristics.
  • Conclusions and further research.

3
Information source
  • Is a thing that produces infinite sequences of
    symbols in some finite alphabet ?.
  • The theoretical model proposed by Shannon is an
    ergodic Markov chain.
  • Markov chain stochastic process where the state
    reached at the i-th time step depends on the n
    previous states, n denotes the order of the
    Markov chain.

4
Ergodic source
  • A Markov chain is ergodic if the probability
    distribution over the set of states tends to be
    stable in the limit. If p(i,j) denotes the
    transition probability from state i to state j in
    an ergodic Markov chain, then p(i,j) tends to
    some limit that does not depend on the source
    state i.
  • Almost every sample is a representative sample.
  • There exists only one set of interconnected
    states.
  • No periodic states.

5
Information
  • Let p(s) be the probability that symbol s? ? will
    be produced by some information source S.
  • The information (in bits) of s is defined as

6
The meaning
  • A measure of surprise.
  • Better The number of yes/no questions needed
    to determine that s has occurred.

7
Entropy
  • Is the expected value of symbol information.
  • Important Note that entropy is measured over the
    source. Probabilities are used, assuming infinite
    amount of data.

8
Data Compression
  • Given a finite sample of data produced by an
    unkown information source (unknown in the sense
    that we doesnt know the statistical model of
    such source)
  • To express the same information contained in the
    sample with less data.
  • Exactly the same lossless. Almost the same
    lossy.
  • We will focus in lossless data compression.

9
Huffman encoding
  • Is based on a statistical model of the sample to
    be compressed.
  • The codeword length for some symbol is inversely
    related with its frequency.
  • Target minimize the average codeword length
    (AveLen).

10
Example
  • f(A) 10
  • f(B) 15
  • f(C) 10
  • f(D) 15
  • f(E) 25
  • f(F) 40

11
Huffman codes
  • A 000
  • B 100
  • C 001
  • D 101
  • E 01
  • F 11
  • AveLen 2.24 bits/word Vs. 3 bits

12
Extensions of a source
  • Suppose a source S with alphabet ?A, B
  • P(A) 0.6875, P(B) 0.3125
  • Since there are only two symbols Huffman
    algorithm encodes every sample of such source
    using one bit per symbol in ? (1 BPS).
  • Entropy H(S) 0.8960

13
2nd extension
  • P(AA) 0.4727, WordLen(AA) 1
  • P(AB) 0.2148, WordLen(AB) 2
  • P(BA) 0.2148, WordLen(BA) 3
  • P(BB) 0.0977, WordLen(BB) 3
  • AveLen 1.8398, BPS2(?) 0.9199

14
3rd extension
AveLen 2.759, BPS3(?) 0.9197
15
And so on...
  • 4th extension
  • AveLen 3.64138794
  • BPS4(?) 0.91034699

16
In practice
  • Suppose a sample of our previous source S
  • A A A B A A A A B A A B A A B B
  • f(A) 11
  • f(B) 5
  • There are only two symbols, therefore Huffman
    assigns A0, B1, 16 bits to express the sample.

17
Thinking in extensions
18
Longer strings are better
  • The 4-gram sample cannot be compressed since each
    of the 4 metasymbols (strings of 4 symbols)
    found, appear with the same frequency.
  • Let ? AAAA, AAAB, BAAB, AABB be the alphabet
    of some information source S that produces the
    symbols in ? equiprobably.
  • The sample could be produced by the maximum
    entropy source S.

19
Dictionary-based methods
  • Build a dictionary with frequent strings.
  • Each time a string in the dictionary appear in
    the sample, replace them with a dictionary
    reference, which are shorter.
  • Every frequent string is included only once (in
    the dictionary).

20
Example
AL QUE INGRATO ME DEJA, BUSCO AMANTE AL QUE
AMANTE ME SIGUE, DEJO INGRATA CONSTANTE ADORO A
QUIEN MI AMOR MALTRATA MALTRATO A QUIEN MI AMOR
BUSCA CONSTANTE
  • AL_QUE_
  • INGRAT
  • _ME_
  • _AMANTE
  • _A_QUIEN_MI_AMOR_
  • MALTRAT
  • CONSTANTE
  • DEJ
  • BUSC

21
Result
12O38A, 9O 4 143SIGUE, 8O 2A 7 ADORO56A
6O59A 7
22
Another posibility
AL_QUE_INGRATO_ME_DEJA,_BUSCO_AMANTE_
AL_QUE_AMANTE_ME_SIGUE,_DEJO_INGRATA_
CONSTANTE_ADORO_A_QUIEN_MI_AMOR_MALTRATA_
MALTRATO_A_QUIEN_MI_AMOR_BUSCA_CONSTANTE
  • Build a dictionary of frequent patterns, no
    necessarily of consecutive symbols (strings).

23
The compression process
  • Given a finite sample of consecutive symbols
    produced by some source S whose statistical
    properties can only be estimated from its sample.
  • To find a set of frequent patterns such that the
    sample can be expressed briefly using references
    to these patterns.
  • Encode the sample using the set of patterns
    (dictionary), and encode the dictionary itself
    using some other method.

24
Example
25
Finding patterns, a naïve algorithm
26
Algorithm complexity
  • Naïve algorithm is very expensive.
  • We need to find coincidence patterns, then
    coincidence patterns in the coincidence patterns
    previously found, then...
  • The number of intersections between coincidence
    patterns grows exponentially on the number of
    patterns found (which is O(sample size) ).

27
There are better algorithms but...
  • Not very much better.
  • The best reported algorithms have complexity O (
    n 2 n ). Vilo 02
  • The patterns we are looking for, are type P3
    Patterns with wildcards of unrestricted length

28
The algorithms for pattern discovery
  • Are based in well known string matching
    techniques supported by special data structure
    called suffix tree.
  • There are several algorithms for suffix tree
    construction (n stands for the string size)
  • The worst is O ( n 3 )
  • The two best methods (Wiener and Ukkonen) are
    linear on n, and builds the tree on the fly.

29
Suffix tree
Suffix tree for the string ATCAGTGCAATGC
30
Some posibility?
  • Generalizing the suffix tree concept in order to
    include patterns rather than strings. A tree of
    suffix patterns.
  • Cannot be constructed on the fly since we need
    to remember an arbitrary number of previous
    symbols.
  • We need to perform Find the longest common
    pattern in a set of strings.
  • We call this problem the MAXIMUMCOMMONPATTERN
    problem or MCP.

31
MAXIMUMCOMMONPATTERN
  • We have recently proved that this problem is
    NP-Complete. That is currently there is no
    deterministic polynomial time algorithm to solve
    it. If such algorithm would be found then all the
    other problems in this category (the upper bound
    of complexity) can also be solved in polynomial
    time and PNP (the fundamental question in
    computability theory).

32
Finding patterns (option 1)
33
Finding patterns (option 2)
34
Finding patterns (option 3)
35
Several options
  • Option 1 12 metasymbols
  • Option 2 14 metasymbols
  • Option 3 10 metasymbols
  • Option 3 gives shorter expression of sample,
    considering only the data in the sample, ignoring
    dictionary size.

36
There is a right choice but...
  • The right choice is not easy to do.
  • There is a trade-off between pattern size and
    pattern frequency.
  • The inclusion of some pattern in dictionary must
    be amortized by its use.

37
How much difficult is the right choice
  • Suppose we have a set of frequent patterns P.
    Each pattern have its frequency and its size.
  • We need to chose the subset P? P that maximizes
    the compression ratio

38
  • Where M is the original sample size, and T(P)
    is the sample size after compression is done and
    dictionary is included.
  • T(P) D(P) E(P)

39
OPTIMALPATTERNSUBSET
  • We call the selection of best subset of patterns
    the OPTIMALPATTERNSUBSET problem.
  • We have proved that this problem is also
    NP-Complete.

40
But here we have some resources
  • We can approximate the best subset by an
    heuristic algorithm.
  • We select the patterns with greatest coverage
    (number of symbols in the sample that are in the
    pattern appearances).
  • Then we iteratively refine the solution with
    hillclimbers with local changes.

41
Conclusions
  • The pattern-based data compression is the most
    general approach to the compression problem based
    on statistical models of the data to be
    compressed. Every other technique in this class
    can be considered a particular case.
  • Unfortunately the sub-tasks involved in the
    compression process are mostly NP-Complete
    problems.

42
Further research
  • We need to achieve approximation algorithms or
    heuristics in order to solve the pattern
    discovery problem efficiently.
Write a Comment
User Comments (0)
About PowerShow.com