Title: Information Theory and Patternbased Data Compression
1Information Theory and Pattern-based Data
Compression
- José Galaviz Casas
- Facultad de Ciencias
- UNAM
2Contents
- Introduction, fundamental concepts.
- Huffman codes and extensions of a source.
- Pattern-based Data Compression (PbDC).
- The problems for PbDC.
- Trying to solve. Heuristics.
- Conclusions and further research.
3Information source
- Is a thing that produces infinite sequences of
symbols in some finite alphabet ?. - The theoretical model proposed by Shannon is an
ergodic Markov chain. - Markov chain stochastic process where the state
reached at the i-th time step depends on the n
previous states, n denotes the order of the
Markov chain.
4Ergodic source
- A Markov chain is ergodic if the probability
distribution over the set of states tends to be
stable in the limit. If p(i,j) denotes the
transition probability from state i to state j in
an ergodic Markov chain, then p(i,j) tends to
some limit that does not depend on the source
state i. - Almost every sample is a representative sample.
- There exists only one set of interconnected
states. - No periodic states.
5Information
- Let p(s) be the probability that symbol s? ? will
be produced by some information source S. - The information (in bits) of s is defined as
6The meaning
- A measure of surprise.
- Better The number of yes/no questions needed
to determine that s has occurred.
7Entropy
- Is the expected value of symbol information.
- Important Note that entropy is measured over the
source. Probabilities are used, assuming infinite
amount of data.
8Data Compression
- Given a finite sample of data produced by an
unkown information source (unknown in the sense
that we doesnt know the statistical model of
such source) - To express the same information contained in the
sample with less data. - Exactly the same lossless. Almost the same
lossy. - We will focus in lossless data compression.
9Huffman encoding
- Is based on a statistical model of the sample to
be compressed. - The codeword length for some symbol is inversely
related with its frequency. - Target minimize the average codeword length
(AveLen).
10Example
- f(A) 10
- f(B) 15
- f(C) 10
- f(D) 15
- f(E) 25
- f(F) 40
11Huffman codes
- A 000
- B 100
- C 001
- D 101
- E 01
- F 11
- AveLen 2.24 bits/word Vs. 3 bits
12Extensions of a source
- Suppose a source S with alphabet ?A, B
- P(A) 0.6875, P(B) 0.3125
- Since there are only two symbols Huffman
algorithm encodes every sample of such source
using one bit per symbol in ? (1 BPS). - Entropy H(S) 0.8960
132nd extension
- P(AA) 0.4727, WordLen(AA) 1
- P(AB) 0.2148, WordLen(AB) 2
- P(BA) 0.2148, WordLen(BA) 3
- P(BB) 0.0977, WordLen(BB) 3
- AveLen 1.8398, BPS2(?) 0.9199
143rd extension
AveLen 2.759, BPS3(?) 0.9197
15And so on...
- 4th extension
- AveLen 3.64138794
- BPS4(?) 0.91034699
16In practice
- Suppose a sample of our previous source S
- A A A B A A A A B A A B A A B B
- f(A) 11
- f(B) 5
- There are only two symbols, therefore Huffman
assigns A0, B1, 16 bits to express the sample.
17Thinking in extensions
18Longer strings are better
- The 4-gram sample cannot be compressed since each
of the 4 metasymbols (strings of 4 symbols)
found, appear with the same frequency. - Let ? AAAA, AAAB, BAAB, AABB be the alphabet
of some information source S that produces the
symbols in ? equiprobably. - The sample could be produced by the maximum
entropy source S.
19Dictionary-based methods
- Build a dictionary with frequent strings.
- Each time a string in the dictionary appear in
the sample, replace them with a dictionary
reference, which are shorter. - Every frequent string is included only once (in
the dictionary).
20Example
AL QUE INGRATO ME DEJA, BUSCO AMANTE AL QUE
AMANTE ME SIGUE, DEJO INGRATA CONSTANTE ADORO A
QUIEN MI AMOR MALTRATA MALTRATO A QUIEN MI AMOR
BUSCA CONSTANTE
- AL_QUE_
- INGRAT
- _ME_
- _AMANTE
- _A_QUIEN_MI_AMOR_
- MALTRAT
- CONSTANTE
- DEJ
- BUSC
21Result
12O38A, 9O 4 143SIGUE, 8O 2A 7 ADORO56A
6O59A 7
22Another posibility
AL_QUE_INGRATO_ME_DEJA,_BUSCO_AMANTE_
AL_QUE_AMANTE_ME_SIGUE,_DEJO_INGRATA_
CONSTANTE_ADORO_A_QUIEN_MI_AMOR_MALTRATA_
MALTRATO_A_QUIEN_MI_AMOR_BUSCA_CONSTANTE
- Build a dictionary of frequent patterns, no
necessarily of consecutive symbols (strings).
23The compression process
- Given a finite sample of consecutive symbols
produced by some source S whose statistical
properties can only be estimated from its sample. - To find a set of frequent patterns such that the
sample can be expressed briefly using references
to these patterns. - Encode the sample using the set of patterns
(dictionary), and encode the dictionary itself
using some other method.
24Example
25Finding patterns, a naïve algorithm
26Algorithm complexity
- Naïve algorithm is very expensive.
- We need to find coincidence patterns, then
coincidence patterns in the coincidence patterns
previously found, then... - The number of intersections between coincidence
patterns grows exponentially on the number of
patterns found (which is O(sample size) ).
27There are better algorithms but...
- Not very much better.
- The best reported algorithms have complexity O (
n 2 n ). Vilo 02 - The patterns we are looking for, are type P3
Patterns with wildcards of unrestricted length
28The algorithms for pattern discovery
- Are based in well known string matching
techniques supported by special data structure
called suffix tree. - There are several algorithms for suffix tree
construction (n stands for the string size) - The worst is O ( n 3 )
- The two best methods (Wiener and Ukkonen) are
linear on n, and builds the tree on the fly.
29Suffix tree
Suffix tree for the string ATCAGTGCAATGC
30Some posibility?
- Generalizing the suffix tree concept in order to
include patterns rather than strings. A tree of
suffix patterns. - Cannot be constructed on the fly since we need
to remember an arbitrary number of previous
symbols. - We need to perform Find the longest common
pattern in a set of strings. - We call this problem the MAXIMUMCOMMONPATTERN
problem or MCP.
31MAXIMUMCOMMONPATTERN
- We have recently proved that this problem is
NP-Complete. That is currently there is no
deterministic polynomial time algorithm to solve
it. If such algorithm would be found then all the
other problems in this category (the upper bound
of complexity) can also be solved in polynomial
time and PNP (the fundamental question in
computability theory).
32Finding patterns (option 1)
33Finding patterns (option 2)
34Finding patterns (option 3)
35Several options
- Option 1 12 metasymbols
- Option 2 14 metasymbols
- Option 3 10 metasymbols
- Option 3 gives shorter expression of sample,
considering only the data in the sample, ignoring
dictionary size.
36There is a right choice but...
- The right choice is not easy to do.
- There is a trade-off between pattern size and
pattern frequency. - The inclusion of some pattern in dictionary must
be amortized by its use.
37How much difficult is the right choice
- Suppose we have a set of frequent patterns P.
Each pattern have its frequency and its size. - We need to chose the subset P? P that maximizes
the compression ratio
38- Where M is the original sample size, and T(P)
is the sample size after compression is done and
dictionary is included. - T(P) D(P) E(P)
39OPTIMALPATTERNSUBSET
- We call the selection of best subset of patterns
the OPTIMALPATTERNSUBSET problem. - We have proved that this problem is also
NP-Complete.
40But here we have some resources
- We can approximate the best subset by an
heuristic algorithm. - We select the patterns with greatest coverage
(number of symbols in the sample that are in the
pattern appearances). - Then we iteratively refine the solution with
hillclimbers with local changes.
41Conclusions
- The pattern-based data compression is the most
general approach to the compression problem based
on statistical models of the data to be
compressed. Every other technique in this class
can be considered a particular case. - Unfortunately the sub-tasks involved in the
compression process are mostly NP-Complete
problems.
42Further research
- We need to achieve approximation algorithms or
heuristics in order to solve the pattern
discovery problem efficiently.