Information Theory and Patternbased Data Compression

About This Presentation

Title:

Information Theory and Patternbased Data Compression

Description:

Is a 'thing' that produces infinite sequences of symbols in some finite alphabet ... CONSTANTE ADORO A QUIEN MI AMOR MALTRATA; MALTRATO A QUIEN MI AMOR BUSCA CONSTANTE ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 43

Provided by: familiagal

Category:

more less

Transcript and Presenter's Notes

Title: Information Theory and Patternbased Data Compression

1
Information Theory and Pattern-based Data
Compression

José Galaviz Casas
Facultad de Ciencias
UNAM

2
Contents

Introduction, fundamental concepts.
Huffman codes and extensions of a source.
Pattern-based Data Compression (PbDC).
The problems for PbDC.
Trying to solve. Heuristics.
Conclusions and further research.

3
Information source

Is a thing that produces infinite sequences of
symbols in some finite alphabet ?.
The theoretical model proposed by Shannon is an
ergodic Markov chain.
Markov chain stochastic process where the state
reached at the i-th time step depends on the n
previous states, n denotes the order of the
Markov chain.

4
Ergodic source

A Markov chain is ergodic if the probability
distribution over the set of states tends to be
stable in the limit. If p(i,j) denotes the
transition probability from state i to state j in
an ergodic Markov chain, then p(i,j) tends to
some limit that does not depend on the source
state i.
Almost every sample is a representative sample.
There exists only one set of interconnected
states.
No periodic states.

5
Information

Let p(s) be the probability that symbol s? ? will
be produced by some information source S.
The information (in bits) of s is defined as

6
The meaning

A measure of surprise.
Better The number of yes/no questions needed
to determine that s has occurred.

7
Entropy

Is the expected value of symbol information.
Important Note that entropy is measured over the
source. Probabilities are used, assuming infinite
amount of data.

8
Data Compression

Given a finite sample of data produced by an
unkown information source (unknown in the sense
that we doesnt know the statistical model of
such source)
To express the same information contained in the
sample with less data.
Exactly the same lossless. Almost the same
lossy.
We will focus in lossless data compression.

9
Huffman encoding

Is based on a statistical model of the sample to
be compressed.
The codeword length for some symbol is inversely
related with its frequency.
Target minimize the average codeword length
(AveLen).

10
Example

f(A) 10
f(B) 15
f(C) 10
f(D) 15
f(E) 25
f(F) 40

11
Huffman codes

A 000
B 100
C 001
D 101
E 01
F 11
AveLen 2.24 bits/word Vs. 3 bits

12
Extensions of a source

Suppose a source S with alphabet ?A, B
P(A) 0.6875, P(B) 0.3125
Since there are only two symbols Huffman
algorithm encodes every sample of such source
using one bit per symbol in ? (1 BPS).
Entropy H(S) 0.8960

13
2nd extension

P(AA) 0.4727, WordLen(AA) 1
P(AB) 0.2148, WordLen(AB) 2
P(BA) 0.2148, WordLen(BA) 3
P(BB) 0.0977, WordLen(BB) 3
AveLen 1.8398, BPS2(?) 0.9199

14
3rd extension
AveLen 2.759, BPS3(?) 0.9197
15
And so on...

4th extension
AveLen 3.64138794
BPS4(?) 0.91034699

16
In practice

Suppose a sample of our previous source S
A A A B A A A A B A A B A A B B
f(A) 11
f(B) 5
There are only two symbols, therefore Huffman
assigns A0, B1, 16 bits to express the sample.

17
Thinking in extensions
18
Longer strings are better

The 4-gram sample cannot be compressed since each
of the 4 metasymbols (strings of 4 symbols)
found, appear with the same frequency.
Let ? AAAA, AAAB, BAAB, AABB be the alphabet
of some information source S that produces the
symbols in ? equiprobably.
The sample could be produced by the maximum
entropy source S.

19
Dictionary-based methods

Build a dictionary with frequent strings.
Each time a string in the dictionary appear in
the sample, replace them with a dictionary
reference, which are shorter.
Every frequent string is included only once (in
the dictionary).

20
Example
AL QUE INGRATO ME DEJA, BUSCO AMANTE AL QUE
AMANTE ME SIGUE, DEJO INGRATA CONSTANTE ADORO A
QUIEN MI AMOR MALTRATA MALTRATO A QUIEN MI AMOR
BUSCA CONSTANTE

AL_QUE_
INGRAT
_ME_
_AMANTE
_A_QUIEN_MI_AMOR_

MALTRAT
CONSTANTE
DEJ
BUSC

21
Result
12O38A, 9O 4 143SIGUE, 8O 2A 7 ADORO56A
6O59A 7
22
Another posibility
AL_QUE_INGRATO_ME_DEJA,_BUSCO_AMANTE_
AL_QUE_AMANTE_ME_SIGUE,_DEJO_INGRATA_
CONSTANTE_ADORO_A_QUIEN_MI_AMOR_MALTRATA_
MALTRATO_A_QUIEN_MI_AMOR_BUSCA_CONSTANTE

Build a dictionary of frequent patterns, no
necessarily of consecutive symbols (strings).

23
The compression process

Given a finite sample of consecutive symbols
produced by some source S whose statistical
properties can only be estimated from its sample.
To find a set of frequent patterns such that the
sample can be expressed briefly using references
to these patterns.
Encode the sample using the set of patterns
(dictionary), and encode the dictionary itself
using some other method.

24
Example
25
Finding patterns, a naïve algorithm
26
Algorithm complexity

Naïve algorithm is very expensive.
We need to find coincidence patterns, then
coincidence patterns in the coincidence patterns
previously found, then...
The number of intersections between coincidence
patterns grows exponentially on the number of
patterns found (which is O(sample size) ).

27
There are better algorithms but...

Not very much better.
The best reported algorithms have complexity O (
n 2 n ). Vilo 02
The patterns we are looking for, are type P3
Patterns with wildcards of unrestricted length

28
The algorithms for pattern discovery

Are based in well known string matching
techniques supported by special data structure
called suffix tree.
There are several algorithms for suffix tree
construction (n stands for the string size)
The worst is O ( n 3 )
The two best methods (Wiener and Ukkonen) are
linear on n, and builds the tree on the fly.

29
Suffix tree
Suffix tree for the string ATCAGTGCAATGC
30
Some posibility?

Generalizing the suffix tree concept in order to
include patterns rather than strings. A tree of
suffix patterns.
Cannot be constructed on the fly since we need
to remember an arbitrary number of previous
symbols.
We need to perform Find the longest common
pattern in a set of strings.
We call this problem the MAXIMUMCOMMONPATTERN
problem or MCP.

31
MAXIMUMCOMMONPATTERN

We have recently proved that this problem is
NP-Complete. That is currently there is no
deterministic polynomial time algorithm to solve
it. If such algorithm would be found then all the
other problems in this category (the upper bound
of complexity) can also be solved in polynomial
time and PNP (the fundamental question in
computability theory).

32
Finding patterns (option 1)
33
Finding patterns (option 2)
34
Finding patterns (option 3)
35
Several options

Option 1 12 metasymbols
Option 2 14 metasymbols
Option 3 10 metasymbols
Option 3 gives shorter expression of sample,
considering only the data in the sample, ignoring
dictionary size.

36
There is a right choice but...

The right choice is not easy to do.
There is a trade-off between pattern size and
pattern frequency.
The inclusion of some pattern in dictionary must
be amortized by its use.

37
How much difficult is the right choice

Suppose we have a set of frequent patterns P.
Each pattern have its frequency and its size.
We need to chose the subset P? P that maximizes
the compression ratio

Where M is the original sample size, and T(P)
is the sample size after compression is done and
dictionary is included.
T(P) D(P) E(P)

39
OPTIMALPATTERNSUBSET

We call the selection of best subset of patterns
the OPTIMALPATTERNSUBSET problem.
We have proved that this problem is also
NP-Complete.

40
But here we have some resources

We can approximate the best subset by an
heuristic algorithm.
We select the patterns with greatest coverage
(number of symbols in the sample that are in the
pattern appearances).
Then we iteratively refine the solution with
hillclimbers with local changes.

41
Conclusions

The pattern-based data compression is the most
general approach to the compression problem based
on statistical models of the data to be
compressed. Every other technique in this class
can be considered a particular case.
Unfortunately the sub-tasks involved in the
compression process are mostly NP-Complete
problems.

42
Further research

We need to achieve approximation algorithms or
heuristics in order to solve the pattern
discovery problem efficiently.

Write a Comment

User Comments (0)

About PowerShow.com

Information Theory and Patternbased Data Compression - PowerPoint PPT Presentation

Information Theory and Patternbased Data Compression

Is a 'thing' that produces infinite sequences of symbols in some finite alphabet ... CONSTANTE ADORO A QUIEN MI AMOR MALTRATA; MALTRATO A QUIEN MI AMOR BUSCA CONSTANTE ... – PowerPoint PPT presentation