BWTBased Compression Algorithms

About This Presentation

Title:

BWTBased Compression Algorithms

Description:

order-0 entropy. Lower bound for compression without context ... (vs. order-k entropy) We believe that they are good since English text is not Markovian. ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 47

Provided by: csd50

Category:

more less

Transcript and Presenter's Notes

Title: BWTBased Compression Algorithms

1
BWT-Based Compression Algorithms

Haim Kaplan and Elad Verbin
Tel-Aviv University

Presented in CPM 07, July 8 , 2007
2
Results

Cannot show constant clt2 s.t.
Similarly,
no clt1.26 for BWRL
no clt1.3 for BWDC
Probabilistic technique.

3
Outline

Part I Definitions
Part II Results
Part III Proofs
Part IV Experimental Results

4
Part I Definitions
5
BW0

The Main Burrows-Wheeler Compression Algorithm

BWT Burrows-Wheeler Transform
MTF Move-to-front
Order-0 Encoding
String S
Compressed String S
Text in English (similar contexts -gt similar
character)
Text with local uniformity
Integer string with many small numbers
6
The BWT

Invented by Burrows-and-Wheeler (94)
Analogous to Fourier Transform (smooth!)

Fenwick
string with context-regularity
mississippi
BWT
ipssmpissii
string with spikes (close repetitions)
7
The BWT
T mississippi
F
LBWT(T)
mississippi
ississippim
ssissippimi
sissippimis
issippimiss
ssippimissi
sippimissis ippimississ ppimississi pimississi
p imississipp mississippi
BWT sorts the characters by their post-context
8
BWT Facts

permutes the text
(n1)-to-1 function

9
Move To Front

By Bentley, Sleator, Tarjan and Wei (86)

string with spikes (close repetitions)
ipssmpissii
move-to-front
0,0,0,0,0,2,4,3,0,1,0
integer string with small numbers
10
Move to Front
11
Move to Front
12
Move to Front
13
Move to Front
14
Move to Front
15
Move to Front
16
Move to Front
17
Move to Front
18
After MTF

Now we have a string with small numbers lots of
0s, many 1s,
Skewed frequencies Run Arithmetic!

Character frequencies
19
BW0

The Main Burrows-Wheeler Compression Algorithm

BWDC Encodes using distance coding instead of
MTF
BW with inversion frequencies coding
Booster-Based Ferragina-Giancarlo-Manzini-Sciorti
no
Block-based compressor of Effros et al.

22
order-0 entropy

Lower bound for compression without context
information

1/2 As Each represented by 1 bit 1/3 Bs
Each represented by log(3) bits 1/6 Cs Each
represented by log(6) bits 6H0(S)312log(3)1
log(6)
SACABBA
23
order-k entropy

Lower bound for compression
with order-k contexts

24
order-k entropy
mississippi Context for i mssp Context for s
isis Context for p ip
25
Part II Results
26
Measuring against Hk

When performing worst-case analysis of lossless
text compressors, we usually measure against Hk
The goal a bound of the form
A(s) cnHk(s)lower order term
Optimal A(s) nHk(s)lower order term

27
Bounds
28
Bounds
a
29
Bounds
30
Bounds
Surprising!! Since BWT-based compressors work
better than gzip in practice!
31
Possible Explanations

Asymptotics
and real compressors cut into blocks, so
English Text is not Markovian!
Analyzing on different model might show BWT's
superiority

32
Part III Proofs
33
Lower bound

Wish to analyze BW0BWTMTFOrder0
Need to show s s.t.
Consider string s 103 a', 106 b'
Entropy of s
BWT(s)
same frequencies
MTF(BWT(s)) has 2103 1', 106-103 0
Compressed size about

need BWT(s) to have many isolated as
34
many isolated as

Goal find s such that in BWT(s), most as are
isolated
Solution probabilistic.
BWT is (n1)-to-1 function.
A random string s has 1/(n1) chance of being a
BWT-image
A random string has 1-1/n2 chance of having
many isolated as
Therefore, such a string exists

35
General Calculation

s contains pn as, (1-p)n bs.
Entropy of s
MTF(BWT(s)) contains 2p(1-p)n 1s, rest 0s
compressed size of MTF(BWT(s))
Ratio

36
Lower bounds on BWDC, BWRL

Similar technique.
p infinitesimally small gives compressible
string.
So maximize ratio over p.
Gives weird constants, but quite strong

37
Experimental Results

Sanity Check Picking texts from above Markov
models really shows behavior in practice
Picking text from realistic Markov sources also
shows non-optimal behavior
(realistic generated from actual texts)
On long Markov text, gzip works better than BWT

38
Bottom Line

BWT compressors are not optimal
(vs. order-k entropy)
We believe that they are good since English text
is not Markovian.
Find theoretical justification!
also improve constants, find BWT algs with better
ratios, ...

39
Thank You!
40
Additional Slides (taken out for lack of time)
41
BWT - Invertibility

Go forward, one character at a time

42
Main Property L ? F mapping

The ith occurrence of c in L corresponds to the
ith occurrence of c in F.
This happens because the characters in L are
sorted by their post-context, and the occurrences
of character c in F are sorted by their
post-context.

43
BW0 vs. Lempel-Ziv

BW0 dynamically takes advantage of
context-regularity
Robust, smooth, alternative for Lempel-Ziv

44
BW0 vs. Statistical Coding

Statistical Coding (e.g. PPM)
Builds a model for each context
Prediction -gt Compression

45
Compressed Text Indexing

Application of BWT
Compressed representation of text, that supports
fast pattern matching (without decompression!)
Partial decompression
So, no need to ever decompress!
space usage BW0(s)o(n)
See more in Ferragina-Manzini

46
Musings

On one hand BWT based algorithms are not
optimal, while Lempel-Ziv is.
On the other hand BWT compresses much better
Reasons
Results are Asymptotic. (EE reason)
English text was not generated by a Markov
source (real reason?)
Goal Get a more honest way to analyze
Use a statistic different than Hk?

Write a Comment

User Comments (0)

About PowerShow.com

BWTBased Compression Algorithms - PowerPoint PPT Presentation

BWTBased Compression Algorithms

order-0 entropy. Lower bound for compression without context ... (vs. order-k entropy) We believe that they are good since English text is not Markovian. ... – PowerPoint PPT presentation