Title: BWTBased Compression Algorithms
1BWT-Based Compression Algorithms
- Haim Kaplan and Elad Verbin
- Tel-Aviv University
Presented in CPM 07, July 8 , 2007
2Results
- Cannot show constant clt2 s.t.
- Similarly,
- no clt1.26 for BWRL
- no clt1.3 for BWDC
- Probabilistic technique.
3Outline
- Part I Definitions
- Part II Results
- Part III Proofs
- Part IV Experimental Results
4Part I Definitions
5BW0
- The Main Burrows-Wheeler Compression Algorithm
BWT Burrows-Wheeler Transform
MTF Move-to-front
Order-0 Encoding
String S
Compressed String S
Text in English (similar contexts -gt similar
character)
Text with local uniformity
Integer string with many small numbers
6The BWT
- Invented by Burrows-and-Wheeler (94)
- Analogous to Fourier Transform (smooth!)
Fenwick
string with context-regularity
mississippi
BWT
ipssmpissii
string with spikes (close repetitions)
7The BWT
T mississippi
F
LBWT(T)
mississippi
ississippim
ssissippimi
sissippimis
issippimiss
ssippimissi
sippimissis ippimississ ppimississi pimississi
p imississipp mississippi
BWT sorts the characters by their post-context
8BWT Facts
- permutes the text
- (n1)-to-1 function
9Move To Front
- By Bentley, Sleator, Tarjan and Wei (86)
string with spikes (close repetitions)
ipssmpissii
move-to-front
0,0,0,0,0,2,4,3,0,1,0
integer string with small numbers
10Move to Front
11Move to Front
12Move to Front
13Move to Front
14Move to Front
15Move to Front
16Move to Front
17Move to Front
18After MTF
- Now we have a string with small numbers lots of
0s, many 1s, - Skewed frequencies Run Arithmetic!
Character frequencies
19BW0
- The Main Burrows-Wheeler Compression Algorithm
BWT Burrows-Wheeler Transform
MTF Move-to-front
Order-0 Encoding
String S
Compressed String S
Text in English (similar contexts -gt similar
character)
Text with local uniformity
Integer string with many small numbers
20BWRL (e.g. bzip)
21Many more BWT-based algorithms
- BWDC Encodes using distance coding instead of
MTF - BW with inversion frequencies coding
- Booster-Based Ferragina-Giancarlo-Manzini-Sciorti
no - Block-based compressor of Effros et al.
22order-0 entropy
- Lower bound for compression without context
information
1/2 As Each represented by 1 bit 1/3 Bs
Each represented by log(3) bits 1/6 Cs Each
represented by log(6) bits 6H0(S)312log(3)1
log(6)
SACABBA
23order-k entropy
- Lower bound for compression
with order-k contexts
24order-k entropy
mississippi Context for i mssp Context for s
isis Context for p ip
25Part II Results
26Measuring against Hk
- When performing worst-case analysis of lossless
text compressors, we usually measure against Hk - The goal a bound of the form
- A(s) cnHk(s)lower order term
- Optimal A(s) nHk(s)lower order term
27Bounds
28Bounds
a
29Bounds
30Bounds
Surprising!! Since BWT-based compressors work
better than gzip in practice!
31Possible Explanations
- Asymptotics
- and real compressors cut into blocks, so
- English Text is not Markovian!
- Analyzing on different model might show BWT's
superiority
32Part III Proofs
33Lower bound
- Wish to analyze BW0BWTMTFOrder0
- Need to show s s.t.
- Consider string s 103 a', 106 b'
- Entropy of s
- BWT(s)
- same frequencies
- MTF(BWT(s)) has 2103 1', 106-103 0
- Compressed size about
need BWT(s) to have many isolated as
34many isolated as
- Goal find s such that in BWT(s), most as are
isolated - Solution probabilistic.
- BWT is (n1)-to-1 function.
- A random string s has 1/(n1) chance of being a
BWT-image - A random string has 1-1/n2 chance of having
many isolated as - Therefore, such a string exists
35General Calculation
- s contains pn as, (1-p)n bs.
- Entropy of s
- MTF(BWT(s)) contains 2p(1-p)n 1s, rest 0s
- compressed size of MTF(BWT(s))
- Ratio
36Lower bounds on BWDC, BWRL
- Similar technique.
- p infinitesimally small gives compressible
string. - So maximize ratio over p.
- Gives weird constants, but quite strong
37Experimental Results
- Sanity Check Picking texts from above Markov
models really shows behavior in practice - Picking text from realistic Markov sources also
shows non-optimal behavior - (realistic generated from actual texts)
- On long Markov text, gzip works better than BWT
38Bottom Line
- BWT compressors are not optimal
- (vs. order-k entropy)
- We believe that they are good since English text
is not Markovian. - Find theoretical justification!
- also improve constants, find BWT algs with better
ratios, ...
39Thank You!
40Additional Slides (taken out for lack of time)
41BWT - Invertibility
- Go forward, one character at a time
42Main Property L ? F mapping
- The ith occurrence of c in L corresponds to the
ith occurrence of c in F. - This happens because the characters in L are
sorted by their post-context, and the occurrences
of character c in F are sorted by their
post-context.
43BW0 vs. Lempel-Ziv
- BW0 dynamically takes advantage of
context-regularity - Robust, smooth, alternative for Lempel-Ziv
44BW0 vs. Statistical Coding
- Statistical Coding (e.g. PPM)
- Builds a model for each context
- Prediction -gt Compression
45Compressed Text Indexing
- Application of BWT
- Compressed representation of text, that supports
- fast pattern matching (without decompression!)
- Partial decompression
- So, no need to ever decompress!
- space usage BW0(s)o(n)
- See more in Ferragina-Manzini
46Musings
- On one hand BWT based algorithms are not
optimal, while Lempel-Ziv is. - On the other hand BWT compresses much better
- Reasons
- Results are Asymptotic. (EE reason)
- English text was not generated by a Markov
source (real reason?) - Goal Get a more honest way to analyze
- Use a statistic different than Hk?