Title: BWTBased Compression Algorithms
1BWT-Based Compression Algorithms
- Elad Verbin
- Tel-Aviv University
Based on joint work with Haim Kaplan and Shir
Landau
Presented at ITCS, Tsinghua University, Sept.
14th, 2007
2Talk Outline
- Part I Introduction to BWT-based compression
- The BWT and its properties
- BWT-based compression
- Part II Measuring efficiency of compression algs
- Empirical evaluation vs. worst-case analysis
- Lower and upper bounds.
- Part III Proving the Lower Bound
- Part IV A new research direction
3Part I Introduction to BWT-Based Compression
4Lossless text compression
- Three main classes
- Dictionary-based (Lempel-Ziv etc.)
- Statistical Coders (PPM, etc.)
- BWT-Based (bzip2, etc.)
- Currently, statistical coders achieve best
compression ratios - will this change?
- Advantage of BWT-based they are out-of-the-box,
and very fast
5Empiric Comparison
- Alice.txt (filesize 152K)
taken from the Canterbury Corpus website
- bzip2 Achieves compression close to statistical
coders, with running-time close to gzip.
6Algorithm BW0
7BW0 Burrows-Wheeler Compression
Text in English (similar contexts -gt similar
character)
mississippi
BWT
Text with spikes (close repetitions)
ipssmpissii
Move-to-front
Integer string with small numbers
0,0,0,0,0,2,4,3,0,1,0
Arithmetic
Compressed text
01000101010100010101010
8The BWT
- Invented by Burrows-and-Wheeler (94)
- Analogous to Fourier Transform (smooth!)
Fenwick
string with context-regularity
mississippi
BWT
ipssmpissii
string with spikes (close repetitions)
9The BWT
T mississippi
F
LBWT(T)
mississippi
ississippim
ssissippimi
sissippimis
issippimiss
ssippimissi
sippimissis ippimississ ppimississi pimississi
p imississipp mississippi
BWT sorts the characters by their post-context
10The BWT
- Invertible!
- Linear-time algorithms for computing and for
inverting - Useful for compression
- Facts
- BWT permutes the input text
- BWT is a (n1)-to-1 function
11BWT useful for compression
- Suppose text contains 1000 swimming, 100
fighting, and no other words containing ing.
- So long run of ms, interspersed with some
ts - BWT turns context-regularity to local uniformity
(closely-occurring character repetitions)
output of BWT
12Move To Front
- By Bentley, Sleator, Tarjan and Wei (86)
string with spikes (close repetitions)
ipssmpissii
move-to-front
0,0,0,0,0,2,4,3,0,1,0
integer string with small numbers
13Move to Front
14Move to Front
15Move to Front
16Move to Front
17Move to Front
18Move to Front
19Move to Front
20Move to Front
21After MTF
- Now we have a string with small numbers lots of
0s, many 1s, - Skewed frequencies Run Arithmetic!
Character frequencies
22Order-0 encoding
- Huffman Encoding
- Arithmetic Encoding
- These work well when character frequencies are
skewed
23Summary of BW0
24BW0
- The Main Burrows-Wheeler Compression Algorithm
BWT Burrows-Wheeler Transform
MTF Move-to-front
Order-0 Encoding
String S
Compressed String S
Text in English (similar contexts -gt similar
character)
Text with local uniformity
Integer string with many small numbers
25BWRL (e.g. bzip)
26Many more BWT-based algorithms
- BW0 is the basic algorithm, but there exist many
others. Abel,Deorowicz,Fenwick - BWRL Runs RLE after MTF
- BWDC Encodes using distance coding instead of
MTF - Booster-Based Ferragina-Giancarlo-Manzini-Sciorti
no Completely different.
27A dirty little secret
- Before compressing, bzip2 partitions the text
into chunks of 900K, and compresses each one
separately - This seems horrible. Can you get rid of this?
- Streaming algorithms?
- Difficult algorithmic problem BWT is hard to
divide-and-conquer chunks are highly-dependent. - gzip and PPM partition too.
28Part II Measuring Efficiency of Compression
Algorithms
29Measuring Compression
- 3 Approaches
- Empiric
- Model-Based
- Worst-Case
30Worst-Case Bounds for Compression
- Need Statistic to Measure Against.
- cant compress all texts. Chosen statistic
measures compressibility of text. - Examples
- H0 (order-0 entropy)
- Hk (order-k entropy)
- Kolmogorov Complexity
- Smallest Grammar that Generates the text
31order-0 entropy
- Lower bound for compression without context
information
1/2 As Each represented by 1 bit 1/3 Bs
Each represented by log(3) bits 1/6 Cs Each
represented by log(6) bits 6H0(S)312log(3)1
log(6)
SACABBA
32Order-0 Compressors
- When need to compress texts where the character
frequencies are skewed - Huffman Encoding
- Huff(s) nH0(s)n
- Arithmetic Encoding
- Arith(s) nH0(s)O(logn)
33order-k entropy
- Lower bound for compression
with order-k contexts
34order-k entropy
mississippi Context for i mssp Context for s
isis Context for p ip
35Measuring against Hk
- When performing worst-case analysis of lossless
text compressors, we usually measure against Hk - The goal a bound of the form
- A(s) anHk(s)lower order term
- Optimal A(s) nHk(s)lower order term
36Known Bounds
37Known Bounds
a
38Known Bounds
39Known Bounds
Surprising!! Since BWT-based compressors work
better than gzip in practice! Note on Markov
sources, gzip indeed works better
40Part III Proof of Lower Bound
41Lower bound
- Wish to analyze BW0BWTMTFOrder0
- Need to show s s.t.
- Consider string s 103 a', 106 b'
- Entropy of s
- BWT(s)
- same frequencies
- MTF(BWT(s)) has 2103 1', 106-103 0
- Compressed size about
need BWT(s) to have many isolated as
42many isolated as
- Goal find s such that in sBWT(s), many as
are isolated - Solution probabilistic.
- BWT is (n1)-to-1 function.
- A random string s has chance of
being a BWT-image - A random string has chance of
having many isolated as - Therefore, such a string exists.
43General Calculation
- s contains pn as, (1-p)n bs.
- Entropy of s
- BWT(s) contains 2p(1-p)n 1s, rest 0s
- compressed size of MTF(BWT(s))
- Ratio
44Lower bounds on BWDC, BWRL
- Similar technique.
- p infinitesimally small gives compressible
string. - So maximize ratio over p.
- Gives weird constants, but quite strong
45Weird Observation
- The same argument works when BWT is replaced by
any function f such that - f is poly(n)-to-1
- f(s) is a permutation of s
- Thus, BWT is just as bad as any function! Weird
- Explanation
- If A is bad vs. H0, then BWTA is bad vs. Hk
- If A is good vs. H0, and A is convex, then BWTA
is good vs. Hk
46Sketch of upper bound
- In KLV we have proved
-
-
- Thm (Manzini 99) If A is good vs. H0, and A is
convex, then BWTA is good vs. Hk for any k - A is convex if
47Sketch of upper bound
- AMTFOrder0 is not convex, so cant use Thm.
- Define AMTFPF, PF is prefix-free code.
- A is convex
- We prove that
- This finishes the proof.
48Conclusion
- Seen BWT-based algorithms.
- Intuition why they work well
- Analytical results
- Compared to other types of compression
- Further work
- Find better analytical ways to explain behaviour
- Compare against something other than entropy?
- English text is not Markovian
- Get better BWT compressors
- Modern Applications (indexing,...)
- Get better constants for existing algorithms
49Thank you!
- And thanks to Haim and Shir!