Title: A Simpler Analysis of BurrowsWheeler Based Compression
1A Simpler Analysis of Burrows-Wheeler Based
Compression
- Haim Kaplan Shir Landau Elad Verbin
2Our Results
- Improve the bounds of one of the main BWT based
compression algorithms - New technique for worst case analysis of BWT
based compression algorithms using the Local
Entropy - Interesting results concerning compression of
integer strings
3The Burrows-Wheeler Transform(1994)
- Given a string S the Burrows-Wheeler Transform
creates a permutation of S that is locally
homogeneous.
S
BWT
S is locally homogeneous
4Empirical Entropy - Intuition
- H0(s) Maximum compression we can get without
context information where a fixed codeword is
assigned to each alphabet character (e.g.
Huffman code ) - Hk(s) Lower bound for compression with order-k
contexts the codeword representing each symbol
depends on the k symbols preceding it - Traditionally, compression ratio of compression
algorithms measured using Hk(s)
5History
- The Main Burrows-Wheeler Compression Algorithm
(Burrows, Wheeler 1994)
6MTF
Given a string S baacb over alphabet a,b,c,d
b a a c b
S
1
1
0
2
2
MTF(S)
7Main Bounds (Manzini 1999)
- gk is a constant dependant on the context k and
the size of the alphabet - these are worst-case bounds
8Now we are ready to begin
9Some Intuition
- The more the contexts are similar in the original
string, the more its BWT will exhibit local
similarity - The more local similarity found in the BWT of the
string the smaller the numbers we get in MTF - ? We want a statistic that measures local
similarity in a string and specifically in the
BWT of the string - ? The solution Local Entropy
10The Local Entropy- Definition
- We define given a string s
- s1s2sn
- The local entropy of s (Bentley, Sleator,
Tarjan, Wei, 86)
11The Local Entropy - Definition
- Note LE(s) number of bits needed to write the
MTF sequence in binary. - Example
- MTF(s) 311
- ? LE(s) 4
- ? MTF(s) in binary 1111
In Dream world We would like to compress S to
LE(S)
12The Local Entropy Properties
- We use two properties of LE
- The entropy hierarchy
- Convexity
13The Local Entropy Property 1
- The entropy hierarchy
- We prove For each k
- LE(BWT(s)) nHk(s) O(1)
-
- ? Any upper bound that we get for BWT with LE
holds for Hk(s) as well.
14The Local Entropy Properties 2
- Convexity
- ? This means that a partition of a string s does
not improve the Local Entropy of s.
15Convexity
- Cutting the input string into parts doesnt
influence much Only positions per part
16Convexity Why do we need it?
- Ferragina, Giancarlo, Manzini and Sciortino, JACM
2005
17Using LE and its properties we get our bounds
Our LE bound
Our Hk bound
18Our bounds
- We get an improvement of the known bounds
-
- As opposed to the known bounds (Manzini, 1999)
19Our Test Results
The files are non-binary files from the
Canterbury corpus. gzip results are also taken
from the corpus. The size is indicated in bytes.
20How is LE related to compression of integer
sequences?
- We mentioned dream world but what about
reality? - How close can we come to ?
- Problem
- Compress an integer sequence S close to its sum
of logs - Notice for any s
21Compressing Integer Sequences
- Universal Encodings of Integers prefix-free
encoding for integers (e.g. Fibonacci encoding,
Elias encoding). - Doing some math, it turns out that order-0
encoding is good. - Not only good It is best!
22The order-0 math
- Theorem For any string s of length n over the
integer alphabet 1,2,h and for any , - Strange conclusion we get an upper-bound on the
order-0 algorithm with a phrase dependant on the
value of the integers. - This is true for all strings but is especially
interesting for strings with smaller integers.
23A lower bound for SL
- Theorem For any algorithm A and for any ,
and any C such that C lt log(?(µ)) there exists a
string S of length n for which - A(S) gt µSL(S) Cn
24Our Results - Summary
- New improved bounds for BWMTF
- Local Entropy (LE)
- New bounds for compression of integer strings
25Open Issues
- We question the effectiveness of .
- Is there a better statistic?
26Any Questions?
27Thank You!
Anybody want to guess??
28Creating a Huffman encoding
- For each encoding unit (letter, in this example),
associate a frequency (number of times it occurs) - Create a binary tree whose children are the
encoding units with the smallest frequencies - The frequency of the root is the sum of the
frequencies of the leaves - Repeat this procedure until all the encoding
units are in the binary tree
29Example
Assume that relative frequencies are A 40 B
20 C 10 D 10 R 20
30Example , cont.
31Example, cont.
- Assign 0 to left branches, 1 to right branches
- Each encoding is a path from the root
A 0B 100C 1010D 1011R 11
32The Burrows-Wheeler Transform (1994)
Given a string S banana
banana
banan a
ananab
a bana n
nanaba
a naba n
anaban
nabana
abanan
banana
33Suffix Arrays and the BWT
The Suffix Array
Index of BWT
7 6 4 2 1 5 3
6 5 3 1 7 4 2