Title: The Burrows-Wheeler Transform: Theory and Practice
1The Burrows-Wheeler TransformTheory and Practice
- Article by Giovanni Manzini
- Original Algorithm by
- M. Burrows and D. J. Wheeler
- Lecturer Eran Vered
2Overview
- The Burrows-Wheeler transform (bwt).
- Statistical compression overview
- Compressing using bwt
- Analysis of the results of the compression.
3General
- bwt Transforms the order of the symbols of a
text. - The bwt output can be very easily compressed.
- Used by the compressor bzip2.
4Calculating bw(s)
- Add an end-of-string symbol () to s
- Generate a matrix of all the cyclic shifts of s
- Sort the matrix rows, in right to left
lexicographic order - bw(s) is the first column of the matrix
- sign is dropped. Its location saved
5BWT Example
mississippiississippimssissippimisissippimi
sissippimissssippimissisippimissisippimiss
iss ppimississi pimississip imississipp
mississippi
mississippississippimimississippissippimiss
ippimississi ississippimpimississipimissis
sipp sissippimissippimissisissippimissippi
mississ
Sorting the rows of the matrix is equivalent to
sorting the suffixes of sr (ippississim)
bw(s) (msspipissii, 3)
6BWT Matrix Properties
F
L
- Sorting F gives L
- s1F1
- Fi follows Li in s
- Equal symbols in L are ordered the same as in F
m ississippi s sissippim i mississipp is
sippimiss ip pimississ i i ssissippi mp
imississi pi mississip p s issippimi ss
ippimissi si ssippimis si ppimissis s
7Reconstructing s
- Equal order of appearance
s m
i
s
s
i
8Reconstructing s
- Lsort(F)
- s F1
- j1
- for i2 to n
-
- a of appearances of Fj in F1 , F2 , Fj
- j index of the ath appearance of Fj in L
- s s Fj
-
9Whats good about bwt?
- bwt(s) is locally homogenous
- For every substring w of s, all the symbols
following w in s are grouped together.
mississippississippimimississippissippimiss
ippimississi ississippimpimississipimissis
sipp sissippimissippimissisissippimissippi
mississ
- These symbols will usually be homogenous.
10Whats good about bwt?
- miss_mississippi_misses_miss_missouri
mmmmmssssss_spiiiiiupii_ssssss_e_ioir
follow mi
follow _
follow mis
follow m
11Statistical Compression
- We will discuss lossless statistical compression
with the following notations - s input string over the alphabet
- S a1 , a2 , a3 , , ah
- h S
- n s
- ni number of appearances of ai in s.
- log x log2x
12Zeroth Order Encoding
e ? 0 a ? 10 c ? 111
- Every input symbol is replaced by the same
codeword for all its appearances - ai ? ci
Krafts Inequality
Output size
Minimum achieved for
13Zeroth Order Encoding
- Output size is bounded by sH0(s), where
is the Empirical Entropy (zeroth order) of s.
- Compressing a string using Huffman Coding or
Arithmetic Coding produces an output which size
is close to sH0(s) bits. - Specifically
14Zeroth order Entropy Example
15k-th Order Encoding
- The codeword that encodes an input symbol is
determined according to that symbol, and its k
preceding symbols.
- Output size is bounded by sHk(s) bits
k-th Order Empirical Entropy of s
ws A string containing all the symbols
following w in s.
16k-th order Entropy Example
- s mississippi (k1)
- msi ? H0(i)0
- isssp ? H0(ssp)0.92
- sssisi ? H0(sisi)1
- pspi ? H0(pi)1
17k-th Order Encoding and bwt
- After applying bwt, for every substring w of s,
all the symbols following w in s are grouped
together
- Did we get an optimal k-th order compressor?
- Local homogeneity instead of global homogeneity.
18k-th Order Encoding and bwt
- For example
- sababababababab.
- bwt(s) abbbbbbbbbbaaaaaaaaa
w2 (a)
w1 ()
w3 (b)
H1(s)0 (wabbb , wbaaa) H0(wi)0 H0(w1
w2 w3 )H0(s)1
19Compressing bwt
20MoveToFront Compression
- Every input symbol is encoded by the number of
distinct symbols that occurred since the last
appearance of this symbol. - Implemented using a list of symbols sorted by
recency of usage. - Output contains a lot of small numbers if the
text is locally homogenous.
?Transforms local homogeneity into global
homogeneity.
21MoveToFront Compression
- S d,e,h,l,o,r,w
- s h e l l o w o r l d
- mtf-list
- mtf(s)
d, e, h, l, o, r, w
h, d, e, l, o, r, w
e, h, d, l, o, r, w
l, e, h, d, o, r, w
o, l, e, h, d, r, w
w, o, l, e, h, d, r
2
2
3
0
4
6
1
- Initial list may be either
- Ordered alphabetically
- Symbols in order of appearance in the string
(need to add it to the output)
22bwt0 Compression
- bwt0(s) ? arit( mtf( bw(s) ) )
Theorem 1 For any k (hsize of alphabet)
23Notations
- x mtf(x)
- for a string w over 0,1,2, , m define
- w01 w, with all the non-zeros replaced by 1.
- x01 x, with all the non-zeros replaced by 1.
- Note bwt(x) x mtf(x)x
24Theorem 1 - Proof
Lemma 1 ss1s2st , smtf(s). Then
25Theorem 1 - Proof
- bw(s) can be partitioned into at most hk
substrings w1, w2, , wl such that
sHk(s)
- Using bound on output of Arit
26Lemma 1 - Proof
ss1s2st , smtf(s). Then
- Encoding of s
- For each symbol is it 0 or not?
- For non-zeros encode one of 1, 2, 3, , h-1
- Note Ignoring some inter-substrings problems.
27Encoding non-zeros of s
- Use prefix code (i ? ci ) s pcnz(s)
- c1 10
- c2 11
- ci 0 0 0 0 0 B(i1)
(igt2)
B(i1) - 2
B(i1)
ci lt 2log(i1) (c0 0)
mi occurrences of i in s.
28Encoding non-zeros of smtf(s)
For any string s
Sum over all symbols of s
Proof Na Occurrences of symbol a in s p1, p2,,
pNa
29Encoding non-zeros of s
ss1s2st
- Summing for all substrings
30Encoding of s
- For non-zeros encode one of 1, 2, 3, , h-1
- ?No more than bits
-
- For each symbol Is it 0 or not?
- ?Encode s01
31Encoding s01
- If for every si01 the number of 0s is at least
as large as the number of 1s
and
It follows that
32Encoding s01 (second case)
- If si01 has more 1s than 0s for i1,2,l
If there are more 1s than 0s in si01, then
It follows that
33Encoding of s
- For non-zeros encode one of 1, 2, 3, , h-1
- ?No more than bits
- For each symbol Is it 0 or not? (Encode s01 )
- ?No more than bits
- Total (after fixing some inaccuracies)
- ?No more than
bits
34Improvement
- Use RLE
- bw0RL(s) ? arit( rle( mtf( bw(s) ) ) )
- Better performance
- Better theoretical bound
35Notes
- Compressor Implementation
- Use blocks of text. Sort using one of
- Compact suffix trees (long average LCP)
- Suffix arrays (medium average LCP)
- General String sorter (short average LCP)
- Search in a compressed text Extract
suffix-array from bwt(s). - Empirical Results