BWTBased Compression Algorithms - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

BWTBased Compression Algorithms

Description:

Based on joint work with Haim Kaplan and Shir ... Part II: Measuring efficiency of compression algs. Empirical evaluation vs. ... the Canterbury Corpus ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 50

Provided by: itcsTsin

Category:

more less

Transcript and Presenter's Notes

Title: BWTBased Compression Algorithms

1
BWT-Based Compression Algorithms

Elad Verbin
Tel-Aviv University

Based on joint work with Haim Kaplan and Shir
Landau
Presented at ITCS, Tsinghua University, Sept.
14th, 2007
2
Talk Outline

Part I Introduction to BWT-based compression
The BWT and its properties
BWT-based compression
Part II Measuring efficiency of compression algs
Empirical evaluation vs. worst-case analysis
Lower and upper bounds.
Part III Proving the Lower Bound
Part IV A new research direction

3
Part I Introduction to BWT-Based Compression
4
Lossless text compression

Three main classes
Dictionary-based (Lempel-Ziv etc.)
Statistical Coders (PPM, etc.)
BWT-Based (bzip2, etc.)
Currently, statistical coders achieve best
compression ratios
will this change?
Advantage of BWT-based they are out-of-the-box,
and very fast

5
Empiric Comparison

Alice.txt (filesize 152K)

taken from the Canterbury Corpus website

bzip2 Achieves compression close to statistical
coders, with running-time close to gzip.

6
Algorithm BW0

BW0
BWT
MTF
Arithmetic

7
BW0 Burrows-Wheeler Compression
Text in English (similar contexts -gt similar
character)
mississippi
BWT
Text with spikes (close repetitions)
ipssmpissii
Move-to-front
Integer string with small numbers
0,0,0,0,0,2,4,3,0,1,0
Arithmetic
Compressed text
01000101010100010101010
8
The BWT

Invented by Burrows-and-Wheeler (94)
Analogous to Fourier Transform (smooth!)

Fenwick
string with context-regularity
mississippi
BWT
ipssmpissii
string with spikes (close repetitions)
9
The BWT
T mississippi
F
LBWT(T)
mississippi
ississippim
ssissippimi
sissippimis
issippimiss
ssippimissi
sippimissis ippimississ ppimississi pimississi
p imississipp mississippi
BWT sorts the characters by their post-context
10
The BWT

Invertible!
Linear-time algorithms for computing and for
inverting
Useful for compression
Facts
BWT permutes the input text
BWT is a (n1)-to-1 function

11
BWT useful for compression

Suppose text contains 1000 swimming, 100
fighting, and no other words containing ing.

So long run of ms, interspersed with some
ts
BWT turns context-regularity to local uniformity
(closely-occurring character repetitions)

output of BWT
12
Move To Front

By Bentley, Sleator, Tarjan and Wei (86)

string with spikes (close repetitions)
ipssmpissii
move-to-front
0,0,0,0,0,2,4,3,0,1,0
integer string with small numbers
13
Move to Front
14
Move to Front
15
Move to Front
16
Move to Front
17
Move to Front
18
Move to Front
19
Move to Front
20
Move to Front
21
After MTF

Now we have a string with small numbers lots of
0s, many 1s,
Skewed frequencies Run Arithmetic!

Character frequencies
22
Order-0 encoding

Huffman Encoding
Arithmetic Encoding
These work well when character frequencies are
skewed

23
Summary of BW0

BW0
BWT
MTF
Arithmetic

24
BW0

The Main Burrows-Wheeler Compression Algorithm

BWT Burrows-Wheeler Transform
MTF Move-to-front
Order-0 Encoding
String S
Compressed String S
Text in English (similar contexts -gt similar
character)
Text with local uniformity
Integer string with many small numbers
25
BWRL (e.g. bzip)
26
Many more BWT-based algorithms

BW0 is the basic algorithm, but there exist many
others. Abel,Deorowicz,Fenwick
BWRL Runs RLE after MTF
BWDC Encodes using distance coding instead of
MTF
Booster-Based Ferragina-Giancarlo-Manzini-Sciorti
no Completely different.

27
A dirty little secret

Before compressing, bzip2 partitions the text
into chunks of 900K, and compresses each one
separately
This seems horrible. Can you get rid of this?
Streaming algorithms?
Difficult algorithmic problem BWT is hard to
divide-and-conquer chunks are highly-dependent.
gzip and PPM partition too.

28
Part II Measuring Efficiency of Compression
Algorithms
29
Measuring Compression

3 Approaches
Empiric
Model-Based
Worst-Case

30
Worst-Case Bounds for Compression

Need Statistic to Measure Against.
cant compress all texts. Chosen statistic
measures compressibility of text.
Examples
H0 (order-0 entropy)
Hk (order-k entropy)
Kolmogorov Complexity
Smallest Grammar that Generates the text

31
order-0 entropy

Lower bound for compression without context
information

1/2 As Each represented by 1 bit 1/3 Bs
Each represented by log(3) bits 1/6 Cs Each
represented by log(6) bits 6H0(S)312log(3)1
log(6)
SACABBA
32
Order-0 Compressors

When need to compress texts where the character
frequencies are skewed
Huffman Encoding
Huff(s) nH0(s)n
Arithmetic Encoding
Arith(s) nH0(s)O(logn)

33
order-k entropy

Lower bound for compression
with order-k contexts

34
order-k entropy
mississippi Context for i mssp Context for s
isis Context for p ip
35
Measuring against Hk

When performing worst-case analysis of lossless
text compressors, we usually measure against Hk
The goal a bound of the form
A(s) anHk(s)lower order term
Optimal A(s) nHk(s)lower order term

36
Known Bounds
37
Known Bounds
a
38
Known Bounds
39
Known Bounds
Surprising!! Since BWT-based compressors work
better than gzip in practice! Note on Markov
sources, gzip indeed works better
40
Part III Proof of Lower Bound
41
Lower bound