BWTBased Compression Algorithms - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

BWTBased Compression Algorithms

Description:

Based on joint work with Haim Kaplan and Shir ... Part II: Measuring efficiency of compression algs. Empirical evaluation vs. ... the Canterbury Corpus ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 50
Provided by: itcsTsin
Category:

less

Transcript and Presenter's Notes

Title: BWTBased Compression Algorithms


1
BWT-Based Compression Algorithms
  • Elad Verbin
  • Tel-Aviv University

Based on joint work with Haim Kaplan and Shir
Landau
Presented at ITCS, Tsinghua University, Sept.
14th, 2007
2
Talk Outline
  • Part I Introduction to BWT-based compression
  • The BWT and its properties
  • BWT-based compression
  • Part II Measuring efficiency of compression algs
  • Empirical evaluation vs. worst-case analysis
  • Lower and upper bounds.
  • Part III Proving the Lower Bound
  • Part IV A new research direction

3
Part I Introduction to BWT-Based Compression
4
Lossless text compression
  • Three main classes
  • Dictionary-based (Lempel-Ziv etc.)
  • Statistical Coders (PPM, etc.)
  • BWT-Based (bzip2, etc.)
  • Currently, statistical coders achieve best
    compression ratios
  • will this change?
  • Advantage of BWT-based they are out-of-the-box,
    and very fast

5
Empiric Comparison
  • Alice.txt (filesize 152K)

taken from the Canterbury Corpus website
  • bzip2 Achieves compression close to statistical
    coders, with running-time close to gzip.

6
Algorithm BW0
  • BW0
  • BWT
  • MTF
  • Arithmetic

7
BW0 Burrows-Wheeler Compression
Text in English (similar contexts -gt similar
character)
mississippi
BWT
Text with spikes (close repetitions)
ipssmpissii
Move-to-front
Integer string with small numbers
0,0,0,0,0,2,4,3,0,1,0
Arithmetic
Compressed text
01000101010100010101010
8
The BWT
  • Invented by Burrows-and-Wheeler (94)
  • Analogous to Fourier Transform (smooth!)

Fenwick
string with context-regularity
mississippi
BWT
ipssmpissii
string with spikes (close repetitions)
9
The BWT
T mississippi
F
LBWT(T)
mississippi
ississippim
ssissippimi
sissippimis
issippimiss
ssippimissi
sippimissis ippimississ ppimississi pimississi
p imississipp mississippi
BWT sorts the characters by their post-context
10
The BWT
  • Invertible!
  • Linear-time algorithms for computing and for
    inverting
  • Useful for compression
  • Facts
  • BWT permutes the input text
  • BWT is a (n1)-to-1 function

11
BWT useful for compression
  • Suppose text contains 1000 swimming, 100
    fighting, and no other words containing ing.
  • So long run of ms, interspersed with some
    ts
  • BWT turns context-regularity to local uniformity
    (closely-occurring character repetitions)

output of BWT
12
Move To Front
  • By Bentley, Sleator, Tarjan and Wei (86)

string with spikes (close repetitions)
ipssmpissii
move-to-front
0,0,0,0,0,2,4,3,0,1,0
integer string with small numbers
13
Move to Front
14
Move to Front
15
Move to Front
16
Move to Front
17
Move to Front
18
Move to Front
19
Move to Front
20
Move to Front
21
After MTF
  • Now we have a string with small numbers lots of
    0s, many 1s,
  • Skewed frequencies Run Arithmetic!

Character frequencies
22
Order-0 encoding
  • Huffman Encoding
  • Arithmetic Encoding
  • These work well when character frequencies are
    skewed

23
Summary of BW0
  • BW0
  • BWT
  • MTF
  • Arithmetic

24
BW0
  • The Main Burrows-Wheeler Compression Algorithm

BWT Burrows-Wheeler Transform
MTF Move-to-front
Order-0 Encoding
String S
Compressed String S
Text in English (similar contexts -gt similar
character)
Text with local uniformity
Integer string with many small numbers
25
BWRL (e.g. bzip)
26
Many more BWT-based algorithms
  • BW0 is the basic algorithm, but there exist many
    others. Abel,Deorowicz,Fenwick
  • BWRL Runs RLE after MTF
  • BWDC Encodes using distance coding instead of
    MTF
  • Booster-Based Ferragina-Giancarlo-Manzini-Sciorti
    no Completely different.

27
A dirty little secret
  • Before compressing, bzip2 partitions the text
    into chunks of 900K, and compresses each one
    separately
  • This seems horrible. Can you get rid of this?
  • Streaming algorithms?
  • Difficult algorithmic problem BWT is hard to
    divide-and-conquer chunks are highly-dependent.
  • gzip and PPM partition too.

28
Part II Measuring Efficiency of Compression
Algorithms
29
Measuring Compression
  • 3 Approaches
  • Empiric
  • Model-Based
  • Worst-Case

30
Worst-Case Bounds for Compression
  • Need Statistic to Measure Against.
  • cant compress all texts. Chosen statistic
    measures compressibility of text.
  • Examples
  • H0 (order-0 entropy)
  • Hk (order-k entropy)
  • Kolmogorov Complexity
  • Smallest Grammar that Generates the text

31
order-0 entropy
  • Lower bound for compression without context
    information

1/2 As Each represented by 1 bit 1/3 Bs
Each represented by log(3) bits 1/6 Cs Each
represented by log(6) bits 6H0(S)312log(3)1
log(6)
SACABBA
32
Order-0 Compressors
  • When need to compress texts where the character
    frequencies are skewed
  • Huffman Encoding
  • Huff(s) nH0(s)n
  • Arithmetic Encoding
  • Arith(s) nH0(s)O(logn)

33
order-k entropy
  • Lower bound for compression
    with order-k contexts

34
order-k entropy
mississippi Context for i mssp Context for s
isis Context for p ip
35
Measuring against Hk
  • When performing worst-case analysis of lossless
    text compressors, we usually measure against Hk
  • The goal a bound of the form
  • A(s) anHk(s)lower order term
  • Optimal A(s) nHk(s)lower order term

36
Known Bounds
37
Known Bounds
a
38
Known Bounds
39
Known Bounds
Surprising!! Since BWT-based compressors work
better than gzip in practice! Note on Markov
sources, gzip indeed works better
40
Part III Proof of Lower Bound
41
Lower bound
  • Wish to analyze BW0BWTMTFOrder0
  • Need to show s s.t.
  • Consider string s 103 a', 106 b'
  • Entropy of s
  • BWT(s)
  • same frequencies
  • MTF(BWT(s)) has 2103 1', 106-103 0
  • Compressed size about

need BWT(s) to have many isolated as
42
many isolated as
  • Goal find s such that in sBWT(s), many as
    are isolated
  • Solution probabilistic.
  • BWT is (n1)-to-1 function.
  • A random string s has chance of
    being a BWT-image
  • A random string has chance of
    having many isolated as
  • Therefore, such a string exists.

43
General Calculation
  • s contains pn as, (1-p)n bs.
  • Entropy of s
  • BWT(s) contains 2p(1-p)n 1s, rest 0s
  • compressed size of MTF(BWT(s))
  • Ratio

44
Lower bounds on BWDC, BWRL
  • Similar technique.
  • p infinitesimally small gives compressible
    string.
  • So maximize ratio over p.
  • Gives weird constants, but quite strong

45
Weird Observation
  • The same argument works when BWT is replaced by
    any function f such that
  • f is poly(n)-to-1
  • f(s) is a permutation of s
  • Thus, BWT is just as bad as any function! Weird
  • Explanation
  • If A is bad vs. H0, then BWTA is bad vs. Hk
  • If A is good vs. H0, and A is convex, then BWTA
    is good vs. Hk

46
Sketch of upper bound
  • In KLV we have proved
  • Thm (Manzini 99) If A is good vs. H0, and A is
    convex, then BWTA is good vs. Hk for any k
  • A is convex if

47
Sketch of upper bound
  • AMTFOrder0 is not convex, so cant use Thm.
  • Define AMTFPF, PF is prefix-free code.
  • A is convex
  • We prove that
  • This finishes the proof.

48
Conclusion
  • Seen BWT-based algorithms.
  • Intuition why they work well
  • Analytical results
  • Compared to other types of compression
  • Further work
  • Find better analytical ways to explain behaviour
  • Compare against something other than entropy?
  • English text is not Markovian
  • Get better BWT compressors
  • Modern Applications (indexing,...)
  • Get better constants for existing algorithms

49
Thank you!
  • And thanks to Haim and Shir!
Write a Comment
User Comments (0)
About PowerShow.com