BWTBased Compression Algorithms - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

BWTBased Compression Algorithms

Description:

order-0 entropy. Lower bound for compression without context ... (vs. order-k entropy) We believe that they are good since English text is not Markovian. ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 47
Provided by: csd50
Category:

less

Transcript and Presenter's Notes

Title: BWTBased Compression Algorithms


1
BWT-Based Compression Algorithms
  • Haim Kaplan and Elad Verbin
  • Tel-Aviv University

Presented in CPM 07, July 8 , 2007
2
Results
  • Cannot show constant clt2 s.t.
  • Similarly,
  • no clt1.26 for BWRL
  • no clt1.3 for BWDC
  • Probabilistic technique.

3
Outline
  • Part I Definitions
  • Part II Results
  • Part III Proofs
  • Part IV Experimental Results

4
Part I Definitions
5
BW0
  • The Main Burrows-Wheeler Compression Algorithm

BWT Burrows-Wheeler Transform
MTF Move-to-front
Order-0 Encoding
String S
Compressed String S
Text in English (similar contexts -gt similar
character)
Text with local uniformity
Integer string with many small numbers
6
The BWT
  • Invented by Burrows-and-Wheeler (94)
  • Analogous to Fourier Transform (smooth!)

Fenwick
string with context-regularity
mississippi
BWT
ipssmpissii
string with spikes (close repetitions)
7
The BWT
T mississippi
F
LBWT(T)
mississippi
ississippim
ssissippimi
sissippimis
issippimiss
ssippimissi
sippimissis ippimississ ppimississi pimississi
p imississipp mississippi
BWT sorts the characters by their post-context
8
BWT Facts
  • permutes the text
  • (n1)-to-1 function

9
Move To Front
  • By Bentley, Sleator, Tarjan and Wei (86)

string with spikes (close repetitions)
ipssmpissii
move-to-front
0,0,0,0,0,2,4,3,0,1,0
integer string with small numbers
10
Move to Front
11
Move to Front
12
Move to Front
13
Move to Front
14
Move to Front
15
Move to Front
16
Move to Front
17
Move to Front
18
After MTF
  • Now we have a string with small numbers lots of
    0s, many 1s,
  • Skewed frequencies Run Arithmetic!

Character frequencies
19
BW0
  • The Main Burrows-Wheeler Compression Algorithm

BWT Burrows-Wheeler Transform
MTF Move-to-front
Order-0 Encoding
String S
Compressed String S
Text in English (similar contexts -gt similar
character)
Text with local uniformity
Integer string with many small numbers
20
BWRL (e.g. bzip)
21
Many more BWT-based algorithms
  • BWDC Encodes using distance coding instead of
    MTF
  • BW with inversion frequencies coding
  • Booster-Based Ferragina-Giancarlo-Manzini-Sciorti
    no
  • Block-based compressor of Effros et al.

22
order-0 entropy
  • Lower bound for compression without context
    information

1/2 As Each represented by 1 bit 1/3 Bs
Each represented by log(3) bits 1/6 Cs Each
represented by log(6) bits 6H0(S)312log(3)1
log(6)
SACABBA
23
order-k entropy
  • Lower bound for compression
    with order-k contexts

24
order-k entropy
mississippi Context for i mssp Context for s
isis Context for p ip
25
Part II Results
26
Measuring against Hk
  • When performing worst-case analysis of lossless
    text compressors, we usually measure against Hk
  • The goal a bound of the form
  • A(s) cnHk(s)lower order term
  • Optimal A(s) nHk(s)lower order term

27
Bounds
28
Bounds
a
29
Bounds
30
Bounds
Surprising!! Since BWT-based compressors work
better than gzip in practice!
31
Possible Explanations
  • Asymptotics
  • and real compressors cut into blocks, so
  • English Text is not Markovian!
  • Analyzing on different model might show BWT's
    superiority

32
Part III Proofs
33
Lower bound
  • Wish to analyze BW0BWTMTFOrder0
  • Need to show s s.t.
  • Consider string s 103 a', 106 b'
  • Entropy of s
  • BWT(s)
  • same frequencies
  • MTF(BWT(s)) has 2103 1', 106-103 0
  • Compressed size about

need BWT(s) to have many isolated as
34
many isolated as
  • Goal find s such that in BWT(s), most as are
    isolated
  • Solution probabilistic.
  • BWT is (n1)-to-1 function.
  • A random string s has 1/(n1) chance of being a
    BWT-image
  • A random string has 1-1/n2 chance of having
    many isolated as
  • Therefore, such a string exists

35
General Calculation
  • s contains pn as, (1-p)n bs.
  • Entropy of s
  • MTF(BWT(s)) contains 2p(1-p)n 1s, rest 0s
  • compressed size of MTF(BWT(s))
  • Ratio

36
Lower bounds on BWDC, BWRL
  • Similar technique.
  • p infinitesimally small gives compressible
    string.
  • So maximize ratio over p.
  • Gives weird constants, but quite strong

37
Experimental Results
  • Sanity Check Picking texts from above Markov
    models really shows behavior in practice
  • Picking text from realistic Markov sources also
    shows non-optimal behavior
  • (realistic generated from actual texts)
  • On long Markov text, gzip works better than BWT

38
Bottom Line
  • BWT compressors are not optimal
  • (vs. order-k entropy)
  • We believe that they are good since English text
    is not Markovian.
  • Find theoretical justification!
  • also improve constants, find BWT algs with better
    ratios, ...

39
Thank You!
40
Additional Slides (taken out for lack of time)
41
BWT - Invertibility
  • Go forward, one character at a time

42
Main Property L ? F mapping
  • The ith occurrence of c in L corresponds to the
    ith occurrence of c in F.
  • This happens because the characters in L are
    sorted by their post-context, and the occurrences
    of character c in F are sorted by their
    post-context.

43
BW0 vs. Lempel-Ziv
  • BW0 dynamically takes advantage of
    context-regularity
  • Robust, smooth, alternative for Lempel-Ziv

44
BW0 vs. Statistical Coding
  • Statistical Coding (e.g. PPM)
  • Builds a model for each context
  • Prediction -gt Compression

45
Compressed Text Indexing
  • Application of BWT
  • Compressed representation of text, that supports
  • fast pattern matching (without decompression!)
  • Partial decompression
  • So, no need to ever decompress!
  • space usage BW0(s)o(n)
  • See more in Ferragina-Manzini

46
Musings
  • On one hand BWT based algorithms are not
    optimal, while Lempel-Ziv is.
  • On the other hand BWT compresses much better
  • Reasons
  • Results are Asymptotic. (EE reason)
  • English text was not generated by a Markov
    source (real reason?)
  • Goal Get a more honest way to analyze
  • Use a statistic different than Hk?
Write a Comment
User Comments (0)
About PowerShow.com