The PAQ4 Data Compressor

1 / 44
About This Presentation
Title:

The PAQ4 Data Compressor

Description:

Every compression algorithm must expand some ... Text: as hard as AI. Encrypted data: as hard as cryptanalysis. Text compression is as hard as passing ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 45
Provided by: mattma
Learn more at: https://cs.fit.edu

less

Transcript and Presenter's Notes

Title: The PAQ4 Data Compressor


1
The PAQ4 Data Compressor
  • Matt Mahoney
  • Florida Tech.

2
Outline
  • Data compression background
  • The PAQ4 compressor
  • Modeling NASA valve data
  • History of PAQ4 development

3
Data Compression Background
  • Lossy vs. lossless
  • Theoretical limits on lossless compression
  • Difficulty of modeling data
  • Current compression algorithms

4
Lossy vs. Lossless
  • Lossy compression discards unimportant
    information
  • NTSC (color TV), JPEG, MPEG discard imperceptible
    image details
  • MP3 discards inaudible details
  • Losslessly compressed data can be restored exactly

5
Theoretical Limits on Lossless Compression
  • Cannot compress random data
  • Cannot compress recursively
  • Cannot compress every possible message
  • Every compression algorithm must expand some
    messages by at least 1 bit
  • Cannot compress x better than log2 1/P(x) bits on
    average (Shannon, 1949)

6
Difficulty of Modeling
  • In general, the probability distribution P of a
    source is unknown
  • Estimating P is called modeling
  • Modeling is hard
  • Text as hard as AI
  • Encrypted data as hard as cryptanalysis

7
Text compression is as hard as passing the Turing
test for AI
Q Are you human?
A Yes
Computer
  • P(x) probability of a human dialogue x (known
    implicitly by humans)
  • A machine knowing P(AQ) P(QA)/P(Q) would be
    indistinguishable from human
  • Entropy of English 1 bit per character
    (Shannon, 1950)
  • Best compression 1.2 to 2 bpc (depending on
    input size)

8
Compressing encrypted data is equivalent to
breaking the encryption
  • Example x 1,000,000 0 bytes encrypted with AES
    in CBC mode and key foobar
  • The encrypted data passes all tests for
    statistical randomness (not compressible)
  • C(x) 65 bytes using English
  • Finding C(x) requires guessing the key

9
Nevertheless, some common data is compressible

10
Redundancy in English text
  • Letter frequency P(e) gt P(q)
  • so e is assigned a shorter code
  • Word frequency P(the) gt P(eth)
  • Semantic constraints P(drink tea) gt P(drink air)
  • Syntactic constraints P(of the) gt P(the of)

11
Redundancy in images (pic from Calgary corpus)
Adjacent pixels are often the same color,
P(000111) gt P(011010)
12
Redundancy in the Calgary corpusDistance back to
last match of length 1, 2, 4, or 8
13
Redundancy in DNA
tcgggtcaataaaattattaaagccgcgttttaacaccaccgggcgtttc
tgccagtgacgttcaagaaaatcgggccattaagagtgagttggtattcc
atgttaagcatccacaggctggtatctgcaaccgattataacggatgctt
aacgtaatcgtgaagtatgggcatatttattcatctttcggcgcagaatg
ctggcgaccaaaaatcacctccatccgcgcaccgcccgcatgctctctcc
ggcgacgattttaccctcatattgctcggtgatttcgcgggctacc
P(a)P(t)P(c)P(g)1/4 (2 bpc) e.coli (1.92
bpc?)
14
Some data compression methods
  • LZ77 (gzip) Repeated strings replaced with
    pointers back to previous occurrence
  • LZW (compress, gif) Repeated strings replaced
    with index into dictionary
  • LZ decompression is very fast
  • PPM (prediction by partial match) characters
    are arithmetic encoded based on statistics of
    longest matching context
  • Slower, but better compression

15
LZ77 Example
the cat in the hat
Sub-optimal compression due to redundancy in
LZ77 coding
or?
...a...a...a...
16
LZW Example
at the in
the cat in the hat
a ab b bc c
Sub-optimal compression due to parsing ambiguity
abc or abc?
...ab...bc...abc...
17
Predictive Arithmetic Compression (optimal)
Compressor
input
p
Predict next symbol
Arithmetic Coder
compressed data
Decompressor
output
Arithmetic Decoder
Predict next symbol
p
18
Arithmetic Coding
  • Maps string x into C(x) ? 0,1) represented as a
    high precision binary fraction
  • P(y lt x) lt C(x) lt P(y x)
  • lt is a lexicographical ordering
  • There exists a C(x) with at most a log2 1/P(x)
    1 bit representation
  • Optimal within 1 bit of Shannon limit
  • Can be computed incrementally
  • As characters of x are read, the bounds tighten
  • As the bounds tighten, the high order bits of
    C(x) can be output

19
Arithmetic coding example
  • P(a) 2/3, P(b) 1/3
  • We can output 1 after the first b

aaa
0
a
aa
aaa
0.01
aab
aba 1
ab
0.1
aba
abb
b
ba
baa
baa 11
0.11
bab
bb
bbb 11111
bba
bbb
20
Prediction by Partial Match (PPM) Guess next
letter by matching longest context
the cat in th?
Longest context match is th Next letter in
context th is e
the cat in the ha?
Longest context match is a Next letter in
context a is t
21
How do you mix old and new evidence?
..abx...abx...abx...aby...ab?
P(x) ? P(y) ?
22
How do you mix evidence from contexts of
different lengths?
..abcx...bcy...cy...abc?
P(x) ? P(y) ? P(z) ? (unseen but not
impossible)
23
PAQ4 Overview
  • Predictive arithmetic coder
  • Predicts 1 bit at a time
  • 19 models make independent predictions
  • Most models favor newer data
  • Weighted average of model predictions
  • Weights adapted by gradient descent
  • SSE adjusts final probability (Osnach)
  • Mixer and SSE are context sensitive

24
PAQ4
Input Data
context
Model
p
Mixer
Model
Arithmetic Coder
p
p
SSE
Model
Model
Compressed Data
25
19 Models
  • Fixed (P(1) ½)
  • n-gram, n 1 to 8 bytes
  • Match model for n gt 8
  • 1-word context (white space boundary)
  • Sparse 2-byte contexts (skips a byte) (Osnach)
  • Table models (2 above, or 1 above and left)
  • 8 predictions per byte
  • Context normally begins on a byte boundary

26
n-gram and sparse contexts
.......x? .....x.x? ......xx?
....x..x? .....xxx? ....x.x.? ....xxxx?
x...x...? ...xxxxx? .....xx.? ..xxxxxx?
....xx..? .xxxxxxx? ... word? (begins after
space) xxxxxxxx? xxxxxxxxxx? (variable length gt
8)
27
Record (or Table) Model
  • Find a byte repeated 4 times with same interval,
    e.g. ..x..x..x..x
  • If interval is at least 3, assume a table
  • 2 models
  • first and second bytes above
  • bytes above and left

...x... ...x... ...? ....... ...x... ..x?
28
Nonstationary counter model
  • Count 0 and 1 bits observed in each context
  • Discard from the opposite count
  • If more than 2 then discard ½ of the excess
  • Favors newer data and highly predictive contexts

29
Nonstationary counter example
Input (in some context) n0 n1
p(1) ----------------------- -- --
---- 0000000000 10 0
0/10 00000000001 6 1
1/7 000000000011 4 2
2/6 0000000000111 3 3
3/6 00000000001111 2 4
4/6 000000000011111 2 5
5/7 0000000000111111 2 6 6/8
30
Mixer
  • p(1) ?i win1i / ?i ni
  • wi weight of ith model
  • n0i, n1i 0 and 1 counts for ith model
  • ni n0i n1i
  • Cost to code a 0 bit -log p(1)
  • Weight gradient to reduce cost ?cost/?wi
    n1i/?jwjnj ni/?jwjn1j
  • Adjust wi by small amount (0.1-0.5) in direction
    of negative gradient after coding each bit (to
    reduce the cost of coding that bit)

31
Secondary Symbol Estimation (SSE)
  • Maps P(x) to P(x)
  • Refines final probability by adapting to observed
    bits
  • Piecewise linear approximation
  • 32 segments (shorter near 0 or 1)
  • Counts n0, n1 at segment intersections
    (stationary, no discounting opposite count)
  • 8-bit counts are halved if over 255

32
SSE example
Output p
1
Initial function
Trained function
0
0 Input p 1
33
Mixer and SSE are context sensitive
  • 8 mixers selected by 3 high order bits of last
    whole byte
  • 1024 SSE functions selected by current partial
    byte and 2 high order bits of last whole byte

34
Experimental Results on Popular Compressors,
Calgary Corpus
Compressor Size (bytes) Compression Time, 750 MHz
Original data 3141622
compress 1272772 1.5 sec.
pkzip 2.04e 1032290 1.5
gzip -9 1017624 2
winrar 3.20 754270 7
paq4 672134 166
35
Results on Top Compressors
Compressor Size Time
ppmn 716297 23 sec.
rk 1.02 707160 44
ppmonstr I 696647 35
paq4 672134 166
epm r9 668115 54
rkc 661602 91
slim 18 659358 153
compressia 1.0b 650398 66
durilca 0.3a 647028 35
36
Compression for Anomaly Detection
  • Anomaly detection finding unlikely events
  • Depends on ability to estimate probability
  • So does compression

37
Prior work
  • Compression detects anomalies in NASA TEK valve
    data
  • C(normal) C(abnormal)
  • C(normal normal) lt C(normal abnormal)
  • Verified with gzip, rk, and paq4

38
NASA Valve Solenoid Traces
  • Data set 3 solenoid current (Hall effect sensor)
  • 218 normal traces
  • 20,000 samples per trace
  • Measurements quantized to 208 values
  • Data converted to a 4,360,000 byte file with 1
    sample per byte

39
Graph of 218 overlapped traces data (green)
40
Compression Results
Compressor Size
Original 4360000
gzip -9 1836587
slim 18 1298189
epm r9 1290581
durilca 0.3a 1287610
rkc 1277363
rk4 mx 1275324
ppmonstr Ipre 1272559
paq4 1263021
41
PAQ4 Analysis
  • Removing SSE had little effect
  • Removing all models except n1 to 5 had little
    effect
  • Delta coding made compression worse for all
    compressors
  • Model is still too large to code in SCL, but
    uncompressed data is probably noise which can be
    modeled statistically

42
Future Work
  • Compress with noise filtered out
  • Verify anomaly detection by temperature, voltage,
    and plunger impediment (voltage test 1)
  • Investigate analog and other models
  • Convert models to rules

43
History of PAQ4
Date Compressor Calgary Size
Nov. 1999 P12 (Neural net, FLAIRS paper in 5/2000) 831341
Jan. 2002 PAQ1 (Nonstationary counters) 716704
May 2003 PAQ2 (Serge Osnach adds SSE) 702382
Sept. 2003 PAQ3 (Improved SSE) 696616
Oct. 2003 PAQ3N (Osnach adds sparse models) 684580
Nov. 2003 PAQ4 (Adaptive mixing) 672135
44
Acknowledgments
  • Serge Osnach (author of EPM) for adding SSE and
    sparse models to PAQ2, PAQ3N
  • Yoockin Vadim (YBS), Werner Bergmans, Berto
    Destasio for benchmarking PAQ4
  • Jason Schmidt, Eugene Shelwien (ASH, PPMY) for
    compiling faster/smaller executables
  • Eugene Shelwien, Dmitry Shkarin (DURILCA,
    PPMONSTR, BMF) for improvements to SSE contexts
  • Alexander Ratushnyak (ERI) for finding a bug in
    an earlier version of PAQ4
Write a Comment
User Comments (0)