Title: Canonical Huffman trees:
1- Canonical Huffman trees
-
- Goals a scheme for large alphabets with
- Efficient decoding
- Efficient coding
- Economic use of main memory
-
2- A non-Huffman same cost tree
- Code 1 lca(e,b) 0 code 2 lca(e,b)
- Code 2 successive integers
- (going down from longest codes)
decimal Code 2 Code1 (huffman) frequency symbol
0 000 000 10 a
1 001 001 11 b
2 010 100 12 c
3 011 101 13 d
4 10 01 22 e
5 11 11 23 f
3- tree for code 2
- Lemma (nodes) in each level in Huffman is even
-
- Proof a parent with single child is impossible
e 4
f 5
b 1
c 2
d 3
a 0
4 5- Canonical Huffman Algorithm
- compute lengths of codes and numbers of symbols
for each length (for regular Huffman) - L max length
- first(L) 0
- for i L-1 downto 1
-
- assign to symbols of length i codes of this
length, starting at first(i) -
- Q What happens when there are no symbols of
length i? - Does first(L) 0lt first(L-1)ltltFirst(1) always
hold?
6- Decoding (assume we start now on new symbol)
- i1
- v nextbit() // we have read the first bit
- while vltfirst(i) // small codes start at large
numbers! - i
- v2v nextbit()
-
- / now, v is code of length i of a symbol s
- s is in position v first(i) in the block of
symbols with code length i (positions from 0) - /
- Decoding can be implemented by shift/compare
- (very efficient)
7- Data structures for decoder
- The array first(i)
- Arrays S(i) of the symbols with code length i,
- ordered by their code
- (v-first(i) is the index value to get the
symbol for code v) - Thus, decoding uses efficient arithmetic
operations array look-up more efficient then
storing a tree and traversing pointers - What about coding (for large alphabets, where
symbols words or blocks)? - The problem millions of symbols ? large Huffman
tree,
8- Construction of canonical Huffman (sketch)
- assumption we have the symbol frequencies
- Input a sequence of (symbol, freq)
- Output a sequence of (symbol, length)
- Idea use an array to represent a heap for
creating the tree, and the resulting tree and
lengths - We illustrate by example
9- Example frequencies 2, 8, 11, 12
- (each cell with a freq. also contains a symbol
not shown) - Now reps of 2, 8 (smallest) go out, rest
percolate - The sum 10 is put into cell4, and its rep into
cell 3 - Cell4 is the parent (sum) of cells 5, 8.
-
-
8 11 12 2 6 7 8 5
8 11 12 2 6 7
4 11 12 4 10 3 6 7
10- after one more step
- Finally, a representation of the Huffman tree
- Next, by i2 to 8, assign lengths (here shown
after i4) -
4 3 12 4 3 21 3 6
4 3 2 4 3 2 33 2
4 3 2 4 2 1 0 2
11- Summary
- Insertion of (symbol,freq) into array O(n)
- Creation of heap
- Creating tree from heap each step is
total is - Computing lengths O(n)
- Storage requirements 2n (compare to tree!)
12- Entropy H a lower bound on compression
- How can one still improve?
- Huffman works for given frequencies, e.g., for
the English language static modeling - Plus No need to store model in coder/decoder
- But, can construct frequency table for each file
- semi-static
modeling - Minus
- Need to store model in compressed file
(negligible for large files) - Takes more time to compress
- Plus may provide for better compression
13- 3rd option start compressing with default freqs
- As coding proceeds, update frequencies
- After reading a symbol
- compress it
- update freq table
- Adaptive modeling
- Decoding must use precisely same algorithm for
updating freqs ? can follow coding - Plus
- Model need not be stored
- May provide compression that adapts to file,
including local changes of freqs - Minus less efficient then previous models
- May use a sliding window to better reflect
local changes
14- Adaptive Huffman
- Construction of Huffman after each symbol O(n)
- Incremental adaptation in O(logn) is possible
- Both too expensive for practical use (large
alphabets) - We illustrate adaptive by arithmetic coding (soon)
15- Higher-order modeling use of context
- e.g. for each block of 2 letters, construct
freq. table for the next letter (2-order
compression) - (uses conditional probabilities hence
improvement) - This also can be static/semi-static/adaptive
16- Arithmetic coding
- Can be static, semi-static, adaptive
- Basic idea
- Coder start with the interval 0,1)
- 1st symbol selects a sub-interval, based on its
probability - ith symbol selects a sub-interval of (i-1)th
interval, based on its probability - When file ends, store a number in the final
interval - Decoder reads the number, reconstructs the
sequence of intervals, i.e. symbols - Important Length of file stored at beginning of
compressed file - (otherwise, decoder does not know when
to stop)
17- Example (static) a 3/4, b 1/4
- The file to be compressed aaaba
- The sequence of intervals ( symbols creating
them) - 0,1), a 0,3/4), a 0,9/16), a 0, 27/64),
- b 81/256, 108/256), a 324/1024, 405/1024)
- Assuming this is the end, we store
- 5 length of file
- Any number in final interval, say 0.011 (3
digits) - (after first 3 as, one digit suffices!)
- (for a large file, the length will be negligible)
18- Why is it a good approach in general?
- For a symbol with large probability, of binary
digits needed to represent an occurrence is
smaller than 1 ? poor compression with Huffman - But, arithmetic represents such a symbol with a
small shrinkage of interval, hence the extra
number of digits is smaller than 1! - Consider the example above, after aaa
19- Arithmetic coding adaptive an example
- The symbols a, b, c
- Initial frequencies 1,1,1 ( initial accumulated
freqs) - (0 is illegal, cannot code a symbol with
probability 0!) - b model passes to coder the triple (1, 2, 3)
- 1 the accumulated freqs up to, not including, b
- 2 the accumulated freqs up to, including, b
- 3 the sum of freqs
- Coder notes new interval 1/3, 2/3)
- Model updates freqs to 1, 2, 1
- c model passes (3,4,4) (upper quarter)
- Coder updates interval to 7/12,8/12)
- Model updates freqs to (1,2,2)
- And so on .
20- Practical considerations
- Interval ends are held as binary numbers
- of bits in number to be stored proportional to
size of file impractical to compute it all
before storing - solution as interval gets small, first bit of a
number in it is determined. This bit is written
by code into compressed file, and removed from
interval ends ( mult by 2) - Example in 1st example, when interval becomes
0,27/64 000000,011011) (after 3 as) output
0, and update to 00000,11011) - Decoder sees 1st 0 knows the first three are
as, - Computes interval, throws the 0
21- Practically, (de)coder maintain a word for each
number, computations are approximate - Some (very small) loss of compression
- Both sides must perform same approximations at
same time - Initial assignment of freq. 1 to
low freq. symbols? - Solution assign 1 to all symbols not seen so far
- If k were not seen yet, one now occurs, give it
1/k - Since coder does not know when to stop, file
length must be stored in compressed file
22- Frequencies data structure need to allow both
update, and sums of the form - (expensive for large alphabets)
- Solution a tree-like structure
-
- O(logn) accesses!
sum binary cell
f1 1 1
f1f2 10 2
f3 11 3
f1f2f3f4 100 4
f5 101 5
f5f6 110 6
f7 111 7
f1f8 1000 8
If k, the binary cell ends with i 0s, the
cell contains fkf_(k-1) f_(k-i1)
What is the algorithm to compute
23Dictionary-based methods
- Huffman is a dictionary-based method
- Each symbol in dictionary has associated code
- But, adaptive Huffman is not practical
- Famous adaptive methods
- LZ77, LZ78 (Lempel-Ziv)
- We describe LZ77 (basis of gzip in Unix)
24- Basic idea The dictionary -- the sequences of
symbols in a window before current position - (typical window size )
- When coder at position p, window is the symbols
in positions p-w,p-1 - Coder searches for longest seq. that matches the
one at position p - If found, of length l, put (n,l) into file (n
--offset, l length), and forward l positions, - else output the current symbol
25- Example
- input is a b a a b a bb (11 bs)
- Code is a b (2,1) (1,1) (3,2) (2,1) (1,10)
- Decoding a? a, b? b, (2,1) ? a, (1,1) ? a,
- current known string a b a a
- (3,2) ? b a, (2,1) ? b
- current known string a b a a b a b
- (1, 10) ? Go back one step to b
- do 10 times output scanned
symbol,
advance one - (note run-length encoding hides here)
-
- Note decoding is extremely fast!
26- Practical issues
- Maintenance of window use cyclic buffer
- searching for longest matching word
- ? expensive coding
- How to distinguish a pair (n,l) from a symbol?
- Can we save on the space for (n,l)?
- The gzip solution for 2-4
- 2 a hash table of 3-sequences, with lists of
positions where a sequence starting with them
exists (what about short matches?) - An option limit the search in the list (save
time) - Does not always find the longest match, but loss
is very small
27- 3 one bit suffices (but see below)
- 4 offsets are integers in range 1,2k, often
smaller values are more frequent - Semi-static solution (gzip)
- Divide file into segments of 64k for each
- Find the offsets used and their frequencies
- Code using canonical Huffman
- Do same for lengths
- Actually, add symbols (issue 3) to set of
lengths, code together using one code, and put in
file this code before offset code (why?)
28- One last issue (for all methods) synchronization
- Assume you want to start decoding in mid-file?
- E.g. a db of files, coded using one code
- Bit-based addresses for the files --- these
addresses occur in many ILs, which are loaded to
MM. - 32/address is ok, 64/address may be costly
- Byte/word-based addresses allow for much larger
dbs. It may pay to even use k-word blocks based
addresses -
- how does one synchronize?
-
29- Solution fill last block with 011
- if code fills last block, add a
block - Since file addresses/lengths are known, filling
can be removed - Does this work for Huffman? Arithmetic? LZ77?
- What is the cost?
30- Summary of file compression
- Large dbs ? compression helps reduce storage
- Fast query processing requires synchronization
- and fast decoding
- Db is often given, so statistics can be collected
- semi-static is a viable option
- (plus regular re-organization)
- Context-based methods give good compression, but
expensive decoding - word-based Huffman is recommended (semi-static)
- Construct two models one for words, another for
no-words
31Compression of inverted lists
- Introduction
- Global, non-parametric methods
- Global parametric methods
- Local parametric methods
32- Introduction
- Important parameters
- N - of documents in db
- n - of (distinct) words
- F - of word occurrences
- f - of inverted list entries
- The index contains
- lexicon (MM, if possible), ILs (Disc)
- IL compression helps to reduce size of index,
cost of i/o
(in TREC, 99) 741,856 535,346 333,338,738 134,99
4,414 Total size 2G
33- The IL for a term t contains entries
- An entry
- d ( doc. id), in-doc freq. ,
in-doc-position, - For ranked answers, the entry is usually (d,
) - We consider each separately independent
compressions, can be composed
34- Compression of doc numbers
- A sequence of numbers in 1..N how can it be
compressed? - Most methods use gaps
- g1d1, g2d2-d1,
- We know that
-
- For long lists, most are small.
- These facts can be used for compression
- (Each method has an associated probability
distribution on the gaps, defined by code
lengths )
35- Global, non-parametric methods
- Binary coding
- represent each gap by a fixed length binary
number - Code length for g
- Probability uniform distribution p(g)1/N
36- Unary coding
- represent each ggt0 by d-1 digits 1, then 0
- 1 -gt 0, 2 -gt 10, 3 -gt 110, 4-gt 1110,
- Code length for g g
- ?
- Worst case for sum N (hence for all ILs nN)
- is this a nice bound?
- P(g)
- Exponential decay if does not hold in practice
- ? compression penalty
37- Gamma ( ) code
- a number g is represented by
- Prefix ???? unary code for
- Suffix ???? binary code, with
digits, for - Examples
-
- (Why not ?)
38 39- Interim summary
- We have codes with probability distributions
- Q can you prove that the (exact) formulas for
probabilities for gamma, delta sum to 1?
40- Golomb code
- Semi-static, uses db statistics
- global, parametric code
- Select a basis b (based on db statistics later)
- ggt0 ? we represent g-1
- Prefix let (integer
division) - represent, in unary, q1
- Suffix the remainder is (g-1)-qb (in 0..b-1)
- represent by a binary tree code
- - some leaves at distance
- - the others at distance
41- The binary tree code
- cut 2j leaves from the full binary tree of depth
k - assign leaves, in order, to the values in
0..b-1 - Example b6
0
1
2
3
4
5
42- Summary of Golomb code
- Exponential decay like unary, slower rate,
affected by b - Q what is the underlying theory?
- Q how is b chosen?
43- Infinite Huffman Trees
- Example Consider
- The code () 0, 10, 110, 1110,
- seems natural, but Huffman algorithm is not
applicable! (why?) - For each m, consider the (finite)
m-approximation - each has a Huffman tree code 0, 10, , 110
- the code for m1 refines that of m
- The sequence of codes converges to ()
440
approximation 1, code words 0, 1
1
approximation 2, code words 0, 10,11
1
1/2
1/2
1/4
0
approximation 1, code words 0, 10, 110,
1110,
1111
1
1/8
1/8
45- A more general approximation scheme
- Given the sequence
- An m-approximation, with skip b is the finite
sequence where - for example b 3
approximated tail
46- Fact refining the m-approx. by splitting
to and gives the
m1-approx. - A sequence of m-approximations is good if
- () are the smallest in the
sequence,
-
- so they are the 1st pair merged by Huffman
(why is this important?) - () Depends on the and on b
-
47- Let -- the Bernoulli
distribution - A decreasing sequence
- ?
- to prove (), need to show
- For which b do they hold?
48(No Transcript)
49- We select lt on the right (useful later)
- has a unique solution
- To solve, from the left side, we obtain
- Hence the solution is (b is an integer)
50- Next how do these Huffman trees look like?
- Start with 0-approx.
- Facts
- A decreasing sequence (so last two are smallest)
- (when bgt3)
- follows from
- and ()
- 3. Previous two properties are preserved when
last two are replaced by their sum - 4. The Huffman tree for the sequence assigns to
codes of lengths
of same cost as the Golomb code for remainders - Proof induction on b
51- Now, expand the approximations, to obtain
infinite tree - This is the Golomb code, (with places of
prefix/suffix exchanged)!!
0
1
0
1
0
1
52- Last question
- where do we get p, and why Bernoulli?
- Assume equal probability p for t to be in d
- For a given t, probability of the gap g from
one doc to next is then - For p there are f pairs (t, d), estimate p by
- Since N is large, a reasonable estimate
53- For TREC
- To estimate for a
small p - log(2-p) log 2, log(1-p) -p
- b (log 2)/p 0.69 nN/f 1917
- end of (blobal) Golomb
54- Global observed frequency (a global method)
- Construct all ILs collect statictics on
frequencies of gaps - Construct canonical Huffman tree for gaps
- The model/tree needs to be stored
- (gaps are in 1..N for TREC this is 3/4M gap
values ? storage overhead may not be so large) - Practically, not far from gamma, delta,
- But local methods are better
55- Local (parametric) methods
- Coding of IL(t) based on statistics of IL(t)
- Local observed frequency
- Construct canonical Huffman for IL(t) based on
its gap frequencies - Problem in small ILs of distinct gaps is
close to of gaps - Size of model close to size of compressed data
- Example 25 entries, 15 gap values
- Model 15 gaps, 15 lengths (or freqs)
- Way out construct model for groups of ILs
- (see book for details)
56- Local Bernoulli/Golomb
- Assumption - of entries of IL(t) is known
- (to coder decoder)
-
- , estimate b construct
Golomb - Note
- Large f_t ? larger p ? smaller b ? code gets
close to unary (reasonable, many small gaps) - Small f_t ? large b ? most coding log b
- For example f_t 2 (one gap) ? b 0.69N
- for a gap lt 0.69/N, code in log(0.69N)
- for a larger gap, one more bit
57- Interpolative coding
- Uses original ds , not gs
- Let f f_t, assume ds are stored in L0,,f
- (each entry is at most N)
- Standard binary for middle d, with of bits
determined by its range - Continue as in binary search each d in binary,
with of bits determined by modified range
58- Example L3,8,9,11,12,13,18 (f7) N20
- H ? 7 div 2 3 L3 11 (4th d)
- smallest d is 1, and there are 3 to left of
L3 - largest d is 20, there are 3 to right of L3
- size of interval is (20-3)-(13)17-413
- ? code 11 in 4 bits
- For sub-list left of 11 3, 8, 9
- h ? 3 div 2 1 L1 8
- bounds lower 11 2 upper 10-19
- code using 3 bits
- For L2 9, range is 9..10, use 2 bits
- For sub-list right of 11 do on board
- (note the element that is coded in 0 bits!)
59- Advantages
- Relatively easy to code, decode
- Very efficient for clusters (a word that occurs
in many documents close to each other) - Disadvantage more complex to implement,
requires a stack - And cost of decoding is a bit more than Golomb
- ------ ---------- -------- ----------- -------
-------- - Summary of methods Show table 3.8
60- An entry in IL(t) also contains - freq. of t
in d - compression of f_d,t
- In TREC, F/f 2.7 ? these are small numbers
- Unary total overhead is
- Cost per entry is F/f (for TREC 2.7)
- Gamma shorter than unary, except for 2, 4
- (For TREC 2.13)
- Does not pay the complexity to choose another
- Total cost of compression of IL 8-9 bits/entry