Canonical Huffman trees: - PowerPoint PPT Presentation

About This Presentation
Title:

Canonical Huffman trees:

Description:

Canonical Huffman trees: Goals: a scheme for large alphabets with Efficient decoding Efficient coding Economic use of main memory A non-Huffman same cost tree ... – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 61
Provided by: Offic96
Category:

less

Transcript and Presenter's Notes

Title: Canonical Huffman trees:


1
  • Canonical Huffman trees
  • Goals a scheme for large alphabets with
  • Efficient decoding
  • Efficient coding
  • Economic use of main memory

2
  • A non-Huffman same cost tree
  • Code 1 lca(e,b) 0 code 2 lca(e,b)
  • Code 2 successive integers
  • (going down from longest codes)

decimal Code 2 Code1 (huffman) frequency symbol
0 000 000 10 a
1 001 001 11 b
2 010 100 12 c
3 011 101 13 d
4 10 01 22 e
5 11 11 23 f
3
  • tree for code 2
  • Lemma (nodes) in each level in Huffman is even
  • Proof a parent with single child is impossible

e 4
f 5
b 1
c 2
d 3
a 0
4
  • General approach

5
  • Canonical Huffman Algorithm
  • compute lengths of codes and numbers of symbols
    for each length (for regular Huffman)
  • L max length
  • first(L) 0
  • for i L-1 downto 1
  • assign to symbols of length i codes of this
    length, starting at first(i)
  • Q What happens when there are no symbols of
    length i?
  • Does first(L) 0lt first(L-1)ltltFirst(1) always
    hold?

6
  • Decoding (assume we start now on new symbol)
  • i1
  • v nextbit() // we have read the first bit
  • while vltfirst(i) // small codes start at large
    numbers!
  • i
  • v2v nextbit()
  • / now, v is code of length i of a symbol s
  • s is in position v first(i) in the block of
    symbols with code length i (positions from 0)
  • /
  • Decoding can be implemented by shift/compare
  • (very efficient)

7
  • Data structures for decoder
  • The array first(i)
  • Arrays S(i) of the symbols with code length i,
  • ordered by their code
  • (v-first(i) is the index value to get the
    symbol for code v)
  • Thus, decoding uses efficient arithmetic
    operations array look-up more efficient then
    storing a tree and traversing pointers
  • What about coding (for large alphabets, where
    symbols words or blocks)?
  • The problem millions of symbols ? large Huffman
    tree,

8
  • Construction of canonical Huffman (sketch)
  • assumption we have the symbol frequencies
  • Input a sequence of (symbol, freq)
  • Output a sequence of (symbol, length)
  • Idea use an array to represent a heap for
    creating the tree, and the resulting tree and
    lengths
  • We illustrate by example

9
  • Example frequencies 2, 8, 11, 12
  • (each cell with a freq. also contains a symbol
    not shown)
  • Now reps of 2, 8 (smallest) go out, rest
    percolate
  • The sum 10 is put into cell4, and its rep into
    cell 3
  • Cell4 is the parent (sum) of cells 5, 8.

8 11 12 2 6 7 8 5
8 11 12 2 6 7
4 11 12 4 10 3 6 7
10
  • after one more step
  • Finally, a representation of the Huffman tree
  • Next, by i2 to 8, assign lengths (here shown
    after i4)

4 3 12 4 3 21 3 6
4 3 2 4 3 2 33 2
4 3 2 4 2 1 0 2
11
  • Summary
  • Insertion of (symbol,freq) into array O(n)
  • Creation of heap
  • Creating tree from heap each step is
    total is
  • Computing lengths O(n)
  • Storage requirements 2n (compare to tree!)

12
  • Entropy H a lower bound on compression
  • How can one still improve?
  • Huffman works for given frequencies, e.g., for
    the English language static modeling
  • Plus No need to store model in coder/decoder
  • But, can construct frequency table for each file
  • semi-static
    modeling
  • Minus
  • Need to store model in compressed file
    (negligible for large files)
  • Takes more time to compress
  • Plus may provide for better compression

13
  • 3rd option start compressing with default freqs
  • As coding proceeds, update frequencies
  • After reading a symbol
  • compress it
  • update freq table
  • Adaptive modeling
  • Decoding must use precisely same algorithm for
    updating freqs ? can follow coding
  • Plus
  • Model need not be stored
  • May provide compression that adapts to file,
    including local changes of freqs
  • Minus less efficient then previous models
  • May use a sliding window to better reflect
    local changes

14
  • Adaptive Huffman
  • Construction of Huffman after each symbol O(n)
  • Incremental adaptation in O(logn) is possible
  • Both too expensive for practical use (large
    alphabets)
  • We illustrate adaptive by arithmetic coding (soon)

15
  • Higher-order modeling use of context
  • e.g. for each block of 2 letters, construct
    freq. table for the next letter (2-order
    compression)
  • (uses conditional probabilities hence
    improvement)
  • This also can be static/semi-static/adaptive

16
  • Arithmetic coding
  • Can be static, semi-static, adaptive
  • Basic idea
  • Coder start with the interval 0,1)
  • 1st symbol selects a sub-interval, based on its
    probability
  • ith symbol selects a sub-interval of (i-1)th
    interval, based on its probability
  • When file ends, store a number in the final
    interval
  • Decoder reads the number, reconstructs the
    sequence of intervals, i.e. symbols
  • Important Length of file stored at beginning of
    compressed file
  • (otherwise, decoder does not know when
    to stop)

17
  • Example (static) a 3/4, b 1/4
  • The file to be compressed aaaba
  • The sequence of intervals ( symbols creating
    them)
  • 0,1), a 0,3/4), a 0,9/16), a 0, 27/64),
  • b 81/256, 108/256), a 324/1024, 405/1024)
  • Assuming this is the end, we store
  • 5 length of file
  • Any number in final interval, say 0.011 (3
    digits)
  • (after first 3 as, one digit suffices!)
  • (for a large file, the length will be negligible)

18
  • Why is it a good approach in general?
  • For a symbol with large probability, of binary
    digits needed to represent an occurrence is
    smaller than 1 ? poor compression with Huffman
  • But, arithmetic represents such a symbol with a
    small shrinkage of interval, hence the extra
    number of digits is smaller than 1!
  • Consider the example above, after aaa

19
  • Arithmetic coding adaptive an example
  • The symbols a, b, c
  • Initial frequencies 1,1,1 ( initial accumulated
    freqs)
  • (0 is illegal, cannot code a symbol with
    probability 0!)
  • b model passes to coder the triple (1, 2, 3)
  • 1 the accumulated freqs up to, not including, b
  • 2 the accumulated freqs up to, including, b
  • 3 the sum of freqs
  • Coder notes new interval 1/3, 2/3)
  • Model updates freqs to 1, 2, 1
  • c model passes (3,4,4) (upper quarter)
  • Coder updates interval to 7/12,8/12)
  • Model updates freqs to (1,2,2)
  • And so on .

20
  • Practical considerations
  • Interval ends are held as binary numbers
  • of bits in number to be stored proportional to
    size of file impractical to compute it all
    before storing
  • solution as interval gets small, first bit of a
    number in it is determined. This bit is written
    by code into compressed file, and removed from
    interval ends ( mult by 2)
  • Example in 1st example, when interval becomes
    0,27/64 000000,011011) (after 3 as) output
    0, and update to 00000,11011)
  • Decoder sees 1st 0 knows the first three are
    as,
  • Computes interval, throws the 0

21
  • Practically, (de)coder maintain a word for each
    number, computations are approximate
  • Some (very small) loss of compression
  • Both sides must perform same approximations at
    same time
  • Initial assignment of freq. 1 to
    low freq. symbols?
  • Solution assign 1 to all symbols not seen so far
  • If k were not seen yet, one now occurs, give it
    1/k
  • Since coder does not know when to stop, file
    length must be stored in compressed file

22
  • Frequencies data structure need to allow both
    update, and sums of the form
  • (expensive for large alphabets)
  • Solution a tree-like structure
  • O(logn) accesses!

sum binary cell
f1 1 1
f1f2 10 2
f3 11 3
f1f2f3f4 100 4
f5 101 5
f5f6 110 6
f7 111 7
f1f8 1000 8
If k, the binary cell ends with i 0s, the
cell contains fkf_(k-1) f_(k-i1)
What is the algorithm to compute
23
Dictionary-based methods
  • Huffman is a dictionary-based method
  • Each symbol in dictionary has associated code
  • But, adaptive Huffman is not practical
  • Famous adaptive methods
  • LZ77, LZ78 (Lempel-Ziv)
  • We describe LZ77 (basis of gzip in Unix)

24
  • Basic idea The dictionary -- the sequences of
    symbols in a window before current position
  • (typical window size )
  • When coder at position p, window is the symbols
    in positions p-w,p-1
  • Coder searches for longest seq. that matches the
    one at position p
  • If found, of length l, put (n,l) into file (n
    --offset, l length), and forward l positions,
  • else output the current symbol

25
  • Example
  • input is a b a a b a bb (11 bs)
  • Code is a b (2,1) (1,1) (3,2) (2,1) (1,10)
  • Decoding a? a, b? b, (2,1) ? a, (1,1) ? a,
  • current known string a b a a
  • (3,2) ? b a, (2,1) ? b
  • current known string a b a a b a b
  • (1, 10) ? Go back one step to b
  • do 10 times output scanned
    symbol,
    advance one
  • (note run-length encoding hides here)
  • Note decoding is extremely fast!

26
  • Practical issues
  • Maintenance of window use cyclic buffer
  • searching for longest matching word
  • ? expensive coding
  • How to distinguish a pair (n,l) from a symbol?
  • Can we save on the space for (n,l)?
  • The gzip solution for 2-4
  • 2 a hash table of 3-sequences, with lists of
    positions where a sequence starting with them
    exists (what about short matches?)
  • An option limit the search in the list (save
    time)
  • Does not always find the longest match, but loss
    is very small

27
  • 3 one bit suffices (but see below)
  • 4 offsets are integers in range 1,2k, often
    smaller values are more frequent
  • Semi-static solution (gzip)
  • Divide file into segments of 64k for each
  • Find the offsets used and their frequencies
  • Code using canonical Huffman
  • Do same for lengths
  • Actually, add symbols (issue 3) to set of
    lengths, code together using one code, and put in
    file this code before offset code (why?)

28
  • One last issue (for all methods) synchronization
  • Assume you want to start decoding in mid-file?
  • E.g. a db of files, coded using one code
  • Bit-based addresses for the files --- these
    addresses occur in many ILs, which are loaded to
    MM.
  • 32/address is ok, 64/address may be costly
  • Byte/word-based addresses allow for much larger
    dbs. It may pay to even use k-word blocks based
    addresses
  • how does one synchronize?

29
  • Solution fill last block with 011
  • if code fills last block, add a
    block
  • Since file addresses/lengths are known, filling
    can be removed
  • Does this work for Huffman? Arithmetic? LZ77?
  • What is the cost?

30
  • Summary of file compression
  • Large dbs ? compression helps reduce storage
  • Fast query processing requires synchronization
  • and fast decoding
  • Db is often given, so statistics can be collected
  • semi-static is a viable option
  • (plus regular re-organization)
  • Context-based methods give good compression, but
    expensive decoding
  • word-based Huffman is recommended (semi-static)
  • Construct two models one for words, another for
    no-words

31
Compression of inverted lists
  • Introduction
  • Global, non-parametric methods
  • Global parametric methods
  • Local parametric methods

32
  • Introduction
  • Important parameters
  • N - of documents in db
  • n - of (distinct) words
  • F - of word occurrences
  • f - of inverted list entries
  • The index contains
  • lexicon (MM, if possible), ILs (Disc)
  • IL compression helps to reduce size of index,
    cost of i/o

(in TREC, 99) 741,856 535,346 333,338,738 134,99
4,414 Total size 2G
33
  • The IL for a term t contains entries
  • An entry
  • d ( doc. id), in-doc freq. ,
    in-doc-position,
  • For ranked answers, the entry is usually (d,
    )
  • We consider each separately independent
    compressions, can be composed

34
  • Compression of doc numbers
  • A sequence of numbers in 1..N how can it be
    compressed?
  • Most methods use gaps
  • g1d1, g2d2-d1,
  • We know that
  • For long lists, most are small.
  • These facts can be used for compression
  • (Each method has an associated probability
    distribution on the gaps, defined by code
    lengths )

35
  • Global, non-parametric methods
  • Binary coding
  • represent each gap by a fixed length binary
    number
  • Code length for g
  • Probability uniform distribution p(g)1/N

36
  • Unary coding
  • represent each ggt0 by d-1 digits 1, then 0
  • 1 -gt 0, 2 -gt 10, 3 -gt 110, 4-gt 1110,
  • Code length for g g
  • ?
  • Worst case for sum N (hence for all ILs nN)
  • is this a nice bound?
  • P(g)
  • Exponential decay if does not hold in practice
  • ? compression penalty

37
  • Gamma ( ) code
  • a number g is represented by
  • Prefix ???? unary code for
  • Suffix ???? binary code, with
    digits, for
  • Examples
  • (Why not ?)

38
  • Delta ( ) code

39
  • Interim summary
  • We have codes with probability distributions
  • Q can you prove that the (exact) formulas for
    probabilities for gamma, delta sum to 1?

40
  • Golomb code
  • Semi-static, uses db statistics
  • global, parametric code
  • Select a basis b (based on db statistics later)
  • ggt0 ? we represent g-1
  • Prefix let (integer
    division)
  • represent, in unary, q1
  • Suffix the remainder is (g-1)-qb (in 0..b-1)
  • represent by a binary tree code
  • - some leaves at distance
  • - the others at distance

41
  • The binary tree code
  • cut 2j leaves from the full binary tree of depth
    k
  • assign leaves, in order, to the values in
    0..b-1
  • Example b6

0
1
2
3
4
5
42
  • Summary of Golomb code
  • Exponential decay like unary, slower rate,
    affected by b
  • Q what is the underlying theory?
  • Q how is b chosen?

43
  • Infinite Huffman Trees
  • Example Consider
  • The code () 0, 10, 110, 1110,
  • seems natural, but Huffman algorithm is not
    applicable! (why?)
  • For each m, consider the (finite)
    m-approximation
  • each has a Huffman tree code 0, 10, , 110
  • the code for m1 refines that of m
  • The sequence of codes converges to ()

44
0
approximation 1, code words 0, 1
1
approximation 2, code words 0, 10,11
1
1/2
1/2
1/4
0
approximation 1, code words 0, 10, 110,
1110,
1111
1
1/8
1/8
45
  • A more general approximation scheme
  • Given the sequence
  • An m-approximation, with skip b is the finite
    sequence where
  • for example b 3

approximated tail
46
  • Fact refining the m-approx. by splitting
    to and gives the
    m1-approx.
  • A sequence of m-approximations is good if
  • () are the smallest in the
    sequence,
  • so they are the 1st pair merged by Huffman
    (why is this important?)
  • () Depends on the and on b

47
  • Let -- the Bernoulli
    distribution
  • A decreasing sequence
  • ?
  • to prove (), need to show
  • For which b do they hold?

48
(No Transcript)
49
  • We select lt on the right (useful later)
  • has a unique solution
  • To solve, from the left side, we obtain
  • Hence the solution is (b is an integer)

50
  • Next how do these Huffman trees look like?
  • Start with 0-approx.
  • Facts
  • A decreasing sequence (so last two are smallest)
  • (when bgt3)
  • follows from
  • and ()
  • 3. Previous two properties are preserved when
    last two are replaced by their sum
  • 4. The Huffman tree for the sequence assigns to
    codes of lengths
    of same cost as the Golomb code for remainders
  • Proof induction on b

51
  • Now, expand the approximations, to obtain
    infinite tree
  • This is the Golomb code, (with places of
    prefix/suffix exchanged)!!

0
1
0
1
0
1
52
  • Last question
  • where do we get p, and why Bernoulli?
  • Assume equal probability p for t to be in d
  • For a given t, probability of the gap g from
    one doc to next is then
  • For p there are f pairs (t, d), estimate p by
  • Since N is large, a reasonable estimate

53
  • For TREC
  • To estimate for a
    small p
  • log(2-p) log 2, log(1-p) -p
  • b (log 2)/p 0.69 nN/f 1917
  • end of (blobal) Golomb

54
  • Global observed frequency (a global method)
  • Construct all ILs collect statictics on
    frequencies of gaps
  • Construct canonical Huffman tree for gaps
  • The model/tree needs to be stored
  • (gaps are in 1..N for TREC this is 3/4M gap
    values ? storage overhead may not be so large)
  • Practically, not far from gamma, delta,
  • But local methods are better

55
  • Local (parametric) methods
  • Coding of IL(t) based on statistics of IL(t)
  • Local observed frequency
  • Construct canonical Huffman for IL(t) based on
    its gap frequencies
  • Problem in small ILs of distinct gaps is
    close to of gaps
  • Size of model close to size of compressed data
  • Example 25 entries, 15 gap values
  • Model 15 gaps, 15 lengths (or freqs)
  • Way out construct model for groups of ILs
  • (see book for details)

56
  • Local Bernoulli/Golomb
  • Assumption - of entries of IL(t) is known
  • (to coder decoder)
  • , estimate b construct
    Golomb
  • Note
  • Large f_t ? larger p ? smaller b ? code gets
    close to unary (reasonable, many small gaps)
  • Small f_t ? large b ? most coding log b
  • For example f_t 2 (one gap) ? b 0.69N
  • for a gap lt 0.69/N, code in log(0.69N)
  • for a larger gap, one more bit

57
  • Interpolative coding
  • Uses original ds , not gs
  • Let f f_t, assume ds are stored in L0,,f
  • (each entry is at most N)
  • Standard binary for middle d, with of bits
    determined by its range
  • Continue as in binary search each d in binary,
    with of bits determined by modified range

58
  • Example L3,8,9,11,12,13,18 (f7) N20
  • H ? 7 div 2 3 L3 11 (4th d)
  • smallest d is 1, and there are 3 to left of
    L3
  • largest d is 20, there are 3 to right of L3
  • size of interval is (20-3)-(13)17-413
  • ? code 11 in 4 bits
  • For sub-list left of 11 3, 8, 9
  • h ? 3 div 2 1 L1 8
  • bounds lower 11 2 upper 10-19
  • code using 3 bits
  • For L2 9, range is 9..10, use 2 bits
  • For sub-list right of 11 do on board
  • (note the element that is coded in 0 bits!)

59
  • Advantages
  • Relatively easy to code, decode
  • Very efficient for clusters (a word that occurs
    in many documents close to each other)
  • Disadvantage more complex to implement,
    requires a stack
  • And cost of decoding is a bit more than Golomb
  • ------ ---------- -------- ----------- -------
    --------
  • Summary of methods Show table 3.8

60
  • An entry in IL(t) also contains - freq. of t
    in d
  • compression of f_d,t
  • In TREC, F/f 2.7 ? these are small numbers
  • Unary total overhead is
  • Cost per entry is F/f (for TREC 2.7)
  • Gamma shorter than unary, except for 2, 4
  • (For TREC 2.13)
  • Does not pay the complexity to choose another
  • Total cost of compression of IL 8-9 bits/entry
Write a Comment
User Comments (0)
About PowerShow.com