Language-Model Based Text-Compression - PowerPoint PPT Presentation

About This Presentation
Title:

Language-Model Based Text-Compression

Description:

Language-Model Based Text-Compression James Connor Antoine El Daher Compressing with Structure Compression Huffman Arithmetic Lempel Ziv (LV78 LV77) Most popular ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 9
Provided by: Antoi84
Learn more at: https://nlp.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Language-Model Based Text-Compression


1
Language-Model Based Text-Compression
  • James Connor
  • Antoine El Daher

2
Compressing with Structure
  • Compression
  • Huffman
  • Arithmetic
  • Lempel Ziv (LV78 LV77)
  • Most popular compression tools based on LV77
  • Exploiting structure
  • Our goal incorporate prior knowledge about the
    structure of the input sequence

3
Perplexity and Entropy
  • Compression ratio is bounded by the Entropy of
    the sequence to be compressed
  • A low-perplexity language model is also a
    low-entropy distribution

4
Character N-grams
  • Represent text as an nth order markov chain of
    characters
  • Maintain counts of n-grams
  • Build a library of huffman tables based on these
    counts

5
Compressing the file
  • Training
  • For each bigram in the training set, we keep a
    map of all the words that can follow it, along
    with their probabilities.
  • E.g. to have ? (seen, 0.1), (been, 0.1),
    (UNK, 0.1), etc.
  • Then for each bigram, we build a Huffman tree.

6
Compressing the File
  • Compressing
  • We go through the input file, using the Huffman
    trees from the training set to code each word
    based on the two preceding words.
  • If the trigram is unknown, we code the UNK token,
    the revert to a unigram model (also coded using
    Huffman).
  • If the unigram is unknown, we use a character
    level Huffman (trained on the training set) to
    code it.
  • Decompression works similarily we mimic the same
    behavior

7
Extensions
  • We have a sliding context window, so that
    whenever we are compressing a file, words that
    are seen there have their counts incremented when
    they enter the window (and decremented when they
    leave) this allows us to make better use of the
    local context in terms of trigrams/bigrams, and
    give more representative weights.

8
Results
  • Competitive with Gzip
Write a Comment
User Comments (0)
About PowerShow.com