Indexing - PowerPoint PPT Presentation

About This Presentation
Title:

Indexing

Description:

Dictionary methods ... amount of information per symbol over the whole ... Information content of a symbol s, denoted by I(s) is given by Shannon's formula ... – PowerPoint PPT presentation

Number of Views:13
Avg rating:3.0/5.0
Slides: 27
Provided by: Thom387
Category:

less

Transcript and Presenter's Notes

Title: Indexing


1
Indexing
2
Overview of the Talk
  • Inverted File Indexing
  • Compression of inverted files
  • Signature files and bitmaps
  • Comparison of indexing methods
  • Conclusion

3
Inverted File Indexing
  • Inverted file index
  • contains a list of terms that appear in the
    document collection (called a lexicon or
    vocabulary)
  • and for each term in the lexicon, stores a list
    of pointers to all occurrences of that term in
    the document collection. This list is called an
    inverted list.
  • Granularity of an index determines the accuracy
    of representation of the location of the word
  • Coarse-grained index requires less storage and
    more query processing to eliminate false matches
  • Word-level index enables queries involving
    adjacency and proximity, but has higher space
    requirements

4
Inverted File Index Example
  • Terms Documents
  • ---------------------------
  • cold lt2 1, 4gt
  • days lt2 3, 6gt
  • hot lt2 1, 4gt
  • in lt2 2, 5gt
  • it lt2 4, 5gt
  • like lt2 4, 5gt
  • nine lt2 3, 6gt
  • old lt2 3, 6gt
  • pease lt2 1, 2gt
  • porridge lt2 1, 2gt
  • pot lt2 2, 5gt
  • some lt2 4, 5gt
  • the lt2 2, 5gt
  • Doc Text
  • 1 Pease porridge hot, pease porridge
    cold,
  • 2 Pease porridge in the pot,
  • 3 Nine days old.
  • 4 Some like it hot, some like it cold,
  • 5 Some like it in the pot,
  • 6 Nine days old.

Notation N number of documents (6) n
number of distinct terms (13) f number of
index pointers (26)
5
Inverted File Compression
Each inverted list has the form
A naïve representation results in a storage
overhead of
6
Text Compression
  • Two classes of text compression methods
  • Symbolwise (or statistical) methods
  • Estimate probabilities of symbols - modeling step
  • Code one symbol at a time - coding step
  • Use shorter code for the most likely symbol
  • Usually based on either arithmetic or Huffman
    coding
  • Dictionary methods
  • Replace fragments of text with a single code word
    (typically an index to an entry in the
    dictionary).
  • eg Ziv-Lempel coding, which replaces strings of
    characters with a pointer to a previous
    occurrence of the string.
  • No probability estimates needed

7
Models
8
Huffman Coding Example
9
Huffman Coding Example
10
Huffman Coding Example
Symbol Code Probability A 0000
0.05 B 0001 0.05 C 001 0.1 D 01
0.2 E 10 0.3 F 110
0.2 G 111 0.1
1.0
0
1
0.4
0
1
0.2
0.6
0
1
0
1
0.1
0.3
0
1
0
1
A 0.05
B 0.05
G 0.1
F 0.2
E 0.3
D 0.2
C 0.1
11
Huffman Coding Conclusions
12
Arithmetic Coding
String bccb Alphabet a, b, c
Code 0.64
13
Arithmetic Coding Conclusions
  • High probability events do not reduce the size of
    the interval in the next step very much, whereas
    low-probability events do.
  • A small final interval requires many digits to
    specify a number guaranteed to be in the
    interval.
  • Number of bits required is proportional to the
    negative logarithm of the size of the interval.
  • A symbol s of probability Prs contributes -log
    Prs bits to the output.

14
Inverted File Compression
Each inverted list has the form
A naive representation results in a storage
overhead of
This can also be stored as
Each difference is called a d-gap. Since
each pointer requires fewer than
bits.
15
Methods for Inverted File Compression
  • Methods for compressing d-gap sizes can be
    classified into
  • global each list is compressed using the same
    model
  • local the model for compressing an inverted list
    is adjusted according to some parameter, like the
    frequency of the term
  • Global methods can be divided into
  • non-parameterized probability distribution for
    d-gap sizes is predetermined.
  • parameterized probability distribution is
    adjusted according to certain parameters of the
    collection.
  • By definition, local methods are parameterized.

16
Non-parameterized models
Unary code An integer x gt 0, is coded as (x-1)
1 bits followed by a 0 bit.
17
Non-parameterized models
Each code has an underlying probability
distribution, which can be derived using
Shannons formula.
Probability assumed by unary is too small.
18
Global parameterized models
Probability that a random document contains a
random term, Assuming a Bernoulli process,
Arithmetic coding
Huffman-style coding (Golomb coding)
19
Global observed frequencymodel
  • Use exact d-gap values and then use arithmetic or
    Huffman coding
  • Only slightly better than ? or d code
  • Reason pointers are not scattered randomly in
    the inverted file
  • Need local methods for any improvement

20
Local methods
  • Local Bernoulli
  • Use a different p for each inverted list
  • Use ? code for storing
  • Skewed Bernoulli
  • Local Bernoulli model is bad for clusters
  • Use a cross between ? and Golomb, with bmedian
    gap size
  • Need to store b (use ? representation)
  • This is still a static model

21
Interpolative code
Consider an inverted list
Documents 8, 9, 11, 12 and 13 form a cluster
Can do better with a minimal binary code
22
Performance of index compression methods
Compression of inverted files in bits per pointer
23
Signature Files
  • Each document is given a signature, that captures
    its content
  • Hash each document term to get several hash
    values
  • Bits corresponding to those values are set to 1
  • Query processing
  • Hash each query term to get several hash values
  • If a document has all bits corresponding to those
    values set to 1, it may contain the query term
  • False matches
  • set several bits for each term
  • make the signatures sufficiently long
  • Naïve representation may have to read the entire
    signature file for each query term
  • Use bitslicing to save on disk transfer time

24
Signature files Conclusion
  • Design involves many tradeoffs
  • wide, sparse signatures reduce number of false
    matches
  • short, dense signatures require more disk
    accesses
  • For reasonable query times, requires more space
    than compressed inverted file
  • Inefficient for documents of varying sizes
  • Blocking makes simple queries difficult to answer
  • Text is not random

25
Bitmaps
  • Simple representation For each term in the
    lexicon, store a bitvector of length N. A bit is
    set if and only if the corresponding document
    contains the term.
  • Efficient for boolean queries
  • Enormous amount of storage requirement, even
    after removing stop words
  • Have been used to represent common words

26
Compression of signature files and bitmaps
  • Signature files are already in compressed form
  • Decompression affects query time substantially
  • Lossy compression results in false matches
  • Bitmaps can be compressed by a significant amount

Compressed code 1100 0101, 1010 0010, 0011,
1000, 0100
0000 0010 0000 0011 1000 0000 0100 0000 0000 0000
0000 0000 0000 0000 0000 0000
27
Comparison of indexing methods
  • All indexing methods are variations of the same
    basic idea!!
  • Signature files and inverted files require an
    order of magnitude less secondary storage than
    bitmaps
  • Signature files cause unnecessary access to the
    document collection unless signature width is
    large
  • Signature files are disastrous when record
    lengths vary a lot
  • Advantages of signature files
  • no need to keep lexicon in memory
  • better for conjunctive queries involving common
    terms

28
Conclusion
  • For practical purposes, the best index
    compression algorithm is the local Bernoulli
    method (using Golomb coding)
  • Compressed inverted indices are almost always
    better than signature files and bitmaps in most
    practical situations, in terms of both space and
    response time for queries
Write a Comment
User Comments (0)
About PowerShow.com