Title: Indexing
1Indexing
2Overview of the Talk
- Inverted File Indexing
- Compression of inverted files
- Signature files and bitmaps
- Comparison of indexing methods
- Conclusion
3Inverted File Indexing
- Inverted file index
- contains a list of terms that appear in the
document collection (called a lexicon or
vocabulary) - and for each term in the lexicon, stores a list
of pointers to all occurrences of that term in
the document collection. This list is called an
inverted list. - Granularity of an index determines the accuracy
of representation of the location of the word - Coarse-grained index requires less storage and
more query processing to eliminate false matches - Word-level index enables queries involving
adjacency and proximity, but has higher space
requirements
4Inverted File Index Example
- Terms Documents
- ---------------------------
- cold lt2 1, 4gt
- days lt2 3, 6gt
- hot lt2 1, 4gt
- in lt2 2, 5gt
- it lt2 4, 5gt
- like lt2 4, 5gt
- nine lt2 3, 6gt
- old lt2 3, 6gt
- pease lt2 1, 2gt
- porridge lt2 1, 2gt
- pot lt2 2, 5gt
- some lt2 4, 5gt
- the lt2 2, 5gt
- Doc Text
- 1 Pease porridge hot, pease porridge
cold, - 2 Pease porridge in the pot,
- 3 Nine days old.
- 4 Some like it hot, some like it cold,
- 5 Some like it in the pot,
- 6 Nine days old.
Notation N number of documents (6) n
number of distinct terms (13) f number of
index pointers (26)
5Inverted File Compression
Each inverted list has the form
A naïve representation results in a storage
overhead of
6Text Compression
- Two classes of text compression methods
- Symbolwise (or statistical) methods
- Estimate probabilities of symbols - modeling step
- Code one symbol at a time - coding step
- Use shorter code for the most likely symbol
- Usually based on either arithmetic or Huffman
coding - Dictionary methods
- Replace fragments of text with a single code word
(typically an index to an entry in the
dictionary). - eg Ziv-Lempel coding, which replaces strings of
characters with a pointer to a previous
occurrence of the string. - No probability estimates needed
7Models
8Huffman Coding Example
9Huffman Coding Example
10Huffman Coding Example
Symbol Code Probability A 0000
0.05 B 0001 0.05 C 001 0.1 D 01
0.2 E 10 0.3 F 110
0.2 G 111 0.1
1.0
0
1
0.4
0
1
0.2
0.6
0
1
0
1
0.1
0.3
0
1
0
1
A 0.05
B 0.05
G 0.1
F 0.2
E 0.3
D 0.2
C 0.1
11Huffman Coding Conclusions
12Arithmetic Coding
String bccb Alphabet a, b, c
Code 0.64
13Arithmetic Coding Conclusions
- High probability events do not reduce the size of
the interval in the next step very much, whereas
low-probability events do. - A small final interval requires many digits to
specify a number guaranteed to be in the
interval. - Number of bits required is proportional to the
negative logarithm of the size of the interval. - A symbol s of probability Prs contributes -log
Prs bits to the output.
14Inverted File Compression
Each inverted list has the form
A naive representation results in a storage
overhead of
This can also be stored as
Each difference is called a d-gap. Since
each pointer requires fewer than
bits.
15Methods for Inverted File Compression
- Methods for compressing d-gap sizes can be
classified into - global each list is compressed using the same
model - local the model for compressing an inverted list
is adjusted according to some parameter, like the
frequency of the term - Global methods can be divided into
- non-parameterized probability distribution for
d-gap sizes is predetermined. - parameterized probability distribution is
adjusted according to certain parameters of the
collection. - By definition, local methods are parameterized.
16Non-parameterized models
Unary code An integer x gt 0, is coded as (x-1)
1 bits followed by a 0 bit.
17Non-parameterized models
Each code has an underlying probability
distribution, which can be derived using
Shannons formula.
Probability assumed by unary is too small.
18Global parameterized models
Probability that a random document contains a
random term, Assuming a Bernoulli process,
Arithmetic coding
Huffman-style coding (Golomb coding)
19Global observed frequencymodel
- Use exact d-gap values and then use arithmetic or
Huffman coding - Only slightly better than ? or d code
- Reason pointers are not scattered randomly in
the inverted file - Need local methods for any improvement
20Local methods
- Local Bernoulli
- Use a different p for each inverted list
- Use ? code for storing
- Skewed Bernoulli
- Local Bernoulli model is bad for clusters
- Use a cross between ? and Golomb, with bmedian
gap size - Need to store b (use ? representation)
- This is still a static model
21Interpolative code
Consider an inverted list
Documents 8, 9, 11, 12 and 13 form a cluster
Can do better with a minimal binary code
22Performance of index compression methods
Compression of inverted files in bits per pointer
23Signature Files
- Each document is given a signature, that captures
its content - Hash each document term to get several hash
values - Bits corresponding to those values are set to 1
- Query processing
- Hash each query term to get several hash values
- If a document has all bits corresponding to those
values set to 1, it may contain the query term - False matches
- set several bits for each term
- make the signatures sufficiently long
- Naïve representation may have to read the entire
signature file for each query term - Use bitslicing to save on disk transfer time
24Signature files Conclusion
- Design involves many tradeoffs
- wide, sparse signatures reduce number of false
matches - short, dense signatures require more disk
accesses - For reasonable query times, requires more space
than compressed inverted file - Inefficient for documents of varying sizes
- Blocking makes simple queries difficult to answer
- Text is not random
25Bitmaps
- Simple representation For each term in the
lexicon, store a bitvector of length N. A bit is
set if and only if the corresponding document
contains the term. - Efficient for boolean queries
- Enormous amount of storage requirement, even
after removing stop words - Have been used to represent common words
26Compression of signature files and bitmaps
- Signature files are already in compressed form
- Decompression affects query time substantially
- Lossy compression results in false matches
- Bitmaps can be compressed by a significant amount
Compressed code 1100 0101, 1010 0010, 0011,
1000, 0100
0000 0010 0000 0011 1000 0000 0100 0000 0000 0000
0000 0000 0000 0000 0000 0000
27Comparison of indexing methods
- All indexing methods are variations of the same
basic idea!! - Signature files and inverted files require an
order of magnitude less secondary storage than
bitmaps - Signature files cause unnecessary access to the
document collection unless signature width is
large - Signature files are disastrous when record
lengths vary a lot - Advantages of signature files
- no need to keep lexicon in memory
- better for conjunctive queries involving common
terms
28Conclusion
- For practical purposes, the best index
compression algorithm is the local Bernoulli
method (using Golomb coding) - Compressed inverted indices are almost always
better than signature files and bitmaps in most
practical situations, in terms of both space and
response time for queries