Indexing - PowerPoint PPT Presentation

About This Presentation

Title:

Indexing

Description:

Dictionary methods ... amount of information per symbol over the whole ... Information content of a symbol s, denoted by I(s) is given by Shannon's formula ... – PowerPoint PPT presentation

Number of Views:13

Avg rating:3.0/5.0

Slides: 27

Provided by: Thom387

Category:

more less

Transcript and Presenter's Notes

Title: Indexing

1
Indexing
2
Overview of the Talk

Inverted File Indexing
Compression of inverted files
Signature files and bitmaps
Comparison of indexing methods
Conclusion

3
Inverted File Indexing

Inverted file index
contains a list of terms that appear in the
document collection (called a lexicon or
vocabulary)
and for each term in the lexicon, stores a list
of pointers to all occurrences of that term in
the document collection. This list is called an
inverted list.
Granularity of an index determines the accuracy
of representation of the location of the word
Coarse-grained index requires less storage and
more query processing to eliminate false matches
Word-level index enables queries involving
adjacency and proximity, but has higher space
requirements

4
Inverted File Index Example

Terms Documents
---------------------------
cold lt2 1, 4gt
days lt2 3, 6gt
hot lt2 1, 4gt
in lt2 2, 5gt
it lt2 4, 5gt
like lt2 4, 5gt
nine lt2 3, 6gt
old lt2 3, 6gt
pease lt2 1, 2gt
porridge lt2 1, 2gt
pot lt2 2, 5gt
some lt2 4, 5gt
the lt2 2, 5gt

Doc Text
1 Pease porridge hot, pease porridge
cold,
2 Pease porridge in the pot,
3 Nine days old.
4 Some like it hot, some like it cold,
5 Some like it in the pot,
6 Nine days old.

Notation N number of documents (6) n
number of distinct terms (13) f number of
index pointers (26)
5
Inverted File Compression
Each inverted list has the form
A naïve representation results in a storage
overhead of
6
Text Compression

Two classes of text compression methods
Symbolwise (or statistical) methods
Estimate probabilities of symbols - modeling step
Code one symbol at a time - coding step
Use shorter code for the most likely symbol
Usually based on either arithmetic or Huffman
coding
Dictionary methods
Replace fragments of text with a single code word
(typically an index to an entry in the
dictionary).
eg Ziv-Lempel coding, which replaces strings of
characters with a pointer to a previous
occurrence of the string.
No probability estimates needed

7
Models
8
Huffman Coding Example
9
Huffman Coding Example
10
Huffman Coding Example
Symbol Code Probability A 0000
0.05 B 0001 0.05 C 001 0.1 D 01
0.2 E 10 0.3 F 110
0.2 G 111 0.1
1.0
0
1
0.4
0
1
0.2
0.6
0
1
0
1
0.1
0.3
0
1
0
1
A 0.05
B 0.05
G 0.1
F 0.2
E 0.3
D 0.2
C 0.1
11
Huffman Coding Conclusions
12
Arithmetic Coding
String bccb Alphabet a, b, c
Code 0.64
13
Arithmetic Coding Conclusions

High probability events do not reduce the size of
the interval in the next step very much, whereas
low-probability events do.
A small final interval requires many digits to
specify a number guaranteed to be in the
interval.
Number of bits required is proportional to the
negative logarithm of the size of the interval.
A symbol s of probability Prs contributes -log
Prs bits to the output.

14
Inverted File Compression
Each inverted list has the form
A naive representation results in a storage
overhead of
This can also be stored as
Each difference is called a d-gap. Since
each pointer requires fewer than
bits.
15
Methods for Inverted File Compression

Methods for compressing d-gap sizes can be
classified into
global each list is compressed using the same
model
local the model for compressing an inverted list
is adjusted according to some parameter, like the
frequency of the term
Global methods can be divided into
non-parameterized probability distribution for
d-gap sizes is predetermined.
parameterized probability distribution is
adjusted according to certain parameters of the
collection.
By definition, local methods are parameterized.

16
Non-parameterized models
Unary code An integer x gt 0, is coded as (x-1)
1 bits followed by a 0 bit.
17
Non-parameterized models
Each code has an underlying probability
distribution, which can be derived using
Shannons formula.
Probability assumed by unary is too small.
18
Global parameterized models
Probability that a random document contains a
random term, Assuming a Bernoulli process,
Arithmetic coding
Huffman-style coding (Golomb coding)
19
Global observed frequencymodel

Use exact d-gap values and then use arithmetic or
Huffman coding
Only slightly better than ? or d code
Reason pointers are not scattered randomly in
the inverted file
Need local methods for any improvement

20
Local methods

Local Bernoulli
Use a different p for each inverted list
Use ? code for storing
Skewed Bernoulli
Local Bernoulli model is bad for clusters
Use a cross between ? and Golomb, with bmedian
gap size
Need to store b (use ? representation)
This is still a static model

21
Interpolative code
Consider an inverted list
Documents 8, 9, 11, 12 and 13 form a cluster
Can do better with a minimal binary code
22
Performance of index compression methods
Compression of inverted files in bits per pointer
23
Signature Files

Each document is given a signature, that captures
its content
Hash each document term to get several hash
values
Bits corresponding to those values are set to 1
Query processing
Hash each query term to get several hash values
If a document has all bits corresponding to those
values set to 1, it may contain the query term
False matches
set several bits for each term
make the signatures sufficiently long
Naïve representation may have to read the entire
signature file for each query term
Use bitslicing to save on disk transfer time

24
Signature files Conclusion

Design involves many tradeoffs
wide, sparse signatures reduce number of false
matches
short, dense signatures require more disk
accesses
For reasonable query times, requires more space
than compressed inverted file
Inefficient for documents of varying sizes
Blocking makes simple queries difficult to answer
Text is not random

25
Bitmaps

Simple representation For each term in the
lexicon, store a bitvector of length N. A bit is
set if and only if the corresponding document
contains the term.
Efficient for boolean queries
Enormous amount of storage requirement, even
after removing stop words
Have been used to represent common words

26
Compression of signature files and bitmaps

Signature files are already in compressed form
Decompression affects query time substantially
Lossy compression results in false matches
Bitmaps can be compressed by a significant amount

Compressed code 1100 0101, 1010 0010, 0011,
1000, 0100
0000 0010 0000 0011 1000 0000 0100 0000 0000 0000
0000 0000 0000 0000 0000 0000
27
Comparison of indexing methods

All indexing methods are variations of the same
basic idea!!
Signature files and inverted files require an
order of magnitude less secondary storage than
bitmaps
Signature files cause unnecessary access to the
document collection unless signature width is
large
Signature files are disastrous when record
lengths vary a lot
Advantages of signature files
no need to keep lexicon in memory
better for conjunctive queries involving common
terms

28
Conclusion

For practical purposes, the best index
compression algorithm is the local Bernoulli
method (using Golomb coding)
Compressed inverted indices are almost always
better than signature files and bitmaps in most
practical situations, in terms of both space and
response time for queries

Write a Comment

User Comments (0)