Title: Using Fingerprints in n-Gram Indices
1Using Fingerprints in n-Gram Indices
Stefan Selbach selbach_at_informatik.uni-wuerzburg.de
Digital Libraries Advanced Methods and
Technologies, Digital Collections
17.09.2009
2Using Fingerprints in n-Gram Indices
- Overview
- Introduction
- Inverted Index
- N-Gram Index
- Bitmaps
- Signature Files
- n-Gram Fingerprints
- n-Gram Fingerprints in Combination with Posting
Lists - Fingerprint Compression
- Conclusion and Future Work
3Introduction
4Inverted Index
- Very common index structure
- Term-oriented
- Every term is linked to its postings
5n-Gram Index
- Uses n-Grams as indexing terms
- Any kind of subsequence can be searched
- n-Gram is a subsequence of a text with
- Postings for longer subsequences can be
calculated
6n-Gram Index
- Index structure is very similar to an inverted
index - Searching is more complex
7Bitmaps
- Bitmaps are occurrence maps
- Each bit signals an occurrence of a specific term
in a specific document
8Signature Files
9n-Gram Fingerprint
10N-Gram Fingerprint
- The idea
- Create fingerprints that
- Have a fixed size
- Contain information about the postings
11N-Gram Fingerprint
- A 2D-Fingerprint is a bit-matrix
12N-Gram Fingerprint
- Given two 1-grams and their fingerprintsBw1 and
Bw2 the fingerprint Bw1w2 can beaproximated - Bw2 is constructed by cyclic shifting each
column of Bw2 by one position to the left.
13N-Gram Fingerprint
14N-Gram Fingerprint
Search Speed
Query Bit-matrix Time for verification Hits
rhinolo 219 ms 94 ms 18
sanfilipo 290 ms 0 ms 0
itracon 266 ms 336 ms 64
oxyuria 197 ms 48 ms 6
Results from the Online Encyclopedia of
Dermatology from P. Altmeyer
15Term Frequencies and Query Probability
16N-Gram Fingerprints in combination with posting
lists
17Combining Fingerprints and Posting Lists
- By combining fingerprints and posting lists
- No verification step is needed
- Posting lists are partitioned into smaller
subsets. Each bit of the fingerprint corresponds
to a separate posting list - Costs for intersection of posting lists are being
reduced
18Combining Fingerprints and Posting Lists
19Managing n-Gram Posting Lists
- Very large number of posting-subsets have to be
managed For example 1024 residue classes
for the fileID 128 residue classes for the
offset 14.000 different n-grams - Subsets are stored in a hash
- The hash value is a function of the residue
classes
20Managing n-Gram Posting Lists
21Managing n-Gram Posting Lists
22Results
- Performance improved by 40 compared to the setup
without posting lists
Query Bit-matrix Time for verification Hits
rhinolo 230 ms 10 ms 18
sanfilipo 271 ms 0 ms 0
itracon 245 ms 15 ms 64
oxyuria 210 ms 12 ms 6
23Fingerprint compression
24Fingerprint Compression
- Fingerprints with high or low densities do not
contain much information - Fingerprints can be compressed by reducing the
resolution -
- Dictionary based compression
25Fingerprint Compression
- Results Fingerprint convolution
-
Density threshold for convolution Performance loss Fingerprint index reduction
no convolution 0 0
0-0,025 and 0.975-1 3.1 23
0-0.05 and 0.95-1 3.2 27
0-0.1 and 0.9-1 10 29
0-0.2 and 0.8-1 25 31
- In combination with the dictionary based
compression the index size is being reduced by
additional 30 -
26Conclusion and Future Work
27Conclusion
- Fingerprints improve the scalability of n-gram
indices - Fingerprints improve the performance of n-gram
indices - The index structure can be adjusted to user
behavior, so that common queries can be processed
more efficiently - The fingerprints can be stored in a compressed
index with loosing only a minimum of performance
28Future Work
- Combination of term based inverted index and
n-Gram fingerprint index - Profit from the advantages of both using terms
and n-Grams as indexing terms - Substring search
- Ranking
- Thesaurus information
29Thank You!
Digital Libraries Advanced Methods and
Technologies, Digital Collections
17.09.2009