Using Fingerprints in n-Gram Indices - PowerPoint PPT Presentation

About This Presentation
Title:

Using Fingerprints in n-Gram Indices

Description:

Title: Folie 1 Author: rzuw270 Last modified by: selbach Created Date: 12/4/2006 8:50:09 AM Document presentation format: Bildschirmpr sentation (4:3) – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 29
Provided by: rzu3
Category:

less

Transcript and Presenter's Notes

Title: Using Fingerprints in n-Gram Indices


1
Using Fingerprints in n-Gram Indices
Stefan Selbach selbach_at_informatik.uni-wuerzburg.de
Digital Libraries Advanced Methods and
Technologies, Digital Collections
17.09.2009
2
Using Fingerprints in n-Gram Indices
  • Overview
  • Introduction
  • Inverted Index
  • N-Gram Index
  • Bitmaps
  • Signature Files
  • n-Gram Fingerprints
  • n-Gram Fingerprints in Combination with Posting
    Lists
  • Fingerprint Compression
  • Conclusion and Future Work

3
Introduction
4
Inverted Index
  • Very common index structure
  • Term-oriented
  • Every term is linked to its postings

5
n-Gram Index
  • Uses n-Grams as indexing terms
  • Any kind of subsequence can be searched
  • n-Gram is a subsequence of a text with
  • Postings for longer subsequences can be
    calculated

6
n-Gram Index
  • Index structure is very similar to an inverted
    index
  • Searching is more complex

7
Bitmaps
  • Bitmaps are occurrence maps
  • Each bit signals an occurrence of a specific term
    in a specific document

8
Signature Files
9
n-Gram Fingerprint
10
N-Gram Fingerprint
  • The idea
  • Create fingerprints that
  • Have a fixed size
  • Contain information about the postings

11
N-Gram Fingerprint
  • A 2D-Fingerprint is a bit-matrix

12
N-Gram Fingerprint
  • Given two 1-grams and their fingerprintsBw1 and
    Bw2 the fingerprint Bw1w2 can beaproximated
  • Bw2 is constructed by cyclic shifting each
    column of Bw2 by one position to the left.

13
N-Gram Fingerprint
14
N-Gram Fingerprint
Search Speed
Query Bit-matrix Time for verification Hits
rhinolo 219 ms 94 ms 18
sanfilipo 290 ms 0 ms 0
itracon 266 ms 336 ms 64
oxyuria 197 ms 48 ms 6
Results from the Online Encyclopedia of
Dermatology from P. Altmeyer
15
Term Frequencies and Query Probability
16
N-Gram Fingerprints in combination with posting
lists
17
Combining Fingerprints and Posting Lists
  • By combining fingerprints and posting lists
  • No verification step is needed
  • Posting lists are partitioned into smaller
    subsets. Each bit of the fingerprint corresponds
    to a separate posting list
  • Costs for intersection of posting lists are being
    reduced

18
Combining Fingerprints and Posting Lists
19
Managing n-Gram Posting Lists
  • Very large number of posting-subsets have to be
    managed For example 1024 residue classes
    for the fileID 128 residue classes for the
    offset 14.000 different n-grams
  • Subsets are stored in a hash
  • The hash value is a function of the residue
    classes

20
Managing n-Gram Posting Lists
21
Managing n-Gram Posting Lists
22
Results
  • Performance improved by 40 compared to the setup
    without posting lists

Query Bit-matrix Time for verification Hits
rhinolo 230 ms 10 ms 18
sanfilipo 271 ms 0 ms 0
itracon 245 ms 15 ms 64
oxyuria 210 ms 12 ms 6
23
Fingerprint compression
24
Fingerprint Compression
  • Fingerprints with high or low densities do not
    contain much information
  • Fingerprints can be compressed by reducing the
    resolution
  • Dictionary based compression

25
Fingerprint Compression
  • Results Fingerprint convolution

Density threshold for convolution Performance loss Fingerprint index reduction
no convolution 0 0
0-0,025 and 0.975-1 3.1 23
0-0.05 and 0.95-1 3.2 27
0-0.1 and 0.9-1 10 29
0-0.2 and 0.8-1 25 31
  • In combination with the dictionary based
    compression the index size is being reduced by
    additional 30

26
Conclusion and Future Work
27
Conclusion
  • Fingerprints improve the scalability of n-gram
    indices
  • Fingerprints improve the performance of n-gram
    indices
  • The index structure can be adjusted to user
    behavior, so that common queries can be processed
    more efficiently
  • The fingerprints can be stored in a compressed
    index with loosing only a minimum of performance

28
Future Work
  • Combination of term based inverted index and
    n-Gram fingerprint index
  • Profit from the advantages of both using terms
    and n-Grams as indexing terms
  • Substring search
  • Ranking
  • Thesaurus information

29
Thank You!  
Digital Libraries Advanced Methods and
Technologies, Digital Collections
17.09.2009
Write a Comment
User Comments (0)
About PowerShow.com