SpaceConstrained GramBased Indexing for Efficient Approximate String Search - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

SpaceConstrained GramBased Indexing for Efficient Approximate String Search

Description:

Brad Pitt. Forest Whittacker. George Bush. Angelina Jolie. Arnold ... Brad Pitt. Arnold Schwarzeneger. George Bush. Angelina Jolie. Forrest Whittaker ... – PowerPoint PPT presentation

Number of Views:249

Avg rating:3.0/5.0

Slides: 32

Provided by: Rinde

Category:

more less

Transcript and Presenter's Notes

Title: SpaceConstrained GramBased Indexing for Efficient Approximate String Search

1
Space-Constrained Gram-Based Indexing for
Efficient Approximate String Search

Alexander Behm1, Shengyue Ji1, Chen Li1, Jiaheng
Lu2
1University of California, Irvine
2Renmin University of China

2
Overview

Motivation Preliminaries
Approach 1 Discarding Lists
Approach 2 Combining Lists
Experiments Conclusion

3
Motivation Data Cleaning
Should clearly be Niels Bohr

Real-world data is dirty
Typos
Inconsistent representations
(PO Box vs. P.O. Box)
Approximately check against clean dictionary

Source http//en.wikipedia.org/wiki/Heisenberg's_
microscope, Jan 2008
4
Motivation Record Linkage
We want to link records belonging to the same
entity
No exact match!
The same entity may have similar
representations Arnold Schwarzeneger
versus Arnold Schwarzenegger Forrest Whittaker
versus Forest Whittacker
5
Motivation Query Relaxation

Errors in queries
Errors in data
Bring query and meaningful results closer together

Actual queries gathered by Google
http//www.google.com/jobs/britney.html
6
What is Approximate String Search?
String Collection (People) Brad Pitt Forest
Whittacker George Bush Angelina Jolie Arnold
Schwarzeneger
Queries against collection Find all entries
similar to Forrest Whitaker Find all entries
similar to Arnold Schwarzenegger Find all
entries similar to Brittany Spears

What do we mean by similar to?
Edit Distance
Jaccard Similarity
Cosine Similaity
Dice
Etc.

The similar to predicate can help our described
applications! How can we support these types of
queries efficiently?
7
Approximate Query Answering
Main Idea Use q-grams as signatures for a string
irvine
Sliding Window
2-grams ir, rv, vi, in, ne
Intuition Similar strings share a certain number
of grams
Inverted index on grams supports finding all data
strings sharing enough grams with a query
8
Approximate Query Example
Query irvine, Edit Distance 1 2-grams ir,
rv, vi, in, ne
Lookup Grams
tf
vi
ir
ef
rv
ne
un
in

2-grams
5 9
1 2 4 5 6
1 3 4 5 7 9
1 5
1 2 3 9
3 9
7 9
5 6 9
Inverted Lists (stringIDs)
Candidates 1, 5, 9 May have false
positives Need to compute real similarity
Each edit operations can destroy at most q
grams Answers must share at least T 5 1 2
3 grams T-Occurrence problem Find elements
occurring at least T3 times among inverted
lists. This is called list-merging. T is called
merging-threshold.
9
Motivation Compression
Index-Size Estimation Each string produces s -
q 1 grams For each gram we add one element to
its inverted list (a 4-byte uint) With ASCII
encoding the index is 4x as large as the
original data!

Inverted index can be very large compared to
source data
May need to fit in memory for fast query
processing
Can we compress the index to fit into a space
budget?

10
Motivation Related Work

IR community developed many lossless compression
algorithms for inverted lists (mostly in a
disk-based setting)
Mainly use delta representation packing
If inverted lists are in memory these techniques
always impose decompression overhead
Difficult to tune compression ratio
How to overcome these limitations in our setting?

11
This Paper

We developed two lossy compression techniques
We answer queries exactly
Index can fit into a space budget (space
constraint)
Queries can become faster on the compressed
indexes
Flexibility to choose space / time tradeoff
Existing list-merging algorithms can be re-used
(even with compression specific optimizations)

12
Overview

Motivation Preliminaries
Approach 1 Discarding Lists
Approach 2 Combining Lists
Experiments Conclusion

13
Approach 1 Discarding Lists
B E FORE
tf
vi
ir
ef
rv
ne
un
in

2-grams
1 2 4 5 6
5 9
1 3 4 5 7 9
1 5
1 2 3 9
3 9
7 9
5 6 9
Inverted Lists (stringIDs)
tf
vi
ir
ef
rv
ne
un
in

2-grams
A F TER
1 2 4 5 6
5 9
1 5
7 9
5 6 9
Inverted Lists (stringIDs)
Lists discarded, Holes
14
Effects on Queries

Need to decrease merging-threshold T
Lower T ? more false positives to post-process
If T collection and compute true similarities
Surprisingly! Query Processing time can decrease
because fewerlists to consider

15
Query shanghai, Edit Distance 1 3-grams sha,
han, ang, ngh, gha, hai
Hole grams
Regular grams
han
ngh
hai

uni
ing
sha
ang
gha
ter
3-grams
Merging-threshold without holes, T grams ed
q 6 1 3 3 Basis Each Edit Operation
can destroy at most q3 grams Naïve new
Merging-Threshold T T holes 0 ?
Panic! Can we really destroy at most q3
non-hole grams with each edit operation?
han
ngh
hai
sha
ang
gha
Delete a
Delete g
Can destroy at most 2 grams with 1 Edit
Operation! New Merging-Threshold T 1 We use
Dynamic Programming to compute tighter T
16
Choosing Lists to Discard

One extreme query is entirely unaffected
Other extreme query becomes panic
Good choice of lists depends on query workload
Many combinations of lists to discard that
satisfy memory constraint, checking all is
infeasible
How can we make a reasonable choice efficiently?

17
Choosing Lists to Discard
Input Memory Constraint Inverted Lists L Query
Workload W Output Lists to Discard
D DiscardLists While(Memory Constraint Not
Satisfied) For each list in L ?t
estimateImpact(list, W) benefit
list.size() discard use ?ts and
benefits to choose list add discard to D
remove discard from L
How can we do this efficiently? Perhaps
incrementally? Times needed List-Merging
Time Post-Processing Time Panic Time
What exactly should we minimize? benefit /
cost? cost only? We could ignore benefit
18
Choosing Lists to Discard
Estimating Query Times With Holes List-Merging
Time cost function, parameters decided offline
with linear regression Post-Processing Time
candidates average compute similarity
time Panic Time strings average compute
similarity time candidates depends on T, data
distribution, number of holes Incremental-ScanCou
nt Algorithm
Before Discarding List T 3 candidates 3
2
0
3
3
2
4
0
0
1
0
Counts
List to discard
0
1
2
3
4
5
6
7
8
9
StringIDs
2
3
decrement counts
4
8
After Discarding List T T 1 2 candidates
4
2
0
2
2
1
4
0
0
0
0
Counts
0
1
2
3
4
5
6
7
8
9
StringIDs
Many more ways to improve speed of DiscardLists,
this is just one example
19
Overview

Motivation Preliminaries
Approach 1 Discarding Lists
Approach 2 Combining Lists
Experiments Conclusion

20
Approach 2 Combining Lists
B E FORE
tf
vi
ir
ef
rv
ne
un
in

2-grams
1 2 4 5 6
5 9
1 3 4 5 7 9
5 6 9
1 2 3 9
1 3 9
7 9
6 9
Inverted Lists (stringIDs)
tf
vi
ir
ef
rv
ne
un
in

2-grams
A F TER
1 2 4 5 6
1 3 4 5 7 9
7 9
6 9
1 2 3 9
5 6 9
Inverted Lists (stringIDs)
Lists combined
Intuition Combine correlated lists.
21
Effects on Queries

Merging-threshold T is unchanged (no new panics)
Lists become longer
More time to traverse lists
More false positives

List-Merging Optimization 3-grams sha, han, ang,
ngh, gha, hai
Traverse physical lists once. Count for
stringIDs on physical lists increased by refcount
instead of 1
combined refcount 2
combined refcount 3
22
Choosing Lists to Combine

Discovering candidate gram pairs
Frequent q1-grams ? correlated adjacent q-grams
Using Locality-Sensitive Hashing (LSH)
Selecting candidate pairs to combine
Based on estimated cost on query workload
Similar to DiscardList
Different Incremental ScanCount algorithm

23
Overview

Motivation Preliminaries
Approach 1 Discarding Lists
Approach 2 Combining Lists
Experiments Conclusion

24
Experiments

Datasets
Google WebCorpus (word grams)
IMDB Actors
Queries picked from dataset, Zipf distributed
q3, Edit Distance2
Overview
Performance of flavors of DiscardLists
CombineLists
Scalability with increasing index size
Comparison with IR compression technique
Comparison with VGRAM
What if workload changes from training workload

25
Experiments
DiscardLists
CombineLists
Runtime decreases!
Runtime decreases!
26
Experiments
Comparison with IR compression technique
Compressed
Uncompressed
Compressed
Uncompressed
27
Experiments
Comparison with variable-length gram technique,
VGRAM
Compressed
Uncompressed
Uncompressed
Compressed
28
Future Work

DiscardLists, CombineLists and IR compression
could be combined
When considering filter tree, global vs. local
decisions
How to minimize impact on performance if workload
change

29
Conclusion

We developed two lossy compression techniques
We answer queries exactly
Index can fit into a space budget (space
constraint)
Queries can become faster on the compressed
indexes
Flexibility to choose space / time tradeoff
Existing list-merging algorithms can be re-used
(even with compression specific optimizations)

30
More Experiments
What if the workload changes from the training
workload?
31
More Experiments
What if the workload changes from the training
workload?

Write a Comment

User Comments (0)