Title: SpaceConstrained GramBased Indexing for Efficient Approximate String Search
1Space-Constrained Gram-Based Indexing for
Efficient Approximate String Search
- Alexander Behm1, Shengyue Ji1, Chen Li1, Jiaheng
Lu2 - 1University of California, Irvine
- 2Renmin University of China
2Overview
- Motivation Preliminaries
- Approach 1 Discarding Lists
- Approach 2 Combining Lists
- Experiments Conclusion
3Motivation Data Cleaning
Should clearly be Niels Bohr
- Real-world data is dirty
- Typos
- Inconsistent representations
- (PO Box vs. P.O. Box)
- Approximately check against clean dictionary
Source http//en.wikipedia.org/wiki/Heisenberg's_
microscope, Jan 2008
4Motivation Record Linkage
We want to link records belonging to the same
entity
No exact match!
The same entity may have similar
representations Arnold Schwarzeneger
versus Arnold Schwarzenegger Forrest Whittaker
versus Forest Whittacker
5Motivation Query Relaxation
- Errors in queries
- Errors in data
- Bring query and meaningful results closer together
Actual queries gathered by Google
http//www.google.com/jobs/britney.html
6What is Approximate String Search?
String Collection (People) Brad Pitt Forest
Whittacker George Bush Angelina Jolie Arnold
Schwarzeneger
Queries against collection Find all entries
similar to Forrest Whitaker Find all entries
similar to Arnold Schwarzenegger Find all
entries similar to Brittany Spears
- What do we mean by similar to?
- Edit Distance
- Jaccard Similarity
- Cosine Similaity
- Dice
- Etc.
The similar to predicate can help our described
applications! How can we support these types of
queries efficiently?
7Approximate Query Answering
Main Idea Use q-grams as signatures for a string
irvine
Sliding Window
2-grams ir, rv, vi, in, ne
Intuition Similar strings share a certain number
of grams
Inverted index on grams supports finding all data
strings sharing enough grams with a query
8Approximate Query Example
Query irvine, Edit Distance 1 2-grams ir,
rv, vi, in, ne
Lookup Grams
tf
vi
ir
ef
rv
ne
un
in
2-grams
5 9
1 2 4 5 6
1 3 4 5 7 9
1 5
1 2 3 9
3 9
7 9
5 6 9
Inverted Lists (stringIDs)
Candidates 1, 5, 9 May have false
positives Need to compute real similarity
Each edit operations can destroy at most q
grams Answers must share at least T 5 1 2
3 grams T-Occurrence problem Find elements
occurring at least T3 times among inverted
lists. This is called list-merging. T is called
merging-threshold.
9Motivation Compression
Index-Size Estimation Each string produces s -
q 1 grams For each gram we add one element to
its inverted list (a 4-byte uint) With ASCII
encoding the index is 4x as large as the
original data!
- Inverted index can be very large compared to
source data - May need to fit in memory for fast query
processing - Can we compress the index to fit into a space
budget?
10Motivation Related Work
- IR community developed many lossless compression
algorithms for inverted lists (mostly in a
disk-based setting) - Mainly use delta representation packing
- If inverted lists are in memory these techniques
always impose decompression overhead - Difficult to tune compression ratio
- How to overcome these limitations in our setting?
11This Paper
- We developed two lossy compression techniques
- We answer queries exactly
- Index can fit into a space budget (space
constraint) - Queries can become faster on the compressed
indexes - Flexibility to choose space / time tradeoff
- Existing list-merging algorithms can be re-used
(even with compression specific optimizations)
12Overview
- Motivation Preliminaries
- Approach 1 Discarding Lists
- Approach 2 Combining Lists
- Experiments Conclusion
13Approach 1 Discarding Lists
B E FORE
tf
vi
ir
ef
rv
ne
un
in
2-grams
1 2 4 5 6
5 9
1 3 4 5 7 9
1 5
1 2 3 9
3 9
7 9
5 6 9
Inverted Lists (stringIDs)
tf
vi
ir
ef
rv
ne
un
in
2-grams
A F TER
1 2 4 5 6
5 9
1 5
7 9
5 6 9
Inverted Lists (stringIDs)
Lists discarded, Holes
14Effects on Queries
- Need to decrease merging-threshold T
- Lower T ? more false positives to post-process
- If T collection and compute true similarities
- Surprisingly! Query Processing time can decrease
because fewerlists to consider
15Query shanghai, Edit Distance 1 3-grams sha,
han, ang, ngh, gha, hai
Hole grams
Regular grams
han
ngh
hai
uni
ing
sha
ang
gha
ter
3-grams
Merging-threshold without holes, T grams ed
q 6 1 3 3 Basis Each Edit Operation
can destroy at most q3 grams Naïve new
Merging-Threshold T T holes 0 ?
Panic! Can we really destroy at most q3
non-hole grams with each edit operation?
han
ngh
hai
sha
ang
gha
Delete a
Delete g
Can destroy at most 2 grams with 1 Edit
Operation! New Merging-Threshold T 1 We use
Dynamic Programming to compute tighter T
16Choosing Lists to Discard
- One extreme query is entirely unaffected
- Other extreme query becomes panic
- Good choice of lists depends on query workload
- Many combinations of lists to discard that
satisfy memory constraint, checking all is
infeasible - How can we make a reasonable choice efficiently?
17Choosing Lists to Discard
Input Memory Constraint Inverted Lists L Query
Workload W Output Lists to Discard
D DiscardLists While(Memory Constraint Not
Satisfied) For each list in L ?t
estimateImpact(list, W) benefit
list.size() discard use ?ts and
benefits to choose list add discard to D
remove discard from L
How can we do this efficiently? Perhaps
incrementally? Times needed List-Merging
Time Post-Processing Time Panic Time
What exactly should we minimize? benefit /
cost? cost only? We could ignore benefit
18Choosing Lists to Discard
Estimating Query Times With Holes List-Merging
Time cost function, parameters decided offline
with linear regression Post-Processing Time
candidates average compute similarity
time Panic Time strings average compute
similarity time candidates depends on T, data
distribution, number of holes Incremental-ScanCou
nt Algorithm
Before Discarding List T 3 candidates 3
2
0
3
3
2
4
0
0
1
0
Counts
List to discard
0
1
2
3
4
5
6
7
8
9
StringIDs
2
3
decrement counts
4
8
After Discarding List T T 1 2 candidates
4
2
0
2
2
1
4
0
0
0
0
Counts
0
1
2
3
4
5
6
7
8
9
StringIDs
Many more ways to improve speed of DiscardLists,
this is just one example
19Overview
- Motivation Preliminaries
- Approach 1 Discarding Lists
- Approach 2 Combining Lists
- Experiments Conclusion
20Approach 2 Combining Lists
B E FORE
tf
vi
ir
ef
rv
ne
un
in
2-grams
1 2 4 5 6
5 9
1 3 4 5 7 9
5 6 9
1 2 3 9
1 3 9
7 9
6 9
Inverted Lists (stringIDs)
tf
vi
ir
ef
rv
ne
un
in
2-grams
A F TER
1 2 4 5 6
1 3 4 5 7 9
7 9
6 9
1 2 3 9
5 6 9
Inverted Lists (stringIDs)
Lists combined
Intuition Combine correlated lists.
21Effects on Queries
- Merging-threshold T is unchanged (no new panics)
- Lists become longer
- More time to traverse lists
- More false positives
List-Merging Optimization 3-grams sha, han, ang,
ngh, gha, hai
Traverse physical lists once. Count for
stringIDs on physical lists increased by refcount
instead of 1
combined refcount 2
combined refcount 3
22Choosing Lists to Combine
- Discovering candidate gram pairs
- Frequent q1-grams ? correlated adjacent q-grams
- Using Locality-Sensitive Hashing (LSH)
- Selecting candidate pairs to combine
- Based on estimated cost on query workload
- Similar to DiscardList
- Different Incremental ScanCount algorithm
23Overview
- Motivation Preliminaries
- Approach 1 Discarding Lists
- Approach 2 Combining Lists
- Experiments Conclusion
24Experiments
- Datasets
- Google WebCorpus (word grams)
- IMDB Actors
- Queries picked from dataset, Zipf distributed
- q3, Edit Distance2
- Overview
- Performance of flavors of DiscardLists
CombineLists - Scalability with increasing index size
- Comparison with IR compression technique
- Comparison with VGRAM
- What if workload changes from training workload
25Experiments
DiscardLists
CombineLists
Runtime decreases!
Runtime decreases!
26Experiments
Comparison with IR compression technique
Compressed
Uncompressed
Compressed
Uncompressed
27Experiments
Comparison with variable-length gram technique,
VGRAM
Compressed
Uncompressed
Uncompressed
Compressed
28Future Work
- DiscardLists, CombineLists and IR compression
could be combined - When considering filter tree, global vs. local
decisions - How to minimize impact on performance if workload
change
29Conclusion
- We developed two lossy compression techniques
- We answer queries exactly
- Index can fit into a space budget (space
constraint) - Queries can become faster on the compressed
indexes - Flexibility to choose space / time tradeoff
- Existing list-merging algorithms can be re-used
(even with compression specific optimizations)
30More Experiments
What if the workload changes from the training
workload?
31More Experiments
What if the workload changes from the training
workload?