Title: Efficient Approximate Search on String Collections Part I
1Efficient Approximate Search on String
CollectionsPart I
Chen Li
2DBLP Author Search
- http//www.informatik.uni-trier.de/ley/db/indices
/a-tree/index.html
3Try their names (good luck!)
UCSD
Yannis Papakonstantinou
Meral Ozsoyoglu
Marios Hadjieleftheriou
- http//www.informatik.uni-trier.de/ley/db/indices
/a-tree/index.html
4?
5Better system?
http//dblp.ics.uci.edu/authors/
6People Search at UC Irvine
http//psearch.ics.uci.edu/
7Web Search
- Errors in queries
- Errors in data
- Bring query and meaningful results closer together
Actual queries gathered by Google
http//www.google.com/jobs/britney.html
8Data Cleaning
R
S
infromix
mcrosoft
informix
microsoft
9Problem Formulation
Find strings similar to a given string
dist(Q,D) lt d Example find strings similar to
hadjeleftheriou
- Performance is important!
- 10 ms 100 queries per second (QPS)
- 5 ms 200 QPS
10Outline
- Motivation
- Preliminaries
- Trie-based approach
- Gram-based algorithms
- Sketch-based algorithms
- Compression
- Selectivity estimation
- Transformations/Synonyms
- Conclusion
Part I
Part II
11Next
12Similarity Functions
- Similar to
- a domain-specific function
- returns a similarity value between two strings
- Examples
- Edit distance
- Hamming distance
- Jaccard similarity
- Soundex
- TF/IDF, BM25, DICE
- See KSS06 for an excellent survey
13Edit Distance
- A widely used metric to define string similarity
- Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2 - Example
- s1 Tom Hanks
- s2 Ton Hank
- ed(s1,s2) 2
13
14Next
- Gram-based algorithms
- List-merging algorithms LLL08
- Variable-length grams (VGRAM) LWY07,YWL08
15q-grams of strings
u n i v e r s a l
2-grams
16Edit operations effect on grams
Fixed length q
u n i v e r s a l
- k operations could affect k q grams
If ed(s1,s2) lt k, then their of common grams
gt (s1- q 1) k q
17q-gram inverted lists
18Searching using inverted lists
- Query shtick, ED(shtick, ?)1
ic
ck
sh ht ti ic ck
ti
2-grams
19T-occurrence Problem
Merge
Ascending order
Find elements whose occurrences T
20Example
1 3 5 10 13
10 13 15
5 7 13
13
15
Result 13
21List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
22Heap-based Algorithm
Push to heap
Min-heap
Count of occurrences of each element using a
heap
23MergeOpt Algorithm SK04
Binary search
Long Lists T-1
Short Lists
24Example of MergeOpt
1 3 5 10 13
10 13 15
5 7 13
13
15
Long Lists 3
Short Lists 2
Count threshold T 4
25ScanCount
String ids
of occurrences
Increment by 1
1 2 3
0
1
1 3 5 10 13
10 13 15
5 7 13
13
15
0
1
0
4
13
0
Result!
14
0
Count threshold T 4
15
2
0
25
26List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
27MergeSkip algorithm BK02, LLL08
Pop T-1
Min-heap
Jump
Greater or equals
T-1
28Example of MergeSkip
1
minHeap
10
5
13
15
1 3 5 10
10 15
5 7
13
15
13
13
Jump
17
17
15
15
Count threshold T 4
29DivideSkip Algorithm LLL08
Binary search
MergeSkip
Long Lists
Short Lists
30How many lists are treated as long lists?
31 Length Filtering
Length 10
s
By length only!
Ed(s,t) 2
t
Length 19
32Positional Filtering
Ed(s,t) 2
a b
s
(ab,1)
a b
t
(ab,12)
33A filter tree
- Combine filters with list-merging algorithms
LLL08
34Next
- Variable-length grams (VGRAM) LWY07,YWL08
352-grams -gt 3-grams?
- Query shtick, ED(shtick, ?)1
ick
sht hti tic ick
tic
of common grams gt 1
3-grams
36Observation 1 dilemma of choosing q
- Increasing q causing
- Longer grams ? Shorter lists
- Smaller of common grams of similar strings
37Observation 2 skew distributions of gram
frequencies
- DBLP 276,699 article titles
- Popular 5-grams ation (gt114K times), tions,
ystem, catio
38VGRAM Main idea
- Grams with variable lengths (between qmin and
qmax) - zebra
- ze(123)
- corrasion
- co(5213), cor(859), corr(171)
- Advantages
- Reduce index size ?
- Reducing running time ?
- Adoptable by many algorithms ?
39Challenges
- Generating variable-length grams?
- Constructing a high-quality gram dictionary?
- Relationship between string similarity and their
gram-set similarity? - Adopting VGRAM in existing algorithms?
40Challenge 1 String ? Variable-length grams?
u n i v e r s a l
u n i v e r s a l
41Representing gram dictionary as a trie
ni ivr sal uni vers
42Step 2 Constructing a gram dictionary
qmin2
qmax4
- Frequency-based LYW07
- Cost-based YLW08
43Challenge 3 Edit operations effect on grams
Fixed length q
u n i v e r s a l
- k operations could affect k q grams
44Deletion affects variable-length grams
Not affected
Not affected
Affected
i
i-qmax1
iqmax- 1
Deletion
45Main idea
- For a string, for each position, compute the
number of grams that could be destroyed by an
operation at this position - Compute number of grams possibly destroyed by k
operations - Store these numbers (for all data strings) as
part of the index
- Use this number to do count filtering
46Summary of VGRAM index
47Challenge 4 adopting VGRAM
- Easily adoptable by many algorithms
- Basic interfaces
- String s ? grams
- String s1, s2 such that ed(s1,s2) lt k ? min of
their common grams
48Lower bound on of common grams
Fixed length (q)
u n i v e r s a l
- If ed(s1,s2) lt k, then their of common grams
gt - (s1- q 1) k q
Variable lengths of grams of s1 NAG(s1,k)
49Example algorithm using inverted lists
- Query shtick, ED(shtick, ?)1
sh ht tick
tick
2-4 grams
2-grams
Lower bound 3
Lower bound 1
50End of part I
- Motivation
- Preliminaries
- Trie-based approach
- Gram-based algorithms
- Sketch-based algorithms
- Compression
- Selectivity estimation
- Transformations/Synonyms
- Conclusion
Part I
Part II