Title: Efficient Approximate Search on String Collections Part I
1Efficient Approximate Search on String
CollectionsPart I
Chen Li
2DBLP Author Search
- http//www.informatik.uni-trier.de/ley/db/indices
/a-tree/index.html
3Try their names (good luck!)
UCSD
Yannis Papakonstantinou
Meral Ozsoyoglu
Marios Hadjieleftheriou
- http//www.informatik.uni-trier.de/ley/db/indices
/a-tree/index.html
4?
5Better system?
http//dblp.ics.uci.edu/authors/
6People Search at UC Irvine
http//psearch.ics.uci.edu/
7Web Search
- Errors in queries
- Errors in data
- Bring query and meaningful results closer together
Actual queries gathered by Google
http//www.google.com/jobs/britney.html
8Data Cleaning
R
S
9Problem Formulation
Find strings similar to a given string
dist(Q,D) lt d Example find strings similar to
hadjeleftheriou
- Performance is important!
- 10 ms 100 queries per second (QPS)
- 5 ms 200 QPS
10Outline
- Motivation
- Preliminaries
- Trie-based approach
- Gram-based algorithms
- Sketch-based algorithms
- Compression
- Selectivity estimation
- Transformations/Synonyms
- Conclusion
Part I
Part II
11Next
12Similarity Functions
- Similar to
- a domain-specific function
- returns a similarity value between two strings
- Examples
- Edit distance
- Hamming distance
- Jaccard similarity
- Soundex
- TF/IDF, BM25, DICE
- See KSS06 for an excellent survey
13Edit Distance
- A widely used metric to define string similarity
- Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2 - Example
- s1 Tom Hanks
- s2 Ton Hank
- ed(s1,s2) 2
13
14State-of-the-art Oracle 10g and older versions
- Supported by Oracle Text
- CREATE TABLE engdict(word VARCHAR(20), len INT)
- Create preferences for text indexing
- begin ctx_ddl.create_preference('STEM_FUZZY_PREF'
, 'BASIC_WORDLIST') ctx_ddl.set_attribute('STEM_F
UZZY_PREF','FUZZY_MATCH','ENGLISH')
ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCO
RE','0') ctx_ddl.set_attribute('STEM_FUZZY_PREF',
'FUZZY_NUMRESULTS','5000') ctx_ddl.set_attribute(
'STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE')
ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER',
'ENGLISH') end / - CREATE INDEX fuzzy_stem_subst_idx ON engdict (
word ) INDEXTYPE IS ctxsys.context PARAMETERS
('Wordlist STEM_FUZZY_PREF') - Usage
- SELECT FROM engdict
- WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6,
weight)', 1) gt 0 - Limitation cannot handle errors in the first
letters - Katherine versus Catherine
15Microsoft SQL Server CGG05
- Data cleaning tools available in SQL Server 2005
- Part of Integration Services
- Supports fuzzy lookups
- Uses data flow pipeline of transformations
- Similarity function tokens with TF/IDF scores
15
16Lucene
- Using Levenshtein Distance (Edit Distance).
- Example roam0.8
- Prefix pruning followed by a scan (Efficiency?)
17Next
- Trie-based approach JLL09
18Trie Indexing
e
s
- Strings
- exam
- example
- exemplar
- exempt
- sample
x
a
a
e
m
m
m
p
p
p
l
l
l
t
e
a
e
r
19Active nodes on Trie
e
s
- Query example
- Edit-distance threshold 2
x
a
a
e
m
m
m
p
2
p
p
l
1
2
2
l
l
t
e
0
2
a
e
r
20Initialization
0
1
1
e
s
2
2
x
a
a
e
m
m
m
p
p
p
l
l
l
t
e
Initial active nodes all nodes within depth d
a
e
r
21Incremental Algorithm
Return leaf nodes as answers.
221
Active nodes for Q e
1
0
e
s
1
2
x
a
2
2
a
e
m
m
m
p
Active nodes for Q e
p
p
l
l
l
t
e
e
a
r
23Good and bad
- Advantages
- Trie size is small
- Can do search as the user types
- Disadvantages
- Works for edit distance only
23
24Next
- Gram-based algorithms
- List-merging algorithms LLL08
- Variable-length grams (VGRAM) LWY07,YWL08
25q-grams of strings
u n i v e r s a l
2-grams
26Edit operations effect on grams
Fixed length q
u n i v e r s a l
- k operations could affect k q grams
If ed(s1,s2) lt k, then their of common grams
gt (s1- q 1) k q
27q-gram inverted lists
28Searching using inverted lists
- Query shtick, ED(shtick, ?)1
ic
ck
sh ht ti ic ck
ti
2-grams
29T-occurrence Problem
Merge
Ascending order
Find elements whose occurrences T
30Example
1 3 5 10 13
10 13 15
5 7 13
13
15
Result 13
31List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
32Heap-based Algorithm
Push to heap
Min-heap
Count of occurrences of each element using a
heap
33MergeOpt Algorithm SK04
Binary search
Long Lists T-1
Short Lists
34Example of MergeOpt
1 3 5 10 13
10 13 15
5 7 13
13
15
Long Lists 3
Short Lists 2
Count threshold T 4
35ScanCount
String ids
of occurrences
Increment by 1
1 2 3
0
1
1 3 5 10 13
10 13 15
5 7 13
13
15
0
1
0
4
13
0
Result!
14
0
Count threshold T 4
15
2
0
35
36List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
37MergeSkip algorithm BK02, LLL08
Pop T-1
Min-heap
Jump
Greater or equals
T-1
38Example of MergeSkip
1
minHeap
10
5
13
15
1 3 5 10
10 15
5 7
13
15
13
13
Jump
17
17
15
15
Count threshold T 4
39DivideSkip Algorithm LLL08
Binary search
MergeSkip
Long Lists
Short Lists
40How many lists are treated as long lists?
41 Length Filtering
Length 10
s
By length only!
Ed(s,t) 2
t
Length 19
42Positional Filtering
Ed(s,t) 2
s
(ab,1)
t
(ab,12)
43Normalized weights HKS08
- Compute a weight for each string
- L0 length of the string
- L1, L2 Depend on q-gram frequencies
- Similar strings have similar weights
- A very strong pruning condition
44Pruning using normalized weights
- Sort inverted lists based on string weights
- Search within a small weight range
- Shown to be effective (gt 90 candidates pruned)
45Next
- Variable-length grams (VGRAM) LWY07,YWL08
462-grams -gt 3-grams?
- Query shtick, ED(shtick, ?)1
ick
sht hti tic ick
tic
of common grams gt 1
3-grams
47Observation 1 dilemma of choosing q
- Increasing q causing
- Longer grams ? Shorter lists
- Smaller of common grams of similar strings
48Observation 2 skew distributions of gram
frequencies
- DBLP 276,699 article titles
- Popular 5-grams ation (gt114K times), tions,
ystem, catio
49VGRAM Main idea
- Grams with variable lengths (between qmin and
qmax) - zebra
- ze(123)
- corrasion
- co(5213), cor(859), corr(171)
- Advantages
- Reduce index size ?
- Reducing running time ?
- Adoptable by many algorithms ?
50Challenges
- Generating variable-length grams?
- Constructing a high-quality gram dictionary?
- Relationship between string similarity and their
gram-set similarity? - Adopting VGRAM in existing algorithms?
51Challenge 1 String ? Variable-length grams?
u n i v e r s a l
u n i v e r s a l
52Representing gram dictionary as a trie
ni ivr sal uni vers
53Step 2 Constructing a gram dictionary
qmin2
qmax4
- Frequency-based LYW07
- Cost-based YLW08
54Challenge 3 Edit operations effect on grams
Fixed length q
u n i v e r s a l
- k operations could affect k q grams
55Deletion affects variable-length grams
Not affected
Not affected
Affected
i
i-qmax1
iqmax- 1
Deletion
56Main idea
- For a string, for each position, compute the
number of grams that could be destroyed by an
operation at this position - Compute number of grams possibly destroyed by k
operations - Store these numbers (for all data strings) as
part of the index
- Use this number to do count filtering
57Summary of VGRAM index
58Challenge 4 adopting VGRAM
- Easily adoptable by many algorithms
- Basic interfaces
- String s ? grams
- String s1, s2 such that ed(s1,s2) lt k ? min of
their common grams
59Lower bound on of common grams
Fixed length (q)
u n i v e r s a l
- If ed(s1,s2) lt k, then their of common grams
gt - (s1- q 1) k q
Variable lengths of grams of s1 NAG(s1,k)
60Example algorithm using inverted lists
- Query shtick, ED(shtick, ?)1
sh ht tick
tick
2-4 grams
2-grams
Lower bound 3
Lower bound 1
61End of part I
- Motivation
- Preliminaries
- Trie-based approach
- Gram-based algorithms
- Sketch-based algorithms
- Compression
- Selectivity estimation
- Transformations/Synonyms
- Conclusion
Part I
Part II