Efficient Approximate Search on String Collections Part I - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

Efficient Approximate Search on String Collections Part I

Description:

A widely used metric to define string similarity ... Prefix pruning followed by a scan (Efficiency?) 16. Trie-based approach [JLL 09] 17 ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 62

Provided by: marios155

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Approximate Search on String Collections Part I

1
Efficient Approximate Search on String
CollectionsPart I

Marios Hadjieleftheriou

Chen Li
2
DBLP Author Search

http//www.informatik.uni-trier.de/ley/db/indices
/a-tree/index.html

3
Try their names (good luck!)
UCSD
Yannis Papakonstantinou
Meral Ozsoyoglu
Marios Hadjieleftheriou

http//www.informatik.uni-trier.de/ley/db/indices
/a-tree/index.html

4
?
5
Better system?
http//dblp.ics.uci.edu/authors/
6
People Search at UC Irvine
http//psearch.ics.uci.edu/
7
Web Search

Errors in queries
Errors in data
Bring query and meaningful results closer together

Actual queries gathered by Google
http//www.google.com/jobs/britney.html
8
Data Cleaning
R
S
9
Problem Formulation
Find strings similar to a given string
dist(Q,D) lt d Example find strings similar to
hadjeleftheriou

Performance is important!
10 ms 100 queries per second (QPS)
5 ms 200 QPS

10
Outline

Motivation
Preliminaries
Trie-based approach
Gram-based algorithms
Sketch-based algorithms
Compression
Selectivity estimation
Transformations/Synonyms
Conclusion

Part I
Part II
11
Next

Preliminaries

12
Similarity Functions

Similar to
a domain-specific function
returns a similarity value between two strings
Examples
Edit distance
Hamming distance
Jaccard similarity
Soundex
TF/IDF, BM25, DICE
See KSS06 for an excellent survey

13
Edit Distance

A widely used metric to define string similarity
Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2
Example
s1 Tom Hanks
s2 Ton Hank
ed(s1,s2) 2

13
14
State-of-the-art Oracle 10g and older versions

Supported by Oracle Text
CREATE TABLE engdict(word VARCHAR(20), len INT)
Create preferences for text indexing
begin ctx_ddl.create_preference('STEM_FUZZY_PREF'
, 'BASIC_WORDLIST') ctx_ddl.set_attribute('STEM_F
UZZY_PREF','FUZZY_MATCH','ENGLISH')
ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCO
RE','0') ctx_ddl.set_attribute('STEM_FUZZY_PREF',
'FUZZY_NUMRESULTS','5000') ctx_ddl.set_attribute(
'STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE')
ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER',
'ENGLISH') end /
CREATE INDEX fuzzy_stem_subst_idx ON engdict (
word ) INDEXTYPE IS ctxsys.context PARAMETERS
('Wordlist STEM_FUZZY_PREF')
Usage
SELECT FROM engdict
WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6,
weight)', 1) gt 0
Limitation cannot handle errors in the first
letters
Katherine versus Catherine

15
Microsoft SQL Server CGG05

Data cleaning tools available in SQL Server 2005
Part of Integration Services
Supports fuzzy lookups
Uses data flow pipeline of transformations
Similarity function tokens with TF/IDF scores

15
16
Lucene

Using Levenshtein Distance (Edit Distance).
Example roam0.8
Prefix pruning followed by a scan (Efficiency?)

17
Next

Trie-based approach JLL09

18
Trie Indexing
e
s

Strings
exam
example
exemplar
exempt
sample

x
a
a
e
m
m
m
p
p

p
l
l
l
t
e
a

e

r

19
Active nodes on Trie
e
s

Query example
Edit-distance threshold 2

x
a
a
e
m
m
m
p
2
p

p
l
1
2
2
l
l
t
e
0
2
a

e

r

20
Initialization

0
1
1
e
s
2
2
x
a
a
e
m
m
m
p
p

p
l
l
l
t
e
Initial active nodes all nodes within depth d
a

e

r

21
Incremental Algorithm
Return leaf nodes as answers.
22

Q e x a m p l e

1
Active nodes for Q e
1
0
e
s
1
2
x
a
2
2
a
e
m
m
m
p
Active nodes for Q e
p

p
l
l
l
t
e
e
a

r

23
Good and bad

Advantages
Trie size is small
Can do search as the user types
Disadvantages
Works for edit distance only

23
24
Next

Gram-based algorithms
List-merging algorithms LLL08
Variable-length grams (VGRAM) LWY07,YWL08

25
q-grams of strings
u n i v e r s a l
2-grams
26
Edit operations effect on grams
Fixed length q
u n i v e r s a l

k operations could affect k q grams

If ed(s1,s2) lt k, then their of common grams
gt (s1- q 1) k q
27
q-gram inverted lists
28
Searching using inverted lists

Query shtick, ED(shtick, ?)1

ic
ck
sh ht ti ic ck
ti
2-grams
29
T-occurrence Problem
Merge
Ascending order
Find elements whose occurrences T
30
Example

1 3 5 10 13
10 13 15
5 7 13
13
15
Result 13
31
List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
32
Heap-based Algorithm
Push to heap
Min-heap

Count of occurrences of each element using a
heap
33
MergeOpt Algorithm SK04
Binary search
Long Lists T-1
Short Lists
34
Example of MergeOpt
1 3 5 10 13
10 13 15
5 7 13
13
15
Long Lists 3
Short Lists 2
Count threshold T 4
35
ScanCount
String ids
of occurrences
Increment by 1
1 2 3
0
1
1 3 5 10 13
10 13 15
5 7 13
13
15
0
1
0
4
13
0
Result!
14
0
Count threshold T 4
15
2
0
35
36
List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
37
MergeSkip algorithm BK02, LLL08
Pop T-1

Min-heap
Jump
Greater or equals
T-1
38
Example of MergeSkip
1
minHeap
10
5
13
15
1 3 5 10
10 15
5 7
13
15
13
13
Jump
17
17
15
15
Count threshold T 4
39
DivideSkip Algorithm LLL08
Binary search
MergeSkip
Long Lists
Short Lists
40
How many lists are treated as long lists?
41
Length Filtering
Length 10
s
By length only!
Ed(s,t) 2
t
Length 19
42
Positional Filtering
Ed(s,t) 2
s
(ab,1)
t
(ab,12)
43
Normalized weights HKS08

Compute a weight for each string
L0 length of the string
L1, L2 Depend on q-gram frequencies
Similar strings have similar weights
A very strong pruning condition

44
Pruning using normalized weights

Sort inverted lists based on string weights
Search within a small weight range
Shown to be effective (gt 90 candidates pruned)

45
Next

Variable-length grams (VGRAM) LWY07,YWL08

46
2-grams -gt 3-grams?

Query shtick, ED(shtick, ?)1

ick
sht hti tic ick
tic
of common grams gt 1
3-grams
47
Observation 1 dilemma of choosing q

Increasing q causing
Longer grams ? Shorter lists
Smaller of common grams of similar strings

48
Observation 2 skew distributions of gram
frequencies

DBLP 276,699 article titles
Popular 5-grams ation (gt114K times), tions,
ystem, catio

49
VGRAM Main idea

Grams with variable lengths (between qmin and
qmax)
zebra
ze(123)
corrasion
co(5213), cor(859), corr(171)
Advantages
Reduce index size ?
Reducing running time ?
Adoptable by many algorithms ?

50
Challenges

Generating variable-length grams?
Constructing a high-quality gram dictionary?
Relationship between string similarity and their
gram-set similarity?
Adopting VGRAM in existing algorithms?

51
Challenge 1 String ? Variable-length grams?

Fixed-length 2-grams

u n i v e r s a l

Variable-length grams

u n i v e r s a l
52
Representing gram dictionary as a trie
ni ivr sal uni vers
53
Step 2 Constructing a gram dictionary
qmin2
qmax4

Frequency-based LYW07
Cost-based YLW08

54
Challenge 3 Edit operations effect on grams
Fixed length q
u n i v e r s a l

k operations could affect k q grams

55
Deletion affects variable-length grams
Not affected
Not affected
Affected
i
i-qmax1
iqmax- 1
Deletion
56
Main idea

For a string, for each position, compute the
number of grams that could be destroyed by an
operation at this position
Compute number of grams possibly destroyed by k
operations
Store these numbers (for all data strings) as
part of the index