Efficient Approximate Search on String Collections Part I - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Efficient Approximate Search on String Collections Part I

Description:

A widely used metric to define string similarity ... Prefix pruning followed by a scan (Efficiency?) 16. Trie-based approach [JLL 09] 17 ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 62
Provided by: marios155
Category:

less

Transcript and Presenter's Notes

Title: Efficient Approximate Search on String Collections Part I


1
Efficient Approximate Search on String
CollectionsPart I
  • Marios Hadjieleftheriou

Chen Li
2
DBLP Author Search
  • http//www.informatik.uni-trier.de/ley/db/indices
    /a-tree/index.html

3
Try their names (good luck!)
UCSD
Yannis Papakonstantinou
Meral Ozsoyoglu
Marios Hadjieleftheriou
  • http//www.informatik.uni-trier.de/ley/db/indices
    /a-tree/index.html

4
?
5
Better system?
http//dblp.ics.uci.edu/authors/
6
People Search at UC Irvine
http//psearch.ics.uci.edu/
7
Web Search
  • Errors in queries
  • Errors in data
  • Bring query and meaningful results closer together

Actual queries gathered by Google
http//www.google.com/jobs/britney.html
8
Data Cleaning
R
S
9
Problem Formulation
Find strings similar to a given string
dist(Q,D) lt d Example find strings similar to
hadjeleftheriou
  • Performance is important!
  • 10 ms 100 queries per second (QPS)
  • 5 ms 200 QPS

10
Outline
  • Motivation
  • Preliminaries
  • Trie-based approach
  • Gram-based algorithms
  • Sketch-based algorithms
  • Compression
  • Selectivity estimation
  • Transformations/Synonyms
  • Conclusion

Part I
Part II
11
Next
  • Preliminaries

12
Similarity Functions
  • Similar to
  • a domain-specific function
  • returns a similarity value between two strings
  • Examples
  • Edit distance
  • Hamming distance
  • Jaccard similarity
  • Soundex
  • TF/IDF, BM25, DICE
  • See KSS06 for an excellent survey

13
Edit Distance
  • A widely used metric to define string similarity
  • Ed(s1,s2) minimum of operations (insertion,
    deletion, substitution) to change s1 to s2
  • Example
  • s1 Tom Hanks
  • s2 Ton Hank
  • ed(s1,s2) 2

13
14
State-of-the-art Oracle 10g and older versions
  • Supported by Oracle Text
  • CREATE TABLE engdict(word VARCHAR(20), len INT)
  • Create preferences for text indexing
  • begin ctx_ddl.create_preference('STEM_FUZZY_PREF'
    , 'BASIC_WORDLIST') ctx_ddl.set_attribute('STEM_F
    UZZY_PREF','FUZZY_MATCH','ENGLISH')
    ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCO
    RE','0') ctx_ddl.set_attribute('STEM_FUZZY_PREF',
    'FUZZY_NUMRESULTS','5000') ctx_ddl.set_attribute(
    'STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE')
    ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER',
    'ENGLISH') end /
  • CREATE INDEX fuzzy_stem_subst_idx ON engdict (
    word ) INDEXTYPE IS ctxsys.context PARAMETERS
    ('Wordlist STEM_FUZZY_PREF')
  • Usage
  • SELECT FROM engdict
  • WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6,
    weight)', 1) gt 0
  • Limitation cannot handle errors in the first
    letters
  • Katherine versus Catherine

15
Microsoft SQL Server CGG05
  • Data cleaning tools available in SQL Server 2005
  • Part of Integration Services
  • Supports fuzzy lookups
  • Uses data flow pipeline of transformations
  • Similarity function tokens with TF/IDF scores

15
16
Lucene
  • Using Levenshtein Distance (Edit Distance).
  • Example roam0.8
  • Prefix pruning followed by a scan (Efficiency?)

17
Next
  • Trie-based approach JLL09

18
Trie Indexing
e
s
  • Strings
  • exam
  • example
  • exemplar
  • exempt
  • sample

x
a
a
e
m
m
m
p
p

p
l
l
l
t
e
a


e

r

19
Active nodes on Trie
e
s
  • Query example
  • Edit-distance threshold 2

x
a
a
e
m
m
m
p
2
p

p
l
1
2
2
l
l
t
e
0
2
a


e

r

20
Initialization
  • Q e

0
1
1
e
s
2
2
x
a
a
e
m
m
m
p
p

p
l
l
l
t
e
Initial active nodes all nodes within depth d
a


e

r

21
Incremental Algorithm
Return leaf nodes as answers.
22
  • Q e x a m p l e

1
Active nodes for Q e
1
0
e
s
1
2
x
a
2
2
a
e
m
m
m
p
Active nodes for Q e
p

p
l
l
l
t
e
e
a


r


23
Good and bad
  • Advantages
  • Trie size is small
  • Can do search as the user types
  • Disadvantages
  • Works for edit distance only

23
24
Next
  • Gram-based algorithms
  • List-merging algorithms LLL08
  • Variable-length grams (VGRAM) LWY07,YWL08

25
q-grams of strings
u n i v e r s a l
2-grams
26
Edit operations effect on grams
Fixed length q
u n i v e r s a l
  • k operations could affect k q grams

If ed(s1,s2) lt k, then their of common grams
gt (s1- q 1) k q
27
q-gram inverted lists
28
Searching using inverted lists
  • Query shtick, ED(shtick, ?)1

ic
ck
sh ht ti ic ck
ti
2-grams
29
T-occurrence Problem
Merge
Ascending order
Find elements whose occurrences T
30
Example
  • T 4

1 3 5 10 13
10 13 15
5 7 13
13
15
Result 13
31
List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
32
Heap-based Algorithm
Push to heap
Min-heap

Count of occurrences of each element using a
heap
33
MergeOpt Algorithm SK04
Binary search
Long Lists T-1
Short Lists
34
Example of MergeOpt
1 3 5 10 13
10 13 15
5 7 13
13
15
Long Lists 3
Short Lists 2
Count threshold T 4
35
ScanCount
String ids
of occurrences
Increment by 1
1 2 3
0
1
1 3 5 10 13
10 13 15
5 7 13
13
15
0
1
0
4
13
0
Result!
14
0
Count threshold T 4
15
2
0
35
36
List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
37
MergeSkip algorithm BK02, LLL08
Pop T-1

Min-heap
Jump
Greater or equals
T-1
38
Example of MergeSkip
1
minHeap
10
5
13
15
1 3 5 10
10 15
5 7
13
15
13
13
Jump
17
17
15
15
Count threshold T 4
39
DivideSkip Algorithm LLL08
Binary search
MergeSkip
Long Lists
Short Lists
40
How many lists are treated as long lists?
41
Length Filtering
Length 10
s
By length only!
Ed(s,t) 2
t
Length 19
42
Positional Filtering
Ed(s,t) 2
s
(ab,1)
t
(ab,12)
43
Normalized weights HKS08
  • Compute a weight for each string
  • L0 length of the string
  • L1, L2 Depend on q-gram frequencies
  • Similar strings have similar weights
  • A very strong pruning condition

44
Pruning using normalized weights
  • Sort inverted lists based on string weights
  • Search within a small weight range
  • Shown to be effective (gt 90 candidates pruned)

45
Next
  • Variable-length grams (VGRAM) LWY07,YWL08

46
2-grams -gt 3-grams?
  • Query shtick, ED(shtick, ?)1

ick
sht hti tic ick
tic
of common grams gt 1
3-grams
47
Observation 1 dilemma of choosing q
  • Increasing q causing
  • Longer grams ? Shorter lists
  • Smaller of common grams of similar strings

48
Observation 2 skew distributions of gram
frequencies
  • DBLP 276,699 article titles
  • Popular 5-grams ation (gt114K times), tions,
    ystem, catio

49
VGRAM Main idea
  • Grams with variable lengths (between qmin and
    qmax)
  • zebra
  • ze(123)
  • corrasion
  • co(5213), cor(859), corr(171)
  • Advantages
  • Reduce index size ?
  • Reducing running time ?
  • Adoptable by many algorithms ?

50
Challenges
  • Generating variable-length grams?
  • Constructing a high-quality gram dictionary?
  • Relationship between string similarity and their
    gram-set similarity?
  • Adopting VGRAM in existing algorithms?

51
Challenge 1 String ? Variable-length grams?
  • Fixed-length 2-grams

u n i v e r s a l
  • Variable-length grams

u n i v e r s a l
52
Representing gram dictionary as a trie
ni ivr sal uni vers
53
Step 2 Constructing a gram dictionary
qmin2
qmax4
  • Frequency-based LYW07
  • Cost-based YLW08

54
Challenge 3 Edit operations effect on grams
Fixed length q
u n i v e r s a l
  • k operations could affect k q grams

55
Deletion affects variable-length grams
Not affected
Not affected
Affected
i
i-qmax1
iqmax- 1
Deletion
56
Main idea
  • For a string, for each position, compute the
    number of grams that could be destroyed by an
    operation at this position
  • Compute number of grams possibly destroyed by k
    operations
  • Store these numbers (for all data strings) as
    part of the index
  • Use this number to do count filtering

57
Summary of VGRAM index
58
Challenge 4 adopting VGRAM
  • Easily adoptable by many algorithms
  • Basic interfaces
  • String s ? grams
  • String s1, s2 such that ed(s1,s2) lt k ? min of
    their common grams

59
Lower bound on of common grams
Fixed length (q)
u n i v e r s a l
  • If ed(s1,s2) lt k, then their of common grams
    gt
  • (s1- q 1) k q

Variable lengths of grams of s1 NAG(s1,k)
60
Example algorithm using inverted lists
  • Query shtick, ED(shtick, ?)1

sh ht tick
tick
2-4 grams
2-grams
Lower bound 3
Lower bound 1
61
End of part I
  • Motivation
  • Preliminaries
  • Trie-based approach
  • Gram-based algorithms
  • Sketch-based algorithms
  • Compression
  • Selectivity estimation
  • Transformations/Synonyms
  • Conclusion

Part I
Part II
Write a Comment
User Comments (0)
About PowerShow.com