Efficient Approximate Search on String Collections Part I - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Approximate Search on String Collections Part I

Description:

Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li * * * * Variable-length grams (VGRAM) [LWY07,YWL08] * Next # of common ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 51
Provided by: MariosHadj9
Category:

less

Transcript and Presenter's Notes

Title: Efficient Approximate Search on String Collections Part I


1
Efficient Approximate Search on String
CollectionsPart I
  • Marios Hadjieleftheriou

Chen Li
2
DBLP Author Search
  • http//www.informatik.uni-trier.de/ley/db/indices
    /a-tree/index.html

3
Try their names (good luck!)
UCSD
Yannis Papakonstantinou
Meral Ozsoyoglu
Marios Hadjieleftheriou
  • http//www.informatik.uni-trier.de/ley/db/indices
    /a-tree/index.html

4
?
5
Better system?
http//dblp.ics.uci.edu/authors/
6
People Search at UC Irvine
http//psearch.ics.uci.edu/
7
Web Search
  • Errors in queries
  • Errors in data
  • Bring query and meaningful results closer together

Actual queries gathered by Google
http//www.google.com/jobs/britney.html
8
Data Cleaning
R
S
infromix

mcrosoft

informix
microsoft


9
Problem Formulation
Find strings similar to a given string
dist(Q,D) lt d Example find strings similar to
hadjeleftheriou
  • Performance is important!
  • 10 ms 100 queries per second (QPS)
  • 5 ms 200 QPS

10
Outline
  • Motivation
  • Preliminaries
  • Trie-based approach
  • Gram-based algorithms
  • Sketch-based algorithms
  • Compression
  • Selectivity estimation
  • Transformations/Synonyms
  • Conclusion

Part I
Part II
11
Next
  • Preliminaries

12
Similarity Functions
  • Similar to
  • a domain-specific function
  • returns a similarity value between two strings
  • Examples
  • Edit distance
  • Hamming distance
  • Jaccard similarity
  • Soundex
  • TF/IDF, BM25, DICE
  • See KSS06 for an excellent survey

13
Edit Distance
  • A widely used metric to define string similarity
  • Ed(s1,s2) minimum of operations (insertion,
    deletion, substitution) to change s1 to s2
  • Example
  • s1 Tom Hanks
  • s2 Ton Hank
  • ed(s1,s2) 2

13
14
Next
  • Gram-based algorithms
  • List-merging algorithms LLL08
  • Variable-length grams (VGRAM) LWY07,YWL08

15
q-grams of strings
u n i v e r s a l
2-grams
16
Edit operations effect on grams
Fixed length q
u n i v e r s a l
  • k operations could affect k q grams

If ed(s1,s2) lt k, then their of common grams
gt (s1- q 1) k q
17
q-gram inverted lists
18
Searching using inverted lists
  • Query shtick, ED(shtick, ?)1

ic
ck
sh ht ti ic ck
ti
2-grams
19
T-occurrence Problem
Merge
Ascending order
Find elements whose occurrences T
20
Example
  • T 4

1 3 5 10 13
10 13 15
5 7 13
13
15
Result 13
21
List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
22
Heap-based Algorithm
Push to heap
Min-heap

Count of occurrences of each element using a
heap
23
MergeOpt Algorithm SK04
Binary search
Long Lists T-1
Short Lists
24
Example of MergeOpt
1 3 5 10 13
10 13 15
5 7 13
13
15
Long Lists 3
Short Lists 2
Count threshold T 4
25
ScanCount
String ids
of occurrences
Increment by 1
1 2 3







0
1
1 3 5 10 13
10 13 15
5 7 13
13
15
0
1
0
4
13
0
Result!
14
0
Count threshold T 4
15
2
0
25
26
List-Merging Algorithms
HeapMerger
MergeOpt
SK04
LLL08, BK02
ScanCount
MergeSkip
DivideSkip
27
MergeSkip algorithm BK02, LLL08
Pop T-1

Min-heap
Jump
Greater or equals
T-1
28
Example of MergeSkip
1
minHeap
10
5
13
15
1 3 5 10
10 15
5 7
13
15
13
13
Jump
17
17
15
15
Count threshold T 4
29
DivideSkip Algorithm LLL08
Binary search
MergeSkip
Long Lists
Short Lists
30
How many lists are treated as long lists?
31
Length Filtering
Length 10

s
By length only!
Ed(s,t) 2
t

Length 19
32
Positional Filtering
Ed(s,t) 2
a b
s
(ab,1)
a b
t
(ab,12)
33
A filter tree
  • Combine filters with list-merging algorithms
    LLL08

34
Next
  • Variable-length grams (VGRAM) LWY07,YWL08

35
2-grams -gt 3-grams?
  • Query shtick, ED(shtick, ?)1

ick
sht hti tic ick
tic
of common grams gt 1
3-grams
36
Observation 1 dilemma of choosing q
  • Increasing q causing
  • Longer grams ? Shorter lists
  • Smaller of common grams of similar strings

37
Observation 2 skew distributions of gram
frequencies
  • DBLP 276,699 article titles
  • Popular 5-grams ation (gt114K times), tions,
    ystem, catio

38
VGRAM Main idea
  • Grams with variable lengths (between qmin and
    qmax)
  • zebra
  • ze(123)
  • corrasion
  • co(5213), cor(859), corr(171)
  • Advantages
  • Reduce index size ?
  • Reducing running time ?
  • Adoptable by many algorithms ?

39
Challenges
  • Generating variable-length grams?
  • Constructing a high-quality gram dictionary?
  • Relationship between string similarity and their
    gram-set similarity?
  • Adopting VGRAM in existing algorithms?

40
Challenge 1 String ? Variable-length grams?
  • Fixed-length 2-grams

u n i v e r s a l
  • Variable-length grams

u n i v e r s a l
41
Representing gram dictionary as a trie
ni ivr sal uni vers
42
Step 2 Constructing a gram dictionary
qmin2
qmax4
  • Frequency-based LYW07
  • Cost-based YLW08

43
Challenge 3 Edit operations effect on grams
Fixed length q
u n i v e r s a l
  • k operations could affect k q grams

44
Deletion affects variable-length grams
Not affected
Not affected
Affected

i
i-qmax1
iqmax- 1
Deletion
45
Main idea
  • For a string, for each position, compute the
    number of grams that could be destroyed by an
    operation at this position
  • Compute number of grams possibly destroyed by k
    operations
  • Store these numbers (for all data strings) as
    part of the index
  • Use this number to do count filtering

46
Summary of VGRAM index
47
Challenge 4 adopting VGRAM
  • Easily adoptable by many algorithms
  • Basic interfaces
  • String s ? grams
  • String s1, s2 such that ed(s1,s2) lt k ? min of
    their common grams

48
Lower bound on of common grams
Fixed length (q)
u n i v e r s a l
  • If ed(s1,s2) lt k, then their of common grams
    gt
  • (s1- q 1) k q

Variable lengths of grams of s1 NAG(s1,k)
49
Example algorithm using inverted lists
  • Query shtick, ED(shtick, ?)1

sh ht tick
tick
2-4 grams
2-grams
Lower bound 3
Lower bound 1
50
End of part I
  • Motivation
  • Preliminaries
  • Trie-based approach
  • Gram-based algorithms
  • Sketch-based algorithms
  • Compression
  • Selectivity estimation
  • Transformations/Synonyms
  • Conclusion

Part I
Part II
Write a Comment
User Comments (0)
About PowerShow.com