Title: Liang Jin and Chen Li
1Selectivity Estimation for Fuzzy String
Predicates in Large Data Sets
Supported by NSF CAREER Award IIS-0238586
2Example a movie database
Find movies starred Schwarrzenger?
Find movies with a star similar to
Schwarrzenger.
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
3Queries with Fuzzy String Predicates
- Stars name similar to Schwarrzenger
- Employees SSN similar to 430-87-7294
- Customers telephone number similar to 412-0964
- Similar to
- a domain-specific function
- returns a similarity value between two strings
- Example edit distance
- Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2 - ed(Tom Hanks,
- Ton Hank ) 2
Database
4Selectivity Estimation Problem Formulation
star SIMILARTO Schwarrzenger
Input fuzzy string predicate P(q, d)
A bag of strings
Output of strings s that satisfy dist(s,q) lt d
5Why Selectivity Estimation?
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year BETWEEN 1970,1971
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year BETWEEN 1980,1999
Movies
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
The optimizer needs to know the selectivity of a
predicate to decide a good plan.
6Rest of the talk
- Motivation selectivity estimation of fuzzy
predicates - Our approach SEPIA
- Proximity between strings
- Histograms and estimation algorithm
- Construction and maintenance of SEPIA
- Experiments
7Intuition of SEPIA
- Selectivity Estimation of Approximate Predicates
8Proximity between Strings
Edit Distance? Not discriminative enough
9Edit Vector from s1 to s2
- A vector ltI, D, Sgt
- I of insertions
- D of deletions
- S of substitutions
- in a sequence of edit operations with their edit
distance
10Why Edit Vector? More discriminative
11SEPIA histograms Overview
12Frequency table for each cluster
13Global PPD Table
- Proximity Pair Distribution table
14SEPIA histograms summary
15Selectivity Estimation ed(lukas, 2)
- Do it for all v2 vectors in each cluster, for all
clusters - Take the sum of these contributions
16Selectivity Estimation for ed(q,d)
- For each cluster Ci
- For each v2 in frequency table of Ci
- Use (v1,v2,d) to lookup PPD
- Take the sum of these f N
- Pruning possible (triangle inequality)
17Outline
- Motivation selectivity estimation of fuzzy
predicates - Our approach SEPIA
- Proximity between strings
- Histograms and estimation algorithm
- Construction and maintenance of SEPIA
- Experiments
18Clustering Strings
- Two example algorithms
- Lexicographic order based.
- K-Medoids
- Choose initial pivots
- Assign strings to its closest pivot
- Swap a pivot with another string
- Reassign the strings
19Number of Clusters
- It affects
- Cluster quality
- Similarity of strings within each cluster
- Costs
- Space
- Estimation time
20Constructing Frequency Tables
- For each cluster, group strings based on their
edit vector from the pivot - Count the frequency for each group
21Constructing PPD Table
- Get enough samples of string triplets (q,p,s)
- Propose a few heuristics
- ALL_RAND
- CLOSE_RAND
- CLOSE_LEX
- CLOSE_UNIQUE
22Dynamic Maintenance Frequency Table
- Take insertion as an example
23Dynamic Maintenance PPD
24Improving Estimation Accuracy
- A post-processing step to further improve
estimation accuracy - See paper for details.
25Outline
- Motivation selectivity estimation of fuzzy
predicates - Our approach SEPIA
- Proximity between strings
- Histograms and estimation algorithm
- Construction and maintenance of SEPIA
- Experiments
26Data
- Citeseer
- 71K author names
- Length 2,20, avg 12
- Movie records from UCI KDD repository
- 11K movie titles.
- Length 3,80, avg 35
- Introduced duplicates
- 10 of records
- of duplicates 1,20, uniform
- Final results
- Citeseer 142K author names
- UCI KDD 23K movie titles
27Setting
- Test bed
- PC 2.4G P4, 1.2GB RAM, Windows XP
- Visual C compiler
- Query workload
- Strings from the data
- String not in the data
- Results similar
- Quality measurements
- Relative error (fest freal) / freal
- Absolute relative error fest freal / freal
28Quartile distribution of relative errors
Data set 1. CLOSE_RAND 1000 clusters
29Number of Clusters
30Dynamic Maintenance
- More results in the paper
- Extension to other similarity functions
- More experimental results
31Related Work
- Traditional histograms
- Selectivity estimation for predicates with
wildcards star LIKE Hanks - Answering fuzzy predicates efficiently (another
talk in this conference)
32Conclusions
- Important to support queries with fuzzy string
predicates - SEPIA provides accurate selectivity estimation
- Structures can be efficiently constructed and
maintained. - Extendable to various similarity measurements
The Flamingo Project http//www.ics.uci.edu/fla
mingo/
QA?
33BLANK
34Backup Slides
35Why do we care?
- Errors in queries
- User doesnt remember a string exactly
- User types a wrong string
36Size of histograms
- Data set 1
- 1000 clusters
- PPD table 5MB
- Frequency tables 200KB
37Constructing PPD table
- We want to generate enough sample triplets to
cover as many (v1, v2) pairs as possible. - We also want to control the cost of generating
the samples and calculation. - Heuristics
- ALL_RAND
- CLOSE_RAND
- CLOSE_LEX
- CLOSE_UNIQUE
38Populating PPD Table
CLOSE_RAND is used
39Number of Clusters (cont)
Number of clusters grows with the size of the
dataset
Fixed number of clusters
40Extension to other similarity functions
- SEPIA a general framework for selectivity
estimation for fuzzy string predicates. - Key issue in extensions proximity between
strings - Too specific?
- Too general?
- Example Jaccard coefficient distance
- Proximity between two strings s1 and s2.
- G(s,n) is the n-gram set for string s.
- lt G(s1, n) G(s2, n), G(s1, n) v G(s2, n),
ed(s1, s2) gt
41Jaccard Coefficient Distance
42Research Issues
- Deciding similarity functions
- Domain specific
- Query processing
- Answering a query with fuzzy predicates
efficiently - Query optimization
- Selectivity estimation
43Queries with Fuzzy String Predicates
- Stars name similar to Schwarrzenger
- Employees SSN similar to 430-87-7294
- Customers telephone number similar to 412-0964
Database
- Similar to
- a domain-specific function
- returns a similarity value between two strings
- Examples
- Edit distance ed(Schwarrzenger,
Schwarzenegger)2 - Cosine similarity
- Jaccard coefficient distance
- Soundex
44Why do we care?
- Errors in the query
- The user doesnt remember a string exactly
- The user unintentionally types a wrong string
45Selectivity of Fuzzy Predicates
- star SIMILARTO Schwarrzenger
- Selectivity of records satisfying the predicate
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
46Example Similarity Function Edit Distance
- A widely used metric to define string similarity
- Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2 - Example
- s1 Tom Hanks
- s2 Ton Hank
- ed(s1,s2) 2
47Using traditional histograms?
- No nice order for strings
- Lexicographical order?
- Similar strings could be far from each other
Kammy/Cammy - Adjacent strings have different selectivities
Cathy/Catherine
48Edit Vector from s1 to s2
- A vector ltI, D, Sgt
- I of insertions
- D of deletions
- S of substitutions
- in a sequence of edit operations with their edit
distance
- Easily computable
- Not symmetric
- Not unique, but tend to be (ed lt 3 ? 91 unique)
49Improving Estimation Accuracy
- Reasons of estimate errors
- Miss hits in PPD.
- Inaccurate percentage entries in PPD.
- Improvement use sample fuzzy predicates to
analyze their estimation errors
50Relative-Error Model
- Use the errors to build a model
- Use the model to adjust initial estimation
51Effectiveness of Applying Relative-Error Model
52Clustering Algorithms
K-Metoids is better