Title: Chen Li
1Answering Approximate Queries Efficiently
- Chen Li
- Department of Computer Science
- Joint work with Liang Jin, Nick Koudas, Anthony
Tung, and Rares Vernica
230,000-Foot View of Info Systems
Data Repository (RDBMS, Search Engines, etc.)
Answers matching conditions
Query
3Example a movie database
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
4How about our governor Schwarrzenger?
The user doesnt know the exact spelling!
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
5Relaxing Conditions
Find movies with a star similar to
Schwarrzenger.
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
6In general Gap between Queries and Facts
- Errors in the query
- The user doesnt remember a string exactly
- The user unintentionally types a wrong string
7Did you mean? features in Search Engines
8What if we dont want the user to change the
query?Answering Queries Approximately
Data Repository (RDBMS, Search Engines, etc.)
Answers matching conditions approximately
Query
9Technical Challenges
- How to relax conditions?
- Name Schwarzenegger vs Schwarrzenger
- Salary in 50K,60K vs in 49K,63K
- How to answer queries efficiently?
- Index structures
- Selectivity estimation
- See our three recent VLDB papers
10Rest of the talk
- Selectivity estimation of fuzzy predicates
- Our approach SEPIA
- Construction and maintenance of SEPIA
- Experiments
- Other works
11Queries with Fuzzy String Predicates
- Stars name similar to Schwarrzenger
- Employees SSN similar to 430-87-7294
- Customers telephone number similar to 412-0964
Database
- Similar to
- a domain-specific function
- returns a similarity value between two strings
- Examples
- Edit distance ed(Schwarrzenger,
Schwarzenegger)2 - Cosine similarity
- Jaccard coefficient distance
- Soundex
12Example Similarity Function Edit Distance
- A widely used metric to define string similarity
- Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2 - Example
- s1 Tom Hanks
- s2 Ton Hank
- ed(s1,s2) 2
13Selectivity of Fuzzy Predicates
- star SIMILARTO Schwarrzenger
- Selectivity of records satisfying the predicate
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
14Selectivity Estimation Problem Formulation
star SIMILARTO Schwarrzenger
Input fuzzy string predicate P(q, d)
A bag of strings
Output of strings s that satisfy dist(s,q) lt d
15Why Selectivity Estimation?
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year BETWEEN 1970,1971
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year BETWEEN 1980,1989
Movies
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
The optimizer needs to know the selectivity of a
predicate to decide a good plan.
16Using traditional histograms?
- No nice order for strings
- Lexicographical order?
- Similar strings could be far from each other
Kammy/Cammy - Adjacent strings have different selectivities
Cathy/Catherine
17Outline
- Selectivity estimation of fuzzy predicates
- Our approach SEPIA
- Overview
- Proximity between strings
- Estimation algorithm
- Construction and maintenance of SEPIA
- Experiments
- Other works
18Our approach SEPIA
- Selectivity Estimation of Approximate Predicates
Intuition
19Proximity between Strings
Edit Distance? Not discriminative enough
20Edit Vector from s1 to s2
- A vector ltI, D, Sgt
- I of insertions
- D of deletions
- S of substitutions
- in a sequence of edit operations with their edit
distance
- Easily computable
- Not symmetric
- Not unique, but tend to be (ed lt 3 ? 91 unique)
21Why Edit Vector?
22SEPIA histograms Overview
23Frequency table for each cluster
24Global PPD Table
- Proximity Pair Distribution table
25SEPIA histograms summary
26Selectivity Estimation ed(lukas, 2)
- Do it for all v2 vectors in each cluster, for all
clusters - Take the sum of these contributions
27Selectivity Estimation for ed(q,d)
- For each cluster Ci
- For each v2 in frequency table of Ci
- Use (v1,v2,d) to lookup PPD
- Take the sum of these f N
- Pruning possible (triangle inequality)
28Outline
- Selectivity estimation of fuzzy predicates
- Our approach SEPIA
- Overview
- Proximity between strings
- Estimation algorithm
- Construction and maintenance of SEPIA
- Experiments
- Other works
29Clustering Strings
- Two example algorithms
- Lexicographic order based.
- K-Medoids
- Choose initial pivots
- Assign strings to its closest pivot
- Swap a pivot with another string
- Reassign the strings
30Number of Clusters
- It affects
- Cluster quality
- Similarity of strings within each cluster
- Costs
- Space
- Estimation time
31Constructing Frequency Tables
- For each cluster, group strings based on their
edit vector from the pivot - Count the frequency for each group
32Constructing PPD Table
- Get enough samples of string triplets (q,p,s)
- Propose a few heuristics
- ALL_RAND
- CLOSE_RAND
- CLOSE_LEX
- CLOSE_UNIQUE
33Dynamic Maintenance Frequency Table
- Take insertion as an example
34Dynamic Maintenance PPD
35Improving Estimation Accuracy
- Reasons of estimate errors
- Miss hits in PPD.
- Inaccurate percentage entries in PPD.
- Improvement use sample fuzzy predicates to
analyze their estimation errors
36Relative-Error Model
- Use the errors to build a model
- Use the model to adjust initial estimation
37Outline
- Motivation selectivity estimation of fuzzy
predicates - Our approach SEPIA
- Overview
- Proximity between strings
- Estimation algorithm
- Construction and maintenance of SEPIA
- Experiments
- Other works
38Data
- Citeseer
- 71K author names
- Length 2,20, avg 12
- Movie records from UCI KDD repository
- 11K movie titles.
- Length 3,80, avg 35
- Introduced duplicates
- 10 of records
- of duplicates 1,20, uniform
- Final results
- Citeseer 142K author names
- UCI KDD 23K movie titles
39Setting
- Test bed
- PC 2.4G P4, 1.2GB RAM, Windows XP
- Visual C compiler
- Query workload
- Strings from the data
- String not in the data
- Results similar
- Quality measurements
- Relative error (fest freal) / freal
- Absolute relative error fest freal / freal
40Clustering Algorithms
K-Metoids is better
41Quartile distribution of relative errors
Data set 1. CLOSE_RAND 1000 clusters
42Number of Clusters
43Effectiveness of Applying Relative-Error Model
44Dynamic Maintenance
45Other work 1 Relaxing SQL queries with
Selections/Joins
- SELECT FROM Jobs J, Candidate CWHERE J.Salary
lt 95 AND J.Zipcode C.Zipcode AND
C.WorkYear gt 5
Jobs Jobs Jobs Jobs Candidates Candidates Candidates Candidates
JID Company Zipcode Salary CID Zipcode ExpSalary WorkYear
r1 Broadcom 92047 80 s1 93652 120 3
r2 Intel 93652 95 s2 92612 130 6
r3 Microsoft 82632 120 s3 82632 100 5
r4 IBM 90391 130 s4 90391 150 1
... ...
46Query Relaxation Skyline!
47Other work 2 Fuzzy predicates on attributes of
mixed types
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1977 lt 3
Movies
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
48Mixed-Typed Predicates
- String attributes edit distance
- Numeric attributes absolute numeric difference
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1977 lt 3
49MAT-tree Intuition
- Indexing on two attributes is more effective than
two separate indexing structures - Numeric attribute B-tree
- String attribute tree-based index structure?
50MAT-tree Overview
- Tree-based indexing structure
- Each node has MBR for both numeric attribute and
string attribute - Compressing strings as a compressed trie that
fits into a limited space - An edit distance between a string and compressed
trie can be computed - Experiments show that MAT-tree is very efficient
51Conclusion
- Its important to support answering approximate
queries efficiently - Our results so far
- SEPIA provides accurate selectivity estimation
for fuzzy string predicates - Relaxing SQL queries with selections and joins
- MAT-tree indexing structure supporting fuzzy
queries with mixed-types predicates