Liang Jin and Chen Li - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Liang Jin and Chen Li

Description:

Star Wars: Episode III - Revenge of the Sith. The Matrix. Title. Schwarzenegger. Samuel Jackson ... estimation for predicates with wildcards: star LIKE '%Hanks ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 53

Provided by: che791

Category:

more less

Transcript and Presenter's Notes

Title: Liang Jin and Chen Li

1
Selectivity Estimation for Fuzzy String
Predicates in Large Data Sets

Liang Jin and Chen Li

Supported by NSF CAREER Award IIS-0238586
2
Example a movie database
Find movies starred Schwarrzenger?
Find movies with a star similar to
Schwarrzenger.
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

3
Queries with Fuzzy String Predicates

Stars name similar to Schwarrzenger
Employees SSN similar to 430-87-7294
Customers telephone number similar to 412-0964
Similar to
a domain-specific function
returns a similarity value between two strings
Example edit distance
Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2
ed(Tom Hanks,
Ton Hank ) 2

Database
4
Selectivity Estimation Problem Formulation
star SIMILARTO Schwarrzenger
Input fuzzy string predicate P(q, d)
A bag of strings
Output of strings s that satisfy dist(s,q) lt d
5
Why Selectivity Estimation?
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year BETWEEN 1970,1971
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year BETWEEN 1980,1999
Movies
Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

The optimizer needs to know the selectivity of a
predicate to decide a good plan.
6
Rest of the talk

Motivation selectivity estimation of fuzzy
predicates
Our approach SEPIA
Proximity between strings
Histograms and estimation algorithm
Construction and maintenance of SEPIA
Experiments

7
Intuition of SEPIA

Selectivity Estimation of Approximate Predicates

8
Proximity between Strings
Edit Distance? Not discriminative enough
9
Edit Vector from s1 to s2

A vector ltI, D, Sgt
I of insertions
D of deletions
S of substitutions
in a sequence of edit operations with their edit
distance

10
Why Edit Vector? More discriminative
11
SEPIA histograms Overview
12
Frequency table for each cluster
13
Global PPD Table

Proximity Pair Distribution table

14
SEPIA histograms summary
15
Selectivity Estimation ed(lukas, 2)

Do it for all v2 vectors in each cluster, for all
clusters
Take the sum of these contributions

16
Selectivity Estimation for ed(q,d)

For each cluster Ci
For each v2 in frequency table of Ci
Use (v1,v2,d) to lookup PPD
Take the sum of these f N
Pruning possible (triangle inequality)

17
Outline

Motivation selectivity estimation of fuzzy
predicates
Our approach SEPIA
Proximity between strings
Histograms and estimation algorithm
Construction and maintenance of SEPIA
Experiments

18
Clustering Strings

Two example algorithms
Lexicographic order based.
K-Medoids
Choose initial pivots
Assign strings to its closest pivot
Swap a pivot with another string
Reassign the strings

19
Number of Clusters

It affects
Cluster quality
Similarity of strings within each cluster
Costs
Space
Estimation time

20
Constructing Frequency Tables

For each cluster, group strings based on their
edit vector from the pivot
Count the frequency for each group

21
Constructing PPD Table

Get enough samples of string triplets (q,p,s)
Propose a few heuristics
ALL_RAND
CLOSE_RAND
CLOSE_LEX
CLOSE_UNIQUE

22
Dynamic Maintenance Frequency Table

Take insertion as an example

23
Dynamic Maintenance PPD
24
Improving Estimation Accuracy

A post-processing step to further improve
estimation accuracy
See paper for details.

25
Outline

Motivation selectivity estimation of fuzzy
predicates
Our approach SEPIA
Proximity between strings
Histograms and estimation algorithm
Construction and maintenance of SEPIA
Experiments

26
Data

Citeseer
71K author names
Length 2,20, avg 12
Movie records from UCI KDD repository
11K movie titles.
Length 3,80, avg 35
Introduced duplicates
10 of records
of duplicates 1,20, uniform
Final results
Citeseer 142K author names
UCI KDD 23K movie titles

27
Setting

Test bed
PC 2.4G P4, 1.2GB RAM, Windows XP
Visual C compiler
Query workload
Strings from the data
String not in the data
Results similar
Quality measurements
Relative error (fest freal) / freal
Absolute relative error fest freal / freal

28
Quartile distribution of relative errors
Data set 1. CLOSE_RAND 1000 clusters
29
Number of Clusters
30
Dynamic Maintenance

More results in the paper
Extension to other similarity functions
More experimental results

31
Related Work

Traditional histograms
Selectivity estimation for predicates with
wildcards star LIKE Hanks
Answering fuzzy predicates efficiently (another
talk in this conference)

32
Conclusions

Important to support queries with fuzzy string
predicates
SEPIA provides accurate selectivity estimation
Structures can be efficiently constructed and
maintained.
Extendable to various similarity measurements

The Flamingo Project http//www.ics.uci.edu/fla
mingo/
QA?
33
BLANK
34
Backup Slides
35
Why do we care?

Errors in queries
User doesnt remember a string exactly
User types a wrong string

36
Size of histograms

Data set 1
1000 clusters
PPD table 5MB
Frequency tables 200KB

37
Constructing PPD table

We want to generate enough sample triplets to
cover as many (v1, v2) pairs as possible.
We also want to control the cost of generating
the samples and calculation.
Heuristics
ALL_RAND
CLOSE_RAND
CLOSE_LEX
CLOSE_UNIQUE

38
Populating PPD Table
CLOSE_RAND is used
39
Number of Clusters (cont)
Number of clusters grows with the size of the
dataset
Fixed number of clusters
40
Extension to other similarity functions

SEPIA a general framework for selectivity
estimation for fuzzy string predicates.
Key issue in extensions proximity between
strings
Too specific?
Too general?
Example Jaccard coefficient distance
Proximity between two strings s1 and s2.
G(s,n) is the n-gram set for string s.
lt G(s1, n) G(s2, n), G(s1, n) v G(s2, n),
ed(s1, s2) gt

41
Jaccard Coefficient Distance
42
Research Issues

Deciding similarity functions
Domain specific
Query processing
Answering a query with fuzzy predicates
efficiently
Query optimization
Selectivity estimation

43
Queries with Fuzzy String Predicates

Stars name similar to Schwarrzenger
Employees SSN similar to 430-87-7294
Customers telephone number similar to 412-0964

Database

Similar to
a domain-specific function
returns a similarity value between two strings
Examples
Edit distance ed(Schwarrzenger,
Schwarzenegger)2
Cosine similarity
Jaccard coefficient distance
Soundex

44
Why do we care?

Errors in the query
The user doesnt remember a string exactly
The user unintentionally types a wrong string

45
Selectivity of Fuzzy Predicates

star SIMILARTO Schwarrzenger
Selectivity of records satisfying the predicate

Star Title Year Genre
Keanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars Episode III - Revenge of the Sith 2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama

46
Example Similarity Function Edit Distance

A widely used metric to define string similarity
Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2
Example
s1 Tom Hanks
s2 Ton Hank
ed(s1,s2) 2

47
Using traditional histograms?