Hashed Samples - PowerPoint PPT Presentation

About This Presentation
Title:

Hashed Samples

Description:

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 28
Provided by: Marios79
Category:
Tags: hashed | naive | samples | string

less

Transcript and Presenter's Notes

Title: Hashed Samples


1
Hashed Samples
  • Selectivity Estimators for
  • Set Similarity Selection Queries

2
Set Similarity An Application
  • Find similar strings
  • Decompose strings into 3-grams.
  • Represent strings as sets of 3-grams.
  • Compare strings by comparing their respective
    3-gram sets.
  • Nick Koudas Nic, ick, , das
  • Nick Arkoudas Nic, , das
  • We can use TF/IDF similarity (or other metrics)
    to evaluate set similarity.

3
Indexes For Set similarity Evaluation
  • Current approaches use inverted lists
  • Compute IDF of set elements (e.g., 3-grams).
  • Create one inverted list per set element
    consisting of one entry per database set
    containing the respective elements (e.g., one
    entry per string containing the respective
    3-gram).
  • Use various algorithms and sorting/compression
    schemes for fast merging of inverted lists.

4
Motivation
  • Set similarity queries are very important
  • String matching.
  • Data cleaning.
  • Set-valued attributes in ORDBMS.
  • A variety of set similarity operators have been
    proposed (for join, selection queries)
  • Selectivity estimation is important for query
    optimization

5
The Problem
  • Let I be a predefined set similarity measure.
  • Let D a collection of sets.
  • Given query set q and threshold t, a set
    similarity selection query returns the answer set
    A s ? D, s.t. I(q, s) gt t .
  • A set similarity selectivity estimation query
    estimates the size of A.

6
Naïve Solutions
7
Random Sampling
  • Maintain a sample S, of sets s ? D.
  • Size of answer
  • A As??D / S,
  • where As s ? S I(q, s) gt t.
  • Drawbacks
  • Query independent.
  • Large variance.
  • Needs to store complete sets in the sample.
  • Cannot handle updates.

8
One Sample Per List
  • Use the existing inverted index
  • Compute one sample per inverted list.
  • Compute independent estimates per list
    corresponding to the query set elements only
    (query specific)
  • Report median, max, average
  • Drawbacks
  • Ignores correlations between lists.
  • Needs to store complete sets in inverted lists.
  • Will not be better than simple random sampling.
  • Cannot handle updates.

9
Sample Union
  • Compute the sample union of the samples
    corresponding to the query set elements.
  • Drawbacks
  • Results in a biased sample.
  • There are duplicate elements in those lists.
  • Even if we eliminate duplicates we still need to
    compute the distinct set size of the sample union
    (for scaling up).
  • This is more expensive than answering the set
    similarity query exactly to begin with.
  • Needs to store complete sets in inverted lists.
  • Cannot handle updates.

10
Dynamically Computed Samples
  • Given the query
  • Use reservoir sampling to compute a sample union
    from the inverted lists on the fly.
  • Drawbacks
  • Produces a biased sample
  • Skips part of the input.
  • Duplicates.
  • Need to store complete sets in the inverted lists.

11
Hashed Samples
12
Hashed Sample
  • An a priori computed sample that
  • Builds uniform samples from arbitrary
    combinations of inverted lists.
  • Does not need to store complete sets in the
    sample (only set ids).
  • Leverages partial weight information contained in
    the lists.
  • Eliminates the need to store distinct value
    estimation synopses.
  • Provides unbiased estimates.
  • Handles updates gracefully.

13
Construction
  • We cannot draw independent samples per list
  • Draw samples deterministically
  • In order to leverage partial weights contained in
    lists for computing I(q, s) efficiently.
  • Guarantee that if a set id is sampled from one
    list, it will be always sampled in all other
    lists.
  • Guarantee that union of list samples is a uniform
    random sample.
  • We impose a random permutation on the domain of
    set ids
  • Use hashing and sample a consistent subset.

14
Construction 2
  • Randomly choose a hash function h from a family
    of universal hash functions.
  • Assume that h hashes in 1, 100
  • Values h(s1), h(s2), appear as i.i.d.
    (empirically)
  • Choose a value x and sample from every list sets
    s h(s) lt x.

15
Hashed Sample Properties
  • We get an x sample per list on average.
  • We get an overall x sample.
  • The union of samples of any set of inverted lists
    is an x sample of the respective lists.
  • Let q q1, q2, , qn
  • A As ? q1 ? ? qnd / qs1 ?? ? qsnd.
  • Computing As is simple
  • Run any exact evaluation algorithm on the sampled
    lists!
  • Performance improvement with respect to exact
    evaluation is directly proportional to the size
    of the sample.
  • We still need the distinct number of set ids in q
    to scale up the results.

16
The K-Minimum Values Synopsis
  • Estimating the distinct size of arbitrary list
    unions
  • The sampled lists themselves can be used as a KMV
    synopsis, by contsruction.
  • The r-th smallest hash value hr of a set of
    elements gives an unbiased estimator of the
    distinct number of elements in the set
  • Sd ?? P (r 1) / hr
  • Given that sample lists contain all elements s.t.
    h(s) ?? x, we can deduce the rank of hm x.

17
Experimental Evaluation
18
Setup
  • IMDB, DBLP, YellowPages
  • Decompose strings into 3-grams and build inverted
    index for TF/IDF similarity.
  • Build list samples 1, 5, 10.
  • Draw queries from the data
  • 100 queries per workload.
  • Each set contains queries of preset selectivity.
  • Evaluate estimation accuracy and runtime.

19
Storing Sets VS. Storing Set Ids
20
Reservoir Sampling Accuracy
21
Reservoir Sampling Cost
22
Hashed Sampling Accuracy
23
Hashed Sampling Cost
24
Hashed Sampling Threshold
25
Hashed Sampling Answer Size
26
Hashed Sampling KMV Accuracy
27
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com