STHoles: A Multidimensional Workload-Aware Histogram - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

STHoles: A Multidimensional Workload-Aware Histogram

Description:

Used for selectivity estimation and approximate query processing. ... Overhead for intercepting query results in Microsoft SQL Server 2000 is less than 8 ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 33
Provided by: emilio6
Category:

less

Transcript and Presenter's Notes

Title: STHoles: A Multidimensional Workload-Aware Histogram


1
STHoles A Multidimensional Workload-Aware
Histogram
  • Nicolas BrunoColumbia University

Luis GravanoColumbia University
Surajit ChaudhuriMicrosoft Research
Work done in part while the authors were
visiting Microsoft Research.
SIGMOD 2001
2
Histograms as Succinct Data Set Summaries
  • Used for selectivity estimation and approximate
    query processing.
  • Data set partitioned into buckets, each
    approximated by aggregate statistics.

3
Histograms
  • Each bucket consists of a bounding box and a
    tuple frequency value.
  • Uniformity is assumed inside buckets.
  • Histograms should partition data set in buckets
    with uniform tuple density.
  • Multi-dimensional data makes partitioning even
    more challenging.

4
Outline
  • Overview of existing multidimensional histogram
    techniques.
  • Introduction to STHoles histograms.
  • System architecture and STHoles construction
    algorithm.
  • Experimental evaluation.

5
Histograms Techniques EquiDepth
Gaussian Data Set
EquiDepth Histogram Muralikrishna and DeWitt
1988
  • Correctly identifies core of densest clusters.
  • Partitioning uses equi-count instead of
    equi-density

6
Histogram Techniques MHist
Gaussian Data Set
MHist Histogram Poosala and Ioannidis 1997
  • Works well for highly skewed data distributions.
  • Devotes too many buckets to the densest clusters.
  • Bad initial choices are amplified in later
    steps.

7
Histogram Techniques GenHist
GenHist Histogram Gunopulos et al. 2000
Gaussian Data Set
  • More robust than previous techniques (based on
    multidimensional information).
  • Difficult to choose right values of various
    parameters.
  • Requires at least 5-10 passes over the data.

8
Histogram Techniques STGrid
Gaussian Data Set
STGrid Histogram Aboulnaga and Chaudhuri 1999
  • Incorporates feedback from query execution.
  • Grid partitioning strategy is sometimes too
    rigid.
  • Focuses on efficiency rather than accuracy.

9
Our New Histogram Technique STHoles
  • Flexible bucket partitioning.
  • Exploits workload information to allocate
    buckets.
  • Query feedback captures uniformly dense regions.
  • Does not examine actual data set.

10
STHoles Histograms
  • Tree structure among buckets.
  • Buckets with holes relaxes rectangular regions
    while using rectangular bucket structures.

11
System Architecture for STHoles
Range Query
12
STHoles Construction Algorithm
  • Initialize histogram H as an empty histogram.
  • For each query q in workload
  • 1- Gather simple statistics from query results.
  • 2- Identify candidate holes and drill (add) them
    as new buckets in H.
  • 3- Merge superfluous buckets in H.

13
Drilling New Candidate Buckets
For each query q in workload and bucket b in
histogram
  • Count how many tuples in result stream lie inside
    q?b.
  • Drill q?b as a new bucket (child of b).

q
14
Shrinking Candidate Buckets
  • Partition constraint Bounding boxes must be
    rectangular.
  • Apply greedy technique to shrink a candidate hole
    to a rectangle.

15
Merging Buckets
  • To avoid exceeding available space.
  • Merge most similar buckets in terms of tuple
    density.

16
Parent-Child Merges
  • Eliminate buckets too similar to their parents.
  • Example The interesting region in bc is covered
    by its child b1.

17
Sibling-Sibling Merges
  • Consolidate buckets with similar densities that
    cover close regions.
  • Extrapolate frequency distributions to yet unseen
    regions.

18
An Example STHoles Histogram
Gaussian Data Set
STHoles Histogram
19
Experimental Setting
  • Data Sets
  • Real (UCI Repository)
  • Sample of Census data set (200K tuples)
  • Cover data set (500K tuples)
  • Synthetic Variations of Gaussian and
    Zipfian(Array) distributions.
  • 200K to 500K tuples, 2 to 4 dimensions.
  • Histograms
  • 1024 available bytes per histogram.
  • EquiDept, MHist, GenHist, STGrid, STHoles.

20
Experimental Setting (cont.)
  • Workloads Pagel et al. 1993
  • 1,000 queries.
  • Query centers follow different distributions
    Uniform, Biased, Gaussian.
  • Query boundaries follow different constraints
    area covered, tuples covered.

21
Comparison with Other Approaches Biased Workload
22
Comparison with Other Approaches Uniform Workload
23
Convergence with Workload
24
Handling Data Set Updates
From Gaussian to Zipfian data distributions.
25
Other Experiments
  • Varying
  • data skew.
  • data dimensionality.
  • histogram size.
  • workload generation parameters.
  • number of attributes in queries.
  • Overhead for intercepting query results in
    Microsoft SQL Server 2000 is less than 8.
  • STHoles lead to robust selectivity estimates
    across data distributions and workloads.
  • See full paper for details!

26
Summary STHoles, a Multidimensional
Workload-Aware Histogram
  • Exploits query feedback.
  • Built without examining data set.
  • Allows bucket nesting to capture complex shapes
    using only rectangular bucket structures.
  • Results in robust and accurate selectivity
    estimations.
  • In many cases, outperforms the best techniques
    that access full data sets.

27
Related Work (Histograms)
  • Unidimensional
  • EquiDepth Piatetsky-Shapiro and Connell 1984
  • MaxDiff Poosala et al. 1996
  • V-Optimal Jagadish et al. 1998
  • Many more!
  • Multidimensional
  • EquiDepth Muralikrishna and DeWitt 1988
  • MHist Poosala and Ioannidis 1997
  • GenHist Gunopulos et al. 2000
  • STGrid Aboulnaga and Chaudhuri 1999

28
Related Work (Other Techniques)
  • Sampling Olken and Rotem 1990
  • Wavelets Matias et al. 1997
  • Discrete transformations Lee et al. 1999
  • Parametric Curve Fitting Chen and Roussopoulos
    1994

29
Evaluation Metric
  • Absolute Error
  • Normalized Absolute Error

30
Overhead Evaluation over Microsoft SQL Server 2000
31
Varying Histogram Size
32
Varying Spatial Selectivity
Write a Comment
User Comments (0)
About PowerShow.com