On Random Sampling over Joins - PowerPoint PPT Presentation

About This Presentation
Title:

On Random Sampling over Joins

Description:

Semantic and algorithms of sample. Two previous sampling strategies ... Conclusion : The new strategies are better then the earlier techniques. ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 20
Provided by: cryst
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: On Random Sampling over Joins


1
On Random Sampling over Joins
  • Surajit Chaudhuri Rajeeve Motwani Vivek
    Narasayya
  • Microsoft Research Stanford University
    Microsoft Research
  • Compiled by
  • Arjun Dasgupta

2
CONTENTS
  • The difficulty of join sampling
  • Semantic and algorithms of sample
  • Two previous sampling strategies
  • New strategies for join sampling
  • Experiments results

3
  • SAMPLE (R1gtltR2,f)
  • ?
  • SAMPLE (R1,f) gtlt SAMPLE (R2,f)

4
STRATEGY USED
  • Obtain SAMPLE (R1gtltR2,f) from non-uniform samples
    of R1 and R2

5
The Difficulty of Join Sampling -Example
  • Suppose that we have the relations

6
TECHNIQUES FOR SAMPLING
  • Black Box U1 (un-weighted)
  • Black Box U2 (un-weighted)
  • Black Box WR1 (weighted)
  • Black Box WR2 (weighted)

7
Black-Box U2 Given relation R with n tuples,
generate an unweighted WR sample of size r.
  • 1.
  • 2. Initialize reservoir array A1..r with r
    dummy values.
  • 3. While tuples are streaming by do begin
    (a) get
    next tuple t
    (b)
    (c) for
    j1 to r set Aj to t with probability 1/N end

8
Black-Box WR2 Given relation R with n tuples,
generate a weighted WR sample of size r.
  • 1.
  • 2. Initialize reservoir array A1r with r dummy
    values.
  • 3. While tuples are streaming by do begin
    (a) get next tuple t with weight w(t)
    (b)

    (c) for j1 to r do set Aj to t with prob.
    w(t)/W end.

9
The Classification of the Problem
  • Case A No information is available for
    either or .
  • Case B No information is available for
    but indexes and /or statistics are available for
    .
  • Case C Indexes/statistics are available for
    and .

10
Previous Sampling Strategies
  • Strategy Naive-Sample
  • 1. Compute the join .
  • 2. As the tuples of J stream by, use Black-Box
    U1
  • or U2 to produce
    .

11
Previous Sampling Strategies
  • Strategy Olken-Sample
  • 1. Let M be an upper bound on for all
    .
  • 2.repeat
  • (a) Sample a tuple uniformly at
    random.
  • (b) Sample a random tuple from
    among all
  • tuples that have
    .
  • (c) Output with probability
    , and
  • with remaining probability reject the
    sample.
  • Until r tuples have been produced.

12
New Strategies for Join Sampling
  • Strategy Stream Sample
  • 1. Use Black-Box WR1 or WR2 to produce a WR
    sample of size r, where the weight for
    a tuple is set to
  • 2. While tuples of are streaming by do begin
  • (a) get next tuple and let
  • (b) sample a random tuple from
    among all
  • tuples that have
  • (c) output .
    end.

13
New Strategies for Join Sampling
  • Strategy Stream Sample is more efficiency then
    Olken
    1. No information is required for -
    case B.
    2. No tuple is
    rejected after computing the join .
    3.
    Only one iteration is needed for each output
    tuple.

14
New Strategies for Join Sampling
  • Strategy Group Sample
  • 1. Use Black-Box WR1 or WR2 to produce a WR
    sample of size r, where the weight
    for a tuple is set to
    .
  • 2. Let consist of the tuples
    . Produce
    whose tuples are grouped by s tuples
    that generated them.
  • 3. Use r invocations of Black-Box U1 or U2 to
    sample r sample, one of each group.

15
New Strategy for Join Sampling
  • Strategy Frequency-Partition-Sample

16
Experimental Results
17
Experimental Results
18
Experimental Results
19
Summery
  • The difficulty of join sampling- example.
  • The classification of the problem - 3 cases.
  • Naive-sample
    Olken-sample previous
    strategies
  • Stream-sample
    Group-sample
    new strategies Frequency-partition-s
    ample
  • Conclusion The new strategies are better then
    the earlier techniques.
Write a Comment
User Comments (0)
About PowerShow.com