Rainer Gemulla, Wolfgang Lehner and Peter J. Haas - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Rainer Gemulla, Wolfgang Lehner and Peter J. Haas

Description:

Random sampling is an appealing approach to build synopses of large data streams ... uses arriving insertions to compensate for deletions ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 14
Provided by: fres1
Category:

less

Transcript and Presenter's Notes

Title: Rainer Gemulla, Wolfgang Lehner and Peter J. Haas


1
A Dip in the Reservoir Maintaining Sample
Synopses of Evolving Datasets
  • Rainer Gemulla, Wolfgang Lehner and Peter J. Haas
  • VLDB 2006

2
Outline
  • Introduction
  • Bernoulli Sampling
  • Reservoir Sampling
  • Random Pairing
  • Conclusion

3
Introduction
  • Random sampling
  • approximate query answering
  • data mining
  • data stream processing
  • query optimization
  • data integration
  • For example
  • It is often infeasible to process or store the
    entire data stream
  • Random sampling is an appealing approach to build
    synopses of large data streams

4
Uniform Sampling
  • Uniform sampling
  • all samples of the same size are equally likely
  • many statistical procedures assume uniformity
  • flexibility
  • Example
  • a data set (also called population)
  • possible samples of size 2

5
Bernoulli Sampling
  • Each inserted item is included in the sample with
    probability q and excluded with probability 1-q
  • For a dataset R, the sample size follows the
    binomial distribution BINOM(R,q), so that
  • The main disadvantage is the uncontrollable
    variability of the sample size.

6
Reservoir Sampling
  • Reservoir sampling
  • Maintains a random sample of fixed size M
  • building block for many sophisticated sampling
    schemes
  • single-scan algorithm
  • add the first M elements
  • afterwards, flip a coin
  • ignore the element (reject)
  • replace a random element in the sample (accept)
  • accept probability of the ith element

7
Reservoir Sampling (Example)
  • Example
  • sample size M 2

8
Problems with Reservoir Sampling
  • Problems with reservoir sampling
  • lacks support for deletions (stable data sets)

?
9
An Incorrect Approach
  • Idea
  • use arriving insertions to refill the sample

Not uniform!
10
Random Pairing
  • Random pairing
  • compensates deletions with arriving insertions
  • corrects inclusion probabilies
  • General idea (insertion)
  • no uncompensated deletions ? reservoir sampling
  • otherwise,
  • randomly select an uncompensated deletion
    (partner)
  • compensate it Was it in the sample?
  • yes ? add arriving element to sample
  • no ? ignore arriving element

11
(Cont.)
  • The RP algorithm maintains two counters
  • c1 records the number of uncompensated deletions
    in which the deleted item was in the sample
  • c 2 records the number of uncompensated deletions
    in which the deleted item was not in the sample
  • d c1 c 2 the total number of uncompensated
    deletions

12
Random Pairing
  • Example

13
Conclusion
  • Reservoir Sampling
  • lacks support for deletions
  • Random Pairing
  • uses arriving insertions to compensate for
    deletions
  • Can this sampling schemes be applied to sliding
    windows??
  • It may be difficult, because that the number of
    items in the window is unknown in advance.
Write a Comment
User Comments (0)
About PowerShow.com