Two Improved RangeEfficient Algorithms for F0 Estimation - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Two Improved RangeEfficient Algorithms for F0 Estimation

Description:

Typically poly-logarithmic in the size of data. Fast processing per item ... LSB(x): the number of consecutive 0's from the rightmost in x's binary expression. ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 24
Provided by: sun55
Category:

less

Transcript and Presenter's Notes

Title: Two Improved RangeEfficient Algorithms for F0 Estimation


1
Two Improved Range-Efficient Algorithms for F0
Estimation
TAMC 2007
  • He Sun
  • Fudan University

2
Outline
3
Data Stream Model
  • Massive data set
  • Limited workspace
  • Typically poly-logarithmic in the size of data
  • Fast processing per item
  • Constant or logarithmic in data set
  • Provide approximate answers to aggregate queries

4
Definition of FK
5
Related Work
  • F0 of a data stream
  • Flajolet-Martin (JCSS 1985)
  • Alon et al. (JCSS 1999)
  • Cormode et al. (VLDB 2002)
  • Bar-Yossef et al. (RANDOM 2002)
  • Lower Bounds, Indyk-Woodruff (FOCS 2003)
  • Lp difference of data streams
  • P. Indyk (FOCS 2000)

6
Range-Efficient F0 Problem
  • Input A series of intervals
  • where
  • Output The size of the union of the intervals

Example
2 3 10 15
50 60 66 80
Answer
7
Contraints
  • One pass through the data
  • Space complexity
  • Fast processing time
  • An approximation, i.e. relationship
    between the output Z and the exact answer Z
    satisfies

8
Result Comparison
9
Applications to Range-Efficient F0
10
Preliminaries Hash Function
  • LSB(x) the number of consecutive 0s from the
    rightmost in xs binary expression.
  • Define a hash function
    , where a and b are chosen uniformly from
    0,, p-1.

Example
LSB(1)0, LSB(2)1, LSB(8)3, LSB(10)1
Lemma LSB(h(x)) is a pairwise independent
hash function.
11
Algorithm
12
A key issue in our algorithm
  • For the given function h and interval R, a key
    issue in our algorithm design is how to calculate
    the following quantities efficiently?
  • 1
  • 2

13
Calculation of M(R) and G(R)
  • M(R) the largest value of the hashed integers in
    R.
  • G(R) number of integers in R achieved that
    value.
  • A naïve algorithm required time O(R).

Calculate these two quantities in time
O(ploy(logU))
14
Equivalent Description of Calculating M(R), G(R)
  • Given the sequence
  • Find the maximum i, , such
    that

Analysis
For the given i, we obtain
15
Calculating M(R) and G(R) Cont.
  • Define
  • Construct an arithmetic progression S over the
    field Zp
  • Calculate the number of elements in S belonging
    to the interval 0,t.
  • By Pavan and Tirthapuras algorithm, this
    quantity can be calculated with time complexity
    O(logU)

16
Algorithm for M(R) and G(R)
  • Use binary search to determine the maximum i,
  • For each i, use Pavans algorithm to calculate
    G(R).
  • If G(R)gt0, then M(R)i

TheoremSun, Poon, 2006
There exists an algorithm for calculating M(R)
and G(R), with time complexity O(logUloglogU) and
space complexity O(logU).
17
Adaptive Sampling for F0
  • Given a stream of numbers find the number of
    distinct elements in the stream
  • Random Sampling Algorithm
  • Random Sample of distinct elements seen so far
  • Sampling Level i (sampling probability 1/2i)
  • If sample size exceeds threshold, then sub-sample
    to a smaller probability
  • Sample size O(1/?2) integers

18
Adaptive Sample
Sample of element
19
Algorithm Description
Current Level0
Top Level
HeightO(log U)
LSB(h(R1))0
Level 3
if of intervals gt the size
Level 2
Level 1
(R1,M(R1),G(R1))
Level 0
20
Correctness Proof
Theorem
By Chernoff Bound, the probability can be reduced
to by running in parallel copies
of the algorithm.
21
Time/Space Complexity
  • Use a binary tree to maintain the sample
  • Amortized update time
  • Worst case update time
  • Use a binary tree to maintain the intervals
    mapping to the same level
  • Amortized update time
  • Worst case update time

22
Further Work
  • Design range-efficient F0 estimating algorithms
    under Turnstile Model.
  • There exists F0 estimating algorithm for single
    item case in Turnstile Model, which employing
    p-stale distributions.
  • How to generate general range-summable p-stable
    distributions?

23
Thanks ?http//www.cs.fudan.edu.cn/sun
Write a Comment
User Comments (0)
About PowerShow.com