Title: Two Improved RangeEfficient Algorithms for F0 Estimation
1Two Improved Range-Efficient Algorithms for F0
Estimation
TAMC 2007
2Outline
3Data Stream Model
- Massive data set
- Limited workspace
- Typically poly-logarithmic in the size of data
- Fast processing per item
- Constant or logarithmic in data set
- Provide approximate answers to aggregate queries
4Definition of FK
5Related Work
- F0 of a data stream
- Flajolet-Martin (JCSS 1985)
- Alon et al. (JCSS 1999)
- Cormode et al. (VLDB 2002)
- Bar-Yossef et al. (RANDOM 2002)
- Lower Bounds, Indyk-Woodruff (FOCS 2003)
- Lp difference of data streams
- P. Indyk (FOCS 2000)
6Range-Efficient F0 Problem
- Input A series of intervals
- where
- Output The size of the union of the intervals
Example
2 3 10 15
50 60 66 80
Answer
7Contraints
- One pass through the data
- Space complexity
- Fast processing time
- An approximation, i.e. relationship
between the output Z and the exact answer Z
satisfies
8Result Comparison
9Applications to Range-Efficient F0
10Preliminaries Hash Function
- LSB(x) the number of consecutive 0s from the
rightmost in xs binary expression. - Define a hash function
, where a and b are chosen uniformly from
0,, p-1.
Example
LSB(1)0, LSB(2)1, LSB(8)3, LSB(10)1
Lemma LSB(h(x)) is a pairwise independent
hash function.
11Algorithm
12A key issue in our algorithm
- For the given function h and interval R, a key
issue in our algorithm design is how to calculate
the following quantities efficiently? - 1
- 2
13Calculation of M(R) and G(R)
- M(R) the largest value of the hashed integers in
R. - G(R) number of integers in R achieved that
value. - A naïve algorithm required time O(R).
Calculate these two quantities in time
O(ploy(logU))
14Equivalent Description of Calculating M(R), G(R)
- Given the sequence
- Find the maximum i, , such
that
Analysis
For the given i, we obtain
15Calculating M(R) and G(R) Cont.
- Define
- Construct an arithmetic progression S over the
field Zp - Calculate the number of elements in S belonging
to the interval 0,t. - By Pavan and Tirthapuras algorithm, this
quantity can be calculated with time complexity
O(logU)
16Algorithm for M(R) and G(R)
- Use binary search to determine the maximum i,
- For each i, use Pavans algorithm to calculate
G(R). - If G(R)gt0, then M(R)i
-
TheoremSun, Poon, 2006
There exists an algorithm for calculating M(R)
and G(R), with time complexity O(logUloglogU) and
space complexity O(logU).
17Adaptive Sampling for F0
- Given a stream of numbers find the number of
distinct elements in the stream - Random Sampling Algorithm
- Random Sample of distinct elements seen so far
- Sampling Level i (sampling probability 1/2i)
- If sample size exceeds threshold, then sub-sample
to a smaller probability - Sample size O(1/?2) integers
18Adaptive Sample
Sample of element
19Algorithm Description
Current Level0
Top Level
HeightO(log U)
LSB(h(R1))0
Level 3
if of intervals gt the size
Level 2
Level 1
(R1,M(R1),G(R1))
Level 0
20Correctness Proof
Theorem
By Chernoff Bound, the probability can be reduced
to by running in parallel copies
of the algorithm.
21Time/Space Complexity
- Use a binary tree to maintain the sample
- Amortized update time
- Worst case update time
- Use a binary tree to maintain the intervals
mapping to the same level - Amortized update time
- Worst case update time
22Further Work
- Design range-efficient F0 estimating algorithms
under Turnstile Model. - There exists F0 estimating algorithm for single
item case in Turnstile Model, which employing
p-stale distributions. - How to generate general range-summable p-stable
distributions?
23Thanks ?http//www.cs.fudan.edu.cn/sun