Title: Statistic estimation over data stream
1Statistic estimationover data stream
Slides modified from Minos Garofalakis ( yahoo!
research) and S. Muthukrishnan (Rutgers
University)
2Outline
- Introduction
- Frequent moment estimation
- Element Frequency estimation
3Data Stream Processing Algorithms
- Generally, algorithms compute approximate answers
- Provably difficult to compute answers accurately
with limited memory - Approximate answers - Deterministic bounds
- Algorithms only compute an approximate answer,
but bounds on error - Approximate answers - Probabilistic bounds
- Algorithms compute an approximate answer with
high probability - With probability at least , the computed
answer is within a factor of the actual
answer
4Sampling Basics
- Idea A small random sample S of the data often
well-represents all the data - For a fast approximate answer, apply modified
query to S - Example select agg from R
(n12)
- If agg is avg, return average of the elements in
S - Number of odd elements ?
Data stream 9 3 5 2 7 1 6 5 8
4 9 1
Sample S 9 5 1 8
answer 11.5
5Probabilistic Guarantees
- Example Actual answer is within 11.5 1 with
prob ? 0.9 - Randomized algorithms Answer returned is a
specially-built random variable - Use Tail Inequalities to give probabilistic
bounds on returned answer - Markov Inequality
- Chebyshevs Inequality
- Chernoff/Hoeffding Bound
6Basic Tools Tail Inequalities
- General bounds on tail probability of a random
variable (that is, probability that a random
variable deviates far from its expectation) - Basic Inequalities Let X be a random variable
with expectation and variance VarX. Then
for any
Markov
Chebyshev
7Tail Inequalities for Sums
- Possible to derive even stronger bounds on tail
probabilities for the sum of independent
Bernoulli trials - Chernoff Bound Let X1, ..., Xm be independent
Bernoulli trials such that PrXi1 p (PrXi0
1-p). Let and be
the expectation of . Then, for any ,
Do not need to compute Var(X), but need the
independent assumption!
- Application to count queries
- m is size of sample S (4 in example)
- p is fraction of odd elements in stream (2/3 in
example)
8The Streaming Model
- Underlying signal One-dimensional array A1N
with values Ai all initially zero - Multi-dimensional arrays as well (e.g.,
row-major) - Signal is implicitly represented via a stream of
updates - j-th update is ltk, cjgt implying
- Ak Ak cj (cj can be gt0, lt0)
- Goal Compute functions on A subject to
- Small space
- Fast processing of updates
- Fast function computation
9Streaming Model Special Cases
- Time-Series Model
- Only j-th update updates Aj (i.e., Aj
cj) - Cash-Register Model
- cj is always gt 0 (i.e., increment-only)
- Typically, cj1, so we see a multi-set of
items in one pass - Turnstile Model
- Most general streaming model
- cj can be gt0 or lt0 (i.e., increment or
decrement) - Problem difficulty varies depending on the model
- E.g., MIN/MAX in Time-Series vs. Turnstile!
10Frequent moment computation
- Problem
- Data arrives online ( a1,a2,a3..am )
- Let f(i) j aj i ( represented by
Ai ) -
- Example
F0 5 lt distinct elementsgt, F1 7, F2 11 (
1122221111) ( surprise index)
What is F8?
11Frequent moment computation
- Easy for F1
- How about others ?
- - focus on the F2 and F0
- - Estimation of Fk
-
12Linear-Projection (AMS) Sketch Synopses
- Goal Build small-space summary for distribution
vector f(i) (i1,..., N) seen as a stream of
i-values - Basic Construct Randomized Linear Projection of
f() project onto inner/dot product of
f-vector - Simple to compute over the stream Add
whenever the i-th value is seen - Generate s in small O(logN) space using
pseudo-random generators - Tunable probabilistic guarantees on approximation
error - Delete-Proof Just subtract to delete an
i-th value occurrence
where vector of random values from an
appropriate distribution
13AMS ( sketch ) cont.
- Key Intuition Use randomized linear projections
of f() to define random variable X such that - X is easily computed over the stream (in small
space) - EX F2
- VarX is small
- Basic Idea
- Define a family of 4-wise independent -1, 1
random variables - Pr 1 Pr -1 1/2
- Expected value of each , E ? E
? - Variables are 4-wise independent
- Expected value of product of 4 distinct 0
E( ) 0 - Variables can be generated using
pseudo-random generator using only O(log N) space
(for seeding)!
Probabilistic error guarantees (e.g., actual
answer is 101 with probability 0.9)
14AMS ( sketch ) cont.
3
2
- Suppose
- 1,2 ?1, 3,4 ?-1 then Z ?
- 4 ?1, 1,3,4 ?-1 then Z ?
15AMS ( sketch ) cont.
- Expected value of X F2
- Using 4-wise independence, possible to show
that -
0
16Boosting Accuracy
- Chebyshevs Inequality
- Boost accuracy to by averaging over several
independent copies of X (reduces
variance) - By Chebyshev
y
Average
17Boosting Confidence
- Boost confidence to by taking median of
2log(1/ ) independent copies of Y - Each Y Bernoulli Trial
FAILURE
copies
median
Prmedian(Y)-F2 F2
(By Chernoff Bound)
18Summary of AMS Sketching for F2
- Step 1 Compute random variables
- Step 2 Define X Z2
- Steps 3 4 Average independent copies of X
Return median of averages - Main Theorem Sketching approximates F2 to
within a relative error - of with probability using
space - Remember O(log N) space for seeding the
construction of each X
copies
y
Average
y
median
Average
copies
y
Average
19 Binary-Join COUNT Query
- Problem Compute answer for the query COUNT(R
A S) - Example
3
2
1
Data stream R.A 4 1 2 4 1 4
0
1
3
4
2
10 (2 2 0 6)
- Exact solution too expensive, requires O(N)
space! - N sizeof(domain(A))
20Basic AMS Sketching Technique AMS96
- Key Intuition Use randomized linear projections
of f() to define random variable X such that - X is easily computed over the stream (in small
space) - EX COUNT(R A S)
- VarX is small
- Basic Idea
- Define a family of 4-wise independent -1, 1
random variables
21AMS Sketch Construction
- Compute random variables
and - Simply add to XR(XS) whenever the i-th value
is observed in the R.A (S.A) stream - Define X XRXS to be estimate of COUNT query
- Example
3
2
1
Data stream R.A 4 1 2 4 1 4
0
1
3
4
2
2
2
1
1
Data stream S.A 3 1 2 4 2 4
1
3
4
2
22Binary-Join AMS Sketching Analysis
- Expected value of X COUNT(R A S)
- Using 4-wise independence, possible to show
that - is self-join size of R
(second/L2 moment)
1
0
23Boosting Accuracy
- Chebyshevs Inequality
- Boost accuracy to by averaging over several
independent copies of X (reduces
variance) - By Chebyshev
y
Average
24Boosting Confidence
- Boost confidence to by taking median of
2log(1/ ) independent copies of Y - Each Y Bernoulli Trial
FAILURE
copies
median
(By Chernoff Bound)
25Summary of Binary-Join AMS Sketching
- Step 1 Compute random variables
and - Step 2 Define X XRXS
- Steps 3 4 Average independent copies of X
Return median of averages - Main Theorem (AGMS99) Sketching approximates
COUNT to within a relative error of with
probability using space - Remember O(log N) space for seeding the
construction of each X
copies
y
Average
y
median
Average
copies
y
Average
26Distinct Value Estimation ( F0 )
- Problem Find the number of distinct values in a
stream of values with domain 0,...,N-1 - Zeroth frequency moment
- Statistics number of species or classes in a
population - Important for query optimizers
- Network monitoring distinct destination IP
addresses, source/destination pairs, requested
URLs, etc. - Example (N64)
- Hard problem for random sampling!
- Must sample almost the entire table to guarantee
the estimate is within a factor of 10 with
probability gt 1/2, regardless of the estimator
used!
Number of distinct values 5
27Hash (aka FM) Sketches for Distinct Value
Estimation FM85
- Assume a hash function h(x) that maps incoming
values x in 0,, N-1 uniformly across 0,,
2L-1, where L O(logN) - Let lsb(y) denote the position of the
least-significant 1 bit in the binary
representation of y - A value x is mapped to lsb(h(x))
- Maintain Hash Sketch BITMAP array of L bits,
initialized to 0 - For each incoming value x, set BITMAP
lsb(h(x)) 1 - Prob lsb(h(x) i ?
x 5
28Hash (FM) Sketches for Distinct Value Estimation
FM85
- By uniformity through h(x) Prob BITMAPk1
- Assuming d distinct values expect d/2 to map
to BITMAP0 , d/4 to map to BITMAP1, . . . - Let R position of rightmost zero in BITMAP
- Use as indicator of log(d)
- FM85 prove that ER ,
where - Estimate d
- Average several iid instances (different hash
functions) to reduce estimator variance
0
L-1
29Accuracy of FM
BITMAP 1
0
0
0
1
0
0
0
1
1
1
0
1
0
0
0
1
0
1
1
1
1
1
0
1
BITMAP m
0
0
0
1
0
1
1
1
1
1
0
1
0
Approximation with probability at least 1-
30Hash (FM) Sketches for Distinct Value Estimation
- FM85 assume ideal hash functions h(x)
(N-wise independence) - In practice
- h(x) , where
a, b are random binary vectors in 0,,2L-1 - Composable Component-wise OR/add distributed
sketches together - Estimate S1 S2 Sk set-union
cardinality
31Cash Register Sketch (AMS)
A more general algorithm for Fk
Choose random p from 1..n and let
Stream
sampling
Estimator
Using F2 ( k2 ) as example
If we choose the first element a1 r 2 and X
7(22-11) 21 And for a2 r ? , X ?
a5 r ? , X ?
32Cash Register Sketch (AMS)
- YAverage A copies of X and Z median of B
copies - Of Ys
A copies
y
Average
y
median
Average
B copies
y
Average
Claim This is a 1 e approx to F2, and space
used is
O(AB)
words of size O(lognlogm)
with probability at least 1-d.
33Analysis Cash Register Sketch
E(X) F2
V(X) E(X)2 - (E(X))2.
Using (a2 - b2) 2(a-b)a, we have V(X) 2
F1F3. Also, V(X) 2 F1 F3 .
Hence,
E
(
Y
)
E
(
X
)
F
i
i
2
V
(
Y
)
V
(
X
)
/
A
i
i
34Analysis Contd.
Applying Chebyshevs inequality
Hence, by Chernoff bounds, probability that
more than B/2 Yis deviate by far is at most d,
if we take log (1/d) of Yis. Hence, median gives
the correct approximation.
35Computation of Fk
- E(X) Fk
- When A
-
- B
- Get approximation with probability at least
1 -
36Estimate the element frequency
- Ask for f(1) ? f(4) ?
- - AMS based algorithm
- - Count Min sketch.
37AMS ( sketch ) based algorithm.
- Key Intuition Use randomized linear projections
of f() to define random variable Z such that - For given element Ai
- E( Z ) Ai fi
- Similar, we have E( Z ) fj
- Basic Idea
- Define a family of 4-wise independent -1, 1
random variables( same as before ) - Pr 1 Pr -1 ½
- Let Z
- So E( Z )
1
0
38AMS cont.
- Keep an array of w ? d counters for Zij
- Use d hash functions to map element x to 1..w
h1(a)
a
hd(a)
Zi, hi(a)
Est(fa) median i (Zi,hi(a) )
39The Count Min (CM) Sketch
- Simple sketch idea, can be used for point queries
( fi), range queries, quantiles, join size
estimation - Creates a small summary as an array of w ? d
counters C - Use d hash functions to map element to 1..w
W
d
40CM Sketch Structure
h1(xi)
xi
d
hd(xi )
w
- Each element xi is mapped to one counter per row
- C k,hk(xi) Ck, hk(xi)1 ( -1 if deletion )
- or cj if income is ltj, cjgt
- Estimate Aj by taking mink Ck,hk(j)
41CM Sketch Summary
- CM sketch guarantees approximation error on point
queries less than eA in size O(1/e log 1/d) - Probability of more error is less than 1-d
- Hints
- Counts are biased! Can you limit the expected
amount of extra mass at each bucket? (Use
Markov)