Statistic estimation over data stream - PowerPoint PPT Presentation

About This Presentation

Title:

Statistic estimation over data stream

Description:

Variables are 4-wise independent. Expected value of product of 4 ... Using 4-wise independence, possible to show that. is self-join size of R (second/L2 moment) ... – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 42

Provided by: mino67

Category:

more less

Transcript and Presenter's Notes

Title: Statistic estimation over data stream

1
Statistic estimationover data stream
Slides modified from Minos Garofalakis ( yahoo!
research) and S. Muthukrishnan (Rutgers
University)
2
Outline

Introduction
Frequent moment estimation
Element Frequency estimation

3
Data Stream Processing Algorithms

Generally, algorithms compute approximate answers
Provably difficult to compute answers accurately
with limited memory
Approximate answers - Deterministic bounds
Algorithms only compute an approximate answer,
but bounds on error
Approximate answers - Probabilistic bounds
Algorithms compute an approximate answer with
high probability
With probability at least , the computed
answer is within a factor of the actual
answer

4
Sampling Basics

Idea A small random sample S of the data often
well-represents all the data
For a fast approximate answer, apply modified
query to S
Example select agg from R

(n12)
If agg is avg, return average of the elements in
S
Number of odd elements ?

Data stream 9 3 5 2 7 1 6 5 8
4 9 1
Sample S 9 5 1 8
answer 11.5
5
Probabilistic Guarantees

Example Actual answer is within 11.5 1 with
prob ? 0.9
Randomized algorithms Answer returned is a
specially-built random variable
Use Tail Inequalities to give probabilistic
bounds on returned answer
Markov Inequality
Chebyshevs Inequality
Chernoff/Hoeffding Bound

6
Basic Tools Tail Inequalities

General bounds on tail probability of a random
variable (that is, probability that a random
variable deviates far from its expectation)
Basic Inequalities Let X be a random variable
with expectation and variance VarX. Then
for any

Markov
Chebyshev
7
Tail Inequalities for Sums

Possible to derive even stronger bounds on tail
probabilities for the sum of independent
Bernoulli trials
Chernoff Bound Let X1, ..., Xm be independent
Bernoulli trials such that PrXi1 p (PrXi0
1-p). Let and be
the expectation of . Then, for any ,

Do not need to compute Var(X), but need the
independent assumption!

Application to count queries
m is size of sample S (4 in example)
p is fraction of odd elements in stream (2/3 in
example)

8
The Streaming Model

Underlying signal One-dimensional array A1N
with values Ai all initially zero
Multi-dimensional arrays as well (e.g.,
row-major)
Signal is implicitly represented via a stream of
updates
j-th update is ltk, cjgt implying
Ak Ak cj (cj can be gt0, lt0)
Goal Compute functions on A subject to
Small space
Fast processing of updates
Fast function computation

9
Streaming Model Special Cases

Time-Series Model
Only j-th update updates Aj (i.e., Aj
cj)
Cash-Register Model
cj is always gt 0 (i.e., increment-only)
Typically, cj1, so we see a multi-set of
items in one pass
Turnstile Model
Most general streaming model
cj can be gt0 or lt0 (i.e., increment or
decrement)
Problem difficulty varies depending on the model
E.g., MIN/MAX in Time-Series vs. Turnstile!

10
Frequent moment computation

Problem
Data arrives online ( a1,a2,a3..am )
Let f(i) j aj i ( represented by
Ai )
Example

F0 5 lt distinct elementsgt, F1 7, F2 11 (
1122221111) ( surprise index)
What is F8?
11
Frequent moment computation

Easy for F1
How about others ?
- focus on the F2 and F0
- Estimation of Fk

12
Linear-Projection (AMS) Sketch Synopses

Goal Build small-space summary for distribution
vector f(i) (i1,..., N) seen as a stream of
i-values
Basic Construct Randomized Linear Projection of
f() project onto inner/dot product of
f-vector
Simple to compute over the stream Add
whenever the i-th value is seen
Generate s in small O(logN) space using
pseudo-random generators
Tunable probabilistic guarantees on approximation
error
Delete-Proof Just subtract to delete an
i-th value occurrence

where vector of random values from an
appropriate distribution
13
AMS ( sketch ) cont.

Key Intuition Use randomized linear projections
of f() to define random variable X such that
X is easily computed over the stream (in small
space)
EX F2
VarX is small
Basic Idea
Define a family of 4-wise independent -1, 1
random variables
Pr 1 Pr -1 1/2
Expected value of each , E ? E
?
Variables are 4-wise independent
Expected value of product of 4 distinct 0
E( ) 0
Variables can be generated using
pseudo-random generator using only O(log N) space
(for seeding)!

Probabilistic error guarantees (e.g., actual
answer is 101 with probability 0.9)
14
AMS ( sketch ) cont.

Example

3
2

Suppose
1,2 ?1, 3,4 ?-1 then Z ?
4 ?1, 1,3,4 ?-1 then Z ?

15
AMS ( sketch ) cont.

Expected value of X F2
Using 4-wise independence, possible to show
that

0
16
Boosting Accuracy

Chebyshevs Inequality
Boost accuracy to by averaging over several
independent copies of X (reduces
variance)
By Chebyshev

y
Average
17
Boosting Confidence

Boost confidence to by taking median of
2log(1/ ) independent copies of Y
Each Y Bernoulli Trial

FAILURE
copies
median
Prmedian(Y)-F2 F2
(By Chernoff Bound)
18
Summary of AMS Sketching for F2

Step 1 Compute random variables
Step 2 Define X Z2
Steps 3 4 Average independent copies of X
Return median of averages
Main Theorem Sketching approximates F2 to
within a relative error
of with probability using
space
Remember O(log N) space for seeding the
construction of each X

copies
y
Average
y
median
Average
copies
y
Average
19
Binary-Join COUNT Query

Problem Compute answer for the query COUNT(R
A S)
Example

3
2
1
Data stream R.A 4 1 2 4 1 4
0
1
3
4
2
10 (2 2 0 6)

Exact solution too expensive, requires O(N)
space!
N sizeof(domain(A))

20
Basic AMS Sketching Technique AMS96

Key Intuition Use randomized linear projections
of f() to define random variable X such that
X is easily computed over the stream (in small
space)
EX COUNT(R A S)
VarX is small
Basic Idea
Define a family of 4-wise independent -1, 1
random variables

21
AMS Sketch Construction

Compute random variables
and
Simply add to XR(XS) whenever the i-th value
is observed in the R.A (S.A) stream
Define X XRXS to be estimate of COUNT query
Example

3
2
1
Data stream R.A 4 1 2 4 1 4
0
1
3
4
2
2
2
1
1
Data stream S.A 3 1 2 4 2 4
1
3
4
2
22
Binary-Join AMS Sketching Analysis

Expected value of X COUNT(R A S)
Using 4-wise independence, possible to show
that
is self-join size of R
(second/L2 moment)

1
0
23
Boosting Accuracy

Chebyshevs Inequality
Boost accuracy to by averaging over several
independent copies of X (reduces
variance)
By Chebyshev

y
Average
24
Boosting Confidence

Boost confidence to by taking median of
2log(1/ ) independent copies of Y
Each Y Bernoulli Trial

FAILURE
copies
median
(By Chernoff Bound)
25
Summary of Binary-Join AMS Sketching

Step 1 Compute random variables
and
Step 2 Define X XRXS
Steps 3 4 Average independent copies of X
Return median of averages
Main Theorem (AGMS99) Sketching approximates
COUNT to within a relative error of with
probability using space
Remember O(log N) space for seeding the
construction of each X

copies
y
Average
y
median
Average
copies
y
Average
26
Distinct Value Estimation ( F0 )

Problem Find the number of distinct values in a
stream of values with domain 0,...,N-1
Zeroth frequency moment
Statistics number of species or classes in a
population
Important for query optimizers
Network monitoring distinct destination IP
addresses, source/destination pairs, requested
URLs, etc.
Example (N64)
Hard problem for random sampling!
Must sample almost the entire table to guarantee
the estimate is within a factor of 10 with
probability gt 1/2, regardless of the estimator
used!

Number of distinct values 5
27
Hash (aka FM) Sketches for Distinct Value
Estimation FM85

Assume a hash function h(x) that maps incoming
values x in 0,, N-1 uniformly across 0,,
2L-1, where L O(logN)
Let lsb(y) denote the position of the
least-significant 1 bit in the binary
representation of y
A value x is mapped to lsb(h(x))
Maintain Hash Sketch BITMAP array of L bits,
initialized to 0
For each incoming value x, set BITMAP
lsb(h(x)) 1
Prob lsb(h(x) i ?

x 5
28
Hash (FM) Sketches for Distinct Value Estimation
FM85

By uniformity through h(x) Prob BITMAPk1
Assuming d distinct values expect d/2 to map
to BITMAP0 , d/4 to map to BITMAP1, . . .
Let R position of rightmost zero in BITMAP
Use as indicator of log(d)
FM85 prove that ER ,
where
Estimate d
Average several iid instances (different hash
functions) to reduce estimator variance

0
L-1
29
Accuracy of FM
BITMAP 1
0
0
0
1
0
0
0
1
1
1
0
1
0
0
0
1
0
1
1
1
1
1
0
1
BITMAP m
0
0
0
1
0
1
1
1
1
1
0
1
0
Approximation with probability at least 1-
30
Hash (FM) Sketches for Distinct Value Estimation

FM85 assume ideal hash functions h(x)
(N-wise independence)
In practice
h(x) , where
a, b are random binary vectors in 0,,2L-1
Composable Component-wise OR/add distributed
sketches together
Estimate S1 S2 Sk set-union
cardinality

31
Cash Register Sketch (AMS)
A more general algorithm for Fk
Choose random p from 1..n and let
Stream

sampling
Estimator
Using F2 ( k2 ) as example
If we choose the first element a1 r 2 and X
7(22-11) 21 And for a2 r ? , X ?
a5 r ? , X ?
32
Cash Register Sketch (AMS)

YAverage A copies of X and Z median of B
copies
Of Ys

A copies
y
Average
y
median
Average
B copies
y
Average
Claim This is a 1 e approx to F2, and space
used is
O(AB)
words of size O(lognlogm)
with probability at least 1-d.
33
Analysis Cash Register Sketch
E(X) F2
V(X) E(X)2 - (E(X))2.
Using (a2 - b2) 2(a-b)a, we have V(X) 2
F1F3. Also, V(X) 2 F1 F3 .
Hence,
E
(
Y
)

E
(
X
)

F
i
i
2
V
(
Y
)

V
(
X
)
/
A

i
i
34
Analysis Contd.
Applying Chebyshevs inequality
Hence, by Chernoff bounds, probability that
more than B/2 Yis deviate by far is at most d,
if we take log (1/d) of Yis. Hence, median gives
the correct approximation.
35
Computation of Fk

E(X) Fk
When A
B
Get approximation with probability at least
1 -

36
Estimate the element frequency

Ask for f(1) ? f(4) ?
- AMS based algorithm
- Count Min sketch.

37
AMS ( sketch ) based algorithm.

Key Intuition Use randomized linear projections
of f() to define random variable Z such that
For given element Ai
E( Z ) Ai fi
Similar, we have E( Z ) fj
Basic Idea
Define a family of 4-wise independent -1, 1
random variables( same as before )
Pr 1 Pr -1 ½
Let Z
So E( Z )

1
0
38
AMS cont.

Keep an array of w ? d counters for Zij
Use d hash functions to map element x to 1..w

h1(a)
a
hd(a)
Zi, hi(a)
Est(fa) median i (Zi,hi(a) )
39
The Count Min (CM) Sketch

Simple sketch idea, can be used for point queries
( fi), range queries, quantiles, join size
estimation
Creates a small summary as an array of w ? d
counters C
Use d hash functions to map element to 1..w

W
d
40
CM Sketch Structure
h1(xi)
xi
d
hd(xi )
w

Each element xi is mapped to one counter per row
C k,hk(xi) Ck, hk(xi)1 ( -1 if deletion )
or cj if income is ltj, cjgt
Estimate Aj by taking mink Ck,hk(j)

41
CM Sketch Summary

CM sketch guarantees approximation error on point
queries less than eA in size O(1/e log 1/d)
Probability of more error is less than 1-d
Hints
Counts are biased! Can you limit the expected
amount of extra mass at each bucket? (Use
Markov)

Write a Comment

User Comments (0)