Approximate Frequency Counts over Data Streams

About This Presentation

Title:

Approximate Frequency Counts over Data Streams

Description:

The incoming stream is conceptually divided into buckets of width w = ceil(1/e) ... Denote the current bucket id by bcurrent whose value is ceil(N/w) ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 88

Provided by: michal73

Category:

more less

Transcript and Presenter's Notes

Title: Approximate Frequency Counts over Data Streams

1
Approximate Frequency Counts over Data Streams

Gurmeet Singh Manku (Standford) Rajeev Motwani
(Standford)

Presented by Michal Spivak November, 2003
2
The Problem
Identify all elements whose current frequency
exceeds support threshold s 0.1.
3
Related problem
Identify all subsets of items whose current
frequency exceeds s0.1
4
Purpose of this paper

Present an algorithm for computing frequency
counts exceeding a user-specified threshold over
data streams with the following advantages
Simple
Low memory footprint
Output is approximate but guaranteed not to
exceed a user specified error parameter.
Can be deployed for streams of singleton items
and handle streams of variable sized sets of
items.

5
Overview

Introduction
Frequency counting applications
Problem definition
Algorithm for Frequent Items
Algorithm for Frequent Sets of Items
Experimental results
Summary

6
Introduction
7
Motivating examples

Iceberg Query Perform an aggregate function over
an attribute and eliminate those below some
threshold.
Association RulesRequire computation of frequent
itemsets.
Iceberg DatacubesGroup bys of a CUBE operator
whose aggregate frequency exceeds threshold
Traffic measurement Require identification of
flows that exceed a certain fraction of total
traffic

8
Whats out there today

Algorithms that compute exact results
Attempt to minimize number of data passes (best
algorithms take two passes).
Problems when adapted to streams
Only one pass is allowed.
Results are expected to be available with short
response time.
Fail to provide any a-priori guarantee on the
quality of their output.

9
Why Streams?Streams vs. Stored data

Volume of a stream over its lifetime can be huge
Queries for streams require timely answers,
response times need to be small
As a result it is not possible to store the
stream as an entirety.

10
Frequency counting applications
11
Existing applications for the following problems

Iceberg Query Perform an aggregate function over
an attribute and eliminate those below some
threshold.
Association RulesRequire computation of frequent
itemsets.
Iceberg DatacubesGroup bys of a CUBE operator
whose aggregate frequency exceeds threshold
Traffic measurement Require identification of
flows that exceed a certain fraction of total
traffic

12
Iceberg QueriesIdentify aggregates that exceed a
user-specified threshold r

One of the published algorithms to compute
iceberg queries efficiently uses repeated hashing
over multiple passes.
Basic Idea
In the first pass a set of counters is maintained
Each incoming item is hashed to one of the
counters which is incremented
These counters are then compressed to a bitmap,
with a 1 denoting large counter value
In the second pass exact frequencies are
maintained for only those elements that hash to a
counter whose bitmap value is 1
This algorithm is difficult to adapt for streams
because it requires two passes

M. FANG, N. SHIVAKUMAR, H. GARCIA-MOLINA,R.
MOTWANI, AND J. ULLMAN. Computing iceberg queries
efficiently. In Proc. of 24th Intl. Conf. on Very
Large Data Bases, pages 299310, 1998.
13
Association Rules

Definitions
Transaction subset of items drawn from I, the
universe of all Items.
Itemset X I has support s if X occurs as a
subset at least a fraction - s of all
transactions
Associations rules over a set of transactions are
of the form XgtY, where X and Y are subsets of I
such that XnY 0 and XUY has support exceeding a
user specified threshold s.
Confidence of a rule X gt Y is the value
support(XUY) / support(X)

U
14
Example - Market basket analysis
For support 50, confidence 50, we have the
following rules 1 gt 3 with 50 support and 66
confidence 3 gt 1 with 50 support and 100
confidence
15
Reduce to computing frequent itemsets
For support 50, confidence 50

For the rule 1 gt 3
Support Support(1, 3) 50
Confidence Support(1,3)/Support(1) 66

16
Toivonens algorithm

Based on sampling of the data stream.
Basically, in the first pass, frequencies are
computed for samples of the stream, and in the
second pass these the validity of these items is
determined.Can be adapted for data stream
Problems- false negatives occur because the
error in frequency counts is two sided- for
small values of e, the number of samples is
enormous 1/e (100 million samples)

17
Network flow identification

Flow sequence of transport layer packets that
share the same sourcedestination addresses
Estan and Verghese proposed an algorithm for this
identifying flows that exceed a certain
threshold.The algorithm is a combination of
repeated hashing and sampling, similar to those
for iceberg queries.
Algorithm presented in this paper is directly
applicable to the problem of network flow
identification. It beats the algorithm in terms
of space and requirements.

18
Problem definition
19
Problem Definition

Algorithm accepts two user-specified parameters-
support threshold s E (0,1)- error parameter e E
(0,1)- e ltlt s
N length of stream (i.e no. of tuples seen so
far)
Itemset set of items
Denote item(set) to be item or itemset
At any point of time, the algorithm can be asked
to produce a list of item(set)s along with their
estimated frequency.

20
Approximation guarantees

All item(set)s whose true frequency exceeds sN
are output. There are no false negatives.
No item(set)s whose true frequency is less than
(s- e(N is output.
Estimated frequencies are less than true
frequencies by at most eN

21
Input Example

S 0.1
e as a rule of thumb, should be set to one-tenth
or one-twentieth of s. e 0.01
As per property 1, ALL elements with frequency
exceeding 0.1 will be output.
As per property 2, NO element with frequency
below 0.09 will be output
Elements between 0.09 and 0.1 may or may not be
output. Those that make their way are false
positives
As per property 3, all individual frequencies are
less than their true frequencies by at most 0.01

22
Problem Definition cont

An algorithm maintains an e-deficient synopsis if
its output satisifies the aforementioned
properties
Goal to devise algorithms to support e-deficient
synopsis using as little main memory as possible

23
The Algorithms for frequent Items

Sticky Sampling
Lossy Counting

24
Sticky Sampling Algorithm
34 15 30
28 31 41
23 35 19
? Create counters by sampling
25
Notations

Data structure S - set of entries of the form
(e,f)
f estimates the frequency of an element e.
r sampling rate. Sampling an element with rate
r means we select the element with probablity
1/r

26
Sticky Sampling cont

Initially S is empty, r 1.
For each incoming element eif (e exists in S)
increment corresponding felse sample
element with rate r if (sampled) add entry
(e,1) to S else ignore

27
The sampling rate

Let t 1/ e log(s-1 ?-1) (? probability of
failure)
First 2t elements are sampled at rate1
The next 2t elements at rate2
The next 4t elements at rate4
and so on

28
Sticky Sampling cont

Whenever the sampling rate r changes for each
entry (e,f) in S repeat toss an unbiased
coin if (toss is not successful) diminsh f by
one if (f 0) delete entry from
S break until toss is successful

29
Sticky Sampling cont

The number of unsuccessful coin tosses folows a
geometric distribution.
Effectively, after each rate change S is
transformed to exactly the state it would have
been in, if the new rate had been used from the
beginning.
When a user requests a list of items with
threshold s, the output are those entries in S
where f (s e)N

30
Theorem 1

Sticky Sampling computes an e-deficient synopsis
with probability at least 1 - ? using at most 2/
e log(s-1 ?-1) expected number of entries.

31
Theorem 1 - proof

First 2t elements find their way into S
When r 2 N rt rt ( tE 1,t) ) gt 1/r
t/N
Error in frequency corresponds to a sequence of
unsuccessful coin tosses during the first few
occurrences of e.the probability that this
length exceeds eN is at most (1 1/r)eN lt (1
t/N)-eN lt e-et
Number of elements with f gt s is no more than 1/s
gt the probability that the estimate for any of
them is deficient by eN is at most e-et/s

32
Theorem 1 proof cont

Probability of failure should be at most ?. This
yieldse-et/s lt ?
t 1/ e log(s-1 ?-1)
since the space requirements are 2t, the space
bound follows

33
Sticky Sampling summary

The algorithm name is called sticky sampling
because S sweeps over the stream like a magnet,
attracting all elements which already have an
entry in S
The space complexity is independent of N
The idea of maintaining samples was first
presented by Gibbons and Matias who used it to
solve the top-k problem.
This algorithm is different in that the sampling
rate r increases logarithmically to produce ALL
items with frequency gt s, not just the top k

34
Lossy Counting
Divide the stream into buckets Keep exact
counters for items in the buckets Prune entrys at
bucket boundaries
35
Lossy Counting cont

A deterministic algorithm that computes frequency
counts over a stream of singleitem transactions,
satisfying the guarantees outlined in Section 3
using at most 1/elog(eN) space where N denotes
the current length of the stream.
The user specifies two parameters- support s-
error e

36
Definitions

The incoming stream is conceptually divided into
buckets of width w ceil(1/e)
Buckets are labeled with bucket ids, starting
from 1
Denote the current bucket id by bcurrent whose
value is ceil(N/w)
Denote fe to be the true frequency of an element
e in the stream seen so far
Data stucture D is a set of entries of the form
(e,f,D)

37
The algorithm

Initially D is empty
Receive element eif (e exists in D) increment
its frequency (f) by 1else create a new entry
(e, 1, bcurrent 1)
If it bucket boundary prune D by the following
the rule(e,f,D) is deleted if f D bcurrent
When the user requests a list of items with
threshold s, output those entries in D where f
(s e)N

38
Some algorithm facts

For an entry (e,f,D) f represents the exact
frequency count for e ever since it was inserted
into D.
The value D is the maximum number of times e
could have occurred in the first bcurrent 1
buckets ( this value is exactly bcurrent 1)
Once a value is inserted into D its value D is
unchanged

39
Lossy counting in action
D is Empty
At window boundary, remove entries that for them
fD bcurrent
40
Lossy counting in action cont
At window boundary, remove entries that for them
fD bcurrent
41
Lemma 1

Whenver deletions occur, bcurrent eN
Proof N bcurrentw neN bcurrent eneN
bcurrent

42
Lemma 2

Whenever an entry (e,f,D) gets deleted fe
bcurrent
Proof by induction
Base case bcurrent 1
(e,f,D) is deleted only if f 1 Thus fe
bcurrent (fe f)
Induction step - Consider (e,f,D) that gets
deleted for some bcurrent gt 1. - This entry was
inserted when bucket D1 was being processed. -
It was deleted at late as the time as bucket D
became full. - By induction the true frequency
for e was no more than D. - f is the true
frequency of e since it was inserted.- fe fD
combined with the deletion rule fD bcurrent gt
fe bcurrent

43
Lemma 3

If e does not appear in D, then fe eN
Proof If the lemma is true for an element e
whenever it gets deleted, it is true for all
other N also.From lemmas 1, 2 we infer that fe
eN whenever it gets deleted.

44
Lemma 4

If (e,f,D) E D, then f fe f eN
Proof
If D0, ffe.
Otherwisee was possibly deleted in the first D
buckets.From lemma 2 fe fDD bcurrent 1
eN
Conclusion f fe f eN

45
Lossy Counting cont

Lemma 3 shows that all elements whose true
frequency exceed eN have entries in D
Lemma 4 shows that the estimated frequency of all
such elements are accurate to within eN
gt D correctly maintains an e-deficient synopsis

46
Theorem 2

Lossy counting computes an e-deficient synopsis
using at most 1/elog(eN) entries

47
Theorem 2 - proof

Let B bcurrent
di denote the number of entries in D whose
bucket id is B - i 1 (iE1,B)
e corresponding to di must occur at least i times
in buckets B-i1 through B, otherwise it would
have been deleted
We get the following constraint(1) Sidi jw
for j 1,2,B. i 1..j

48
Theorem 2 proof

The following inequality can be proved by
inductionSdi Sw/i for j 1,2,B i 1..j
D Sdi for i 1..B
From the above inequality D Sw/i 1/elogB
1/elog(eN)
z

49
Sticky Sampling vs. Lossy counting
Log10 of N (stream length)
Support s 1 Error e 0.1
50
Sticky Sampling vs.Lossy counting cont
Kinks in the curve for sticky sampling correspond
to re-sampling Kinks in the curve for lossy
counting correspond to bucket boundaries
51
Sticky Sampling vs. Lossy counting cont
? s SS worst LC worst SS Zipf LC Zipf SS Uniq LC Uniq
0.1 1.0 27K 9K 6K 419 27K 1K
0.05 0.5 58K 17K 11K 709 58K 2K
0.01 0.1 322K 69K 37K 2K 322K 10K
0.005 0.05 672K 124K 62K 4K 672K 20K
SS Sticky Sampling LC Lossy Counting Zipf
zipfian distribution Uniq stream with no
duplicates
52
Sticky Sampling vs. Lossy summary

Lossy counting is superior by a large factor
Sticky sampling performs worse because of its
tendency to remember every unique element that
gets sampled
Lossy counting is good at pruning low frequency
elements quickly

53
Comparison with alternative approaches

Toivonen sampling algorithm for association
rules.
Sticky sampling beats the approach by roughly a
factor of 1/e

54
Comparison with alternative approaches cont

KPS02 In the first path the algorithm maintains
1/e elements with their frequencies. If a counter
exists for an element it is increased, if there
is a free counter it is inserted, otherwise all
existing counters are reduced by one
Can be used to maintain e-deficient synopsis with
exactly 1/e space
If the input stream is ZipfianLossy Counting
takes less than 1/e spacefor e0.01 roughly
2000 entries 20 1/e

55
Frequent Sets of Items

From theory to Practice

56
Frequent Sets of Items
Identify all subsets of items whose current
frequency exceeds s 0.1.
57
Frequent itemsets algorithm

Input stream of transactions, each transaction
is a set of items from I
N length of the stream
User specifies two parameters support s, error
e
Challenge - handling variable sized
transactions- avoiding explicit enumeration of
all subsets of any transaction

58
Notations

Data structure D set of entries of the form
(set, f, D)
Transactions are divided into buckets
w ceil(1/e) no. of transactions in each
bucket
bcurrent current bucket id
Transactions are not processed one by one. Main
memory is filled with as many transactions as
possible. Processing is done on a batch of
transactions.B no. of buckets in main memory
in the current batch being processed.

59
Update D

UPDATE_SET for each entry (f,set,D) E D, update
f by counting occurrences of set in the current
batch. If the updated entry satisfies fD
bcurrent, we delete this entry
NEW_SET if a set set has frequency f B in the
current batch and set does not occur in D, create
a new entry (set,f,bcurrent B)

60
Algorithm facts

If fset eN it has an entry in D
If (set,f,D)ED then the true frequency of fset
satisfies the inequality f fset fD
When a user requests a list of items with
threshold s, output those entries in D wheref
(s-e)N
B needs to be a large number. Any subset of I
that occurs B1 times or more contributes to D.

61
Three modules
maintains the data structure D
implement UPDATE_SET, NEW_SET
operates on the current batch of transactions
repeatedly reads in a batch of transactionsinto
available main memory
62
Module 1 - Buffer

Read a batch of transactions
Transactions are laid out one after the other in
a big array
A bitmap is used to remember transaction
boundaries
After reading in a batch, BUFFER sorts each
transaction by its item-ids

63
Module 2 - TRIE
64
Module 2 TRIE cont

Nodes are labeled item-id, f, D, level
Children of any node are ordered by their
item-ids
Root nodes are also ordered by their item-ids
A node represents an itemset consisting of
item-ids in that node and all its ancestors
TRIE is maintained as an array of entries of the
form item-id, f, D, level (pre-order of the
trees). Equivalent to a lexicographic ordering of
subsets it encodes.
No pointers, levels compactly encode the
underlying tree structure.

65
Module 3 - SetGen
SetGen uses the following pruning ruleif a
subset S does not make its way into TRIE after
application of both UPDATE_SET and NEW_SET, then
no supersets of S should be considered
66
Overall Algorithm
67
Efficient ImplementationsBuffer

If item-ids are successive integers from 1 thru
I, and I is small enough (less than 1 million)
Maintain exact frequency counts for singleton
sets. Prune away those item-ids whose frequency
is less than eN and then sort the transactions
If I 105, array size 0.4 MB

68
Efficient ImplementationsTRIE

Take advantage of the fact that the sets produced
by SetGen are lexicographic.
Maintain TRIE as a set of fairly large-sized
chunks of memory instead of one huge array
Instead of modifying the original TRIE, create a
new TRIE.
Chunks from the old TRIE are freed as soon as
they are not required.
By the time SetGen finishes, the chunks of the
original TRIE have been discarder.

69
Efficient ImplementationsSetGen

Employs a priority queue called Heap
Initially contains pointers to smallest item-ids
of all transactions in buffer
Duplicate members are maintained together and
constitute a single item in the Heap. Chain all
these pointers together.
Derive the space from BUFFER. Change item-ids
with pointers.

70
Efficient ImplementationsSetGen cont
3 1 6 5 4 2 5 4 1 3 2 1
2
1
71
Efficient ImplementationsSetGen cont

Repeatedly process the smallest item-id in Heap
to generate singleton sets.
If the singleton belongs to TRIE after UPDATE_SET
and NEW_SET try to generate the next set by
extending the current singleton set.
This is done by invoking SetGen recursively with
a new Heap created out of successors of the
pointers to item-ids just processed and removed.
When the recursive call returns, the smallest
entry in Heap is removed and all successors of
the currently smallest item-id are added.

72
Efficient ImplementationsSetGen cont
3 1 6 5 4 2 5 2 1 3 2 1
2
1
2
3
73
System issues and optimizations

Buffer scans the incoming stream by memory
mapping the input file.
Use standard qsort to sort transactions
Threading SetGen and Buffer does not help because
SetGen is significantly slower.
The rate at which tries are scanned is much
smaller than the rate at which sequiential disk
I/O can be done
Possible to maintain TRIE on disk without loss in
performance

74
System issues and optimizationsTRIE on disk
advantages

The size of TRIE is not limited by main memory
this algorithm can function with a low amount of
main memory.
Since most available main memory can be devoted
to BUFFER, this algorithm can handle smaller
values of e than other algorithms can handle.

75
Novel features of this technique

No candidate generation phase.
Compact disk-based tries is novel
Able to compute frequent itemsets under low
memory conditions.
Able to handle smaller values of support
threshold than previously possible.

76
Experimental results
77
Experimental Results
78
What is studied

Support threshold s
Number of transactions N
Size of BUFFER B
Total time taken t

set e 0.1s
79
Varying buffer sizes and support s
Decreasing s leads to increases in running time
80
Varying support s and buffer size B
Time taken in seconds
Kinks occur due to TRIE optimization on last
batch
81
Varying length N and support s
Running time is linear proportional to the length
of the stream The curve flattens in the end as
processing the last batch is faster
82
Comparison with Apriori
APriori APriori Our Algorithm with 4MB Buffer Our Algorithm with 4MB Buffer Our Algorithm with 44MB Buffer Our Algorithm with 44MB Buffer
Support Time Memory Time Memory Time Memory
0.001 99 s 82 MB 111 s 12 MB 27 s 45 MB
0.002 25 s 53 MB 94 s 10 MB 15 s 45 MB
0.004 14 s 48 MB 65 s 7MB 8 s 45 MB
0.006 13 s 48 MB 46 s 6 MB 6 s 45 MB
0.008 13 s 48 MB 34 s 5 MB 4 s 45 MB
0.010 14 s 48 MB 26 s 5 MB 4 s 45 MB
Dataset IBM T10.I4.1000K with 1M transactions,
average size 10
83
Comparison with Iceberg Queries
Query Identify all word pairs in 100K web
documents which co-occur in at least
0.5 of the documents.
FSGM98 multiple pass algorithm
7000 seconds with 30 MB memory Our
single-pass algorithm 4500 seconds
with 26 MB memory
84
Summary
A Novel Algorithm for computing approximate
frequency counts over Data Streams
85
SummaryAdvantages of the algorithms presented

Require provably small main memory footprints
Each of the motivating examples can now be solved
over streaming data
Handle smaller values of support threshold than
previously possible
Remains practical in environments with moderate
main memory

86
Summary cont

Give an Apriori error guarantee
Work for variable sized transactions.
Optimized implementation for frequent itemsets
For the datasets tested, the algorithm runs in
one pass and produces exact results, beating
previous algorithms in terms of time.