Title: Approximate Frequency Counts over Data Streams
1Approximate Frequency Counts over Data Streams
- Gurmeet Singh Manku (Standford) Rajeev Motwani
(Standford)
Presented by Michal Spivak November, 2003
2The Problem
Identify all elements whose current frequency
exceeds support threshold s 0.1.
3Related problem
Identify all subsets of items whose current
frequency exceeds s0.1
4Purpose of this paper
- Present an algorithm for computing frequency
counts exceeding a user-specified threshold over
data streams with the following advantages - Simple
- Low memory footprint
- Output is approximate but guaranteed not to
exceed a user specified error parameter. - Can be deployed for streams of singleton items
and handle streams of variable sized sets of
items.
5Overview
- Introduction
- Frequency counting applications
- Problem definition
- Algorithm for Frequent Items
- Algorithm for Frequent Sets of Items
- Experimental results
- Summary
6Introduction
7Motivating examples
- Iceberg Query Perform an aggregate function over
an attribute and eliminate those below some
threshold. - Association RulesRequire computation of frequent
itemsets. - Iceberg DatacubesGroup bys of a CUBE operator
whose aggregate frequency exceeds threshold - Traffic measurement Require identification of
flows that exceed a certain fraction of total
traffic
8Whats out there today
- Algorithms that compute exact results
- Attempt to minimize number of data passes (best
algorithms take two passes). - Problems when adapted to streams
- Only one pass is allowed.
- Results are expected to be available with short
response time. - Fail to provide any a-priori guarantee on the
quality of their output.
9Why Streams?Streams vs. Stored data
- Volume of a stream over its lifetime can be huge
- Queries for streams require timely answers,
response times need to be small - As a result it is not possible to store the
stream as an entirety.
10Frequency counting applications
11Existing applications for the following problems
- Iceberg Query Perform an aggregate function over
an attribute and eliminate those below some
threshold. - Association RulesRequire computation of frequent
itemsets. - Iceberg DatacubesGroup bys of a CUBE operator
whose aggregate frequency exceeds threshold - Traffic measurement Require identification of
flows that exceed a certain fraction of total
traffic
12Iceberg QueriesIdentify aggregates that exceed a
user-specified threshold r
- One of the published algorithms to compute
iceberg queries efficiently uses repeated hashing
over multiple passes. - Basic Idea
- In the first pass a set of counters is maintained
- Each incoming item is hashed to one of the
counters which is incremented - These counters are then compressed to a bitmap,
with a 1 denoting large counter value - In the second pass exact frequencies are
maintained for only those elements that hash to a
counter whose bitmap value is 1 - This algorithm is difficult to adapt for streams
because it requires two passes
M. FANG, N. SHIVAKUMAR, H. GARCIA-MOLINA,R.
MOTWANI, AND J. ULLMAN. Computing iceberg queries
efficiently. In Proc. of 24th Intl. Conf. on Very
Large Data Bases, pages 299310, 1998.
13Association Rules
- Definitions
- Transaction subset of items drawn from I, the
universe of all Items. - Itemset X I has support s if X occurs as a
subset at least a fraction - s of all
transactions - Associations rules over a set of transactions are
of the form XgtY, where X and Y are subsets of I
such that XnY 0 and XUY has support exceeding a
user specified threshold s. - Confidence of a rule X gt Y is the value
support(XUY) / support(X)
U
14Example - Market basket analysis
For support 50, confidence 50, we have the
following rules 1 gt 3 with 50 support and 66
confidence 3 gt 1 with 50 support and 100
confidence
15Reduce to computing frequent itemsets
For support 50, confidence 50
- For the rule 1 gt 3
- Support Support(1, 3) 50
- Confidence Support(1,3)/Support(1) 66
16Toivonens algorithm
- Based on sampling of the data stream.
- Basically, in the first pass, frequencies are
computed for samples of the stream, and in the
second pass these the validity of these items is
determined.Can be adapted for data stream - Problems- false negatives occur because the
error in frequency counts is two sided- for
small values of e, the number of samples is
enormous 1/e (100 million samples)
17Network flow identification
- Flow sequence of transport layer packets that
share the same sourcedestination addresses - Estan and Verghese proposed an algorithm for this
identifying flows that exceed a certain
threshold.The algorithm is a combination of
repeated hashing and sampling, similar to those
for iceberg queries. - Algorithm presented in this paper is directly
applicable to the problem of network flow
identification. It beats the algorithm in terms
of space and requirements.
18Problem definition
19Problem Definition
- Algorithm accepts two user-specified parameters-
support threshold s E (0,1)- error parameter e E
(0,1)- e ltlt s - N length of stream (i.e no. of tuples seen so
far) - Itemset set of items
- Denote item(set) to be item or itemset
- At any point of time, the algorithm can be asked
to produce a list of item(set)s along with their
estimated frequency.
20Approximation guarantees
- All item(set)s whose true frequency exceeds sN
are output. There are no false negatives. - No item(set)s whose true frequency is less than
(s- e(N is output. - Estimated frequencies are less than true
frequencies by at most eN
21Input Example
- S 0.1
- e as a rule of thumb, should be set to one-tenth
or one-twentieth of s. e 0.01 - As per property 1, ALL elements with frequency
exceeding 0.1 will be output. - As per property 2, NO element with frequency
below 0.09 will be output - Elements between 0.09 and 0.1 may or may not be
output. Those that make their way are false
positives - As per property 3, all individual frequencies are
less than their true frequencies by at most 0.01
22Problem Definition cont
- An algorithm maintains an e-deficient synopsis if
its output satisifies the aforementioned
properties - Goal to devise algorithms to support e-deficient
synopsis using as little main memory as possible
23The Algorithms for frequent Items
- Sticky Sampling
- Lossy Counting
24Sticky Sampling Algorithm
34 15 30
28 31 41
23 35 19
? Create counters by sampling
25Notations
- Data structure S - set of entries of the form
(e,f) - f estimates the frequency of an element e.
- r sampling rate. Sampling an element with rate
r means we select the element with probablity
1/r
26Sticky Sampling cont
- Initially S is empty, r 1.
- For each incoming element eif (e exists in S)
increment corresponding felse sample
element with rate r if (sampled) add entry
(e,1) to S else ignore -
27The sampling rate
- Let t 1/ e log(s-1 ?-1) (? probability of
failure) - First 2t elements are sampled at rate1
- The next 2t elements at rate2
- The next 4t elements at rate4
- and so on
28Sticky Sampling cont
- Whenever the sampling rate r changes for each
entry (e,f) in S repeat toss an unbiased
coin if (toss is not successful) diminsh f by
one if (f 0) delete entry from
S break until toss is successful
29Sticky Sampling cont
- The number of unsuccessful coin tosses folows a
geometric distribution. - Effectively, after each rate change S is
transformed to exactly the state it would have
been in, if the new rate had been used from the
beginning. - When a user requests a list of items with
threshold s, the output are those entries in S
where f (s e)N
30Theorem 1
- Sticky Sampling computes an e-deficient synopsis
with probability at least 1 - ? using at most 2/
e log(s-1 ?-1) expected number of entries.
31Theorem 1 - proof
- First 2t elements find their way into S
- When r 2 N rt rt ( tE 1,t) ) gt 1/r
t/N - Error in frequency corresponds to a sequence of
unsuccessful coin tosses during the first few
occurrences of e.the probability that this
length exceeds eN is at most (1 1/r)eN lt (1
t/N)-eN lt e-et - Number of elements with f gt s is no more than 1/s
gt the probability that the estimate for any of
them is deficient by eN is at most e-et/s
32Theorem 1 proof cont
- Probability of failure should be at most ?. This
yieldse-et/s lt ? - t 1/ e log(s-1 ?-1)
- since the space requirements are 2t, the space
bound follows
33Sticky Sampling summary
- The algorithm name is called sticky sampling
because S sweeps over the stream like a magnet,
attracting all elements which already have an
entry in S - The space complexity is independent of N
- The idea of maintaining samples was first
presented by Gibbons and Matias who used it to
solve the top-k problem. - This algorithm is different in that the sampling
rate r increases logarithmically to produce ALL
items with frequency gt s, not just the top k
34Lossy Counting
Divide the stream into buckets Keep exact
counters for items in the buckets Prune entrys at
bucket boundaries
35Lossy Counting cont
- A deterministic algorithm that computes frequency
counts over a stream of singleitem transactions,
satisfying the guarantees outlined in Section 3
using at most 1/elog(eN) space where N denotes
the current length of the stream. - The user specifies two parameters- support s-
error e
36Definitions
- The incoming stream is conceptually divided into
buckets of width w ceil(1/e) - Buckets are labeled with bucket ids, starting
from 1 - Denote the current bucket id by bcurrent whose
value is ceil(N/w) - Denote fe to be the true frequency of an element
e in the stream seen so far - Data stucture D is a set of entries of the form
(e,f,D)
37The algorithm
- Initially D is empty
- Receive element eif (e exists in D) increment
its frequency (f) by 1else create a new entry
(e, 1, bcurrent 1) - If it bucket boundary prune D by the following
the rule(e,f,D) is deleted if f D bcurrent - When the user requests a list of items with
threshold s, output those entries in D where f
(s e)N
38Some algorithm facts
- For an entry (e,f,D) f represents the exact
frequency count for e ever since it was inserted
into D. - The value D is the maximum number of times e
could have occurred in the first bcurrent 1
buckets ( this value is exactly bcurrent 1) - Once a value is inserted into D its value D is
unchanged
39Lossy counting in action
D is Empty
At window boundary, remove entries that for them
fD bcurrent
40Lossy counting in action cont
At window boundary, remove entries that for them
fD bcurrent
41Lemma 1
- Whenver deletions occur, bcurrent eN
- Proof N bcurrentw neN bcurrent eneN
bcurrent
42Lemma 2
- Whenever an entry (e,f,D) gets deleted fe
bcurrent - Proof by induction
- Base case bcurrent 1
- (e,f,D) is deleted only if f 1 Thus fe
bcurrent (fe f) - Induction step - Consider (e,f,D) that gets
deleted for some bcurrent gt 1. - This entry was
inserted when bucket D1 was being processed. -
It was deleted at late as the time as bucket D
became full. - By induction the true frequency
for e was no more than D. - f is the true
frequency of e since it was inserted.- fe fD
combined with the deletion rule fD bcurrent gt
fe bcurrent
43Lemma 3
- If e does not appear in D, then fe eN
- Proof If the lemma is true for an element e
whenever it gets deleted, it is true for all
other N also.From lemmas 1, 2 we infer that fe
eN whenever it gets deleted.
44Lemma 4
- If (e,f,D) E D, then f fe f eN
- Proof
- If D0, ffe.
- Otherwisee was possibly deleted in the first D
buckets.From lemma 2 fe fDD bcurrent 1
eN - Conclusion f fe f eN
45Lossy Counting cont
- Lemma 3 shows that all elements whose true
frequency exceed eN have entries in D - Lemma 4 shows that the estimated frequency of all
such elements are accurate to within eN - gt D correctly maintains an e-deficient synopsis
46Theorem 2
- Lossy counting computes an e-deficient synopsis
using at most 1/elog(eN) entries
47Theorem 2 - proof
- Let B bcurrent
- di denote the number of entries in D whose
bucket id is B - i 1 (iE1,B) - e corresponding to di must occur at least i times
in buckets B-i1 through B, otherwise it would
have been deleted - We get the following constraint(1) Sidi jw
for j 1,2,B. i 1..j
48Theorem 2 proof
- The following inequality can be proved by
inductionSdi Sw/i for j 1,2,B i 1..j - D Sdi for i 1..B
- From the above inequality D Sw/i 1/elogB
1/elog(eN) - z
49Sticky Sampling vs. Lossy counting
Log10 of N (stream length)
Support s 1 Error e 0.1
50Sticky Sampling vs.Lossy counting cont
Kinks in the curve for sticky sampling correspond
to re-sampling Kinks in the curve for lossy
counting correspond to bucket boundaries
51Sticky Sampling vs. Lossy counting cont
? s SS worst LC worst SS Zipf LC Zipf SS Uniq LC Uniq
0.1 1.0 27K 9K 6K 419 27K 1K
0.05 0.5 58K 17K 11K 709 58K 2K
0.01 0.1 322K 69K 37K 2K 322K 10K
0.005 0.05 672K 124K 62K 4K 672K 20K
SS Sticky Sampling LC Lossy Counting Zipf
zipfian distribution Uniq stream with no
duplicates
52Sticky Sampling vs. Lossy summary
- Lossy counting is superior by a large factor
- Sticky sampling performs worse because of its
tendency to remember every unique element that
gets sampled - Lossy counting is good at pruning low frequency
elements quickly
53Comparison with alternative approaches
- Toivonen sampling algorithm for association
rules. - Sticky sampling beats the approach by roughly a
factor of 1/e
54Comparison with alternative approaches cont
- KPS02 In the first path the algorithm maintains
1/e elements with their frequencies. If a counter
exists for an element it is increased, if there
is a free counter it is inserted, otherwise all
existing counters are reduced by one - Can be used to maintain e-deficient synopsis with
exactly 1/e space - If the input stream is ZipfianLossy Counting
takes less than 1/e spacefor e0.01 roughly
2000 entries 20 1/e
55Frequent Sets of Items
56Frequent Sets of Items
Identify all subsets of items whose current
frequency exceeds s 0.1.
57Frequent itemsets algorithm
- Input stream of transactions, each transaction
is a set of items from I - N length of the stream
- User specifies two parameters support s, error
e - Challenge - handling variable sized
transactions- avoiding explicit enumeration of
all subsets of any transaction
58Notations
- Data structure D set of entries of the form
(set, f, D) - Transactions are divided into buckets
- w ceil(1/e) no. of transactions in each
bucket - bcurrent current bucket id
- Transactions are not processed one by one. Main
memory is filled with as many transactions as
possible. Processing is done on a batch of
transactions.B no. of buckets in main memory
in the current batch being processed.
59Update D
- UPDATE_SET for each entry (f,set,D) E D, update
f by counting occurrences of set in the current
batch. If the updated entry satisfies fD
bcurrent, we delete this entry - NEW_SET if a set set has frequency f B in the
current batch and set does not occur in D, create
a new entry (set,f,bcurrent B)
60Algorithm facts
- If fset eN it has an entry in D
- If (set,f,D)ED then the true frequency of fset
satisfies the inequality f fset fD - When a user requests a list of items with
threshold s, output those entries in D wheref
(s-e)N - B needs to be a large number. Any subset of I
that occurs B1 times or more contributes to D.
61Three modules
maintains the data structure D
implement UPDATE_SET, NEW_SET
operates on the current batch of transactions
repeatedly reads in a batch of transactionsinto
available main memory
62Module 1 - Buffer
- Read a batch of transactions
- Transactions are laid out one after the other in
a big array - A bitmap is used to remember transaction
boundaries - After reading in a batch, BUFFER sorts each
transaction by its item-ids
63Module 2 - TRIE
64Module 2 TRIE cont
- Nodes are labeled item-id, f, D, level
- Children of any node are ordered by their
item-ids - Root nodes are also ordered by their item-ids
- A node represents an itemset consisting of
item-ids in that node and all its ancestors - TRIE is maintained as an array of entries of the
form item-id, f, D, level (pre-order of the
trees). Equivalent to a lexicographic ordering of
subsets it encodes. - No pointers, levels compactly encode the
underlying tree structure.
65Module 3 - SetGen
SetGen uses the following pruning ruleif a
subset S does not make its way into TRIE after
application of both UPDATE_SET and NEW_SET, then
no supersets of S should be considered
66Overall Algorithm
67Efficient ImplementationsBuffer
- If item-ids are successive integers from 1 thru
I, and I is small enough (less than 1 million)
Maintain exact frequency counts for singleton
sets. Prune away those item-ids whose frequency
is less than eN and then sort the transactions - If I 105, array size 0.4 MB
68Efficient ImplementationsTRIE
- Take advantage of the fact that the sets produced
by SetGen are lexicographic. - Maintain TRIE as a set of fairly large-sized
chunks of memory instead of one huge array - Instead of modifying the original TRIE, create a
new TRIE. - Chunks from the old TRIE are freed as soon as
they are not required. - By the time SetGen finishes, the chunks of the
original TRIE have been discarder.
69Efficient ImplementationsSetGen
- Employs a priority queue called Heap
- Initially contains pointers to smallest item-ids
of all transactions in buffer - Duplicate members are maintained together and
constitute a single item in the Heap. Chain all
these pointers together. - Derive the space from BUFFER. Change item-ids
with pointers.
70Efficient ImplementationsSetGen cont
3 1 6 5 4 2 5 4 1 3 2 1
2
1
71Efficient ImplementationsSetGen cont
- Repeatedly process the smallest item-id in Heap
to generate singleton sets. - If the singleton belongs to TRIE after UPDATE_SET
and NEW_SET try to generate the next set by
extending the current singleton set. - This is done by invoking SetGen recursively with
a new Heap created out of successors of the
pointers to item-ids just processed and removed. - When the recursive call returns, the smallest
entry in Heap is removed and all successors of
the currently smallest item-id are added.
72Efficient ImplementationsSetGen cont
3 1 6 5 4 2 5 2 1 3 2 1
2
1
2
3
73System issues and optimizations
- Buffer scans the incoming stream by memory
mapping the input file. - Use standard qsort to sort transactions
- Threading SetGen and Buffer does not help because
SetGen is significantly slower. - The rate at which tries are scanned is much
smaller than the rate at which sequiential disk
I/O can be done - Possible to maintain TRIE on disk without loss in
performance
74System issues and optimizationsTRIE on disk
advantages
- The size of TRIE is not limited by main memory
this algorithm can function with a low amount of
main memory. - Since most available main memory can be devoted
to BUFFER, this algorithm can handle smaller
values of e than other algorithms can handle.
75Novel features of this technique
- No candidate generation phase.
- Compact disk-based tries is novel
- Able to compute frequent itemsets under low
memory conditions. - Able to handle smaller values of support
threshold than previously possible.
76Experimental results
77Experimental Results
78What is studied
- Support threshold s
- Number of transactions N
- Size of BUFFER B
- Total time taken t
set e 0.1s
79Varying buffer sizes and support s
Decreasing s leads to increases in running time
80Varying support s and buffer size B
Time taken in seconds
Kinks occur due to TRIE optimization on last
batch
81Varying length N and support s
Running time is linear proportional to the length
of the stream The curve flattens in the end as
processing the last batch is faster
82Comparison with Apriori
APriori APriori Our Algorithm with 4MB Buffer Our Algorithm with 4MB Buffer Our Algorithm with 44MB Buffer Our Algorithm with 44MB Buffer
Support Time Memory Time Memory Time Memory
0.001 99 s 82 MB 111 s 12 MB 27 s 45 MB
0.002 25 s 53 MB 94 s 10 MB 15 s 45 MB
0.004 14 s 48 MB 65 s 7MB 8 s 45 MB
0.006 13 s 48 MB 46 s 6 MB 6 s 45 MB
0.008 13 s 48 MB 34 s 5 MB 4 s 45 MB
0.010 14 s 48 MB 26 s 5 MB 4 s 45 MB
Dataset IBM T10.I4.1000K with 1M transactions,
average size 10
83Comparison with Iceberg Queries
Query Identify all word pairs in 100K web
documents which co-occur in at least
0.5 of the documents.
FSGM98 multiple pass algorithm
7000 seconds with 30 MB memory Our
single-pass algorithm 4500 seconds
with 26 MB memory
84Summary
A Novel Algorithm for computing approximate
frequency counts over Data Streams
85SummaryAdvantages of the algorithms presented
- Require provably small main memory footprints
- Each of the motivating examples can now be solved
over streaming data - Handle smaller values of support threshold than
previously possible - Remains practical in environments with moderate
main memory
86Summary cont
- Give an Apriori error guarantee
- Work for variable sized transactions.
- Optimized implementation for frequent itemsets
- For the datasets tested, the algorithm runs in
one pass and produces exact results, beating
previous algorithms in terms of time.
87Questions?
- More questions/comments can be sent to
michal.spivak_at_sun.com