Title: Approximate Frequency Counts over Data Streams
1Approximate Frequency Counts over Data Streams
- Loo Kin Kong
- 4th Oct., 2002
2Plan
- Motivation
- Paper review Approximate Frequency Counts over
Data Streams - Finding frequent items
- Finding frequent itemsets
- Performance
- Conclusion
3Motivation
- In some new applications, data come as a
continuous stream - The sheer volume of a stream over its lifetime is
huge - Queries require timely answer
- Examples
- Stock ticks
- Network traffic measurements
4Frequent itemset mining on offline databases vs
data streams
- Often, level-wise algorithms are used to mine
offline databases - E.g., the Apriori algorithm and its variants
- At least 2 database scans are needed
- Level-wise algorithms cannot be applied to mine
data streams - Cannot go through the data stream multiple times
5Paper review Approximate Frequency Counts over
Data Streams
- By G. S. Manku and R. Motwani
- Published in VLDB 02
- Main contributions of the paper
- Proposed 2 algorithms to find frequent items
appear in a data stream of items - Extended the algorithms to find frequent itemset
6Notations
- Some notations
- Let N denote the current length of the stream
- Let s ?(0,1) denote the support threshold
- Let ? ?(0,1) denote the error tolerance
7Goals of the paper
- The algorithm ensures that
- All itemsets whose true frequency exceeds sN are
reported - No itemset whose true frequency is less than
(s-?)N is output - Estimated frequencies are less than the true
frequencies by at most ?N
8The simple case finding frequent items
- Each transaction in the stream contains only 1
item - 2 algorithms were proposed, namely
- Sticky Sampling Algorithm
- Lossy Counting Algorithm
- Features of the algorithms
- Sampling techniques are used
- Frequency counts found are approximate but error
is guaranteed not to exceed a user-specified
tolerance level - For Lossy Counting, all frequent items are
reported
9Sticky Sampling Algorithm
- User input includes 3 values, namely
- Support threshold s
- Error tolerance ?
- Probability of failure ?
- Counts are kept in a data structure S
- Each entry in S is in the form (e,f), where
- e is the item
- f is the frequency of e in the stream since the
entry is inserted in S - When queried about the frequent items, all
entries (e,f) such that f ? (s - ?)N
10Sticky Sampling Algorithm (contd)
S The set of all counts e Transaction (item) N
Curr. len. of stream r Sampling rate t 1/? log
(1/s?)
- S ? ? N ? 0 t ? 1/? log (1/s?) r ? 1
- e ? next transaction N ? N 1
- if (e,f) exists in S do
- increment the count f
- else if random(0,1) gt 1/r do
- insert (e,1) to S
- endif
- if N 2t ? 2n do
- r ? 2r
- halfSampRate(S)
- endif
- Goto 2
11Sticky Sampling Algorithm halfSampRate()
- function halfSampRate(S)
- for every entry (e,f) in S do
- while random(0,1) lt 0.5 and f gt 0 do
- f ? f 1
- if f 0 do
- remove the entry from S
- endif
12Lossy Counting Algorithm
- Incoming data stream is conceptually divided into
buckets of ?1/?? transactions - Counts are kept in a data structure D
- Each entry in D is in the form (e, f, ?), where
- e is the item
- f is the frequency of e in the stream since the
entry is inserted in D - ? is the maximum count of e in the stream before
e is added to D
13Lossy Counting Algorithm (contd)
D The set of all counts N Curr. len. of
stream e Transaction (itemset) w Bucket
width b Current bucket id
- D ? ? N ? 0
- w ? ?1/?? b ? 1
- e ? next transaction N ? N 1
- if (e,f,?) exists in D do
- f ? f 1
- else do
- insert (e,1,b-1) to D
- endif
- if N mod w 0 do
- prune(D, b) b ? b 1
- endif
- Goto 3
14Lossy Counting Algorithm prune()
- function prune(D, b)
- for each entry (e,f,?) in D do
- if f ? ? b do
- remove the entry from D
- endif
15Lossy Counting
- Lossy Counting guarantees that
- When deletion occurs, b ? ?N
- If an entry (e, f, ?) is deleted, fe ? b where fe
is the actual frequency count of e - Hence, if an entry (e, f, ?) is deleted, fe ? ?N
- Finally, f ? fe ? f ?N
16Sticky Sampling vs Lossy Counting
- Sticky Sampling is non-deterministic, while Lossy
Counting is deterministic - Experimental result shows that Lossy Counting
requires fewer entries than Sticky Sampling
17The more complex case finding frequent itemsets
- The Lossy Counting algorithm is extended to find
frequent itemsets - Transactions in the data stream contains any
number of items
18Overview of the algorithm
- Incoming data stream is conceptually divided into
buckets of ?1/?? transactions - Counts are kept in a data structure D
- Multiple buckets (? of them say) are processed in
a batch - Each entry in D is in the form (set, f, ?),
where - set is the itemset
- f is the frequency of set in the stream since the
entry is inserted in D - ? is the maximum count of set in the stream
before set is added to D
19Overview of the algorithm (contd)
- D is updated by the operations UPDATE_SET and
NEW_SET - UPDATE_SET updates and deletes entries in D
- For each entry (set, f, ?), count occurrence of
set in the batch and update the entry - If an updated entry satisfies f ? ? bcurrent,
the entry is removed from D - NEW_SET inserts new entries into D
- If a set set has frequency f ? ? in the batch and
set does not occur in D, create a new entry (set,
f, bcurrent-?)
20Implementation
- Challenges
- Not to enumerate all subsets of a transaction
- Data structure must be compact for better space
efficiency - 3 major modules
- Buffer
- Trie
- SetGen
21Implementation (contd)
- Buffer repeatedly reads in a batch of buckets of
transactions, where each transaction is a set of
item-ids, into available main memory - Trie maintains the data structure D
- SetGen generates subsets of item-ids along with
their frequency counts in the current batch - Not all possible subsets need to be generated
- If a subset S is not inserted into D after
application of both UPDATE_SET and NEW_SET, then
no supersets of S should be considered
22Performance
- IBM dataset (T10 I4 D1000K / 10K items)
23Performance (contd)
- Compared with Apriori
- IBM dataset (T10 I4 D1000K / 10K items)
24Conclusion
- Sticky Sampling and Lossy Counting are 2
approximate algorithms that can find frequent
items - Both algorithms produces frequency counts within
a user-specified error tolerance level, though
Sticky Sampling is non-deterministic - Lossy Counting can be extended to find frequent
itemsets
25Reference
- G. S. Manku and R. Motwani. Approximate Frequency
Counts over Data Streams. In VLDB 02, Hong Kong,
2002
26Q A