Approximate Frequency Counts over Data Streams - PowerPoint PPT Presentation

About This Presentation

Title:

Approximate Frequency Counts over Data Streams

Description:

Often, level-wise algorithms are used to mine offline databases ... Level-wise algorithms cannot be applied to mine data streams ... – PowerPoint PPT presentation

Number of Views:319

Avg rating:3.0/5.0

Slides: 27

Provided by: looki

Category:

more less

Transcript and Presenter's Notes

Title: Approximate Frequency Counts over Data Streams

1
Approximate Frequency Counts over Data Streams

Loo Kin Kong
4th Oct., 2002

2
Plan

Motivation
Paper review Approximate Frequency Counts over
Data Streams
Finding frequent items
Finding frequent itemsets
Performance
Conclusion

3
Motivation

In some new applications, data come as a
continuous stream
The sheer volume of a stream over its lifetime is
huge
Queries require timely answer
Examples
Stock ticks
Network traffic measurements

4
Frequent itemset mining on offline databases vs
data streams

Often, level-wise algorithms are used to mine
offline databases
E.g., the Apriori algorithm and its variants
At least 2 database scans are needed
Level-wise algorithms cannot be applied to mine
data streams
Cannot go through the data stream multiple times

5
Paper review Approximate Frequency Counts over
Data Streams

By G. S. Manku and R. Motwani
Published in VLDB 02
Main contributions of the paper
Proposed 2 algorithms to find frequent items
appear in a data stream of items
Extended the algorithms to find frequent itemset

6
Notations

Some notations
Let N denote the current length of the stream
Let s ?(0,1) denote the support threshold
Let ? ?(0,1) denote the error tolerance

7
Goals of the paper

The algorithm ensures that
All itemsets whose true frequency exceeds sN are
reported
No itemset whose true frequency is less than
(s-?)N is output
Estimated frequencies are less than the true
frequencies by at most ?N

8
The simple case finding frequent items

Each transaction in the stream contains only 1
item
2 algorithms were proposed, namely
Sticky Sampling Algorithm
Lossy Counting Algorithm
Features of the algorithms
Sampling techniques are used
Frequency counts found are approximate but error
is guaranteed not to exceed a user-specified
tolerance level
For Lossy Counting, all frequent items are
reported

9
Sticky Sampling Algorithm

User input includes 3 values, namely
Support threshold s
Error tolerance ?
Probability of failure ?
Counts are kept in a data structure S
Each entry in S is in the form (e,f), where
e is the item
f is the frequency of e in the stream since the
entry is inserted in S
When queried about the frequent items, all
entries (e,f) such that f ? (s - ?)N

10
Sticky Sampling Algorithm (contd)
S The set of all counts e Transaction (item) N
Curr. len. of stream r Sampling rate t 1/? log
(1/s?)

S ? ? N ? 0 t ? 1/? log (1/s?) r ? 1
e ? next transaction N ? N 1
if (e,f) exists in S do
increment the count f
else if random(0,1) gt 1/r do
insert (e,1) to S
endif
if N 2t ? 2n do
r ? 2r
halfSampRate(S)
endif
Goto 2

11
Sticky Sampling Algorithm halfSampRate()

function halfSampRate(S)
for every entry (e,f) in S do
while random(0,1) lt 0.5 and f gt 0 do
f ? f 1
if f 0 do
remove the entry from S
endif

12
Lossy Counting Algorithm

Incoming data stream is conceptually divided into
buckets of ?1/?? transactions
Counts are kept in a data structure D
Each entry in D is in the form (e, f, ?), where
e is the item
f is the frequency of e in the stream since the
entry is inserted in D
? is the maximum count of e in the stream before
e is added to D

13
Lossy Counting Algorithm (contd)
D The set of all counts N Curr. len. of
stream e Transaction (itemset) w Bucket
width b Current bucket id

D ? ? N ? 0
w ? ?1/?? b ? 1
e ? next transaction N ? N 1
if (e,f,?) exists in D do
f ? f 1
else do
insert (e,1,b-1) to D
endif
if N mod w 0 do
prune(D, b) b ? b 1
endif
Goto 3

14
Lossy Counting Algorithm prune()

function prune(D, b)
for each entry (e,f,?) in D do
if f ? ? b do
remove the entry from D
endif

15
Lossy Counting

Lossy Counting guarantees that
When deletion occurs, b ? ?N
If an entry (e, f, ?) is deleted, fe ? b where fe
is the actual frequency count of e
Hence, if an entry (e, f, ?) is deleted, fe ? ?N
Finally, f ? fe ? f ?N

16
Sticky Sampling vs Lossy Counting

Sticky Sampling is non-deterministic, while Lossy
Counting is deterministic
Experimental result shows that Lossy Counting
requires fewer entries than Sticky Sampling

17
The more complex case finding frequent itemsets

The Lossy Counting algorithm is extended to find
frequent itemsets
Transactions in the data stream contains any
number of items

18
Overview of the algorithm

Incoming data stream is conceptually divided into
buckets of ?1/?? transactions
Counts are kept in a data structure D
Multiple buckets (? of them say) are processed in
a batch
Each entry in D is in the form (set, f, ?),
where
set is the itemset
f is the frequency of set in the stream since the
entry is inserted in D
? is the maximum count of set in the stream
before set is added to D

19
Overview of the algorithm (contd)

D is updated by the operations UPDATE_SET and
NEW_SET
UPDATE_SET updates and deletes entries in D
For each entry (set, f, ?), count occurrence of
set in the batch and update the entry
If an updated entry satisfies f ? ? bcurrent,
the entry is removed from D
NEW_SET inserts new entries into D
If a set set has frequency f ? ? in the batch and
set does not occur in D, create a new entry (set,
f, bcurrent-?)

20
Implementation

Challenges
Not to enumerate all subsets of a transaction
Data structure must be compact for better space
efficiency
3 major modules
Buffer
Trie
SetGen

21
Implementation (contd)

Buffer repeatedly reads in a batch of buckets of
transactions, where each transaction is a set of
item-ids, into available main memory
Trie maintains the data structure D
SetGen generates subsets of item-ids along with
their frequency counts in the current batch
Not all possible subsets need to be generated
If a subset S is not inserted into D after
application of both UPDATE_SET and NEW_SET, then
no supersets of S should be considered

22
Performance

IBM dataset (T10 I4 D1000K / 10K items)

23
Performance (contd)

Compared with Apriori
IBM dataset (T10 I4 D1000K / 10K items)

24
Conclusion

Sticky Sampling and Lossy Counting are 2
approximate algorithms that can find frequent
items
Both algorithms produces frequency counts within
a user-specified error tolerance level, though
Sticky Sampling is non-deterministic
Lossy Counting can be extended to find frequent
itemsets

25
Reference