Approximate Frequency Counts over Data Streams - PowerPoint PPT Presentation

About This Presentation
Title:

Approximate Frequency Counts over Data Streams

Description:

The incoming stream is conceptually divided into buckets of width w = ceil(1/e) ... Denote the current bucket id by bcurrent whose value is ceil(N/w) ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 88
Provided by: michal73
Category:

less

Transcript and Presenter's Notes

Title: Approximate Frequency Counts over Data Streams


1
Approximate Frequency Counts over Data Streams
  • Gurmeet Singh Manku (Standford) Rajeev Motwani
    (Standford)

Presented by Michal Spivak November, 2003
2
The Problem
Identify all elements whose current frequency
exceeds support threshold s 0.1.
3
Related problem
Identify all subsets of items whose current
frequency exceeds s0.1
4
Purpose of this paper
  • Present an algorithm for computing frequency
    counts exceeding a user-specified threshold over
    data streams with the following advantages
  • Simple
  • Low memory footprint
  • Output is approximate but guaranteed not to
    exceed a user specified error parameter.
  • Can be deployed for streams of singleton items
    and handle streams of variable sized sets of
    items.

5
Overview
  • Introduction
  • Frequency counting applications
  • Problem definition
  • Algorithm for Frequent Items
  • Algorithm for Frequent Sets of Items
  • Experimental results
  • Summary

6
Introduction
7
Motivating examples
  • Iceberg Query Perform an aggregate function over
    an attribute and eliminate those below some
    threshold.
  • Association RulesRequire computation of frequent
    itemsets.
  • Iceberg DatacubesGroup bys of a CUBE operator
    whose aggregate frequency exceeds threshold
  • Traffic measurement Require identification of
    flows that exceed a certain fraction of total
    traffic

8
Whats out there today
  • Algorithms that compute exact results
  • Attempt to minimize number of data passes (best
    algorithms take two passes).
  • Problems when adapted to streams
  • Only one pass is allowed.
  • Results are expected to be available with short
    response time.
  • Fail to provide any a-priori guarantee on the
    quality of their output.

9
Why Streams?Streams vs. Stored data
  • Volume of a stream over its lifetime can be huge
  • Queries for streams require timely answers,
    response times need to be small
  • As a result it is not possible to store the
    stream as an entirety.

10
Frequency counting applications
11
Existing applications for the following problems
  • Iceberg Query Perform an aggregate function over
    an attribute and eliminate those below some
    threshold.
  • Association RulesRequire computation of frequent
    itemsets.
  • Iceberg DatacubesGroup bys of a CUBE operator
    whose aggregate frequency exceeds threshold
  • Traffic measurement Require identification of
    flows that exceed a certain fraction of total
    traffic

12
Iceberg QueriesIdentify aggregates that exceed a
user-specified threshold r
  • One of the published algorithms to compute
    iceberg queries efficiently uses repeated hashing
    over multiple passes.
  • Basic Idea
  • In the first pass a set of counters is maintained
  • Each incoming item is hashed to one of the
    counters which is incremented
  • These counters are then compressed to a bitmap,
    with a 1 denoting large counter value
  • In the second pass exact frequencies are
    maintained for only those elements that hash to a
    counter whose bitmap value is 1
  • This algorithm is difficult to adapt for streams
    because it requires two passes

M. FANG, N. SHIVAKUMAR, H. GARCIA-MOLINA,R.
MOTWANI, AND J. ULLMAN. Computing iceberg queries
efficiently. In Proc. of 24th Intl. Conf. on Very
Large Data Bases, pages 299310, 1998.
13
Association Rules
  • Definitions
  • Transaction subset of items drawn from I, the
    universe of all Items.
  • Itemset X I has support s if X occurs as a
    subset at least a fraction - s of all
    transactions
  • Associations rules over a set of transactions are
    of the form XgtY, where X and Y are subsets of I
    such that XnY 0 and XUY has support exceeding a
    user specified threshold s.
  • Confidence of a rule X gt Y is the value
    support(XUY) / support(X)

U
14
Example - Market basket analysis
For support 50, confidence 50, we have the
following rules 1 gt 3 with 50 support and 66
confidence 3 gt 1 with 50 support and 100
confidence
15
Reduce to computing frequent itemsets
For support 50, confidence 50
  • For the rule 1 gt 3
  • Support Support(1, 3) 50
  • Confidence Support(1,3)/Support(1) 66

16
Toivonens algorithm
  • Based on sampling of the data stream.
  • Basically, in the first pass, frequencies are
    computed for samples of the stream, and in the
    second pass these the validity of these items is
    determined.Can be adapted for data stream
  • Problems- false negatives occur because the
    error in frequency counts is two sided- for
    small values of e, the number of samples is
    enormous 1/e (100 million samples)

17
Network flow identification
  • Flow sequence of transport layer packets that
    share the same sourcedestination addresses
  • Estan and Verghese proposed an algorithm for this
    identifying flows that exceed a certain
    threshold.The algorithm is a combination of
    repeated hashing and sampling, similar to those
    for iceberg queries.
  • Algorithm presented in this paper is directly
    applicable to the problem of network flow
    identification. It beats the algorithm in terms
    of space and requirements.

18
Problem definition
19
Problem Definition
  • Algorithm accepts two user-specified parameters-
    support threshold s E (0,1)- error parameter e E
    (0,1)- e ltlt s
  • N length of stream (i.e no. of tuples seen so
    far)
  • Itemset set of items
  • Denote item(set) to be item or itemset
  • At any point of time, the algorithm can be asked
    to produce a list of item(set)s along with their
    estimated frequency.

20
Approximation guarantees
  • All item(set)s whose true frequency exceeds sN
    are output. There are no false negatives.
  • No item(set)s whose true frequency is less than
    (s- e(N is output.
  • Estimated frequencies are less than true
    frequencies by at most eN

21
Input Example
  • S 0.1
  • e as a rule of thumb, should be set to one-tenth
    or one-twentieth of s. e 0.01
  • As per property 1, ALL elements with frequency
    exceeding 0.1 will be output.
  • As per property 2, NO element with frequency
    below 0.09 will be output
  • Elements between 0.09 and 0.1 may or may not be
    output. Those that make their way are false
    positives
  • As per property 3, all individual frequencies are
    less than their true frequencies by at most 0.01

22
Problem Definition cont
  • An algorithm maintains an e-deficient synopsis if
    its output satisifies the aforementioned
    properties
  • Goal to devise algorithms to support e-deficient
    synopsis using as little main memory as possible

23
The Algorithms for frequent Items
  • Sticky Sampling
  • Lossy Counting

24
Sticky Sampling Algorithm
34 15 30
28 31 41
23 35 19
? Create counters by sampling
25
Notations
  • Data structure S - set of entries of the form
    (e,f)
  • f estimates the frequency of an element e.
  • r sampling rate. Sampling an element with rate
    r means we select the element with probablity
    1/r

26
Sticky Sampling cont
  • Initially S is empty, r 1.
  • For each incoming element eif (e exists in S)
    increment corresponding felse sample
    element with rate r if (sampled) add entry
    (e,1) to S else ignore

27
The sampling rate
  • Let t 1/ e log(s-1 ?-1) (? probability of
    failure)
  • First 2t elements are sampled at rate1
  • The next 2t elements at rate2
  • The next 4t elements at rate4
  • and so on

28
Sticky Sampling cont
  • Whenever the sampling rate r changes for each
    entry (e,f) in S repeat toss an unbiased
    coin if (toss is not successful) diminsh f by
    one if (f 0) delete entry from
    S break until toss is successful

29
Sticky Sampling cont
  • The number of unsuccessful coin tosses folows a
    geometric distribution.
  • Effectively, after each rate change S is
    transformed to exactly the state it would have
    been in, if the new rate had been used from the
    beginning.
  • When a user requests a list of items with
    threshold s, the output are those entries in S
    where f (s e)N

30
Theorem 1
  • Sticky Sampling computes an e-deficient synopsis
    with probability at least 1 - ? using at most 2/
    e log(s-1 ?-1) expected number of entries.

31
Theorem 1 - proof
  • First 2t elements find their way into S
  • When r 2 N rt rt ( tE 1,t) ) gt 1/r
    t/N
  • Error in frequency corresponds to a sequence of
    unsuccessful coin tosses during the first few
    occurrences of e.the probability that this
    length exceeds eN is at most (1 1/r)eN lt (1
    t/N)-eN lt e-et
  • Number of elements with f gt s is no more than 1/s
    gt the probability that the estimate for any of
    them is deficient by eN is at most e-et/s

32
Theorem 1 proof cont
  • Probability of failure should be at most ?. This
    yieldse-et/s lt ?
  • t 1/ e log(s-1 ?-1)
  • since the space requirements are 2t, the space
    bound follows

33
Sticky Sampling summary
  • The algorithm name is called sticky sampling
    because S sweeps over the stream like a magnet,
    attracting all elements which already have an
    entry in S
  • The space complexity is independent of N
  • The idea of maintaining samples was first
    presented by Gibbons and Matias who used it to
    solve the top-k problem.
  • This algorithm is different in that the sampling
    rate r increases logarithmically to produce ALL
    items with frequency gt s, not just the top k

34
Lossy Counting
Divide the stream into buckets Keep exact
counters for items in the buckets Prune entrys at
bucket boundaries
35
Lossy Counting cont
  • A deterministic algorithm that computes frequency
    counts over a stream of singleitem transactions,
    satisfying the guarantees outlined in Section 3
    using at most 1/elog(eN) space where N denotes
    the current length of the stream.
  • The user specifies two parameters- support s-
    error e

36
Definitions
  • The incoming stream is conceptually divided into
    buckets of width w ceil(1/e)
  • Buckets are labeled with bucket ids, starting
    from 1
  • Denote the current bucket id by bcurrent whose
    value is ceil(N/w)
  • Denote fe to be the true frequency of an element
    e in the stream seen so far
  • Data stucture D is a set of entries of the form
    (e,f,D)

37
The algorithm
  • Initially D is empty
  • Receive element eif (e exists in D) increment
    its frequency (f) by 1else create a new entry
    (e, 1, bcurrent 1)
  • If it bucket boundary prune D by the following
    the rule(e,f,D) is deleted if f D bcurrent
  • When the user requests a list of items with
    threshold s, output those entries in D where f
    (s e)N

38
Some algorithm facts
  • For an entry (e,f,D) f represents the exact
    frequency count for e ever since it was inserted
    into D.
  • The value D is the maximum number of times e
    could have occurred in the first bcurrent 1
    buckets ( this value is exactly bcurrent 1)
  • Once a value is inserted into D its value D is
    unchanged

39
Lossy counting in action
D is Empty
At window boundary, remove entries that for them
fD bcurrent
40
Lossy counting in action cont
At window boundary, remove entries that for them
fD bcurrent
41
Lemma 1
  • Whenver deletions occur, bcurrent eN
  • Proof N bcurrentw neN bcurrent eneN
    bcurrent

42
Lemma 2
  • Whenever an entry (e,f,D) gets deleted fe
    bcurrent
  • Proof by induction
  • Base case bcurrent 1
  • (e,f,D) is deleted only if f 1 Thus fe
    bcurrent (fe f)
  • Induction step - Consider (e,f,D) that gets
    deleted for some bcurrent gt 1. - This entry was
    inserted when bucket D1 was being processed. -
    It was deleted at late as the time as bucket D
    became full. - By induction the true frequency
    for e was no more than D. - f is the true
    frequency of e since it was inserted.- fe fD
    combined with the deletion rule fD bcurrent gt
    fe bcurrent

43
Lemma 3
  • If e does not appear in D, then fe eN
  • Proof If the lemma is true for an element e
    whenever it gets deleted, it is true for all
    other N also.From lemmas 1, 2 we infer that fe
    eN whenever it gets deleted.

44
Lemma 4
  • If (e,f,D) E D, then f fe f eN
  • Proof
  • If D0, ffe.
  • Otherwisee was possibly deleted in the first D
    buckets.From lemma 2 fe fDD bcurrent 1
    eN
  • Conclusion f fe f eN

45
Lossy Counting cont
  • Lemma 3 shows that all elements whose true
    frequency exceed eN have entries in D
  • Lemma 4 shows that the estimated frequency of all
    such elements are accurate to within eN
  • gt D correctly maintains an e-deficient synopsis

46
Theorem 2
  • Lossy counting computes an e-deficient synopsis
    using at most 1/elog(eN) entries

47
Theorem 2 - proof
  • Let B bcurrent
  • di denote the number of entries in D whose
    bucket id is B - i 1 (iE1,B)
  • e corresponding to di must occur at least i times
    in buckets B-i1 through B, otherwise it would
    have been deleted
  • We get the following constraint(1) Sidi jw
    for j 1,2,B. i 1..j

48
Theorem 2 proof
  • The following inequality can be proved by
    inductionSdi Sw/i for j 1,2,B i 1..j
  • D Sdi for i 1..B
  • From the above inequality D Sw/i 1/elogB
    1/elog(eN)
  • z

49
Sticky Sampling vs. Lossy counting
Log10 of N (stream length)
Support s 1 Error e 0.1
50
Sticky Sampling vs.Lossy counting cont
Kinks in the curve for sticky sampling correspond
to re-sampling Kinks in the curve for lossy
counting correspond to bucket boundaries
51
Sticky Sampling vs. Lossy counting cont
? s SS worst LC worst SS Zipf LC Zipf SS Uniq LC Uniq
0.1 1.0 27K 9K 6K 419 27K 1K
0.05 0.5 58K 17K 11K 709 58K 2K
0.01 0.1 322K 69K 37K 2K 322K 10K
0.005 0.05 672K 124K 62K 4K 672K 20K
SS Sticky Sampling LC Lossy Counting Zipf
zipfian distribution Uniq stream with no
duplicates
52
Sticky Sampling vs. Lossy summary
  • Lossy counting is superior by a large factor
  • Sticky sampling performs worse because of its
    tendency to remember every unique element that
    gets sampled
  • Lossy counting is good at pruning low frequency
    elements quickly

53
Comparison with alternative approaches
  • Toivonen sampling algorithm for association
    rules.
  • Sticky sampling beats the approach by roughly a
    factor of 1/e

54
Comparison with alternative approaches cont
  • KPS02 In the first path the algorithm maintains
    1/e elements with their frequencies. If a counter
    exists for an element it is increased, if there
    is a free counter it is inserted, otherwise all
    existing counters are reduced by one
  • Can be used to maintain e-deficient synopsis with
    exactly 1/e space
  • If the input stream is ZipfianLossy Counting
    takes less than 1/e spacefor e0.01 roughly
    2000 entries 20 1/e

55
Frequent Sets of Items
  • From theory to Practice

56
Frequent Sets of Items
Identify all subsets of items whose current
frequency exceeds s 0.1.
57
Frequent itemsets algorithm
  • Input stream of transactions, each transaction
    is a set of items from I
  • N length of the stream
  • User specifies two parameters support s, error
    e
  • Challenge - handling variable sized
    transactions- avoiding explicit enumeration of
    all subsets of any transaction

58
Notations
  • Data structure D set of entries of the form
    (set, f, D)
  • Transactions are divided into buckets
  • w ceil(1/e) no. of transactions in each
    bucket
  • bcurrent current bucket id
  • Transactions are not processed one by one. Main
    memory is filled with as many transactions as
    possible. Processing is done on a batch of
    transactions.B no. of buckets in main memory
    in the current batch being processed.

59
Update D
  • UPDATE_SET for each entry (f,set,D) E D, update
    f by counting occurrences of set in the current
    batch. If the updated entry satisfies fD
    bcurrent, we delete this entry
  • NEW_SET if a set set has frequency f B in the
    current batch and set does not occur in D, create
    a new entry (set,f,bcurrent B)

60
Algorithm facts
  • If fset eN it has an entry in D
  • If (set,f,D)ED then the true frequency of fset
    satisfies the inequality f fset fD
  • When a user requests a list of items with
    threshold s, output those entries in D wheref
    (s-e)N
  • B needs to be a large number. Any subset of I
    that occurs B1 times or more contributes to D.

61
Three modules
maintains the data structure D
implement UPDATE_SET, NEW_SET
operates on the current batch of transactions
repeatedly reads in a batch of transactionsinto
available main memory
62
Module 1 - Buffer
  • Read a batch of transactions
  • Transactions are laid out one after the other in
    a big array
  • A bitmap is used to remember transaction
    boundaries
  • After reading in a batch, BUFFER sorts each
    transaction by its item-ids

63
Module 2 - TRIE
64
Module 2 TRIE cont
  • Nodes are labeled item-id, f, D, level
  • Children of any node are ordered by their
    item-ids
  • Root nodes are also ordered by their item-ids
  • A node represents an itemset consisting of
    item-ids in that node and all its ancestors
  • TRIE is maintained as an array of entries of the
    form item-id, f, D, level (pre-order of the
    trees). Equivalent to a lexicographic ordering of
    subsets it encodes.
  • No pointers, levels compactly encode the
    underlying tree structure.

65
Module 3 - SetGen
SetGen uses the following pruning ruleif a
subset S does not make its way into TRIE after
application of both UPDATE_SET and NEW_SET, then
no supersets of S should be considered
66
Overall Algorithm
67
Efficient ImplementationsBuffer
  • If item-ids are successive integers from 1 thru
    I, and I is small enough (less than 1 million)
    Maintain exact frequency counts for singleton
    sets. Prune away those item-ids whose frequency
    is less than eN and then sort the transactions
  • If I 105, array size 0.4 MB

68
Efficient ImplementationsTRIE
  • Take advantage of the fact that the sets produced
    by SetGen are lexicographic.
  • Maintain TRIE as a set of fairly large-sized
    chunks of memory instead of one huge array
  • Instead of modifying the original TRIE, create a
    new TRIE.
  • Chunks from the old TRIE are freed as soon as
    they are not required.
  • By the time SetGen finishes, the chunks of the
    original TRIE have been discarder.

69
Efficient ImplementationsSetGen
  • Employs a priority queue called Heap
  • Initially contains pointers to smallest item-ids
    of all transactions in buffer
  • Duplicate members are maintained together and
    constitute a single item in the Heap. Chain all
    these pointers together.
  • Derive the space from BUFFER. Change item-ids
    with pointers.

70
Efficient ImplementationsSetGen cont
3 1 6 5 4 2 5 4 1 3 2 1
2
1
71
Efficient ImplementationsSetGen cont
  • Repeatedly process the smallest item-id in Heap
    to generate singleton sets.
  • If the singleton belongs to TRIE after UPDATE_SET
    and NEW_SET try to generate the next set by
    extending the current singleton set.
  • This is done by invoking SetGen recursively with
    a new Heap created out of successors of the
    pointers to item-ids just processed and removed.
  • When the recursive call returns, the smallest
    entry in Heap is removed and all successors of
    the currently smallest item-id are added.

72
Efficient ImplementationsSetGen cont
3 1 6 5 4 2 5 2 1 3 2 1
2
1
2
3
73
System issues and optimizations
  • Buffer scans the incoming stream by memory
    mapping the input file.
  • Use standard qsort to sort transactions
  • Threading SetGen and Buffer does not help because
    SetGen is significantly slower.
  • The rate at which tries are scanned is much
    smaller than the rate at which sequiential disk
    I/O can be done
  • Possible to maintain TRIE on disk without loss in
    performance

74
System issues and optimizationsTRIE on disk
advantages
  • The size of TRIE is not limited by main memory
    this algorithm can function with a low amount of
    main memory.
  • Since most available main memory can be devoted
    to BUFFER, this algorithm can handle smaller
    values of e than other algorithms can handle.

75
Novel features of this technique
  • No candidate generation phase.
  • Compact disk-based tries is novel
  • Able to compute frequent itemsets under low
    memory conditions.
  • Able to handle smaller values of support
    threshold than previously possible.

76
Experimental results
77
Experimental Results
78
What is studied
  • Support threshold s
  • Number of transactions N
  • Size of BUFFER B
  • Total time taken t

set e 0.1s
79
Varying buffer sizes and support s
Decreasing s leads to increases in running time
80
Varying support s and buffer size B
Time taken in seconds
Kinks occur due to TRIE optimization on last
batch
81
Varying length N and support s
Running time is linear proportional to the length
of the stream The curve flattens in the end as
processing the last batch is faster
82
Comparison with Apriori
APriori APriori Our Algorithm with 4MB Buffer Our Algorithm with 4MB Buffer Our Algorithm with 44MB Buffer Our Algorithm with 44MB Buffer
Support Time Memory Time Memory Time Memory
0.001 99 s 82 MB 111 s 12 MB 27 s 45 MB
0.002 25 s 53 MB 94 s 10 MB 15 s 45 MB
0.004 14 s 48 MB 65 s 7MB 8 s 45 MB
0.006 13 s 48 MB 46 s 6 MB 6 s 45 MB
0.008 13 s 48 MB 34 s 5 MB 4 s 45 MB
0.010 14 s 48 MB 26 s 5 MB 4 s 45 MB
Dataset IBM T10.I4.1000K with 1M transactions,
average size 10
83
Comparison with Iceberg Queries
Query Identify all word pairs in 100K web
documents which co-occur in at least
0.5 of the documents.
FSGM98 multiple pass algorithm
7000 seconds with 30 MB memory Our
single-pass algorithm 4500 seconds
with 26 MB memory
84
Summary
A Novel Algorithm for computing approximate
frequency counts over Data Streams
85
SummaryAdvantages of the algorithms presented
  • Require provably small main memory footprints
  • Each of the motivating examples can now be solved
    over streaming data
  • Handle smaller values of support threshold than
    previously possible
  • Remains practical in environments with moderate
    main memory

86
Summary cont
  • Give an Apriori error guarantee
  • Work for variable sized transactions.
  • Optimized implementation for frequent itemsets
  • For the datasets tested, the algorithm runs in
    one pass and produces exact results, beating
    previous algorithms in terms of time.

87
Questions?
  • More questions/comments can be sent to
    michal.spivak_at_sun.com
Write a Comment
User Comments (0)
About PowerShow.com