Approximation Algorithms for Frequency Related Query Processing on Streaming Data

1 / 38

About This Presentation

Title:

Approximation Algorithms for Frequency Related Query Processing on Streaming Data

Description:

Document/URL streams from a Web crawler. IP packet streams. Web advertisement click streams ... A bitmap, originally all '0' Duplicate detection process ... –

Number of Views:135

Avg rating:3.0/5.0

Slides: 39

Provided by: csUal

Category:

more less

Transcript and Presenter's Notes

Title: Approximation Algorithms for Frequency Related Query Processing on Streaming Data

1
Approximation Algorithms for Frequency Related
QueryProcessing on Streaming Data

Presented by Fan Deng
Supervisor Dr. Davood Rafiei
May 24, 2007

2
Data stream

A sequence of data records
Examples
Document/URL streams from a Web crawler
IP packet streams
Web advertisement click streams
Sensor reading streams
...

3
Processing in one pass

One pass processing
Online stream (one scan required)
Massive offline stream (one scan preferred)
Challenges
Huge data volume
Fast processing requirement
Relatively small fast storage space

4
Approximation algorithms

Exact query answers
can be slow to obtain
may need large storage space
sometimes are not necessary
Approximate query answers
can take much less time
may need less space
with acceptable errors

5
Frequency related queries

Frequency
of occurrences
Continuous membership query
Point query
Similarity self-join size estimation

6
Outline

Introduction
Continuous membership query
Motivating application
Problem statement
Existing solutions and our solution
Theoretical and experimental results
Point query
Similarity self-join size estimation
Conclusions and future work

7
A Motivating Application

Duplicate URL detection in Web crawling
Search engines Broder et al. WWW03
Fetch web pages continuously
Extract URLs within each downloaded page
Check each URL (duplicate detection)
If never seen before
Then fetch it
Else skip it

8
A Motivating Application (cont.)

Problems
Huge number of distinct URLs
Memory is usually not large enough
Disks are slow
Errors are usually acceptable
A false positive (false alarms)
A distinct URL is wrongly reported as a duplicate
Consequence this URL will not be crawled
A false negative (misses)
A duplicate URL is wrongly reported as distinct
Consequence this URL will be crawled redundantly
or searched on disks

9
Problem statement

A sequence of elements with order
Storage space M
Not large enough to store all distinct elements
Continuous membership query
Appeared before? Yes or No
d g a f b e a d c b a
Our goal
Minimize the of errors
Fast

10
An existing solution (caching)

Store as many distinct elements as possible in a
buffer
Duplicate detection process
Upon element arrival, search the buffer
if found then report duplicate else distinct
Update the buffer using some replacement policies
LRU, FIFO, Random,

11
Another solution (Bloom filters)

A bitmap, originally all 0
Duplicate detection process
Hash each incoming element into some bits
If any bit is 0 then report distinct else
duplicate
Update process - sets corresponding bits to 1
x h1(x) h2(x) 1
2 3 4 5
6
a 1 2
b 1 3
c 2 4
a 1 2

12
Another solution (Bloom filters, cont.)

False positives (false alarms)
Bloom Filters will be full
- All distinct URLs will be reported as
duplicates, and thus skipped!

13
Our solution (Stable Bloom Filters)

Kick elements out of the Bloom filters
Change bits to cells (cellmap)

14
Stable Bloom Filters (SBF, cont.)

A cellmap, originally all 0
Duplicate detection
Hash each element into some cells, check those
cells
If any cell is 0, report distinct else
duplicate
Kick elements
Randomly choose some cells and deduct them by 1
Update the cellmap
Set cells into a predefined value, Max gt 0
Use the same hash functions as in the detection
stage

15
SBF theoretical results

SBF will be stable
The expected of 0s will become a constant
after a number of updates
Converge at an exponential rate
Monotonic
False positive rates become constant
An upper bound of false positive rates
(a function of 4 parameters SBF size, of hash
functions, max cell values, and kick-out rates)
Setting the optimal parameters (partially
empirical)

16
SBF experimental results

Experimental comparison between SBF, and
Caching/Buffering method (LRU)
URL fingerprint data set, originally obtained
from Internet Archive ( 700M URLs)
To fairly compare, we introduce FPBuffering
Let Caching generate some false positives
FPBuffering
If an element is not found in the buffer, report
duplicate with certain probabilities

17
SBF experimental results (cont.)

SBF generates 3-13 less false negatives than
FPBuffering, while having exactly the same of
false positives (lt10)

18
SBF experimental results (cont.)
19
SBF experimental results (cont.)

MIN, Broder et al. WWW03, theoretically optimal
assumes the entire sequence of requests is known
in advance
beats LRU caching by lt5 in most cases
More false positives allowed, SBF gains more

20
Outline

Introduction
Continuous membership query
Point query
Motivating application
Problem statement
Existing solutions and our solution
Theoretical and experimental results
Similarity self-join size estimation
Conclusions and future work

21
Motivating application

Internet traffic monitoring
Query the of IP packets sent by a particular IP
address in the past one hour
Phone call record analysis
Query the of calls to a given phone yesterday

22
Problem statement

Point query
Summarize a stream of elements
Estimate the frequency of a given element
Goal minimize the space cost and answer the
query fast

23
Existing solutions

Fast-AGMS sketch AMS97, Charikar et al. 2002
Count-min sketch (counting Bloom filters)
e.g. an element is hashed to 4 counters
Take the min counter value as the estimate

24
Our solution

Count-median-mean (CMM)
Count-min based
Take the value of the counter the element is
hashed to
Deduct the median/mean value of all other
counters
Remainder from deducting the mean is an unbiased
estimate (in the case of deducting mean)
Basic idea all counters are expected to have the
same value
Example
counter value 3
mean value of all other counters 2 (median 2,
more robust)
remainder 1, so frequency estimate 3-2 1

25
Theoretical results

Unbiased estimate (deduct mean)
Estimate variance is the same as that of
Fast-AGMS (in the case deducting mean)
For less skewed data set
the estimation accuracies of CMM and Fast-AGMS
are exactly the same

26
Experimental results and analysis

For skewed data sets
Accuracy (given the same space)
CMM-median Fast-AGMS gt CMM-mean
Time cost analysis
CMM-mean Fast-AGMS lt CMM-median
but the difference is small
Advantage of CMM
More flexible (with estimate upper bound)
More powerful (Count-min can be more accurate for
the very skewed data set)

27
Outline

Introduction
Continuous membership query
Point query
Similarity self-join size estimation
Motivating application
Problem statement
Existing solutions and our solution
Theoretical and experimental results
Conclusions and future work

28
Motivating application

Near-duplicate document detection for search
engines Broder 99, Henzinger 06
Very slow (30M pages, 10 days in 1997 2006?)
Good to predict the time
How? Estimate the number of similar pairs
Data cleaning in general (similarity self-join)
To find a better query plan (query optimization)
Estimates of similarity self-join size is needed

29
Problem statement

Similarity self-join size
Given a set of records with d attributes,
estimate the of record pairs that at least
s-similar
An s-similar pair
A pair of records with s attributes in common
E.g. ltDavood, Rafiei, CS, UofA, Canadagt
ltFan, Deng, CS, UofA, Canadagt
are 3-similar

30
Existing solutions

A straightforward solution
Compare each record with all other records
Count the number of pairs at least s-similar
Time cost O(n2) for n records
Random sampling
Take a sample of size m uniformly at random
Count the number of pairs at least s-similar
Scale it by a factor of c n(n-1)/m(m-1)

31
Our solution

Offline SimParCount (Step 1- data processing)
Linearly scan all records once
For each record
for each ksd
Randomly pick k different attribute values, and
concatenate them into one k-super-value
Repeat this process l_k times
Look at all k-super-values as a stream
Store the (d-s1) super-value streams on disks

32
Our solution (cont.)

Offline SimParCount
(Step 2 - Result generating)
Obtain the self-join size of those 1-dimensional
super-value streams
Based on the d-s1 self-join sizes, estimate the
similarity self-join size
Online SimParCount
Use small sketches to estimate stream self-join
sizes rather than expensive external sorting

33
Our solution (cont.)

Key idea
Convert similarity self-join size estimation to
stream self-join size estimation
A similar record pair will have certain chance to
have a match in the super-value stream
records --- 2-super-values
lt1a,2c,3b,4vgt --- lt2c,3bgt
lt1e,2c,3b,4vgt --- lt2c,3bgt
lt1e,2f,3d,4egt --- lt1e,3dgt

34
Theoretical results