Title: Approximation Algorithms for Frequency Related Query Processing on Streaming Data
1Approximation Algorithms for Frequency Related
QueryProcessing on Streaming Data
- Presented by Fan Deng
- Supervisor Dr. Davood Rafiei
- May 24, 2007
2Data stream
- A sequence of data records
- Examples
- Document/URL streams from a Web crawler
- IP packet streams
- Web advertisement click streams
- Sensor reading streams
- ...
3Processing in one pass
- One pass processing
- Online stream (one scan required)
- Massive offline stream (one scan preferred)
- Challenges
- Huge data volume
- Fast processing requirement
- Relatively small fast storage space
4Approximation algorithms
- Exact query answers
- can be slow to obtain
- may need large storage space
- sometimes are not necessary
- Approximate query answers
- can take much less time
- may need less space
- with acceptable errors
5Frequency related queries
- Frequency
- of occurrences
- Continuous membership query
- Point query
- Similarity self-join size estimation
6Outline
- Introduction
- Continuous membership query
- Motivating application
- Problem statement
- Existing solutions and our solution
- Theoretical and experimental results
- Point query
- Similarity self-join size estimation
- Conclusions and future work
7A Motivating Application
- Duplicate URL detection in Web crawling
- Search engines Broder et al. WWW03
- Fetch web pages continuously
- Extract URLs within each downloaded page
- Check each URL (duplicate detection)
- If never seen before
- Then fetch it
- Else skip it
8A Motivating Application (cont.)
- Problems
- Huge number of distinct URLs
- Memory is usually not large enough
- Disks are slow
- Errors are usually acceptable
- A false positive (false alarms)
- A distinct URL is wrongly reported as a duplicate
- Consequence this URL will not be crawled
- A false negative (misses)
- A duplicate URL is wrongly reported as distinct
- Consequence this URL will be crawled redundantly
or searched on disks
9Problem statement
- A sequence of elements with order
- Storage space M
- Not large enough to store all distinct elements
- Continuous membership query
- Appeared before? Yes or No
- d g a f b e a d c b a
- Our goal
- Minimize the of errors
- Fast
10An existing solution (caching)
- Store as many distinct elements as possible in a
buffer - Duplicate detection process
- Upon element arrival, search the buffer
- if found then report duplicate else distinct
- Update the buffer using some replacement policies
- LRU, FIFO, Random,
11Another solution (Bloom filters)
- A bitmap, originally all 0
- Duplicate detection process
- Hash each incoming element into some bits
- If any bit is 0 then report distinct else
duplicate - Update process - sets corresponding bits to 1
- x h1(x) h2(x) 1
2 3 4 5
6 - a 1 2
- b 1 3
- c 2 4
- a 1 2
12Another solution (Bloom filters, cont.)
- False positives (false alarms)
- Bloom Filters will be full
- - All distinct URLs will be reported as
duplicates, and thus skipped!
13Our solution (Stable Bloom Filters)
- Kick elements out of the Bloom filters
- Change bits to cells (cellmap)
14Stable Bloom Filters (SBF, cont.)
- A cellmap, originally all 0
- Duplicate detection
- Hash each element into some cells, check those
cells - If any cell is 0, report distinct else
duplicate - Kick elements
- Randomly choose some cells and deduct them by 1
- Update the cellmap
- Set cells into a predefined value, Max gt 0
- Use the same hash functions as in the detection
stage
15SBF theoretical results
- SBF will be stable
- The expected of 0s will become a constant
after a number of updates - Converge at an exponential rate
- Monotonic
- False positive rates become constant
- An upper bound of false positive rates
- (a function of 4 parameters SBF size, of hash
functions, max cell values, and kick-out rates) - Setting the optimal parameters (partially
empirical)
16SBF experimental results
- Experimental comparison between SBF, and
Caching/Buffering method (LRU) - URL fingerprint data set, originally obtained
from Internet Archive ( 700M URLs) - To fairly compare, we introduce FPBuffering
- Let Caching generate some false positives
- FPBuffering
- If an element is not found in the buffer, report
duplicate with certain probabilities
17SBF experimental results (cont.)
- SBF generates 3-13 less false negatives than
FPBuffering, while having exactly the same of
false positives (lt10)
18SBF experimental results (cont.)
19SBF experimental results (cont.)
- MIN, Broder et al. WWW03, theoretically optimal
- assumes the entire sequence of requests is known
in advance - beats LRU caching by lt5 in most cases
- More false positives allowed, SBF gains more
20Outline
- Introduction
- Continuous membership query
- Point query
- Motivating application
- Problem statement
- Existing solutions and our solution
- Theoretical and experimental results
- Similarity self-join size estimation
- Conclusions and future work
21Motivating application
- Internet traffic monitoring
- Query the of IP packets sent by a particular IP
address in the past one hour - Phone call record analysis
- Query the of calls to a given phone yesterday
22Problem statement
- Point query
- Summarize a stream of elements
- Estimate the frequency of a given element
- Goal minimize the space cost and answer the
query fast
23Existing solutions
- Fast-AGMS sketch AMS97, Charikar et al. 2002
- Count-min sketch (counting Bloom filters)
- e.g. an element is hashed to 4 counters
- Take the min counter value as the estimate
24Our solution
- Count-median-mean (CMM)
- Count-min based
- Take the value of the counter the element is
hashed to - Deduct the median/mean value of all other
counters - Remainder from deducting the mean is an unbiased
estimate (in the case of deducting mean) - Basic idea all counters are expected to have the
same value - Example
- counter value 3
- mean value of all other counters 2 (median 2,
more robust) - remainder 1, so frequency estimate 3-2 1
25Theoretical results
- Unbiased estimate (deduct mean)
- Estimate variance is the same as that of
Fast-AGMS (in the case deducting mean) - For less skewed data set
- the estimation accuracies of CMM and Fast-AGMS
are exactly the same
26Experimental results and analysis
- For skewed data sets
- Accuracy (given the same space)
- CMM-median Fast-AGMS gt CMM-mean
- Time cost analysis
- CMM-mean Fast-AGMS lt CMM-median
- but the difference is small
- Advantage of CMM
- More flexible (with estimate upper bound)
- More powerful (Count-min can be more accurate for
the very skewed data set)
27Outline
- Introduction
- Continuous membership query
- Point query
- Similarity self-join size estimation
- Motivating application
- Problem statement
- Existing solutions and our solution
- Theoretical and experimental results
- Conclusions and future work
28Motivating application
- Near-duplicate document detection for search
engines Broder 99, Henzinger 06 - Very slow (30M pages, 10 days in 1997 2006?)
- Good to predict the time
- How? Estimate the number of similar pairs
- Data cleaning in general (similarity self-join)
- To find a better query plan (query optimization)
- Estimates of similarity self-join size is needed
29Problem statement
- Similarity self-join size
- Given a set of records with d attributes,
estimate the of record pairs that at least
s-similar - An s-similar pair
- A pair of records with s attributes in common
- E.g. ltDavood, Rafiei, CS, UofA, Canadagt
- ltFan, Deng, CS, UofA, Canadagt
- are 3-similar
30Existing solutions
- A straightforward solution
- Compare each record with all other records
- Count the number of pairs at least s-similar
- Time cost O(n2) for n records
- Random sampling
- Take a sample of size m uniformly at random
- Count the number of pairs at least s-similar
- Scale it by a factor of c n(n-1)/m(m-1)
31Our solution
- Offline SimParCount (Step 1- data processing)
- Linearly scan all records once
- For each record
- for each ksd
- Randomly pick k different attribute values, and
concatenate them into one k-super-value - Repeat this process l_k times
- Look at all k-super-values as a stream
- Store the (d-s1) super-value streams on disks
32Our solution (cont.)
- Offline SimParCount
- (Step 2 - Result generating)
- Obtain the self-join size of those 1-dimensional
super-value streams - Based on the d-s1 self-join sizes, estimate the
similarity self-join size - Online SimParCount
- Use small sketches to estimate stream self-join
sizes rather than expensive external sorting
33Our solution (cont.)
- Key idea
- Convert similarity self-join size estimation to
stream self-join size estimation - A similar record pair will have certain chance to
have a match in the super-value stream - records --- 2-super-values
- lt1a,2c,3b,4vgt --- lt2c,3bgt
- lt1e,2c,3b,4vgt --- lt2c,3bgt
- lt1e,2f,3d,4egt --- lt1e,3dgt
-
34Theoretical results
- Unbiased estimate
- Standard deviation bound of the estimate
- Time and space cost
- (For both offline and online SimParCount)
35Experimental results
- Online SimParCount v.s. Random sampling
- Given the same amount of space
- Error (estimate trueValue) / trueValue
- Dataset
- DBLP paper titles
- Each converted into a record with 6 attributes
- Using min-wise independent hashing
36Similarity self-join size estimation
Experimental results (cont.)
37Conclusions and future work
- Streaming algorithms
- found real applications (important)
- can lead to theoretical results (fun)
- More work to be done
- Current direction
- multi-dimensional streaming algorithms
- E.g
- Estimating the of outliers in one pass
38Questions/Comments?