Approximation Algorithms for Frequency Related Query Processing on Streaming Data

1 / 38
About This Presentation
Title:

Approximation Algorithms for Frequency Related Query Processing on Streaming Data

Description:

Document/URL streams from a Web crawler. IP packet streams. Web advertisement click streams ... A bitmap, originally all '0' Duplicate detection process ... –

Number of Views:135
Avg rating:3.0/5.0
Slides: 39
Provided by: csUal
Category:

less

Transcript and Presenter's Notes

Title: Approximation Algorithms for Frequency Related Query Processing on Streaming Data


1
Approximation Algorithms for Frequency Related
QueryProcessing on Streaming Data
  • Presented by Fan Deng
  • Supervisor Dr. Davood Rafiei
  • May 24, 2007

2
Data stream
  • A sequence of data records
  • Examples
  • Document/URL streams from a Web crawler
  • IP packet streams
  • Web advertisement click streams
  • Sensor reading streams
  • ...

3
Processing in one pass
  • One pass processing
  • Online stream (one scan required)
  • Massive offline stream (one scan preferred)
  • Challenges
  • Huge data volume
  • Fast processing requirement
  • Relatively small fast storage space

4
Approximation algorithms
  • Exact query answers
  • can be slow to obtain
  • may need large storage space
  • sometimes are not necessary
  • Approximate query answers
  • can take much less time
  • may need less space
  • with acceptable errors

5
Frequency related queries
  • Frequency
  • of occurrences
  • Continuous membership query
  • Point query
  • Similarity self-join size estimation

6
Outline
  • Introduction
  • Continuous membership query
  • Motivating application
  • Problem statement
  • Existing solutions and our solution
  • Theoretical and experimental results
  • Point query
  • Similarity self-join size estimation
  • Conclusions and future work

7
A Motivating Application
  • Duplicate URL detection in Web crawling
  • Search engines Broder et al. WWW03
  • Fetch web pages continuously
  • Extract URLs within each downloaded page
  • Check each URL (duplicate detection)
  • If never seen before
  • Then fetch it
  • Else skip it

8
A Motivating Application (cont.)
  • Problems
  • Huge number of distinct URLs
  • Memory is usually not large enough
  • Disks are slow
  • Errors are usually acceptable
  • A false positive (false alarms)
  • A distinct URL is wrongly reported as a duplicate
  • Consequence this URL will not be crawled
  • A false negative (misses)
  • A duplicate URL is wrongly reported as distinct
  • Consequence this URL will be crawled redundantly
    or searched on disks

9
Problem statement
  • A sequence of elements with order
  • Storage space M
  • Not large enough to store all distinct elements
  • Continuous membership query
  • Appeared before? Yes or No
  • d g a f b e a d c b a
  • Our goal
  • Minimize the of errors
  • Fast

10
An existing solution (caching)
  • Store as many distinct elements as possible in a
    buffer
  • Duplicate detection process
  • Upon element arrival, search the buffer
  • if found then report duplicate else distinct
  • Update the buffer using some replacement policies
  • LRU, FIFO, Random,

11
Another solution (Bloom filters)
  • A bitmap, originally all 0
  • Duplicate detection process
  • Hash each incoming element into some bits
  • If any bit is 0 then report distinct else
    duplicate
  • Update process - sets corresponding bits to 1
  • x h1(x) h2(x) 1
    2 3 4 5
    6
  • a 1 2
  • b 1 3
  • c 2 4
  • a 1 2

12
Another solution (Bloom filters, cont.)
  • False positives (false alarms)
  • Bloom Filters will be full
  • - All distinct URLs will be reported as
    duplicates, and thus skipped!

13
Our solution (Stable Bloom Filters)
  • Kick elements out of the Bloom filters
  • Change bits to cells (cellmap)

14
Stable Bloom Filters (SBF, cont.)
  • A cellmap, originally all 0
  • Duplicate detection
  • Hash each element into some cells, check those
    cells
  • If any cell is 0, report distinct else
    duplicate
  • Kick elements
  • Randomly choose some cells and deduct them by 1
  • Update the cellmap
  • Set cells into a predefined value, Max gt 0
  • Use the same hash functions as in the detection
    stage

15
SBF theoretical results
  • SBF will be stable
  • The expected of 0s will become a constant
    after a number of updates
  • Converge at an exponential rate
  • Monotonic
  • False positive rates become constant
  • An upper bound of false positive rates
  • (a function of 4 parameters SBF size, of hash
    functions, max cell values, and kick-out rates)
  • Setting the optimal parameters (partially
    empirical)

16
SBF experimental results
  • Experimental comparison between SBF, and
    Caching/Buffering method (LRU)
  • URL fingerprint data set, originally obtained
    from Internet Archive ( 700M URLs)
  • To fairly compare, we introduce FPBuffering
  • Let Caching generate some false positives
  • FPBuffering
  • If an element is not found in the buffer, report
    duplicate with certain probabilities

17
SBF experimental results (cont.)
  • SBF generates 3-13 less false negatives than
    FPBuffering, while having exactly the same of
    false positives (lt10)

18
SBF experimental results (cont.)
19
SBF experimental results (cont.)
  • MIN, Broder et al. WWW03, theoretically optimal
  • assumes the entire sequence of requests is known
    in advance
  • beats LRU caching by lt5 in most cases
  • More false positives allowed, SBF gains more

20
Outline
  • Introduction
  • Continuous membership query
  • Point query
  • Motivating application
  • Problem statement
  • Existing solutions and our solution
  • Theoretical and experimental results
  • Similarity self-join size estimation
  • Conclusions and future work

21
Motivating application
  • Internet traffic monitoring
  • Query the of IP packets sent by a particular IP
    address in the past one hour
  • Phone call record analysis
  • Query the of calls to a given phone yesterday

22
Problem statement
  • Point query
  • Summarize a stream of elements
  • Estimate the frequency of a given element
  • Goal minimize the space cost and answer the
    query fast

23
Existing solutions
  • Fast-AGMS sketch AMS97, Charikar et al. 2002
  • Count-min sketch (counting Bloom filters)
  • e.g. an element is hashed to 4 counters
  • Take the min counter value as the estimate

24
Our solution
  • Count-median-mean (CMM)
  • Count-min based
  • Take the value of the counter the element is
    hashed to
  • Deduct the median/mean value of all other
    counters
  • Remainder from deducting the mean is an unbiased
    estimate (in the case of deducting mean)
  • Basic idea all counters are expected to have the
    same value
  • Example
  • counter value 3
  • mean value of all other counters 2 (median 2,
    more robust)
  • remainder 1, so frequency estimate 3-2 1

25
Theoretical results
  • Unbiased estimate (deduct mean)
  • Estimate variance is the same as that of
    Fast-AGMS (in the case deducting mean)
  • For less skewed data set
  • the estimation accuracies of CMM and Fast-AGMS
    are exactly the same

26
Experimental results and analysis
  • For skewed data sets
  • Accuracy (given the same space)
  • CMM-median Fast-AGMS gt CMM-mean
  • Time cost analysis
  • CMM-mean Fast-AGMS lt CMM-median
  • but the difference is small
  • Advantage of CMM
  • More flexible (with estimate upper bound)
  • More powerful (Count-min can be more accurate for
    the very skewed data set)

27
Outline
  • Introduction
  • Continuous membership query
  • Point query
  • Similarity self-join size estimation
  • Motivating application
  • Problem statement
  • Existing solutions and our solution
  • Theoretical and experimental results
  • Conclusions and future work

28
Motivating application
  • Near-duplicate document detection for search
    engines Broder 99, Henzinger 06
  • Very slow (30M pages, 10 days in 1997 2006?)
  • Good to predict the time
  • How? Estimate the number of similar pairs
  • Data cleaning in general (similarity self-join)
  • To find a better query plan (query optimization)
  • Estimates of similarity self-join size is needed

29
Problem statement
  • Similarity self-join size
  • Given a set of records with d attributes,
    estimate the of record pairs that at least
    s-similar
  • An s-similar pair
  • A pair of records with s attributes in common
  • E.g. ltDavood, Rafiei, CS, UofA, Canadagt
  • ltFan, Deng, CS, UofA, Canadagt
  • are 3-similar

30
Existing solutions
  • A straightforward solution
  • Compare each record with all other records
  • Count the number of pairs at least s-similar
  • Time cost O(n2) for n records
  • Random sampling
  • Take a sample of size m uniformly at random
  • Count the number of pairs at least s-similar
  • Scale it by a factor of c n(n-1)/m(m-1)

31
Our solution
  • Offline SimParCount (Step 1- data processing)
  • Linearly scan all records once
  • For each record
  • for each ksd
  • Randomly pick k different attribute values, and
    concatenate them into one k-super-value
  • Repeat this process l_k times
  • Look at all k-super-values as a stream
  • Store the (d-s1) super-value streams on disks

32
Our solution (cont.)
  • Offline SimParCount
  • (Step 2 - Result generating)
  • Obtain the self-join size of those 1-dimensional
    super-value streams
  • Based on the d-s1 self-join sizes, estimate the
    similarity self-join size
  • Online SimParCount
  • Use small sketches to estimate stream self-join
    sizes rather than expensive external sorting

33
Our solution (cont.)
  • Key idea
  • Convert similarity self-join size estimation to
    stream self-join size estimation
  • A similar record pair will have certain chance to
    have a match in the super-value stream
  • records --- 2-super-values
  • lt1a,2c,3b,4vgt --- lt2c,3bgt
  • lt1e,2c,3b,4vgt --- lt2c,3bgt
  • lt1e,2f,3d,4egt --- lt1e,3dgt

34
Theoretical results
  • Unbiased estimate
  • Standard deviation bound of the estimate
  • Time and space cost
  • (For both offline and online SimParCount)

35
Experimental results
  • Online SimParCount v.s. Random sampling
  • Given the same amount of space
  • Error (estimate trueValue) / trueValue
  • Dataset
  • DBLP paper titles
  • Each converted into a record with 6 attributes
  • Using min-wise independent hashing

36
Similarity self-join size estimation
Experimental results (cont.)
37
Conclusions and future work
  • Streaming algorithms
  • found real applications (important)
  • can lead to theoretical results (fun)
  • More work to be done
  • Current direction
  • multi-dimensional streaming algorithms
  • E.g
  • Estimating the of outliers in one pass

38
Questions/Comments?
Write a Comment
User Comments (0)
About PowerShow.com