Models and Issues in Data Stream Systems - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Models and Issues in Data Stream Systems

Description:

an 'almost-sorted' stream can be fixed with a little buffering. ... Building summary of data stream using small amount of memory ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 34
Provided by: Raymo68
Category:

less

Transcript and Presenter's Notes

Title: Models and Issues in Data Stream Systems


1
Models and Issues in Data Stream Systems
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev
Motwani Jennifer Widom ACM SIGMOD/PODS, 2002
  • Adesola Omotayo
  • October 18, 2005

2
Presentation Outline
  • The Data Stream Model
  • Review of Data Stream Projects
  • Queries of Data Streams
  • Proposal for a DSMS
  • Algorithmic Issues
  • Conclusions
  • My Opinions

3
Proposal for a DSMS
  • STREAM (STanford stREam datA Manager)
  • Query Language for a DSMS
  • Query Processing Architecture of a DSMS

4
Query Language for a DSMS
Implicit Timestamp
  • Modified version of SQL
  • well known
  • declarative language
  • Allows FROM clause to refer to streams and
    relations
  • Allows the formulation of sliding window queries
  • ordering of data stream elements
  • optional window specification after a stream in
    the FROM clause

...
order of arrival
Explicit Timestamp
...
847, 12
847, 20
847, 15
incoming stream (S)
S (s1, i1), (s2, i2) ... (sn, in)
5
Query Examples
  • Calls (customer id, type, minutes, and timestamp)

SELECT AVG(S.minutes) FROM Calls S PARTITION BY
S.customer_id ROWS 10 PRECEDING WHERE S.type
Long Distance
SELECT AVG(S.minutes) FROM Calls S PARTITION BY
S.customer_id ROWS 10 PRECEDING WHERE S.type
Long Distance
6
Timestamps in Streams 1
  • Ambiguous for tuples derived from multiple
    streams
  • Drawback of explicit timestamp
  • an almost-sorted stream can be fixed with a
    little buffering.
  • Methods of assigning timestamps output of binary
    operators
  • best effort approach
  • stricter approach

SELECT FROM S1 ROWS 1000 PRECEDING, S2 ROWS
100 PRECEDING WHERE S1.A S2.B
7
Timestamps in Streams 2
  • The keyword, recent

8
Query Processing Architecture 1
  • Query execution plans consist of operators
    connected by queues
  • Central scheduler schedules operators for
    execution
  • During execution
  • operator reads data from its input queues,
    updates synopsis structure and writes results to
    output queues
  • Period of execution of operator determined
    dynamically by scheduler and operator returns
    control back to scheduler once period expires

This object is copied from the original paper
9
Query Processing Architecture 2
  • To handle stream query parameters fluctuations,
    operators are adaptive (primarily to memory)
  • Trading accuracy for memory
  • Operator maximizes accuracy of output based on
    size of available memory
  • Handles dynamic changes in size of its available
    memory
  • Example For a sliding window join, the larger
    the window, the better the approximation

10
Query Processing Architecture 3
  • Issues in Memory Management
  • How do different query ops produce approximate
    answers under limited memory?
  • How do approximate results behave when operators
    are composed in query results?
  • How can the DSMS allocate memory to operators to
    maximize accuracy of answer?
  • How can DSMS reallocate memory among operators
    under changing conditions?
  • Given a query, how does the query optimizer come
    up with a query plan that, with best memory
    allocation, minimizes approximation? Should plans
    be modified when conditions change?
  • Since synopses can be shared among query plans,
    how do we optimally consider a set of queries,
    which may be weighted by importance?

11
Query Processing Architecture 4
  • Issues in Scheduling
  • Scheduler needs to provide rate synchronization
    within operators and across pipelined operators
    in query plans
  • Time-varying arrival rates of data streams and
    time-varying output rates of operators complicate
    matters
  • Need to take into account
  • Memory allocation across operators
  • Mgt of buffers for incoming streams
  • Availability of synopses on disk (instead of
    memory)
  • Performance requirements of individual queries

12
Algorithmic Issues
  • Random Samples
  • Sketching Techniques
  • Histograms
  • Sliding Windows
  • Negative Results
  • Miscellaneous algorithms

13
Random Samples 1
  • Used as summary structure in many scenarios where
    small sample is expected to capture essential
    characteristics of data set
  • Easiest form of summarization
  • Other synopses can be built from sample itself
  • Variations include
  • stratified sampling
  • uniform sampling
  • weighted sampling

14
Random Samples 2
  • Idea A small random sample S of the data often
    well-represents all the data

Data stream
9 3 5 2 7 1 6 5 8 4 9 1
(n12)
Sample S
9 5 1 8
Example select AVG(R.e) from R where R.e is odd
answer 5
15
Sketching Techniques 1
  • Building summary of data stream using small
    amount of memory
  • Make it possible to estimate answers to certain
    queries over data set
  • F0 is number of distinct values in S O(log d)
  • F1 is the length of S
  • F2 is the self-join size O(log d log N)
  • F? is the most frequent items multiplicity

16
Sketching Techniques 2
  • Building small-space summary for distribution
    vector mi (i1,..., N) seen as a stream of
    i-values

Data stream
3, 1, 2, 4, 2, 3, 5, . . .
17
Histograms 1
  • V-Optimal Histogram
  • Equi-Width Histograms
  • End-Biased Histograms

18
Histograms 2
  • V-Optimal Histogram
  • approximates distribution of a set of values by a
    piecewise-constant function
  • such that the sum of squared error is minimized

Idea Select buckets to minimize frequency
variance within buckets
19
Recent Work on V-Optimal Histograms
  • V-Optimal Histogram
  • Jagadish et al.s algorithm uses O(N) space and
    requires O(N2B) time
  • Guha, Koudas and Shim adapted this algorithm to
    sorted data streams with O(B2 logN) space and
    O(B2 logN) time per data element
  • Gilbert et al. removed the restriction that the
    data stream be sorted and achieved poly(B, logN,
    1/?)

20
Histograms 3
  • Equi-Width Histograms
  • partition the domain into buckets such that the
    number of values falling into each bucket is
    uniform across all buckets.
  • They maintain quantiles for the underlying data
    distribution as the bucket boundaries.

Idea Select buckets such that counts per bucket
are equal
21
Recent Work on Equi-Width Histogram
  • Equi-Width Histograms
  • Characterize data distributions in a manner that
    is less sensitive to outliers
  • Applications
  • Traditional databases for selectivity estimation
  • Parallel databases for generation of quantiles or
    splitters
  • Greenwald and Khanna algorithm needs O(1/? log
    ?N) space with a precision of ?N

22
Histograms 4
  • End-Biased Histograms
  • maintain exact counts of items that occur with
    frequency above a threshold, and approximate
    other counts by a uniform distribution.
  • Example
  • SELECT line1, line2, COUNT(others)
  • FROM calls
  • GROUP BY line1, line2
  • HAVING COUNT(others) gt 3
  • Answer lt100, 500, 3gt

23
Recent Work on End-Biased Histogram
  • End-Biased Histograms
  • Find aggregate values above a specified
    threshold. These queries are referred to as
    iceberg queries
  • Example find search terms that account for more
    than 1 of the queries to a search engine
  • Fang et al.s algorithm computes over
    disk-resident data and requires multiple passes.
  • Manku and Motwanis deterministic algorithm
    maintains a sample of distinct items along with
    their frequency. Requires O(1/? log ?N) space. No
    item is undercounted my more than ?N

24
Wavelets 1
  • Mathematical tools for hierarchical decomposition
    of functions/signals
  • Provide a summary representation of data
  • Haar wavelets are used in DB for ease of
    computation
  • The signal reconstructed from top few wavelet
    coefficients best approximate the original signal

25
Wavelets 2
  • Haar Wavelets
  • Recursive pairwise averaging and differencing
    operation

Averages Detail Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
2, 1, 4, 4
0, -1, -1, 0
1.5, 4
0.5, 0
2.75
-1.25
2.75, -1.25, 0.5, 0, 0, -1, -1, 0
Haar wavelet decomposition
26
Wavelets 3
  • Haar Wavelets Hierarchical decomposition
    structure
  • Reconstruct data values d(i) as ? (/-1)
    (coefficient on path)

27
Sliding Windows 1
  • At every time t, a data item arrives
  • The item expires at time tN
    (N is the window length)

Window of size N
t
t N
Past Data
Future Data
Recent Data
28
Sliding Windows 2
  • Prevent stale data from influencing analysis and
    statistics
  • Serve as tool for approx. in face of bounded
    memory
  • Open problems
  • Clustering
  • Maintaining top wavelet coefficients
  • Maintaining statistics like variance
  • Computing correlated aggregates

29
Negative Results
  • Emerging set of negative results on space-time
    requirements of algorithms that operate in stream
    model
  • Henzinger, Raghavan, and Rajagopalan provided
    space lower bounds for concrete problems in
    stream model e.g frequent item counting
  • Alon, Matia, Szeged provided almost tight lower
    bounds for computing the frequency moments lower
    bound of O(N) for estimating F?
  • Manku and Motwanis algorithm for computing a
    sample of distinct items along with their
    frequency has a lower bound of O(1/? log ?N)
  • General lower bound technique for sampling-based
    algorithms presented by Bar-Yoseef et al.
  • useful for deriving space lower bounds for data
    stream algorithms that resort to oblivious
    sampling.

30
Miscellaneous Algorithms
  • Data Mining Decision tree are used for
    prediction and clustering is used to summarize
    data.
  • Multiple Streams Computation of simple functions
    such as the number of distinct elements, over
    unions of data stream is useful in distributed
    environment
  • Reduction of Streams List-efficient streaming
    algorithms that are presented with a list of data
    items in a succinct form must be employed in
    order for reductions to be efficient.
  • Property Testing Programs that make one pass
    over data and using small space verify if the
    data satisfies a certain property
  • Measuring Sortedness Useful in determining the
    choice of a sort algorithm for underlying data

31
Conclusions
  • The need for and research issues arising from a
    new model of data processing.
  • Review past work relevant to data stream systems
    and current projects in that area.
  • Explore topics in stream query languages, new
    requirements and challenges in query processing,
    and algorithmic issues.

32
My Opinions
  • Some existing techniques may be built on in
    solving some outstanding problems in data stream
    model
  • Exact answers from a data stream query is
    probably not possible
  • There is a lot of ongoing projects that deal with
    streams
  • The reviews are too high level!

33
Thank You
Write a Comment
User Comments (0)
About PowerShow.com