Title: Models and Issues in Data Stream Systems
1Models and Issues in Data Stream Systems
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev
Motwani Jennifer Widom ACM SIGMOD/PODS, 2002
- Adesola Omotayo
- October 18, 2005
2Presentation Outline
- The Data Stream Model
- Review of Data Stream Projects
- Queries of Data Streams
- Proposal for a DSMS
- Algorithmic Issues
- Conclusions
- My Opinions
3Proposal for a DSMS
- STREAM (STanford stREam datA Manager)
- Query Language for a DSMS
- Query Processing Architecture of a DSMS
4Query Language for a DSMS
Implicit Timestamp
- Modified version of SQL
- well known
- declarative language
- Allows FROM clause to refer to streams and
relations - Allows the formulation of sliding window queries
- ordering of data stream elements
- optional window specification after a stream in
the FROM clause
...
order of arrival
Explicit Timestamp
...
847, 12
847, 20
847, 15
incoming stream (S)
S (s1, i1), (s2, i2) ... (sn, in)
5Query Examples
- Calls (customer id, type, minutes, and timestamp)
SELECT AVG(S.minutes) FROM Calls S PARTITION BY
S.customer_id ROWS 10 PRECEDING WHERE S.type
Long Distance
SELECT AVG(S.minutes) FROM Calls S PARTITION BY
S.customer_id ROWS 10 PRECEDING WHERE S.type
Long Distance
6Timestamps in Streams 1
- Ambiguous for tuples derived from multiple
streams - Drawback of explicit timestamp
- an almost-sorted stream can be fixed with a
little buffering. - Methods of assigning timestamps output of binary
operators - best effort approach
- stricter approach
SELECT FROM S1 ROWS 1000 PRECEDING, S2 ROWS
100 PRECEDING WHERE S1.A S2.B
7Timestamps in Streams 2
8Query Processing Architecture 1
- Query execution plans consist of operators
connected by queues - Central scheduler schedules operators for
execution - During execution
- operator reads data from its input queues,
updates synopsis structure and writes results to
output queues - Period of execution of operator determined
dynamically by scheduler and operator returns
control back to scheduler once period expires
This object is copied from the original paper
9Query Processing Architecture 2
- To handle stream query parameters fluctuations,
operators are adaptive (primarily to memory) - Trading accuracy for memory
- Operator maximizes accuracy of output based on
size of available memory - Handles dynamic changes in size of its available
memory - Example For a sliding window join, the larger
the window, the better the approximation
10Query Processing Architecture 3
- Issues in Memory Management
- How do different query ops produce approximate
answers under limited memory? - How do approximate results behave when operators
are composed in query results? - How can the DSMS allocate memory to operators to
maximize accuracy of answer? - How can DSMS reallocate memory among operators
under changing conditions? - Given a query, how does the query optimizer come
up with a query plan that, with best memory
allocation, minimizes approximation? Should plans
be modified when conditions change? - Since synopses can be shared among query plans,
how do we optimally consider a set of queries,
which may be weighted by importance?
11Query Processing Architecture 4
- Issues in Scheduling
- Scheduler needs to provide rate synchronization
within operators and across pipelined operators
in query plans - Time-varying arrival rates of data streams and
time-varying output rates of operators complicate
matters - Need to take into account
- Memory allocation across operators
- Mgt of buffers for incoming streams
- Availability of synopses on disk (instead of
memory) - Performance requirements of individual queries
12Algorithmic Issues
- Random Samples
- Sketching Techniques
- Histograms
- Sliding Windows
- Negative Results
- Miscellaneous algorithms
13Random Samples 1
- Used as summary structure in many scenarios where
small sample is expected to capture essential
characteristics of data set - Easiest form of summarization
- Other synopses can be built from sample itself
- Variations include
- stratified sampling
- uniform sampling
- weighted sampling
14Random Samples 2
- Idea A small random sample S of the data often
well-represents all the data
Data stream
9 3 5 2 7 1 6 5 8 4 9 1
(n12)
Sample S
9 5 1 8
Example select AVG(R.e) from R where R.e is odd
answer 5
15Sketching Techniques 1
- Building summary of data stream using small
amount of memory
- Make it possible to estimate answers to certain
queries over data set
- F0 is number of distinct values in S O(log d)
- F1 is the length of S
- F2 is the self-join size O(log d log N)
- F? is the most frequent items multiplicity
16Sketching Techniques 2
- Building small-space summary for distribution
vector mi (i1,..., N) seen as a stream of
i-values
Data stream
3, 1, 2, 4, 2, 3, 5, . . .
17Histograms 1
- V-Optimal Histogram
- Equi-Width Histograms
- End-Biased Histograms
18Histograms 2
- V-Optimal Histogram
- approximates distribution of a set of values by a
piecewise-constant function - such that the sum of squared error is minimized
Idea Select buckets to minimize frequency
variance within buckets
19Recent Work on V-Optimal Histograms
- V-Optimal Histogram
- Jagadish et al.s algorithm uses O(N) space and
requires O(N2B) time - Guha, Koudas and Shim adapted this algorithm to
sorted data streams with O(B2 logN) space and
O(B2 logN) time per data element - Gilbert et al. removed the restriction that the
data stream be sorted and achieved poly(B, logN,
1/?)
20Histograms 3
- Equi-Width Histograms
- partition the domain into buckets such that the
number of values falling into each bucket is
uniform across all buckets. - They maintain quantiles for the underlying data
distribution as the bucket boundaries.
Idea Select buckets such that counts per bucket
are equal
21Recent Work on Equi-Width Histogram
- Equi-Width Histograms
- Characterize data distributions in a manner that
is less sensitive to outliers - Applications
- Traditional databases for selectivity estimation
- Parallel databases for generation of quantiles or
splitters - Greenwald and Khanna algorithm needs O(1/? log
?N) space with a precision of ?N
22Histograms 4
- End-Biased Histograms
- maintain exact counts of items that occur with
frequency above a threshold, and approximate
other counts by a uniform distribution. - Example
- SELECT line1, line2, COUNT(others)
- FROM calls
- GROUP BY line1, line2
- HAVING COUNT(others) gt 3
- Answer lt100, 500, 3gt
23Recent Work on End-Biased Histogram
- End-Biased Histograms
- Find aggregate values above a specified
threshold. These queries are referred to as
iceberg queries - Example find search terms that account for more
than 1 of the queries to a search engine - Fang et al.s algorithm computes over
disk-resident data and requires multiple passes. - Manku and Motwanis deterministic algorithm
maintains a sample of distinct items along with
their frequency. Requires O(1/? log ?N) space. No
item is undercounted my more than ?N
24Wavelets 1
- Mathematical tools for hierarchical decomposition
of functions/signals - Provide a summary representation of data
- Haar wavelets are used in DB for ease of
computation - The signal reconstructed from top few wavelet
coefficients best approximate the original signal
25Wavelets 2
- Haar Wavelets
- Recursive pairwise averaging and differencing
operation
Averages Detail Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
2, 1, 4, 4
0, -1, -1, 0
1.5, 4
0.5, 0
2.75
-1.25
2.75, -1.25, 0.5, 0, 0, -1, -1, 0
Haar wavelet decomposition
26Wavelets 3
- Haar Wavelets Hierarchical decomposition
structure - Reconstruct data values d(i) as ? (/-1)
(coefficient on path)
27Sliding Windows 1
- At every time t, a data item arrives
- The item expires at time tN
(N is the window length)
Window of size N
t
t N
Past Data
Future Data
Recent Data
28Sliding Windows 2
- Prevent stale data from influencing analysis and
statistics - Serve as tool for approx. in face of bounded
memory - Open problems
- Clustering
- Maintaining top wavelet coefficients
- Maintaining statistics like variance
- Computing correlated aggregates
29Negative Results
- Emerging set of negative results on space-time
requirements of algorithms that operate in stream
model - Henzinger, Raghavan, and Rajagopalan provided
space lower bounds for concrete problems in
stream model e.g frequent item counting - Alon, Matia, Szeged provided almost tight lower
bounds for computing the frequency moments lower
bound of O(N) for estimating F? - Manku and Motwanis algorithm for computing a
sample of distinct items along with their
frequency has a lower bound of O(1/? log ?N) - General lower bound technique for sampling-based
algorithms presented by Bar-Yoseef et al. - useful for deriving space lower bounds for data
stream algorithms that resort to oblivious
sampling.
30Miscellaneous Algorithms
- Data Mining Decision tree are used for
prediction and clustering is used to summarize
data. - Multiple Streams Computation of simple functions
such as the number of distinct elements, over
unions of data stream is useful in distributed
environment - Reduction of Streams List-efficient streaming
algorithms that are presented with a list of data
items in a succinct form must be employed in
order for reductions to be efficient. - Property Testing Programs that make one pass
over data and using small space verify if the
data satisfies a certain property - Measuring Sortedness Useful in determining the
choice of a sort algorithm for underlying data
31Conclusions
- The need for and research issues arising from a
new model of data processing. - Review past work relevant to data stream systems
and current projects in that area. - Explore topics in stream query languages, new
requirements and challenges in query processing,
and algorithmic issues.
32My Opinions
- Some existing techniques may be built on in
solving some outstanding problems in data stream
model - Exact answers from a data stream query is
probably not possible - There is a lot of ongoing projects that deal with
streams - The reviews are too high level!
33Thank You