Models and Issues in Data Stream Systems - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Models and Issues in Data Stream Systems

Description:

an 'almost-sorted' stream can be fixed with a little buffering. ... Building summary of data stream using small amount of memory ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 34

Provided by: Raymo68

Category:

more less

Transcript and Presenter's Notes

Title: Models and Issues in Data Stream Systems

1
Models and Issues in Data Stream Systems
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev
Motwani Jennifer Widom ACM SIGMOD/PODS, 2002

Adesola Omotayo
October 18, 2005

2
Presentation Outline

The Data Stream Model
Review of Data Stream Projects
Queries of Data Streams
Proposal for a DSMS
Algorithmic Issues
Conclusions
My Opinions

3
Proposal for a DSMS

STREAM (STanford stREam datA Manager)
Query Language for a DSMS
Query Processing Architecture of a DSMS

4
Query Language for a DSMS
Implicit Timestamp

Modified version of SQL
well known
declarative language
Allows FROM clause to refer to streams and
relations
Allows the formulation of sliding window queries
ordering of data stream elements
optional window specification after a stream in
the FROM clause

...
order of arrival
Explicit Timestamp
...
847, 12
847, 20
847, 15
incoming stream (S)
S (s1, i1), (s2, i2) ... (sn, in)
5
Query Examples

Calls (customer id, type, minutes, and timestamp)

SELECT AVG(S.minutes) FROM Calls S PARTITION BY
S.customer_id ROWS 10 PRECEDING WHERE S.type
Long Distance
SELECT AVG(S.minutes) FROM Calls S PARTITION BY
S.customer_id ROWS 10 PRECEDING WHERE S.type
Long Distance
6
Timestamps in Streams 1

Ambiguous for tuples derived from multiple
streams
Drawback of explicit timestamp
an almost-sorted stream can be fixed with a
little buffering.
Methods of assigning timestamps output of binary
operators
best effort approach
stricter approach

SELECT FROM S1 ROWS 1000 PRECEDING, S2 ROWS
100 PRECEDING WHERE S1.A S2.B
7
Timestamps in Streams 2

The keyword, recent

8
Query Processing Architecture 1

Query execution plans consist of operators
connected by queues
Central scheduler schedules operators for
execution
During execution
operator reads data from its input queues,
updates synopsis structure and writes results to
output queues
Period of execution of operator determined
dynamically by scheduler and operator returns
control back to scheduler once period expires

This object is copied from the original paper
9
Query Processing Architecture 2

To handle stream query parameters fluctuations,
operators are adaptive (primarily to memory)
Trading accuracy for memory
Operator maximizes accuracy of output based on
size of available memory
Handles dynamic changes in size of its available
memory
Example For a sliding window join, the larger
the window, the better the approximation

10
Query Processing Architecture 3

Issues in Memory Management
How do different query ops produce approximate
answers under limited memory?
How do approximate results behave when operators
are composed in query results?
How can the DSMS allocate memory to operators to
maximize accuracy of answer?
How can DSMS reallocate memory among operators
under changing conditions?
Given a query, how does the query optimizer come
up with a query plan that, with best memory
allocation, minimizes approximation? Should plans
be modified when conditions change?
Since synopses can be shared among query plans,
how do we optimally consider a set of queries,
which may be weighted by importance?

11
Query Processing Architecture 4

Issues in Scheduling
Scheduler needs to provide rate synchronization
within operators and across pipelined operators
in query plans
Time-varying arrival rates of data streams and
time-varying output rates of operators complicate
matters
Need to take into account
Memory allocation across operators
Mgt of buffers for incoming streams
Availability of synopses on disk (instead of
memory)
Performance requirements of individual queries

12
Algorithmic Issues

Random Samples
Sketching Techniques
Histograms
Sliding Windows
Negative Results
Miscellaneous algorithms

13
Random Samples 1

Used as summary structure in many scenarios where
small sample is expected to capture essential
characteristics of data set
Easiest form of summarization
Other synopses can be built from sample itself
Variations include
stratified sampling
uniform sampling
weighted sampling

14
Random Samples 2

Idea A small random sample S of the data often
well-represents all the data

Data stream
9 3 5 2 7 1 6 5 8 4 9 1
(n12)
Sample S
9 5 1 8
Example select AVG(R.e) from R where R.e is odd
answer 5
15
Sketching Techniques 1

Building summary of data stream using small
amount of memory

Make it possible to estimate answers to certain
queries over data set

F0 is number of distinct values in S O(log d)
F1 is the length of S
F2 is the self-join size O(log d log N)
F? is the most frequent items multiplicity

16
Sketching Techniques 2

Building small-space summary for distribution
vector mi (i1,..., N) seen as a stream of
i-values

Data stream
3, 1, 2, 4, 2, 3, 5, . . .
17
Histograms 1

V-Optimal Histogram
Equi-Width Histograms
End-Biased Histograms

18
Histograms 2

V-Optimal Histogram
approximates distribution of a set of values by a
piecewise-constant function
such that the sum of squared error is minimized

Idea Select buckets to minimize frequency
variance within buckets
19
Recent Work on V-Optimal Histograms

V-Optimal Histogram
Jagadish et al.s algorithm uses O(N) space and
requires O(N2B) time
Guha, Koudas and Shim adapted this algorithm to
sorted data streams with O(B2 logN) space and
O(B2 logN) time per data element
Gilbert et al. removed the restriction that the
data stream be sorted and achieved poly(B, logN,
1/?)

20
Histograms 3

Equi-Width Histograms
partition the domain into buckets such that the
number of values falling into each bucket is
uniform across all buckets.
They maintain quantiles for the underlying data
distribution as the bucket boundaries.

Idea Select buckets such that counts per bucket
are equal
21
Recent Work on Equi-Width Histogram

Equi-Width Histograms
Characterize data distributions in a manner that
is less sensitive to outliers
Applications
Traditional databases for selectivity estimation
Parallel databases for generation of quantiles or
splitters
Greenwald and Khanna algorithm needs O(1/? log
?N) space with a precision of ?N

22
Histograms 4

End-Biased Histograms
maintain exact counts of items that occur with
frequency above a threshold, and approximate
other counts by a uniform distribution.
Example
SELECT line1, line2, COUNT(others)
FROM calls
GROUP BY line1, line2
HAVING COUNT(others) gt 3
Answer lt100, 500, 3gt

23
Recent Work on End-Biased Histogram

End-Biased Histograms
Find aggregate values above a specified
threshold. These queries are referred to as
iceberg queries
Example find search terms that account for more
than 1 of the queries to a search engine
Fang et al.s algorithm computes over
disk-resident data and requires multiple passes.
Manku and Motwanis deterministic algorithm
maintains a sample of distinct items along with
their frequency. Requires O(1/? log ?N) space. No
item is undercounted my more than ?N

24
Wavelets 1

Mathematical tools for hierarchical decomposition
of functions/signals
Provide a summary representation of data
Haar wavelets are used in DB for ease of
computation
The signal reconstructed from top few wavelet
coefficients best approximate the original signal

25
Wavelets 2

Haar Wavelets
Recursive pairwise averaging and differencing
operation

Averages Detail Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
2, 1, 4, 4
0, -1, -1, 0
1.5, 4
0.5, 0
2.75
-1.25
2.75, -1.25, 0.5, 0, 0, -1, -1, 0
Haar wavelet decomposition
26
Wavelets 3

Haar Wavelets Hierarchical decomposition
structure
Reconstruct data values d(i) as ? (/-1)
(coefficient on path)

27
Sliding Windows 1

At every time t, a data item arrives
The item expires at time tN
(N is the window length)

Window of size N
t
t N
Past Data
Future Data
Recent Data
28
Sliding Windows 2

Prevent stale data from influencing analysis and
statistics
Serve as tool for approx. in face of bounded
memory
Open problems
Clustering
Maintaining top wavelet coefficients
Maintaining statistics like variance
Computing correlated aggregates

29
Negative Results

Emerging set of negative results on space-time
requirements of algorithms that operate in stream
model
Henzinger, Raghavan, and Rajagopalan provided
space lower bounds for concrete problems in
stream model e.g frequent item counting
Alon, Matia, Szeged provided almost tight lower
bounds for computing the frequency moments lower
bound of O(N) for estimating F?
Manku and Motwanis algorithm for computing a
sample of distinct items along with their
frequency has a lower bound of O(1/? log ?N)
General lower bound technique for sampling-based
algorithms presented by Bar-Yoseef et al.
useful for deriving space lower bounds for data
stream algorithms that resort to oblivious
sampling.

30
Miscellaneous Algorithms

Data Mining Decision tree are used for
prediction and clustering is used to summarize
data.
Multiple Streams Computation of simple functions
such as the number of distinct elements, over
unions of data stream is useful in distributed
environment
Reduction of Streams List-efficient streaming
algorithms that are presented with a list of data
items in a succinct form must be employed in
order for reductions to be efficient.
Property Testing Programs that make one pass
over data and using small space verify if the
data satisfies a certain property
Measuring Sortedness Useful in determining the
choice of a sort algorithm for underlying data

31
Conclusions

The need for and research issues arising from a
new model of data processing.
Review past work relevant to data stream systems
and current projects in that area.
Explore topics in stream query languages, new
requirements and challenges in query processing,
and algorithmic issues.

32
My Opinions

Some existing techniques may be built on in
solving some outstanding problems in data stream
model
Exact answers from a data stream query is
probably not possible
There is a lot of ongoing projects that deal with
streams
The reviews are too high level!

33
Thank You

Write a Comment

User Comments (0)