Title: ... Streaming Through Time: A Vision for ... Stock ticke
1Consistent Streaming Through Time A Vision for
Event Stream Processing
- by Jonathan Goldstein (speaker), Roger Barga,
Mohamed Ali, and Mingsheng Hong - Microsoft Research
2Are StreamSQL semantics ok?
- Suppose we want to monitor the bandwidth of a
device - We create an input stream which has one field
bytes sent - We create an output stream which computes a
windowed sum - What are the StreamSQL semantics when the system
gets overloaded (strange question to ask)? - Either events must be dropped, or they must be
queued at the receiver or sender for later
processing - Since window semantics are based on system time
(StreamSQL server time) , if the device has
constant bandwidth, apparent bandwidth will
decrease! - In StreamSQL, the user has no reasonable way of
knowing! - Conclusion Something is deeply wrong with the
use of time in StreamSQL query semantics!
3Whats in the paper?
- Laundry list of CEDR features either unsupported
or poorly supported in existing streaming systems
(Read the paper) - Some of these features come from event processing
- Some come from specific scenarios which we
believe to be important - These features are described formally through a
query language description
4Whats in the talk (and the paper)?
- Formal definitions of CEDR streams and operator
semantics - Provides a clear and intuitive framework for
discussing subtle semantic issues - Formalization of materialized view update
semantics in standing queries and discuss why
they are inadequate in isolation - Definition of a non-view update compliant
operator which can express a very wide range of
seemingly disparate streaming features - A myriad of window types, the separation of
inserts and deletes, etc - We discuss theoretically both the expression and
correct handling of both data delivered out of
order and data retraction - Different formal notions of correctness lead to
different consistency levels and associated
performance tradeoffs
5What is a stream and a standing query?
- A stream is a (possibly infinite) collection of
events, where each event contains - A payload (P)
- A key which uniquely identifies the event (K)
- An interval of time (application) for which the
payload is valid Vs, Ve) - A time at which it arrives at a listener (C for
CEDR time) - A standing query is an operator graph, where each
operator takes 0 or more input streams and
produces 0 or more output streams
Acknowledgement This is inspired by and built on
Rick Snodgrasss temporal work
6What properties do operators have?
- All operators should be well behaved
- Definition 6 A CEDR operator O is well behaved
iff for all (combinations of) inputs to O which
are logically equivalent to infinity, Os outputs
are also logically equivalent to infinity - Any well behaved operator, when given 2 identical
sets of input streams, except for CEDR time,
should produce identical sets of output streams,
except for CEDR time - Query semantics are independent of CEDR time
7What properties do operators have?
- Some operators are also view update compliant
- Definition 11 A unary CEDR operator O is view
update compliant iff for all R, S s.t. (R) and
(S) are identical, (O(R)) and (O(S)) are also
identical - If we interpret the stream as describing a
changing relation where each rows lifetime is
specified by valid time, then - A view update compliant operator produces
snapshot identical output for snapshot identical
input
8What are our operators?
- We may now happily use all our favorite
relational operators - Definition 9 Join ?f(P1,P2)(S1, S2)
- ??(P1,P2)(S1, S2) (Vs, Ve, (e1.Payload
concantenated with e2.Payload)) e1 ? E(S1), e2
? E(S2), Vsmax e1.Vs, e2.Vs, Vemin e1.Ve,
e2.Ve, where Vs lt Ve, and ?(e1.Payload,
e2.Payload) - These operators output streams describe the
changing contents of a materialized view computed
over the changing input relation(s) described by
the input streams
9Non-view update compliant operators
- Moving window all output valid end times are
set to their valid start times plus the window
size - insert separation (CQL) all output valid end
times are set to infinity - The semantics of these operations plus many more
can be easily captured using AlterLifetime - Definition 12 AlterLifetime ?fvs, f?(S)
- ?fvs, f?(S)(fVs(e), fVs(e) f? (e),
e.Payload) e ? E(S - Allows the lifetime of input events to be
recomputed - It is not view update compliant, but it is well
behaved
10But is this implementable?
Input
- Avg(P) The usual average operator in
materialized view update compliant form - But how could CEDR know it needed to wait for K2
(to produce output) when it saw K1? - It couldnt have without waiting indefinitely or
without some external guarantee
Correct Output
11But is this implementable?
- We need the ability to retract previously output
results in the stream
is logically equivalent to
12But is this implementable?
- Our real definition of well behavedness
- Any well behaved operator, when given
logically equivalent sets of input streams,
produces logically equivalent sets of output
streams - Avg may now fully retract incorrect previous
output and issue new correct output for the
appropriate time period - We can denote operator semantics in a very clean
manner even in a system with arbitrarily out of
order data - The use of retractions to handle out of order
data induces a spectrum of formally defined
consistency levels for operators - These levels expose interesting tradeoffs between
various aspects of performance and correctness
(much more in the paper)
13Imperfections in Event Streaming
- How do current systems cope
- Wait until were sure we have all data that
affects our results up to a point in time (High
consistency) - High latency
- Requires application and network guarantee
- Requires high memory
- Absolutely correct answers
- Useful for standing queries that result in some
expensive form of corrective or examination
action - A human must examine something because some
aggregation (avg) or negation based alert tripped - Provide an answer quickly as of the current time,
but ignore late arriving data (Low Consistency) - Low latency
- No application or network guarantee required
- Low memory
- Sacrifices answer correctness
- Useful in applications which are unable to
provide guarantees about data arrival timeliness
and where exact answers arent required - E.g. Aggregations in internet scale monitoring
14Imperfections in Event Streaming
- With retractions
- Compute our output early in an optimistic fashion
and retract later if necessary (Middle
Consistency) - Low latency
- Doesnt require application and network
guarantees - High memory requirements equal to the high
consistency case if we have guarantees - May produce more output
- Useful in situations where we dont want to
block, but where we want eventual correctness - Stock ticker data example. We want to compute
real time info about stock data, but compensate
when a correction is issued. - Shared expressions between two queries, one
running at the high level of consistency and one
at the low
15Infinite Spectrum of Consistency Levels
B How long (at most) does the query block M
How long (at most) is the query required to
remember data
Blocking
Strong consistency
Slow cautious
B
Middle
consistency
Quick optimistic
M
Memory
Weak consistency
Small less correct
Big more correct