Title: Stream and Sensor Data Management
1Stream and Sensor Data Management
- Zachary G. Ives
- University of Pennsylvania
- CIS 650 Implementing Data Management Systems
- November 17, 2008
2Converting between Streams Relations
- Stream-to-relation operators
- Sliding window tuple-based (last N rows) or
time-based (within time range) - Partitioned sliding window does grouping by
keys, then does sliding window over that - Is this necessary or minimal?
- Relation-to-stream operators
- Istream stream-ifies any insertions over a
relation - Dstream stream-ifies the deletes
- Rstream stream contains the set of tuples in the
relation
3Some Examples
- Select From S1 Rows 1000, S2 Range 2
minutesWhere S1.A S2.A And S1.A gt 10 - Select Rstream(S.A, R.B) From S Now, R Where
S.A R.A
4Building a Stream System
- Basic data item is the element
- ltop, time, tuplegt where op 2 , -
- Query plans need a few new (?) items
- Queues
- Used for hooking together operators, esp. over
windows - (Assumption is that pipelining is generally not
possible, and we may need to drop some tuples
from the queue) - Synopses
- The intermediate state an operator needs to carry
around - Note that this is usually bounded by windows
5Example Query Plan
Whats different here?
6Some Tricks for Performance
- Sharing synopses across multiple operators
- In a few cases, more than one operator may join
with the same synopsis - Can exploit punctuations or k-constraints
- Analogous to interesting orders
- Referential integrity k-constraint bound of k
between arrival of many element and its
corresponding one element - Ordered-arrival k-constraint need window of at
most k to sort - Clustered-arrival k-constraint bound on distance
between items with same grouping attributes
7Query Processing Chain Scheduling
- Similar in many ways to eddies
- May decide to apply operators as follows
- Assume we know how many tuples can be processed
in a time unit - Cluster groups of operators into chains that
maximize reduction in queue size per unit time - Greedily forward tuples into the most selective
chain - Within a chain, process in FIFO order
- They also do a form of join reordering
8Scratching the Surface Approximation
- They point out two areas where we might need to
approximate output - CPU is limited, and we need to drop some stream
elements according to some probabilistic metric - Collect statistics via a profiler
- Use Hoeffding inequality to derive a sampling
rate in order to maintain a confidence interval - May need to do similar things if memory usage is
a constraint - Are there other options? When might they be
useful?
9STREAM in General
- Logical semantics first
- Starts with a basic data model streams as
timestamped sets - Develops a language and semantics
- Heavily based on SQL
- Proposes a relatively straightforward
implementation - Interesting ideas like k-constraints
- Interesting approaches like chain scheduling
- No real consideration of distributed processing
10Aurora
- Implementation first mix and match operations
from past literature - Basic philosophy most of the ideas in streams
existed in previous research - Sliding windows, load shedding, approximation,
- So lets borrow those ideas and focus on how to
build a real system with them! - Emphasis is on building a scalable, robust system
- Distributed implementation Medusa
11Queries in Aurora
- Oddly no declarative query language in the
initial version! (Added for commercial product) - Queries are workflows of physical query operators
(SQuAl) - Many operators resemble relational algebra ops
12Example Query
13Some Interesting Aspects
- A relatively simple adaptive query optimizer
- Can push filtering and mapping into many
operators - Can reorder some operators (e.g., joins, unions)
- Need built-in error handling
- If a data source fails to respond in a certain
amount of time, create a special alarm tuple - This propagates through the query plan
- Incorporate built-in load-shedding, RT sched. to
support QoS - Have a notion of combining a query over
historical data with data from a stream - Switches from a pull-based mode (reading from
disk) to a push-based mode (reading from network)
14The Medusa Processor
- Distributed coordinator between many Aurora nodes
- Scalability through federation and distribution
- Fail-over
- Load balancing
15Main Components
- Lookup
- Distributed catalog schemas, where to find
streams, where to find queries - Brain
- Query setup, load monitoring via I/O queues and
stats - Load distribution and balancing scheme is used
- Very reminiscent of Mariposa!
16Load Balancing
- Migration an operator can be moved from one
node to another - Initial implementation didnt support moving of
state - The state is simply dropped, and operator
processing resumes - Implications on semantics?
- Plans to support state migration
- Agoric system model to create incentives
- Clients pay nodes for processing queries
- Nodes pay each other to handle load pairwise
contracts negotiated offline - Bounded-price mechanism price for migration of
load, spec for what a node will take on - Does this address the weaknesses of the Mariposa
model?
17Some Applications They Tried
- Financial services (stock ticker)
- Main issue is not volume, but problems with feeds
- Two-level alarm system, where higher-level alarm
helps diagnose problems - Shared computation among queries
- User-defined aggregation and mapping
- This is the main application for the commercial
version (StreamBase) - Linear road (sensor monitoring)
- Traffic sensors in a toll road change toll
depending on how many cars are on the road - Combination of historical and continuous queries
- Environmental monitoring
- Sliding-window calculations
18Lessons Learned
- Historical data is important not just stream
data - (Summaries?)
- Sometimes need synchronization for consistency
- ACID for streams?
- Streams can be out of order, bursty
- Stream cleaning?
- Adaptors (and also XML) are important
- But we already knew that!
- Performance is critical
- They spent a great deal of time using
microbenchmarks and optimizing
19Sensors and Sensor Networks
- Trends
- Cameras and other sensors are very cheap
- Microprocessors and microcontrollers can be very
small - Wireless networks are easy to build
- Why not instrument the physical world with tiny
wireless sensors and networks? - Vision Smart dust
- Berkeley motes, RF tags, cameras, camera phones,
temperature sensors, etc. - Today we already see pieces of this
- Penn buildings and SCADA system
- 250 surveillance cameras through campus
20What Can We Do with Sensor Networks?
- Many passive monitoring applications
- Environmental monitoring
- temperature in different parts of a building
- air quality
- etc.
- Law enforcement
- Video feeds and anomalous behavior
- Research studies
- Study ocean temperature, currents
- Monitor status of eggs in endangered birds nests
- ZebraNet
- Fun
- Record sporting events or performances from every
angle (video audio) - Ultimately, build reactive systems as well
robotics, Mars landers,
21Some Challenges
- Highly distributed!
- May have thousands of nodes
- Know about a few nodes within proximity may not
know location - Nodes transmissions may interfere with one
another - Power and resource constraints
- Most of these devices are wireless, tiny,
battery-powered - Can only transmit data every so often
- Limited CPU, memory cant run sophisticated
code - High rate of failure
- Collisions, battery failures, sensor calibration,
22The Target Platform
- Most sensor network research argues for the
Berkeley mote as a target platform - Mote 4MHz, 8-bit CPU
- 128KB RAM
- 512KB Flash memory
- 40kbps radio, 100 ft range
- Sensors
- Light, temperature, microphone
- Accelerometer
- Magnetometer
http//robotics.eecs.berkeley.edu/pister/SmartDus
t/
23Sensor Net Data Acquisition
- First build routing tree
- Second begin sensing and aggregation
24Sensor Net Data Acquisition (Sum)
5
5
5
5
5
5
5
5
5
5
5
5
7
8
5
5
5
5
- First build routing tree
- Second begin sensing and aggregation (e.g.,
sum)
25Sensor Net Data Acquisition (Sum)
5
5
15
5
5
10
5
20
5
5
5
25
5
10
5
5
85
20
5
5
5
5
5
60
13
8
55
18
7
8
35
30
5
23
5
5
5
- First build routing tree
- Second begin sensing and aggregation (e.g.,
sum)
26Sensor Network Research
- Routing need to aggregate and consolidate data
in a power-efficient way - Ad hoc routing generate routing tree to base
station - Generally need to merge computation with routing
- Robustness need to combine info from many
sensors to account for individual errors - What aggregation functions make sense?
- Languages how do we express what we want to do
with sensor networks? - Many proposals here
27A First Try Tiny OS and nesC
- TinyOS a custom OS for sensor nets, written in
nesC - Assumes low-power CPU
- Very limited concurrency support events
(signaled asynchronously) and tasks
(cooperatively scheduled) - Applications built from components
- Basically, small objects without any local state
- Various features in libraries that may or may not
be included - interface Timer command result_t start(char
type, uint32_t interval) command result_t
stop() event result_t fired()
28Drawbacks of this Approach
- Need to write very low-level code for sensor net
behavior - Only simple routing policies are built into
TinyOS some of the routing algorithms may have
to be implemented by hand - Has required many follow-up papers to fill in
some of the missing pieces, e.g., Hood (object
tracking and state sharing),
29An Alternative
- Much of the computation being done in sensor
nets looks like what we were discussing with
STREAM - Todays sensor networks look a lot like
databases, pre-Codd - Custom access paths to get to data
- One-off custom-code
- So why not look at mapping sensor network
computation to SQL? - Not very many joins here, but significant
aggregation - Now the challenge is in picking a distribution
and routing strategy that provides appropriate
guarantees and minimizes power usage
30TinyDB and TinySQL
- Treat the entire sensor network as a universal
relation - Each type of sensor data is a column in a global
table - Tuples are created according to a sample interval
(separated by epochs) - (Implications of this model?)
- SELECT nodeid, light, tempFROM sensorsSAMPLE
INTERVAL 1s FOR 10s
31Storage Points and Windows
- Like Aurora, STREAM, can materialize portions of
the data - CREATE STORAGE POINT recentlight SIZE 8AS
(SELECT nodeid, light FROM sensors
SAMPLE INTERVAL 10s) - and we can use windowed aggregates
- SELECT WINAVG(volume, 30s, 5s)FROM
sensorsSAMPLE INTERVAL 1s
32Events
- ON EVENT bird-detect(loc) SELECT AVG(light),
AVG(temp), event.loc FROM sensors AS s WHERE
dist(s.loc, event.loc) lt 10m SAMPLE INTERVAL 2s
FOR 30s - How do we know about events?
- Contrast to UDFs? triggers?
33Power and TinyDB
- Cost-based optimizer tries to find a query plan
to yield lowest overall power consumption - Different sensors have different power usage
- Try to order sampling according to selectivity
(sounds familiar?) - Assumption of uniform distribution of values over
range - Batching of queries (multi-query optimization)
- Convert a series of events into a stream join
does this resemble anything weve seen recently? - Also need to consider where the query is
processed
34Dissemination of Queries
- Based on semantic routing tree idea
- SRT build request is flooded first
- Node n gets to choose its parent p, based on
radio range from root - Parent knows its children
- Maintains an interval on values for each child
- Forwards requests to children as appropriate
- Maintenance
- If interval changes, child notifies its parent
- If a node disappears, parent learns of this when
it fails to get a response to a query
35Query Processing
- Mostly consists of sleeping!
- Wake briefly, sample, and compute operators, then
route onwards - Nodes are time synchronized
- Awake time is proportional to the neighborhood
size (why?) - Computation is based on partial state records
- Basically, each operation is a partial aggregate
value, plus the reading from the sensor
36Load Shedding Approximation
- What if the router queue is overflowing?
- Need to prioritize tuples, drop the ones we dont
want - FIFO vs. averaging the head of the queue vs.
delta-proportional weighting - Later work considers the question of using
approximation for more power efficiency - If sensors in one region change less frequently,
can sample less frequently (or fewer times) in
that region - If sensors change less frequently, can sample
readings that take less power but are correlated
(e.g., battery voltage vs. temperature) - Thursday, 430PM, DB Group Meeting, Ill discuss
some of this work
37The Future of Sensor Nets?
- TinySQL is a nice way of formulating the problem
of query processing with motes - View the sensor net as a universal relation
- Can define views to abstract some concepts, e.g.,
an object being monitored - But
- What about when we have multiple instances of an
object to be tracked? Correlations between
objects? - What if we have more complex data? More CPU
power? - What if we want to reason about accuracy?