Stream and Sensor Data Management - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Stream and Sensor Data Management

Description:

Ordered-arrival k-constraint: need window of at most k to sort ... So why not look at mapping sensor network computation to SQL? ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 38

Provided by: zack4

Learn more at: https://www.seas.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Stream and Sensor Data Management

1
Stream and Sensor Data Management

Zachary G. Ives
University of Pennsylvania
CIS 650 Implementing Data Management Systems
November 17, 2008

2
Converting between Streams Relations

Stream-to-relation operators
Sliding window tuple-based (last N rows) or
time-based (within time range)
Partitioned sliding window does grouping by
keys, then does sliding window over that
Is this necessary or minimal?
Relation-to-stream operators
Istream stream-ifies any insertions over a
relation
Dstream stream-ifies the deletes
Rstream stream contains the set of tuples in the
relation

3
Some Examples

Select From S1 Rows 1000, S2 Range 2
minutesWhere S1.A S2.A And S1.A gt 10
Select Rstream(S.A, R.B) From S Now, R Where
S.A R.A

4
Building a Stream System

Basic data item is the element
ltop, time, tuplegt where op 2 , -
Query plans need a few new (?) items
Queues
Used for hooking together operators, esp. over
windows
(Assumption is that pipelining is generally not
possible, and we may need to drop some tuples
from the queue)
Synopses
The intermediate state an operator needs to carry
around
Note that this is usually bounded by windows

5
Example Query Plan
Whats different here?
6
Some Tricks for Performance

Sharing synopses across multiple operators
In a few cases, more than one operator may join
with the same synopsis
Can exploit punctuations or k-constraints
Analogous to interesting orders
Referential integrity k-constraint bound of k
between arrival of many element and its
corresponding one element
Ordered-arrival k-constraint need window of at
most k to sort
Clustered-arrival k-constraint bound on distance
between items with same grouping attributes

7
Query Processing Chain Scheduling

Similar in many ways to eddies
May decide to apply operators as follows
Assume we know how many tuples can be processed
in a time unit
Cluster groups of operators into chains that
maximize reduction in queue size per unit time
Greedily forward tuples into the most selective
chain
Within a chain, process in FIFO order
They also do a form of join reordering

8
Scratching the Surface Approximation

They point out two areas where we might need to
approximate output
CPU is limited, and we need to drop some stream
elements according to some probabilistic metric
Collect statistics via a profiler
Use Hoeffding inequality to derive a sampling
rate in order to maintain a confidence interval
May need to do similar things if memory usage is
a constraint
Are there other options? When might they be
useful?

9
STREAM in General

Logical semantics first
Starts with a basic data model streams as
timestamped sets
Develops a language and semantics
Heavily based on SQL
Proposes a relatively straightforward
implementation
Interesting ideas like k-constraints
Interesting approaches like chain scheduling
No real consideration of distributed processing

10
Aurora

Implementation first mix and match operations
from past literature
Basic philosophy most of the ideas in streams
existed in previous research
Sliding windows, load shedding, approximation,
So lets borrow those ideas and focus on how to
build a real system with them!
Emphasis is on building a scalable, robust system
Distributed implementation Medusa

11
Queries in Aurora

Oddly no declarative query language in the
initial version! (Added for commercial product)
Queries are workflows of physical query operators
(SQuAl)
Many operators resemble relational algebra ops

12
Example Query
13
Some Interesting Aspects

A relatively simple adaptive query optimizer
Can push filtering and mapping into many
operators
Can reorder some operators (e.g., joins, unions)
Need built-in error handling
If a data source fails to respond in a certain
amount of time, create a special alarm tuple
This propagates through the query plan
Incorporate built-in load-shedding, RT sched. to
support QoS
Have a notion of combining a query over
historical data with data from a stream
Switches from a pull-based mode (reading from
disk) to a push-based mode (reading from network)

14
The Medusa Processor

Distributed coordinator between many Aurora nodes
Scalability through federation and distribution
Fail-over
Load balancing

15
Main Components

Lookup
Distributed catalog schemas, where to find
streams, where to find queries
Brain
Query setup, load monitoring via I/O queues and
stats
Load distribution and balancing scheme is used
Very reminiscent of Mariposa!

16
Load Balancing

Migration an operator can be moved from one
node to another
Initial implementation didnt support moving of
state
The state is simply dropped, and operator
processing resumes
Implications on semantics?
Plans to support state migration
Agoric system model to create incentives
Clients pay nodes for processing queries
Nodes pay each other to handle load pairwise
contracts negotiated offline
Bounded-price mechanism price for migration of
load, spec for what a node will take on
Does this address the weaknesses of the Mariposa
model?

17
Some Applications They Tried

Financial services (stock ticker)
Main issue is not volume, but problems with feeds
Two-level alarm system, where higher-level alarm
helps diagnose problems
Shared computation among queries
User-defined aggregation and mapping
This is the main application for the commercial
version (StreamBase)
Linear road (sensor monitoring)
Traffic sensors in a toll road change toll
depending on how many cars are on the road
Combination of historical and continuous queries
Environmental monitoring
Sliding-window calculations

18
Lessons Learned

Historical data is important not just stream
data
(Summaries?)
Sometimes need synchronization for consistency
ACID for streams?
Streams can be out of order, bursty
Stream cleaning?
Adaptors (and also XML) are important
But we already knew that!
Performance is critical
They spent a great deal of time using
microbenchmarks and optimizing

19
Sensors and Sensor Networks

Trends
Cameras and other sensors are very cheap
Microprocessors and microcontrollers can be very
small
Wireless networks are easy to build
Why not instrument the physical world with tiny
wireless sensors and networks?
Vision Smart dust
Berkeley motes, RF tags, cameras, camera phones,
temperature sensors, etc.
Today we already see pieces of this
Penn buildings and SCADA system
250 surveillance cameras through campus

20
What Can We Do with Sensor Networks?

Many passive monitoring applications
Environmental monitoring
temperature in different parts of a building
air quality
etc.
Law enforcement
Video feeds and anomalous behavior
Research studies
Study ocean temperature, currents
Monitor status of eggs in endangered birds nests
ZebraNet
Fun
Record sporting events or performances from every
angle (video audio)
Ultimately, build reactive systems as well
robotics, Mars landers,

21
Some Challenges

Highly distributed!
May have thousands of nodes
Know about a few nodes within proximity may not
know location
Nodes transmissions may interfere with one
another
Power and resource constraints
Most of these devices are wireless, tiny,
battery-powered
Can only transmit data every so often
Limited CPU, memory cant run sophisticated
code
High rate of failure
Collisions, battery failures, sensor calibration,

22
The Target Platform

Most sensor network research argues for the
Berkeley mote as a target platform
Mote 4MHz, 8-bit CPU
128KB RAM
512KB Flash memory
40kbps radio, 100 ft range
Sensors
Light, temperature, microphone
Accelerometer
Magnetometer

http//robotics.eecs.berkeley.edu/pister/SmartDus
t/
23
Sensor Net Data Acquisition

First build routing tree
Second begin sensing and aggregation

24
Sensor Net Data Acquisition (Sum)
5
5
5
5
5
5
5
5
5
5
5
5
7
8
5
5
5
5

First build routing tree
Second begin sensing and aggregation (e.g.,
sum)

25
Sensor Net Data Acquisition (Sum)
5
5
15
5
5
10
5
20
5
5
5
25
5
10
5
5
85
20
5
5
5
5
5
60
13
8
55
18
7
8
35
30
5
23
5
5
5

First build routing tree
Second begin sensing and aggregation (e.g.,
sum)

26
Sensor Network Research

Routing need to aggregate and consolidate data
in a power-efficient way
Ad hoc routing generate routing tree to base
station
Generally need to merge computation with routing
Robustness need to combine info from many
sensors to account for individual errors
What aggregation functions make sense?
Languages how do we express what we want to do
with sensor networks?
Many proposals here

27
A First Try Tiny OS and nesC

TinyOS a custom OS for sensor nets, written in
nesC
Assumes low-power CPU
Very limited concurrency support events
(signaled asynchronously) and tasks
(cooperatively scheduled)
Applications built from components
Basically, small objects without any local state
Various features in libraries that may or may not
be included
interface Timer command result_t start(char
type, uint32_t interval) command result_t
stop() event result_t fired()

28
Drawbacks of this Approach

Need to write very low-level code for sensor net
behavior
Only simple routing policies are built into
TinyOS some of the routing algorithms may have
to be implemented by hand
Has required many follow-up papers to fill in
some of the missing pieces, e.g., Hood (object
tracking and state sharing),

29
An Alternative

Much of the computation being done in sensor
nets looks like what we were discussing with
STREAM
Todays sensor networks look a lot like
databases, pre-Codd
Custom access paths to get to data
One-off custom-code
So why not look at mapping sensor network
computation to SQL?
Not very many joins here, but significant
aggregation
Now the challenge is in picking a distribution
and routing strategy that provides appropriate
guarantees and minimizes power usage

30
TinyDB and TinySQL

Treat the entire sensor network as a universal
relation
Each type of sensor data is a column in a global
table
Tuples are created according to a sample interval
(separated by epochs)
(Implications of this model?)
SELECT nodeid, light, tempFROM sensorsSAMPLE
INTERVAL 1s FOR 10s

31
Storage Points and Windows

Like Aurora, STREAM, can materialize portions of
the data
CREATE STORAGE POINT recentlight SIZE 8AS
(SELECT nodeid, light FROM sensors
SAMPLE INTERVAL 10s)
and we can use windowed aggregates
SELECT WINAVG(volume, 30s, 5s)FROM
sensorsSAMPLE INTERVAL 1s

32
Events

ON EVENT bird-detect(loc) SELECT AVG(light),
AVG(temp), event.loc FROM sensors AS s WHERE
dist(s.loc, event.loc) lt 10m SAMPLE INTERVAL 2s
FOR 30s
How do we know about events?
Contrast to UDFs? triggers?

33
Power and TinyDB

Cost-based optimizer tries to find a query plan
to yield lowest overall power consumption
Different sensors have different power usage
Try to order sampling according to selectivity
(sounds familiar?)
Assumption of uniform distribution of values over
range
Batching of queries (multi-query optimization)
Convert a series of events into a stream join
does this resemble anything weve seen recently?
Also need to consider where the query is
processed

34
Dissemination of Queries

Based on semantic routing tree idea
SRT build request is flooded first
Node n gets to choose its parent p, based on
radio range from root
Parent knows its children
Maintains an interval on values for each child
Forwards requests to children as appropriate
Maintenance
If interval changes, child notifies its parent
If a node disappears, parent learns of this when
it fails to get a response to a query

35
Query Processing

Mostly consists of sleeping!
Wake briefly, sample, and compute operators, then
route onwards
Nodes are time synchronized
Awake time is proportional to the neighborhood
size (why?)
Computation is based on partial state records
Basically, each operation is a partial aggregate
value, plus the reading from the sensor

36
Load Shedding Approximation

What if the router queue is overflowing?
Need to prioritize tuples, drop the ones we dont
want
FIFO vs. averaging the head of the queue vs.
delta-proportional weighting
Later work considers the question of using
approximation for more power efficiency
If sensors in one region change less frequently,
can sample less frequently (or fewer times) in
that region
If sensors change less frequently, can sample
readings that take less power but are correlated
(e.g., battery voltage vs. temperature)
Thursday, 430PM, DB Group Meeting, Ill discuss
some of this work

37
The Future of Sensor Nets?

TinySQL is a nice way of formulating the problem
of query processing with motes
View the sensor net as a universal relation
Can define views to abstract some concepts, e.g.,
an object being monitored
But
What about when we have multiple instances of an
object to be tracked? Correlations between
objects?
What if we have more complex data? More CPU
power?
What if we want to reason about accuracy?