Stream and Sensor Data Management - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Stream and Sensor Data Management

Description:

Ordered-arrival k-constraint: need window of at most k to sort ... So why not look at mapping sensor network computation to SQL? ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 38
Provided by: zack4
Category:
Tags: at | data | look | management | need | resumes | sample | sensor | stream | to

less

Transcript and Presenter's Notes

Title: Stream and Sensor Data Management


1
Stream and Sensor Data Management
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 650 Implementing Data Management Systems
  • November 17, 2008

2
Converting between Streams Relations
  • Stream-to-relation operators
  • Sliding window tuple-based (last N rows) or
    time-based (within time range)
  • Partitioned sliding window does grouping by
    keys, then does sliding window over that
  • Is this necessary or minimal?
  • Relation-to-stream operators
  • Istream stream-ifies any insertions over a
    relation
  • Dstream stream-ifies the deletes
  • Rstream stream contains the set of tuples in the
    relation

3
Some Examples
  • Select From S1 Rows 1000, S2 Range 2
    minutesWhere S1.A S2.A And S1.A gt 10
  • Select Rstream(S.A, R.B) From S Now, R Where
    S.A R.A

4
Building a Stream System
  • Basic data item is the element
  • ltop, time, tuplegt where op 2 , -
  • Query plans need a few new (?) items
  • Queues
  • Used for hooking together operators, esp. over
    windows
  • (Assumption is that pipelining is generally not
    possible, and we may need to drop some tuples
    from the queue)
  • Synopses
  • The intermediate state an operator needs to carry
    around
  • Note that this is usually bounded by windows

5
Example Query Plan
Whats different here?
6
Some Tricks for Performance
  • Sharing synopses across multiple operators
  • In a few cases, more than one operator may join
    with the same synopsis
  • Can exploit punctuations or k-constraints
  • Analogous to interesting orders
  • Referential integrity k-constraint bound of k
    between arrival of many element and its
    corresponding one element
  • Ordered-arrival k-constraint need window of at
    most k to sort
  • Clustered-arrival k-constraint bound on distance
    between items with same grouping attributes

7
Query Processing Chain Scheduling
  • Similar in many ways to eddies
  • May decide to apply operators as follows
  • Assume we know how many tuples can be processed
    in a time unit
  • Cluster groups of operators into chains that
    maximize reduction in queue size per unit time
  • Greedily forward tuples into the most selective
    chain
  • Within a chain, process in FIFO order
  • They also do a form of join reordering

8
Scratching the Surface Approximation
  • They point out two areas where we might need to
    approximate output
  • CPU is limited, and we need to drop some stream
    elements according to some probabilistic metric
  • Collect statistics via a profiler
  • Use Hoeffding inequality to derive a sampling
    rate in order to maintain a confidence interval
  • May need to do similar things if memory usage is
    a constraint
  • Are there other options? When might they be
    useful?

9
STREAM in General
  • Logical semantics first
  • Starts with a basic data model streams as
    timestamped sets
  • Develops a language and semantics
  • Heavily based on SQL
  • Proposes a relatively straightforward
    implementation
  • Interesting ideas like k-constraints
  • Interesting approaches like chain scheduling
  • No real consideration of distributed processing

10
Aurora
  • Implementation first mix and match operations
    from past literature
  • Basic philosophy most of the ideas in streams
    existed in previous research
  • Sliding windows, load shedding, approximation,
  • So lets borrow those ideas and focus on how to
    build a real system with them!
  • Emphasis is on building a scalable, robust system
  • Distributed implementation Medusa

11
Queries in Aurora
  • Oddly no declarative query language in the
    initial version! (Added for commercial product)
  • Queries are workflows of physical query operators
    (SQuAl)
  • Many operators resemble relational algebra ops

12
Example Query
13
Some Interesting Aspects
  • A relatively simple adaptive query optimizer
  • Can push filtering and mapping into many
    operators
  • Can reorder some operators (e.g., joins, unions)
  • Need built-in error handling
  • If a data source fails to respond in a certain
    amount of time, create a special alarm tuple
  • This propagates through the query plan
  • Incorporate built-in load-shedding, RT sched. to
    support QoS
  • Have a notion of combining a query over
    historical data with data from a stream
  • Switches from a pull-based mode (reading from
    disk) to a push-based mode (reading from network)

14
The Medusa Processor
  • Distributed coordinator between many Aurora nodes
  • Scalability through federation and distribution
  • Fail-over
  • Load balancing

15
Main Components
  • Lookup
  • Distributed catalog schemas, where to find
    streams, where to find queries
  • Brain
  • Query setup, load monitoring via I/O queues and
    stats
  • Load distribution and balancing scheme is used
  • Very reminiscent of Mariposa!

16
Load Balancing
  • Migration an operator can be moved from one
    node to another
  • Initial implementation didnt support moving of
    state
  • The state is simply dropped, and operator
    processing resumes
  • Implications on semantics?
  • Plans to support state migration
  • Agoric system model to create incentives
  • Clients pay nodes for processing queries
  • Nodes pay each other to handle load pairwise
    contracts negotiated offline
  • Bounded-price mechanism price for migration of
    load, spec for what a node will take on
  • Does this address the weaknesses of the Mariposa
    model?

17
Some Applications They Tried
  • Financial services (stock ticker)
  • Main issue is not volume, but problems with feeds
  • Two-level alarm system, where higher-level alarm
    helps diagnose problems
  • Shared computation among queries
  • User-defined aggregation and mapping
  • This is the main application for the commercial
    version (StreamBase)
  • Linear road (sensor monitoring)
  • Traffic sensors in a toll road change toll
    depending on how many cars are on the road
  • Combination of historical and continuous queries
  • Environmental monitoring
  • Sliding-window calculations

18
Lessons Learned
  • Historical data is important not just stream
    data
  • (Summaries?)
  • Sometimes need synchronization for consistency
  • ACID for streams?
  • Streams can be out of order, bursty
  • Stream cleaning?
  • Adaptors (and also XML) are important
  • But we already knew that!
  • Performance is critical
  • They spent a great deal of time using
    microbenchmarks and optimizing

19
Sensors and Sensor Networks
  • Trends
  • Cameras and other sensors are very cheap
  • Microprocessors and microcontrollers can be very
    small
  • Wireless networks are easy to build
  • Why not instrument the physical world with tiny
    wireless sensors and networks?
  • Vision Smart dust
  • Berkeley motes, RF tags, cameras, camera phones,
    temperature sensors, etc.
  • Today we already see pieces of this
  • Penn buildings and SCADA system
  • 250 surveillance cameras through campus

20
What Can We Do with Sensor Networks?
  • Many passive monitoring applications
  • Environmental monitoring
  • temperature in different parts of a building
  • air quality
  • etc.
  • Law enforcement
  • Video feeds and anomalous behavior
  • Research studies
  • Study ocean temperature, currents
  • Monitor status of eggs in endangered birds nests
  • ZebraNet
  • Fun
  • Record sporting events or performances from every
    angle (video audio)
  • Ultimately, build reactive systems as well
    robotics, Mars landers,

21
Some Challenges
  • Highly distributed!
  • May have thousands of nodes
  • Know about a few nodes within proximity may not
    know location
  • Nodes transmissions may interfere with one
    another
  • Power and resource constraints
  • Most of these devices are wireless, tiny,
    battery-powered
  • Can only transmit data every so often
  • Limited CPU, memory cant run sophisticated
    code
  • High rate of failure
  • Collisions, battery failures, sensor calibration,

22
The Target Platform
  • Most sensor network research argues for the
    Berkeley mote as a target platform
  • Mote 4MHz, 8-bit CPU
  • 128KB RAM
  • 512KB Flash memory
  • 40kbps radio, 100 ft range
  • Sensors
  • Light, temperature, microphone
  • Accelerometer
  • Magnetometer

http//robotics.eecs.berkeley.edu/pister/SmartDus
t/
23
Sensor Net Data Acquisition
  • First build routing tree
  • Second begin sensing and aggregation

24
Sensor Net Data Acquisition (Sum)
5
5
5
5
5
5
5
5
5
5
5
5
7
8
5
5
5
5
  • First build routing tree
  • Second begin sensing and aggregation (e.g.,
    sum)

25
Sensor Net Data Acquisition (Sum)
5
5
15
5
5
10
5
20
5
5
5
25
5
10
5
5
85
20
5
5
5
5
5
60
13
8
55
18
7
8
35
30
5
23
5
5
5
  • First build routing tree
  • Second begin sensing and aggregation (e.g.,
    sum)

26
Sensor Network Research
  • Routing need to aggregate and consolidate data
    in a power-efficient way
  • Ad hoc routing generate routing tree to base
    station
  • Generally need to merge computation with routing
  • Robustness need to combine info from many
    sensors to account for individual errors
  • What aggregation functions make sense?
  • Languages how do we express what we want to do
    with sensor networks?
  • Many proposals here

27
A First Try Tiny OS and nesC
  • TinyOS a custom OS for sensor nets, written in
    nesC
  • Assumes low-power CPU
  • Very limited concurrency support events
    (signaled asynchronously) and tasks
    (cooperatively scheduled)
  • Applications built from components
  • Basically, small objects without any local state
  • Various features in libraries that may or may not
    be included
  • interface Timer command result_t start(char
    type, uint32_t interval) command result_t
    stop() event result_t fired()

28
Drawbacks of this Approach
  • Need to write very low-level code for sensor net
    behavior
  • Only simple routing policies are built into
    TinyOS some of the routing algorithms may have
    to be implemented by hand
  • Has required many follow-up papers to fill in
    some of the missing pieces, e.g., Hood (object
    tracking and state sharing),

29
An Alternative
  • Much of the computation being done in sensor
    nets looks like what we were discussing with
    STREAM
  • Todays sensor networks look a lot like
    databases, pre-Codd
  • Custom access paths to get to data
  • One-off custom-code
  • So why not look at mapping sensor network
    computation to SQL?
  • Not very many joins here, but significant
    aggregation
  • Now the challenge is in picking a distribution
    and routing strategy that provides appropriate
    guarantees and minimizes power usage

30
TinyDB and TinySQL
  • Treat the entire sensor network as a universal
    relation
  • Each type of sensor data is a column in a global
    table
  • Tuples are created according to a sample interval
    (separated by epochs)
  • (Implications of this model?)
  • SELECT nodeid, light, tempFROM sensorsSAMPLE
    INTERVAL 1s FOR 10s

31
Storage Points and Windows
  • Like Aurora, STREAM, can materialize portions of
    the data
  • CREATE STORAGE POINT recentlight SIZE 8AS
    (SELECT nodeid, light FROM sensors
    SAMPLE INTERVAL 10s)
  • and we can use windowed aggregates
  • SELECT WINAVG(volume, 30s, 5s)FROM
    sensorsSAMPLE INTERVAL 1s

32
Events
  • ON EVENT bird-detect(loc) SELECT AVG(light),
    AVG(temp), event.loc FROM sensors AS s WHERE
    dist(s.loc, event.loc) lt 10m SAMPLE INTERVAL 2s
    FOR 30s
  • How do we know about events?
  • Contrast to UDFs? triggers?

33
Power and TinyDB
  • Cost-based optimizer tries to find a query plan
    to yield lowest overall power consumption
  • Different sensors have different power usage
  • Try to order sampling according to selectivity
    (sounds familiar?)
  • Assumption of uniform distribution of values over
    range
  • Batching of queries (multi-query optimization)
  • Convert a series of events into a stream join
    does this resemble anything weve seen recently?
  • Also need to consider where the query is
    processed

34
Dissemination of Queries
  • Based on semantic routing tree idea
  • SRT build request is flooded first
  • Node n gets to choose its parent p, based on
    radio range from root
  • Parent knows its children
  • Maintains an interval on values for each child
  • Forwards requests to children as appropriate
  • Maintenance
  • If interval changes, child notifies its parent
  • If a node disappears, parent learns of this when
    it fails to get a response to a query

35
Query Processing
  • Mostly consists of sleeping!
  • Wake briefly, sample, and compute operators, then
    route onwards
  • Nodes are time synchronized
  • Awake time is proportional to the neighborhood
    size (why?)
  • Computation is based on partial state records
  • Basically, each operation is a partial aggregate
    value, plus the reading from the sensor

36
Load Shedding Approximation
  • What if the router queue is overflowing?
  • Need to prioritize tuples, drop the ones we dont
    want
  • FIFO vs. averaging the head of the queue vs.
    delta-proportional weighting
  • Later work considers the question of using
    approximation for more power efficiency
  • If sensors in one region change less frequently,
    can sample less frequently (or fewer times) in
    that region
  • If sensors change less frequently, can sample
    readings that take less power but are correlated
    (e.g., battery voltage vs. temperature)
  • Thursday, 430PM, DB Group Meeting, Ill discuss
    some of this work

37
The Future of Sensor Nets?
  • TinySQL is a nice way of formulating the problem
    of query processing with motes
  • View the sensor net as a universal relation
  • Can define views to abstract some concepts, e.g.,
    an object being monitored
  • But
  • What about when we have multiple instances of an
    object to be tracked? Correlations between
    objects?
  • What if we have more complex data? More CPU
    power?
  • What if we want to reason about accuracy?
Write a Comment
User Comments (0)
About PowerShow.com