Title: Chapter 10: Stream-based Data Management
1Chapter 10 Stream-based Data Management
- Title Retrospective on Aurora
- Authors Hari Balakrishnan, et. al.
2Design, Implementation, and Evaluation of the
Linear Road Benchmark on the Stream Processing
Core
- Problem
- Problem Statement
- Why is this problem important?
- Why is this problem hard?
- Approaches
- Approach description, key concepts
- Contributions (novelty, improved)
- Assumptions
3Problem Statement
- Given
- Stream data
- Experience on the development of five
stream-based applications using Aurora stream
processing engine - Find
- Key requirements of streaming applications
- Objectives
- Reflect on the design of Aurora based on this
experience - Eliminate the limitations and address new
challenges on a follow-on project, Borealis - Constraints
- Data streams arrive in no particular order.
- Data streams arrive without any temporal
regularity.
4Why is this problem important?
- Stream-processing applications
- Financial Services stock ticker
- Transportation congestion pricing, dynamic
tolls - Sensor Networks Environment monitoring
- Defense Battalion monitoring
5Why is this problem Hard?
- High update rate
- Time-series
- Streaming applications entail time series.
- Time series operations are not well supported by
current DBMSs. - Real-time constraints
- Outbound processing, where data are stored before
being processed, cannot deliver real-time
latency. - SPEs must adopt inbound processing, where query
processing is performed directly on incoming
messages. - Spikes in message load.
- Incoming traffic is bursty.
- Quality of Service (QOS) requirements
6Novel Contributions
- Comparison with SQL-centric related Work
- Data Flow Network (DFN) centric
- Developer compose DFN using graphical user
interface - Optimizer rearrange DFN, e.g. swap boxes,
- Compiler Translate DFN to intermediate
representation - Run-time Schedule tasks based on QOS
requirements - Other Contributions Lessons Learnt
- Identify characteristics of streaming
applications - from 5 case studies
- Identify core performance tuning ideas
7Aurora Architecture
- Aurora is based on a dataflow-style boxes
arrows paradigm unlike others using SQL style
query interface. (i.e., performing query back and
forth adds system overhead and latency.) - Can be spread across any number of machines for
scalability and availability.
Input
Output
Operator
Aurora Operators
Aurora GUI
8Aurora Case Study 1 Financial Services
- An application detects feed problems and triggers
switch between feeds in real time. - Hierarchical Alarm
- Low alarm is triggered when update is delayed
beyond threshold (e.g., 5 sec). - High alarm is triggered when low alarms
accumulate beyond threshold (e.g., 100 times). - Boxes in red circle separate the alarms from
- both Reuters and Comstock into alarms from
- NYSE and alarms from NASDAQ.
- Filter Merging techniques
- This case study illustrates the ability to detect
stream imperfections and extend functionality
using user-defined Map functions.
9Aurora Case Study 2 Linear Road Benchmark
- Linear Road is a bench mark for stream processing
eingines. - Simulates an unban highway system that uses
variable tolling (i.e, congestion-based
pricing). - Linear Road should support for
- Two continuous queries
- Calculates a segment toll every time a vehicle
enters the segment. - Detects and reports accidents and adjusts tolls
accordingly. - Three Historical queries
- Request an account balance
- Days total expenditure for a given vehicle
- Prediction of travel time between two segments
using historical data - Each of these queries must be answered with a
specified accuracy and within a specified
response time.
10Aurora Case Study 3 Battalion Monitoring
- Aircrafts gather data and send them to monitoring
stations. - Enemy units cross a given line, signaling an
attack. - The limited resource is the bandwidth between
aircraft and ground. When an attack is initiated,
selective dropping of data is allowed to serve
important classes. - Authors could test their load-shedding
techniques. - Insert random drop boxes to discard a fraction of
their input tuples. - Insert semantic, predicate-based drop filters.
- Observations
- The semantic load-shedding techniques achieve the
least value utility loss. - As load increases, two techniques show similar
performance. - At high loads, all algorithms converge to same
loss levels.
11Aurora Case Study 4 Environmental Monitoring
- Monitoring toxins in water.
- Stream data is fish behavior (e.g., breathing
rate) and water quality (e.g., temperature). - When the fish behave abnormally, an alarm is
sounded. - The water data contain 1,2, and 4 hour sliding
windows. - Ease of developing stream applications
- Aurora proved very convenient for sliding window
calculation. - Auroras GUI proved invaluable.
12Aurora Case Study 5 Medusa
- Is a distributed stream-processing system using
Aurora. - Takes Aurora queries and distributes them across
multiple nodes. - Offers several Benefits
- Incremental scalability over multiple nodes.
- High availability by mutual monitoring between
nodes. - Composition of stream feeds from different
participants. - Handling load spikes by federated system.
13Lessons Learnt Application Characteristics
- Common Queries
- Historical data using Open window
- Last 10 weeks worth of toll data for each driver
- Aggregate - How much a driver has spent on tolls
over past 10 weeks? - Tables of historical data with arbitrary update
patterns - Synchronization
- Stream applications rely on shared data and
computation. - WaitFor (P Predicate, T Timeout)
- Unpredictable stream behavior
- Financial services application detects arrival
rate of a stream. - Military application adjust resources during
times of stress.
14Application Characteristics
- XML and other feed formats
- E.g., stock quote data in XML format
- In case, protocol stability and portability are
important than a minor performance loss. - Programmatic interfaces globally accessible
catalogs - Scripting Aurora networks
- Metadata
15Lessons Learnt Performance Tuning
- Requirements
- Main memory implementation
- Data movement across DFN elements
- Scheduling of DFN elements
- Performance Decisions
- Memory copying memcpy() implementations
- Scheduler
- Reduce scheduler overheads by aggressive
profiling - Tight loops
- keep unnecessary house-keeping out of tight loops
- Data-structures
- Optimize data-structures used to implement DFN
elements
16Future Plans Borealis
- Dynamic revision of query results
- Intelligently corrects query results that have
already been emitted with the corrected data that
arrive later. - Dynamic query modification
- E.g., traders wish to be alerted of interesting
events, where the defn of interesting varies. - Distributed optimization
- Server-heavy or sensor-heavy optimization problem
becomes emerging. - More flexible optimization to handle a very large
of devices - Implementation plans
17Summary
- Papers focus
- Identify the requirements of stream applications
by the experience from the design and
implementation of Aurora stream-processing engine - Ideas
- Describe five applications and their
implementation in detail. - Reflect on the design of Aurora based on the
experience. - Discuss future ideas on follow-on project.
- Contributions
- Identify key requirements of streaming
applications - Analytical Validation
- Case study
18Assumptions, Rewrite today
- Assumptions
- Archiving is not necessary!
- Performance more important than declarative query
language - Rewrite today
- Compare performance with competition, e.g. STREAM
- Allow archiving along with stream processing
- Consider other applications
- RFID, cell phone applications
- Include current status of Borealis implementation.