Title: Telegraph Continuously Adaptive Dataflow
1TelegraphContinuously Adaptive Dataflow
2Scenarios
- Ubiquitous computing more than clients
- sensors and their data feeds are key
- smart dust, biomedical (MEMS sensors)
- each consumer good records (mis)use
- disposable computing
- video from surveillance cameras, broadcasts, etc.
- Global Data Federation
- all the data is online what are we waiting for?
- The plumbing is coming
- XML/HTTP, etc. give LCD communication
- but how do you flow, summarize, query and analyze
data robustly over many sources in the wide area?
3Dataflow in Volatile Environments
- Federated query processors a reality
- Cohera, IBM DataJoiner
- No control over stats, performance,
administration - Large Cluster Systems Scaling Out
- No control over system balance
- User CONTROL of running dataflows
- Long-running dataflow apps are interactive
- No control over user interaction
- Sensor Nets the next killer app
- E.g. Smart Dust
- No control over anything!
- Telegraph
- Dataflow Engine for these environments
4Data Flood Main Features
- What does it look like?
- Never ends interactivity required
- Online, controllable algorithms for all tasks!
- Big data reduction/aggregation is key
- Volatile this scale of devices and nets will not
behave nicely
5The Telegraph Dataflow Engine
- Key technologies
- Interactive Control
- interactivity with early answers and examples
- online aggregation for data reduction
- Dataflow programming via paths/iterators
- Elevate query processing frameworks out of DBMSs
- Long tradition of static optimization here
- Suggestive, but not sufficient for volatile
environments - Continuously adaptive flow optimization
- massively parallel, adaptive dataflow via Rivers
and Eddies
6CONTROLContinuous Output and Navigation
Technology with Refinement On Line
- Data-intensive jobs are long-running. How to
give early answers and interactivity? - online interactivity over feeds
- pipelining online operators, data juggle
- online data correlation algs ripple joins,
online mining and aggregation - statistical estimators, and their performance
implications - Deliver data to satisfy statistical goals
- Appreciate interplay of massive data processing,
stats, and HCI
- Of all men's miseries, the bitterest is this to
know so much and have control over nothing - Herodotus
7Performance Regime for CONTROL
- New Greedy Performance Regime
- Maximize 1st derivative of the user-happiness
function
100
CONTROL
?
Traditional
Time
8CONTROLContinuous Output and Navigation
Technology with Refinement On Line
9CONTROLContinuous Output and Navigation
Technology with Refinement On Line
10 11Potters Wheel Anomaly Detection
12River
- We built the worlds fastest sorting machine
- On the NOW 100 Sun workstations SAN
- But it only beat the record under ideal
conditions! - River performance adaptivity for data flows on
clusters - simplifies management and programming
- perfect for sensor-based streams
13Declarative Dataflow NOT new
- Database Systems have been doing this for years
- Xlate declarative queries into an efficient
dataflow plan - query optimization considers
- Alternate data sources (access methods)
- Alternate implementations of operators
- Multiple orders of operators
- A space of alternatives defined by transformation
rules - Estimate costs and data rates, then search
space - But in a very static way!
- Gather statistics once a week
- Optimize query at submission time
- Run a fixed plan for the life of the query
- And these ideas are ripe to elevate out of DBMSs
- And outside of DBMSs, the world is very volatile
- There are surely going to be lessons outside the
box
14Static Query Plans
- Volatile environments like sensors need to adapt
at a much finer grain
15Continuous Adaptivity Eddies
Eddy
- How to order and reorder operators over time
- based on performance, economic/admin feedback
- Vs.River
- River optimizes each operator horizontally
- Eddies optimize a pipeline vertically
16Competitive Eddies
17Telegraph Putting it Together
- Scalable, adaptive dataflow infrastructure. Apps
include - sensor nets
- massively parallel and wide-area query engines
- net appliances chaining xform8n/aggreg8n/compress
ion/ etc. in proxies - any volatile dataflow scenario
- Technology a marriage of
- CONTROL, Rivers Eddies
- Many research questions here
- E.g. how to combine River and Eddy adaptivity
- E.g. how to tune Eddies for statistical
performance goals - Combinations of browse/query/mine at UI
- Storage management to handle new hardware
realities - Look for a live service this summer!
18Integration with Endeavour
- Give
- Be data-intensive backbone to diverse clients
- Be replication/delivery dataflow engine for
OceanStore - Telegraph Storage Manager provides storage
(xactional/otherwise) for OceanStore - Provide platform for data-intensive tacit info
mining - Take
- Leverage OceanStore to manage distributed
metadata, security - Leverage protocols out of TinyOS for sensors
19Connectivity Heterogeneity
- Lots of folks working on data format translation,
parsing - we will borrow, not build
- currently using JDBC Cohera Net Query
- commercial tool, donated by Cohera Corp.
- gateways XML/HTML (via http) to ODBC/JDBC
- we may write Teletalk gateways from sensors
- Heterogeneity
- never a simple problem
- Control project developed interactive, online
data transformation tool ABC
20More Info
- Collaborators
- Mike Franklin, Eric Brewer, Christos
Papadimitriou - Sirish Chandrasekaran, Amol Deshpande, Kris
Hildrum, Sam Madden, Vijayshankar Raman, Mehul
Shah - Me jmh_at_cs.berkeley.edu
- Web
- http//db.cs.berkeley.edu/telegraph
- http//control.cs.berkeley.edu
21Extra slides for backup