Title: Adaptive Dataflow: A Database/Networking Cosmic Convergence
1Adaptive Dataflow A Database/NetworkingCosmic
Convergence
- Joe Hellerstein
- UC Berkeley
2Road Map
- How I got started on this
- CONTROL project
- Eddies
- Tie-ins to Networking Research
- Telegraph ongoing adaptive dataflow research
- New arenas
- Sensor networks
- P2P networks
3Background CONTROL project
- Online/Interactive query processing
- Online aggregation
- Scalable spreadsheets refining visualizations
- Online data cleaning (Potters Wheel)
- Pipelining operators (ripple joins, online
reordering) over streaming samples
4Example Online Aggregation
5Online Data Visualization
6Potters Wheel
7Goals for Online Processing
- Performance metric ?
- Statistical (e.g. conf. intervals)
- User-driven (e.g. weighted by widgets)
- New greedy performance regime
- Maximize 1st derivative of the mirth index
- Mirth defined on-the-fly
- Therefore need FEEDBACK and CONTROL
100
Online
?
Traditional
Time
8CONTROL ? Volatility
- Goals and data may change over time
- User feedback, sample variance
- Goals and data may be different in different
regions - Group-by, scrollbar position
- An aside dependencies in selectivity
estimation - Q Query optimization in this world?
- Or in any pipelining, volatile environment??
- Where else do we see volatility?
9Continuous Adaptivity Eddies
Eddy
- A little more state per tuple
- Ready/done bits (extensible a la
Volcano/Starburst) - Query processing dataflow routing!!
- We'll come back to this!
10Eddies Two Key Observations
- Break the set-oriented boundary
- Usual DB model algebra expressions (R S)
T - Usual DB implementation pipelining operators!
- Subexpressions never materialized
- Typical implementation is more flexible than
algebra - We can reorder in-flight operators
- Other gains possible by breaking the set-oriented
boundary - Dont rewrite graph. Impose a router
- Graph edge absence of routing constraint
- Observe operator consumption/production rates
- Consumption cost
- Production costselectivity
11Road Map
- How I got started on this
- CONTROL project
- Eddies
- Tie-ins to Networking Research
- Telegraph ongoing adaptive dataflow research
- New arenas
- Sensor networks
- P2P networks
12Coincidence Eddie Comes to Berkeley
- CLICK a NW router is a query plan!
- The Click Modular Router, Robert Morris, Eddie
Kohler, John Jannotti, and M. Frans Kaashoek,
SOSP 99
13Also Scout
- Paths the key to comm-centric OS
- Making Paths Explicit in the Scout Operating
System, David Mosberger and Larry L. Peterson.
OSDI 96.
Figure 3Example Router Graph
14More Interaction CS262 Experiment w/ Eric Brewer
- Merge OS DBMS grad class, over a year
- Eric/Joe, point/counterpoint
- Some tie-ins were obvious
- memory mgmt, storage, scheduling, concurrency
- Surprising QP and networks go well side by side
- E.g. eddies and TCP Congestion Control
- Both use back-pressure and simple Control Theory
to learn in an unpredictable dataflow
environment - Eddies close to the n-armed bandit problem
15Networking Overview for DB People Like Me
- Core function of protocols data xfer
- Data Manipulation (buffer, checksum, encryption,
xfer to/fr app space, presentation) - Transfer Control (flow/congestion ctl, detecting
xmission probs, acks, muxing, timestamps,
framing)-- Clark Tennenhouse, Architectural
Considerations for a New Generation of
Protocols, SIGCOMM 90 - Basic Internet assumption
- a network of unknown topology and with an
unknown, unknowable and constantly changing
population of competing conversations (Van
Jacobson)
16C Ts Wacky Ideas
- Thesis nets are good at xfer control, not so
good at data manipulation - Some CT wacky ideas for better data manipulation
- Xfer semantic units, not packets (ALF)
- Auto-rewrite layers to flatten them (ILP)
- Minimize cross-layer ordering constraints
- Control delivery in parallel via packet content
17Wacky New Ideas in QP
- What if
- We had unbounded data producers and consumers
(streams continuous queries) - We couldnt know our producers behavior or
contents?? (federation mediators) - We couldnt predict user behavior? (control)
- We couldnt predict behavior of components in the
dataflow? (networked services) - We had partial failure as a given? (oops, have
we ignored this?) - Yes networking people have been here!
- Remember Van Jacobsons quote?
18The Cosmic Convergence
Data Models, Query Opt, DataScalability
DATABASE RESEARCH
Adaptive QueryProcessing
ContinuousQueries
Approximate/Interactive QP
SensorDatabases
Content-Based Routing
Router Toolkits
Content Addressable Networks
Directed Diffusion
NETWORKING RESEARCH
Adaptivity, Federated Control, GeoScalability
19The Cosmic Convergence
Data Models, Query Opt, DataScalability
Telegraph
Adaptivity, Federated Control, GeoScalability
20Road Map
- How I got started on this
- CONTROL project
- Eddies
- Tie-ins to Networking Research
- Telegraph ongoing adaptive dataflow research
- New arenas
- Sensor networks
- P2P networks
21Whats in the Sweet Spot?
- Scenarios with
- Structured Content
- Volatility
- Rich Queries
- Clearly
- Long-running data analysis a la CONTROL
- Continuous queries
- Queries over Internet sources and services
- Two emerging scenarios
- Sensor networks
- P2P query processing
22Telegraph Engineering the Sweet Spot
- An adaptive dataflow system
- Dataflow programming model
- A la Volcano, CLICK push and pull. Fjords,
ICDE02 - Extensible set of pipelining operators, including
relational ops, grouped filters (e.g. XFilter) - SQL parser for convenience (looking at XQuery)
- Adaptivity operators
- Eddies
- Extensible rules for routing constraints,
Competition - SteMs (state modules)
- FLuX (Fault-tolerant Load-balancing eXchange)
- Bounded and continuous
- Data sources
- Queries
23State Modules (SteMs)
static dataflow
- Goal Further adaptivity through competition
- Multiple mirrored sources
- Handle rate changes, failures, parallelism
- Multiple alternate operators
- Join Routing State
- SteM operator manages tradeoffs
- State Module, unifies caches, rendezvous buffers,
join state - Competitive sources/operators share
building/probing SteMs - Join algorithm hybridization!
- Vijayshankar Raman
eddy
eddy stems
24FLuX Routing Across Cluster
- Fault Tolerance, Load Balancing
- Continuous/long-running flows need high
availability - Big flows need parallelism
- Adaptive Load-Balancing reqd
- FLuX operator Exchange plus
- Adaptive flow partitioning (River)
- Transient state replication migration
- RAID for SteMs
- Needs to be extensible to different ops
- Content-sensitivity
- History-sensitivity
- Dataflow semantics
- Optimize based on edge semantics
- Networking tie-in again
- At-least-once delivery?
- Exactly-once delivery?
- In/Out of order?
- Migration policy the ski rental analogy
- Mehul Shah
25Continuously AdaptiveContinuous Queries (CACQ)
- Continuous Queries clearly need all this stuff!
Address adaptivity 1st. - 4 Ideas in CACQ
- Use eddies to allow reordering of ops.
- But one eddy will serve for all queries
- Explicit tuple lineage
- Mark each tuple with per-op ready/done bits
- Mark each tuple with per-query completed bits
- Queries are data join with Grouped Filter
- Much like XFilter, but for relational queries
- Joins via SteMs, shared across all queries
- Note mixed-lineage tuples in a SteM. I.e.
shared state is not shared algebraic expressions! - Delete a tuple from flow only if it matches no
query - Next F.T. CACQ via FLuXen
- Sam Madden, Mehul Shah, Vijayshankar Raman
26Road Map
- How I got started on this
- CONTROL project
- Eddies
- Tie-ins to Networking Research
- Telegraph ongoing adaptive dataflow research
- New arenas
- Sensor networks
- P2P networks
27Sensor Nets
- Smart Dust TinyOS
- Thousands of motes
- Expensive communication
- Power constraints
- Query workload
- Aggregation approximation
- Queries and Continuous Queries
- Challenges
- Push the processing into the network
- Deal with volatility failure
- CONTROL issues data variance, user desires
- Joint work with Ramesh Govindan, Sam Madden, Wei
Hong and David Culler (Intel Berkeley Lab)
Simple example Aggregation query
28P2P QP
- Starting point P2P as grassroots phenomenon
- Outrageous filesharing volume (1.8Gfiles in
October 2001) - No business case to date
- Challenge scale DDBMS QP ideas to P2P
- Motivate why
- Pick the right parts of DBMS research to focus on
- Storage no! QP yes.
- Make it work
- Scalability well beyond our usual target
- Admin constraints
- Unknown data distributions, load
- Heterogeneous comm/processing
- Partial failure
- Joint work with Scott Shenker, Ion Stoica, Matt
Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo
29A Grassroots Example TeleNap
30Themes Throughout
- Adaptivity
- Requires clever system design
- The Exchange model encapsulate in ops?
- Interesting adaptive policy problems
- E.g. eddy routing, flux migration
- Control Theory, Machine Learning
- Encompasses another CS goal?
- No-knobs, Autonomic, etc.
- New performance regimes
- Decent performance in the common case
- Mean/Variance more important than MAX
- Interactive Metrics
- Time to completion often unimportant/irrelevant
31More Themes
- Set-valued thinking as albatross?
- E.g. eddies vs. Kabra/DeWitt or Tukwila
- E.g. SteMs vs. Materialized Views
- E.g. CACQ vs. NiagaraCQ
- Some clean theory here would be nice
- Current routing correctness proofs are inelegant
- Extensibility
- Model/language of choice is not clear
- SEQ? Relational? XQuery?
- Extensible operators, edge semantics
- A whine about VLDBs absurd Specificity
Factor
32Conclusions?
- Too early for technical conclusions
- Of this Im sure
- The CS262 experiment is a success
- Our students are getting a bigger picture than
before - Im learning, finding new connections
- May morph to OS/Nets, Nets/DB
- Eventually rethink the systems software
curriculum at the undergraduate level too - Nets folks are coming our way
- Doing relevant work, eager to collaborate
- DB community needs to branch out
- Outbound Better proselytizing in CS
- Inbound Need new ideas
33Conclusions, cont.
- Sabbatical is a good invention
- Hasnt even started, Im already grateful!