Adaptive Dataflow: A Database/Networking Cosmic Convergence - PowerPoint PPT Presentation

About This Presentation
Title:

Adaptive Dataflow: A Database/Networking Cosmic Convergence

Description:

Title: Database/Network Convergence Author: Joe Hellerstein Last modified by: Joe Hellerstein Created Date: 12/6/2001 9:51:31 PM Document presentation format – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 34
Provided by: JoeHell9
Learn more at: https://dsf.berkeley.edu
Category:

less

Transcript and Presenter's Notes

Title: Adaptive Dataflow: A Database/Networking Cosmic Convergence


1
Adaptive Dataflow A Database/NetworkingCosmic
Convergence
  • Joe Hellerstein
  • UC Berkeley

2
Road Map
  • How I got started on this
  • CONTROL project
  • Eddies
  • Tie-ins to Networking Research
  • Telegraph ongoing adaptive dataflow research
  • New arenas
  • Sensor networks
  • P2P networks

3
Background CONTROL project
  • Online/Interactive query processing
  • Online aggregation
  • Scalable spreadsheets refining visualizations
  • Online data cleaning (Potters Wheel)
  • Pipelining operators (ripple joins, online
    reordering) over streaming samples

4
Example Online Aggregation
5
Online Data Visualization
  • CLOUDS

6
Potters Wheel
7
Goals for Online Processing
  • Performance metric ?
  • Statistical (e.g. conf. intervals)
  • User-driven (e.g. weighted by widgets)
  • New greedy performance regime
  • Maximize 1st derivative of the mirth index
  • Mirth defined on-the-fly
  • Therefore need FEEDBACK and CONTROL

100
Online
?
Traditional
Time
8
CONTROL ? Volatility
  • Goals and data may change over time
  • User feedback, sample variance
  • Goals and data may be different in different
    regions
  • Group-by, scrollbar position
  • An aside dependencies in selectivity
    estimation
  • Q Query optimization in this world?
  • Or in any pipelining, volatile environment??
  • Where else do we see volatility?

9
Continuous Adaptivity Eddies
Eddy
  • A little more state per tuple
  • Ready/done bits (extensible a la
    Volcano/Starburst)
  • Query processing dataflow routing!!
  • We'll come back to this!

10
Eddies Two Key Observations
  • Break the set-oriented boundary
  • Usual DB model algebra expressions (R S)
    T
  • Usual DB implementation pipelining operators!
  • Subexpressions never materialized
  • Typical implementation is more flexible than
    algebra
  • We can reorder in-flight operators
  • Other gains possible by breaking the set-oriented
    boundary
  • Dont rewrite graph. Impose a router
  • Graph edge absence of routing constraint
  • Observe operator consumption/production rates
  • Consumption cost
  • Production costselectivity

11
Road Map
  • How I got started on this
  • CONTROL project
  • Eddies
  • Tie-ins to Networking Research
  • Telegraph ongoing adaptive dataflow research
  • New arenas
  • Sensor networks
  • P2P networks

12
Coincidence Eddie Comes to Berkeley
  • CLICK a NW router is a query plan!
  • The Click Modular Router, Robert Morris, Eddie
    Kohler, John Jannotti, and M. Frans Kaashoek,
    SOSP 99

13
Also Scout
  • Paths the key to comm-centric OS
  • Making Paths Explicit in the Scout Operating
    System, David Mosberger and Larry L. Peterson.
    OSDI 96.

Figure 3Example Router Graph
14
More Interaction CS262 Experiment w/ Eric Brewer
  • Merge OS DBMS grad class, over a year
  • Eric/Joe, point/counterpoint
  • Some tie-ins were obvious
  • memory mgmt, storage, scheduling, concurrency
  • Surprising QP and networks go well side by side
  • E.g. eddies and TCP Congestion Control
  • Both use back-pressure and simple Control Theory
    to learn in an unpredictable dataflow
    environment
  • Eddies close to the n-armed bandit problem

15
Networking Overview for DB People Like Me
  • Core function of protocols data xfer
  • Data Manipulation (buffer, checksum, encryption,
    xfer to/fr app space, presentation)
  • Transfer Control (flow/congestion ctl, detecting
    xmission probs, acks, muxing, timestamps,
    framing)-- Clark Tennenhouse, Architectural
    Considerations for a New Generation of
    Protocols, SIGCOMM 90
  • Basic Internet assumption
  • a network of unknown topology and with an
    unknown, unknowable and constantly changing
    population of competing conversations (Van
    Jacobson)

16
C Ts Wacky Ideas
  • Thesis nets are good at xfer control, not so
    good at data manipulation
  • Some CT wacky ideas for better data manipulation
  • Xfer semantic units, not packets (ALF)
  • Auto-rewrite layers to flatten them (ILP)
  • Minimize cross-layer ordering constraints
  • Control delivery in parallel via packet content

17
Wacky New Ideas in QP
  • What if
  • We had unbounded data producers and consumers
    (streams continuous queries)
  • We couldnt know our producers behavior or
    contents?? (federation mediators)
  • We couldnt predict user behavior? (control)
  • We couldnt predict behavior of components in the
    dataflow? (networked services)
  • We had partial failure as a given? (oops, have
    we ignored this?)
  • Yes networking people have been here!
  • Remember Van Jacobsons quote?

18
The Cosmic Convergence
Data Models, Query Opt, DataScalability
DATABASE RESEARCH
Adaptive QueryProcessing
ContinuousQueries
Approximate/Interactive QP
SensorDatabases
Content-Based Routing
Router Toolkits
Content Addressable Networks
Directed Diffusion
NETWORKING RESEARCH
Adaptivity, Federated Control, GeoScalability
19
The Cosmic Convergence
Data Models, Query Opt, DataScalability
Telegraph
Adaptivity, Federated Control, GeoScalability
20
Road Map
  • How I got started on this
  • CONTROL project
  • Eddies
  • Tie-ins to Networking Research
  • Telegraph ongoing adaptive dataflow research
  • New arenas
  • Sensor networks
  • P2P networks

21
Whats in the Sweet Spot?
  • Scenarios with
  • Structured Content
  • Volatility
  • Rich Queries
  • Clearly
  • Long-running data analysis a la CONTROL
  • Continuous queries
  • Queries over Internet sources and services
  • Two emerging scenarios
  • Sensor networks
  • P2P query processing

22
Telegraph Engineering the Sweet Spot
  • An adaptive dataflow system
  • Dataflow programming model
  • A la Volcano, CLICK push and pull. Fjords,
    ICDE02
  • Extensible set of pipelining operators, including
    relational ops, grouped filters (e.g. XFilter)
  • SQL parser for convenience (looking at XQuery)
  • Adaptivity operators
  • Eddies
  • Extensible rules for routing constraints,
    Competition
  • SteMs (state modules)
  • FLuX (Fault-tolerant Load-balancing eXchange)
  • Bounded and continuous
  • Data sources
  • Queries

23
State Modules (SteMs)
static dataflow
  • Goal Further adaptivity through competition
  • Multiple mirrored sources
  • Handle rate changes, failures, parallelism
  • Multiple alternate operators
  • Join Routing State
  • SteM operator manages tradeoffs
  • State Module, unifies caches, rendezvous buffers,
    join state
  • Competitive sources/operators share
    building/probing SteMs
  • Join algorithm hybridization!
  • Vijayshankar Raman

eddy
eddy stems
24
FLuX Routing Across Cluster
  • Fault Tolerance, Load Balancing
  • Continuous/long-running flows need high
    availability
  • Big flows need parallelism
  • Adaptive Load-Balancing reqd
  • FLuX operator Exchange plus
  • Adaptive flow partitioning (River)
  • Transient state replication migration
  • RAID for SteMs
  • Needs to be extensible to different ops
  • Content-sensitivity
  • History-sensitivity
  • Dataflow semantics
  • Optimize based on edge semantics
  • Networking tie-in again
  • At-least-once delivery?
  • Exactly-once delivery?
  • In/Out of order?
  • Migration policy the ski rental analogy
  • Mehul Shah

25
Continuously AdaptiveContinuous Queries (CACQ)
  • Continuous Queries clearly need all this stuff!
    Address adaptivity 1st.
  • 4 Ideas in CACQ
  • Use eddies to allow reordering of ops.
  • But one eddy will serve for all queries
  • Explicit tuple lineage
  • Mark each tuple with per-op ready/done bits
  • Mark each tuple with per-query completed bits
  • Queries are data join with Grouped Filter
  • Much like XFilter, but for relational queries
  • Joins via SteMs, shared across all queries
  • Note mixed-lineage tuples in a SteM. I.e.
    shared state is not shared algebraic expressions!
  • Delete a tuple from flow only if it matches no
    query
  • Next F.T. CACQ via FLuXen
  • Sam Madden, Mehul Shah, Vijayshankar Raman

26
Road Map
  • How I got started on this
  • CONTROL project
  • Eddies
  • Tie-ins to Networking Research
  • Telegraph ongoing adaptive dataflow research
  • New arenas
  • Sensor networks
  • P2P networks

27
Sensor Nets
  • Smart Dust TinyOS
  • Thousands of motes
  • Expensive communication
  • Power constraints
  • Query workload
  • Aggregation approximation
  • Queries and Continuous Queries
  • Challenges
  • Push the processing into the network
  • Deal with volatility failure
  • CONTROL issues data variance, user desires
  • Joint work with Ramesh Govindan, Sam Madden, Wei
    Hong and David Culler (Intel Berkeley Lab)

Simple example Aggregation query
28
P2P QP
  • Starting point P2P as grassroots phenomenon
  • Outrageous filesharing volume (1.8Gfiles in
    October 2001)
  • No business case to date
  • Challenge scale DDBMS QP ideas to P2P
  • Motivate why
  • Pick the right parts of DBMS research to focus on
  • Storage no! QP yes.
  • Make it work
  • Scalability well beyond our usual target
  • Admin constraints
  • Unknown data distributions, load
  • Heterogeneous comm/processing
  • Partial failure
  • Joint work with Scott Shenker, Ion Stoica, Matt
    Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo

29
A Grassroots Example TeleNap
30
Themes Throughout
  • Adaptivity
  • Requires clever system design
  • The Exchange model encapsulate in ops?
  • Interesting adaptive policy problems
  • E.g. eddy routing, flux migration
  • Control Theory, Machine Learning
  • Encompasses another CS goal?
  • No-knobs, Autonomic, etc.
  • New performance regimes
  • Decent performance in the common case
  • Mean/Variance more important than MAX
  • Interactive Metrics
  • Time to completion often unimportant/irrelevant

31
More Themes
  • Set-valued thinking as albatross?
  • E.g. eddies vs. Kabra/DeWitt or Tukwila
  • E.g. SteMs vs. Materialized Views
  • E.g. CACQ vs. NiagaraCQ
  • Some clean theory here would be nice
  • Current routing correctness proofs are inelegant
  • Extensibility
  • Model/language of choice is not clear
  • SEQ? Relational? XQuery?
  • Extensible operators, edge semantics
  • A whine about VLDBs absurd Specificity
    Factor

32
Conclusions?
  • Too early for technical conclusions
  • Of this Im sure
  • The CS262 experiment is a success
  • Our students are getting a bigger picture than
    before
  • Im learning, finding new connections
  • May morph to OS/Nets, Nets/DB
  • Eventually rethink the systems software
    curriculum at the undergraduate level too
  • Nets folks are coming our way
  • Doing relevant work, eager to collaborate
  • DB community needs to branch out
  • Outbound Better proselytizing in CS
  • Inbound Need new ideas

33
Conclusions, cont.
  • Sabbatical is a good invention
  • Hasnt even started, Im already grateful!
Write a Comment
User Comments (0)
About PowerShow.com