Telegraph Endeavour Retreat 2000 - PowerPoint PPT Presentation

About This Presentation

Title:

Telegraph Endeavour Retreat 2000

Description:

Incorporates Ninja 'path' project for caching ... Partner with Ninja to manage local metadata? Work with GUIR on interacting with streams? ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 32

Provided by: JosephMHe5

Learn more at: https://dsf.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Telegraph Endeavour Retreat 2000

1
TelegraphEndeavour Retreat 2000

Joe Hellerstein

2
Roadmap

Motivation Goals
Application Scenarios
Quickie core technology overview
Adaptive dataflow
Event-based storage manager
Come hear more about these tonight/tomorrow!
Status and Plans
Dataflow infrastructure apps
Storage manager?

3
Motivations

Global Data Federation
All the data is online what are we waiting for?
The plumbing is coming
XML/HTTP, XML/WAP, etc. give LCD communication
but how do you flow, summarize, query and analyze
data robustly over many sources in the wide area?
Ubiquitous computing more than clients
sensors and their data feeds are key
smart dust, biomedical (MEMS sensors)
each consumer good records (mis)use
disposable computing
video from surveillance cameras, broadcasts, etc.
Huge Data flood acomin!
will it capsize the good ship Endeavour?

4
Initial Telegraph Goals

Unify data access dataflow apps
Commercial wrappers for most infosources
Most info-centric apps can be cast as dataflow
The data flood needs a big dataflow manager!
Goal a robust, adaptive dataflow engine
Unify storage
Currently lots of disparate data stores
Databases, Files, Email servers (and http access
on these)
Goal A single, clean storage manager that can
serve
DB records semantics
Files and semantics
Email folders, calendars, etc. and semantics

5
Challenge for Dataflow Volatility!

Federated query processors
A la Cohera, IBM DataJoiner
No control over stats, performance,
administration
Large Cluster Systems Scaling Out
No control over system balance
User CONTROL of running dataflows
Long-running dataflow apps are interactive
No control over user interaction
Sensor Nets
No control over anything!
Telegraph
Dataflow Engine for these environments

6
The Data Flood Main Features

What does it look like?
Never ends interactivity required
Online, controllable algorithms for all tasks!
Big data reduction/aggregation is key
Volatile this scale of devices and nets will not
behave nicely

7
The Telegraph Dataflow Engine

Key technologies
Interactive Control
interactivity with early answers and examples
online aggregation for data reduction
Dataflow programming via paths/iterators
Elevate query processing frameworks out of DBMSs
Long tradition of static optimization here
Suggestive, but not sufficient for volatile
environments
Continuously adaptive flow optimization
massively parallel, adaptive dataflow
Rivers and Eddies

8
Static Query Plans

Volatile environments like sensors need to adapt
at a much finer grain

9
Continuous Adaptivity Eddies
Eddy

How to order and reorder operators over time
based on performance, economic/admin feedback
Vs.River
River optimizes each operator horizontally
Eddies optimize a pipeline vertically

10
Unifying Storage

Storage management buried inside specific systems
Elevate and expose the core services semantic
options
Layout/indexing
Concurrent access/modification
Recovery
Design for clustered environments
Replicate for reliability (tie-ins with Ninja)
Cluster options your RAM vs. my disk
Events State Machines for scalability
Unify eventflow and dataflow?
Share optimization lessons?

11
Status Adaptive Dataflow

Initial Eddy results promising, well received
(SIGMOD 2K)
Finishing Telegraph v0 in Java/Jaguar
Prototype now running
Demo service to go live on web this summer
Analysis queries over web sites
Weve picked a provocative app to go live with
(stay tuned!)
Incorporates Ninja path project for caching
Goal Telegraph is to facts and figures as
search engines are to documents
Longer-term goals
Formalize optimize Eddy/River scheduling
policies
Study HCI/systems/stats issues for interaction
Crawl Dark Matter on the web
Attack streams from sensors
Sequence queries and mining, data reduction,
browsing, etc.

12
Status Unified Storage Manager

Prototype implementation in Java/Jaguar
ACID transactions (non-ACID) Java file access
Robust enough to get TPC-W numbers
Events/states vs. threads
Echoes Gribble/Welsh results better than
threaded under load, but Java complicates
detailed mesurement
Time to re-evaluate importance of this part
Interest? More mindshare in dataflow
infrastructure.
Vs. tuning an off-the-shelf solution (e.g.
Berkeley DB)?
Goal? unified lessons about dataflow/eventflow
optimization on clusters.

13
Integration with Rest of Endeavour

Give
Be dataflow backbone for diverse clients
Our own Telegraph apps (federated dataflow,
sensors)
Replication/delivery dataflow engine for
OceanStore
Scalable infrastructure for tacit info mining
algorithms?
Pipes for next version of Iceberg?
Telegraph Storage Manager provides storage
(xactional/otherwise) for OceanStore? Ninja?
Take
OceanStore to manage distributed metadata,
security
Leverage protocols out of TinyOS for sensors
Partner with Ninja to manage local metadata?
Work with GUIR on interacting with streams?

14
More Info

People
Joe Hellerstein, Mike Franklin, Eric Brewer,
Christos Papadimitriou
Sirish Chandrasekaran, Amol Deshpande, Kris
Hildrum, Sam Madden, Vijayshankar Raman, Mehul
Shah
Software
http//telegraph.cs.berkeley.edu coming soon
ABC interactive data anlysis/cleansing at
http//control.cs.berkeley.edu
Papers
See http//db.cs.berkeley.edu/telegraph

15
Extra slides for backup
16
Connectivity Heterogeneity

Lots of folks working on data format translation,
parsing
we will borrow, not build
currently using JDBC Cohera Net Query
commercial tool, donated by Cohera Corp.
gateways XML/HTML (via http) to ODBC/JDBC
we may write Teletalk gateways from sensors
Heterogeneity
never a simple problem
Control project developed interactive, online
data transformation tool ABC

17
CONTROLContinuous Output and Navigation
Technology with Refinement On Line

Data-intensive jobs are long-running. How to
give early answers and interactivity?
online interactivity over feeds
pipelining online operators, data juggle
online data correlation algs ripple joins,
online mining and aggregation
statistical estimators, and their performance
implications
Deliver data to satisfy statistical goals
Appreciate interplay of massive data processing,
stats, and HCI

Of all men's miseries, the bitterest is this to
know so much and have control over nothing
Herodotus

18
Performance Regime for CONTROL

New Greedy Performance Regime
Maximize 1st derivative of the user-happiness
function

100
CONTROL
?
Traditional
Time
19
CONTROLContinuous Output and Navigation
Technology with Refinement On Line
20
CONTROLContinuous Output and Navigation
Technology with Refinement On Line
21
River

We built the worlds fastest sorting machine
On the NOW 100 Sun workstations SAN
But it only beat the record under ideal
conditions!
River performance adaptivity for data flows on
clusters
simplifies management and programming
perfect for sensor-based streams

22
Declarative Dataflow NOT new

Database Systems have been doing this for years
Xlate declarative queries into an efficient
dataflow plan
query optimization considers
Alternate data sources (access methods)
Alternate implementations of operators
Multiple orders of operators
A space of alternatives defined by transformation
rules
Estimate costs and data rates, then search
space
But in a very static way!
Gather statistics once a week
Optimize query at submission time
Run a fixed plan for the life of the query
And these ideas are ripe to elevate out of DBMSs
And outside of DBMSs, the world is very volatile
There are surely going to be lessons outside the
box

23
Static Query Plans

Volatile environments like sensors need to adapt
at a much finer grain

24
Continuous Adaptivity Eddies
Eddy

How to order and reorder operators over time
based on performance, economic/admin feedback
Vs.River
River optimizes each operator horizontally
Eddies optimize a pipeline vertically

25
Competitive Eddies
26

27
Potters Wheel Anomaly Detection
28
The Data Flood is Real
Source J. Porter, Disk/Trend, Inc. http//www.dis
ktrend.com/pdf/portrpkg.pdf
29
Disk Appetite, cont.