Telegraph Endeavour Retreat 2000 - PowerPoint PPT Presentation

About This Presentation
Title:

Telegraph Endeavour Retreat 2000

Description:

Incorporates Ninja 'path' project for caching ... Partner with Ninja to manage local metadata? Work with GUIR on interacting with streams? ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 32
Provided by: JosephMHe5
Learn more at: https://dsf.berkeley.edu
Category:

less

Transcript and Presenter's Notes

Title: Telegraph Endeavour Retreat 2000


1
TelegraphEndeavour Retreat 2000
  • Joe Hellerstein

2
Roadmap
  • Motivation Goals
  • Application Scenarios
  • Quickie core technology overview
  • Adaptive dataflow
  • Event-based storage manager
  • Come hear more about these tonight/tomorrow!
  • Status and Plans
  • Dataflow infrastructure apps
  • Storage manager?

3
Motivations
  • Global Data Federation
  • All the data is online what are we waiting for?
  • The plumbing is coming
  • XML/HTTP, XML/WAP, etc. give LCD communication
  • but how do you flow, summarize, query and analyze
    data robustly over many sources in the wide area?
  • Ubiquitous computing more than clients
  • sensors and their data feeds are key
  • smart dust, biomedical (MEMS sensors)
  • each consumer good records (mis)use
  • disposable computing
  • video from surveillance cameras, broadcasts, etc.
  • Huge Data flood acomin!
  • will it capsize the good ship Endeavour?

4
Initial Telegraph Goals
  • Unify data access dataflow apps
  • Commercial wrappers for most infosources
  • Most info-centric apps can be cast as dataflow
  • The data flood needs a big dataflow manager!
  • Goal a robust, adaptive dataflow engine
  • Unify storage
  • Currently lots of disparate data stores
  • Databases, Files, Email servers (and http access
    on these)
  • Goal A single, clean storage manager that can
    serve
  • DB records semantics
  • Files and semantics
  • Email folders, calendars, etc. and semantics

5
Challenge for Dataflow Volatility!
  • Federated query processors
  • A la Cohera, IBM DataJoiner
  • No control over stats, performance,
    administration
  • Large Cluster Systems Scaling Out
  • No control over system balance
  • User CONTROL of running dataflows
  • Long-running dataflow apps are interactive
  • No control over user interaction
  • Sensor Nets
  • No control over anything!
  • Telegraph
  • Dataflow Engine for these environments

6
The Data Flood Main Features
  • What does it look like?
  • Never ends interactivity required
  • Online, controllable algorithms for all tasks!
  • Big data reduction/aggregation is key
  • Volatile this scale of devices and nets will not
    behave nicely

7
The Telegraph Dataflow Engine
  • Key technologies
  • Interactive Control
  • interactivity with early answers and examples
  • online aggregation for data reduction
  • Dataflow programming via paths/iterators
  • Elevate query processing frameworks out of DBMSs
  • Long tradition of static optimization here
  • Suggestive, but not sufficient for volatile
    environments
  • Continuously adaptive flow optimization
  • massively parallel, adaptive dataflow
  • Rivers and Eddies

8
Static Query Plans
  • Volatile environments like sensors need to adapt
    at a much finer grain

9
Continuous Adaptivity Eddies
Eddy
  • How to order and reorder operators over time
  • based on performance, economic/admin feedback
  • Vs.River
  • River optimizes each operator horizontally
  • Eddies optimize a pipeline vertically

10
Unifying Storage
  • Storage management buried inside specific systems
  • Elevate and expose the core services semantic
    options
  • Layout/indexing
  • Concurrent access/modification
  • Recovery
  • Design for clustered environments
  • Replicate for reliability (tie-ins with Ninja)
  • Cluster options your RAM vs. my disk
  • Events State Machines for scalability
  • Unify eventflow and dataflow?
  • Share optimization lessons?

11
Status Adaptive Dataflow
  • Initial Eddy results promising, well received
    (SIGMOD 2K)
  • Finishing Telegraph v0 in Java/Jaguar
  • Prototype now running
  • Demo service to go live on web this summer
  • Analysis queries over web sites
  • Weve picked a provocative app to go live with
    (stay tuned!)
  • Incorporates Ninja path project for caching
  • Goal Telegraph is to facts and figures as
    search engines are to documents
  • Longer-term goals
  • Formalize optimize Eddy/River scheduling
    policies
  • Study HCI/systems/stats issues for interaction
  • Crawl Dark Matter on the web
  • Attack streams from sensors
  • Sequence queries and mining, data reduction,
    browsing, etc.

12
Status Unified Storage Manager
  • Prototype implementation in Java/Jaguar
  • ACID transactions (non-ACID) Java file access
  • Robust enough to get TPC-W numbers
  • Events/states vs. threads
  • Echoes Gribble/Welsh results better than
    threaded under load, but Java complicates
    detailed mesurement
  • Time to re-evaluate importance of this part
  • Interest? More mindshare in dataflow
    infrastructure.
  • Vs. tuning an off-the-shelf solution (e.g.
    Berkeley DB)?
  • Goal? unified lessons about dataflow/eventflow
    optimization on clusters.

13
Integration with Rest of Endeavour
  • Give
  • Be dataflow backbone for diverse clients
  • Our own Telegraph apps (federated dataflow,
    sensors)
  • Replication/delivery dataflow engine for
    OceanStore
  • Scalable infrastructure for tacit info mining
    algorithms?
  • Pipes for next version of Iceberg?
  • Telegraph Storage Manager provides storage
    (xactional/otherwise) for OceanStore? Ninja?
  • Take
  • OceanStore to manage distributed metadata,
    security
  • Leverage protocols out of TinyOS for sensors
  • Partner with Ninja to manage local metadata?
  • Work with GUIR on interacting with streams?

14
More Info
  • People
  • Joe Hellerstein, Mike Franklin, Eric Brewer,
    Christos Papadimitriou
  • Sirish Chandrasekaran, Amol Deshpande, Kris
    Hildrum, Sam Madden, Vijayshankar Raman, Mehul
    Shah
  • Software
  • http//telegraph.cs.berkeley.edu coming soon
  • ABC interactive data anlysis/cleansing at
    http//control.cs.berkeley.edu
  • Papers
  • See http//db.cs.berkeley.edu/telegraph

15
Extra slides for backup
16
Connectivity Heterogeneity
  • Lots of folks working on data format translation,
    parsing
  • we will borrow, not build
  • currently using JDBC Cohera Net Query
  • commercial tool, donated by Cohera Corp.
  • gateways XML/HTML (via http) to ODBC/JDBC
  • we may write Teletalk gateways from sensors
  • Heterogeneity
  • never a simple problem
  • Control project developed interactive, online
    data transformation tool ABC

17
CONTROLContinuous Output and Navigation
Technology with Refinement On Line
  • Data-intensive jobs are long-running. How to
    give early answers and interactivity?
  • online interactivity over feeds
  • pipelining online operators, data juggle
  • online data correlation algs ripple joins,
    online mining and aggregation
  • statistical estimators, and their performance
    implications
  • Deliver data to satisfy statistical goals
  • Appreciate interplay of massive data processing,
    stats, and HCI
  • Of all men's miseries, the bitterest is this to
    know so much and have control over nothing
  • Herodotus

18
Performance Regime for CONTROL
  • New Greedy Performance Regime
  • Maximize 1st derivative of the user-happiness
    function

100
CONTROL
?
Traditional
Time
19
CONTROLContinuous Output and Navigation
Technology with Refinement On Line
20
CONTROLContinuous Output and Navigation
Technology with Refinement On Line
21
River
  • We built the worlds fastest sorting machine
  • On the NOW 100 Sun workstations SAN
  • But it only beat the record under ideal
    conditions!
  • River performance adaptivity for data flows on
    clusters
  • simplifies management and programming
  • perfect for sensor-based streams

22
Declarative Dataflow NOT new
  • Database Systems have been doing this for years
  • Xlate declarative queries into an efficient
    dataflow plan
  • query optimization considers
  • Alternate data sources (access methods)
  • Alternate implementations of operators
  • Multiple orders of operators
  • A space of alternatives defined by transformation
    rules
  • Estimate costs and data rates, then search
    space
  • But in a very static way!
  • Gather statistics once a week
  • Optimize query at submission time
  • Run a fixed plan for the life of the query
  • And these ideas are ripe to elevate out of DBMSs
  • And outside of DBMSs, the world is very volatile
  • There are surely going to be lessons outside the
    box

23
Static Query Plans
  • Volatile environments like sensors need to adapt
    at a much finer grain

24
Continuous Adaptivity Eddies
Eddy
  • How to order and reorder operators over time
  • based on performance, economic/admin feedback
  • Vs.River
  • River optimizes each operator horizontally
  • Eddies optimize a pipeline vertically

25
Competitive Eddies
26

27
Potters Wheel Anomaly Detection
28
The Data Flood is Real
Source J. Porter, Disk/Trend, Inc. http//www.dis
ktrend.com/pdf/portrpkg.pdf
29
Disk Appetite, cont.
  • Greg Papadopoulos, CTO Sun
  • Disk sales doubling every 9 months
  • Note only counts the data were saving!
  • Translate
  • Time to process all your data doubles every 18
    months
  • MOORES LAW INVERTED!
  • (and Moores Law may run out in the next couple
    decades?)
  • Big challenge (opportunity?) for SW systems
    research
  • Traditional scalability research wont help
  • Ideal linear scaleup is NOT NEARLY ENOUGH!

30
Data Volume Prognostications
  • Today
  • SwipeStream
  • E.g. Wal-Mart 24 Tb Data Warehouse
  • ClickStream
  • Web
  • Internet Archive ?? Tb
  • Replicated OS/Apps
  • Tomorrow
  • Sensors Galore
  • DARPA/Berkeley Smart Dust
  • Note the privacy issues onlyget more complex!
  • Both technically and ethically

Temperature, light, humidity, pressure,
accelerometer, magnetics
31
Explaining Disk Appetite
  • Areal density increases 60/yr
  • Yet Mb/ rises much faster!

Source J. Porter, Disk/Trend, Inc. http//www.dis
ktrend.com/pdf/portrpkg.pdf
Write a Comment
User Comments (0)
About PowerShow.com