Title: Telegraph Endeavour Retreat 2000
1TelegraphEndeavour Retreat 2000
2Roadmap
- Motivation Goals
- Application Scenarios
- Quickie core technology overview
- Adaptive dataflow
- Event-based storage manager
- Come hear more about these tonight/tomorrow!
- Status and Plans
- Dataflow infrastructure apps
- Storage manager?
3Motivations
- Global Data Federation
- All the data is online what are we waiting for?
- The plumbing is coming
- XML/HTTP, XML/WAP, etc. give LCD communication
- but how do you flow, summarize, query and analyze
data robustly over many sources in the wide area? - Ubiquitous computing more than clients
- sensors and their data feeds are key
- smart dust, biomedical (MEMS sensors)
- each consumer good records (mis)use
- disposable computing
- video from surveillance cameras, broadcasts, etc.
- Huge Data flood acomin!
- will it capsize the good ship Endeavour?
4Initial Telegraph Goals
- Unify data access dataflow apps
- Commercial wrappers for most infosources
- Most info-centric apps can be cast as dataflow
- The data flood needs a big dataflow manager!
- Goal a robust, adaptive dataflow engine
- Unify storage
- Currently lots of disparate data stores
- Databases, Files, Email servers (and http access
on these) - Goal A single, clean storage manager that can
serve - DB records semantics
- Files and semantics
- Email folders, calendars, etc. and semantics
5Challenge for Dataflow Volatility!
- Federated query processors
- A la Cohera, IBM DataJoiner
- No control over stats, performance,
administration - Large Cluster Systems Scaling Out
- No control over system balance
- User CONTROL of running dataflows
- Long-running dataflow apps are interactive
- No control over user interaction
- Sensor Nets
- No control over anything!
- Telegraph
- Dataflow Engine for these environments
6The Data Flood Main Features
- What does it look like?
- Never ends interactivity required
- Online, controllable algorithms for all tasks!
- Big data reduction/aggregation is key
- Volatile this scale of devices and nets will not
behave nicely
7The Telegraph Dataflow Engine
- Key technologies
- Interactive Control
- interactivity with early answers and examples
- online aggregation for data reduction
- Dataflow programming via paths/iterators
- Elevate query processing frameworks out of DBMSs
- Long tradition of static optimization here
- Suggestive, but not sufficient for volatile
environments - Continuously adaptive flow optimization
- massively parallel, adaptive dataflow
- Rivers and Eddies
8Static Query Plans
- Volatile environments like sensors need to adapt
at a much finer grain
9Continuous Adaptivity Eddies
Eddy
- How to order and reorder operators over time
- based on performance, economic/admin feedback
- Vs.River
- River optimizes each operator horizontally
- Eddies optimize a pipeline vertically
10Unifying Storage
- Storage management buried inside specific systems
- Elevate and expose the core services semantic
options - Layout/indexing
- Concurrent access/modification
- Recovery
- Design for clustered environments
- Replicate for reliability (tie-ins with Ninja)
- Cluster options your RAM vs. my disk
- Events State Machines for scalability
- Unify eventflow and dataflow?
- Share optimization lessons?
11Status Adaptive Dataflow
- Initial Eddy results promising, well received
(SIGMOD 2K) - Finishing Telegraph v0 in Java/Jaguar
- Prototype now running
- Demo service to go live on web this summer
- Analysis queries over web sites
- Weve picked a provocative app to go live with
(stay tuned!) - Incorporates Ninja path project for caching
- Goal Telegraph is to facts and figures as
search engines are to documents - Longer-term goals
- Formalize optimize Eddy/River scheduling
policies - Study HCI/systems/stats issues for interaction
- Crawl Dark Matter on the web
- Attack streams from sensors
- Sequence queries and mining, data reduction,
browsing, etc.
12Status Unified Storage Manager
- Prototype implementation in Java/Jaguar
- ACID transactions (non-ACID) Java file access
- Robust enough to get TPC-W numbers
- Events/states vs. threads
- Echoes Gribble/Welsh results better than
threaded under load, but Java complicates
detailed mesurement - Time to re-evaluate importance of this part
- Interest? More mindshare in dataflow
infrastructure. - Vs. tuning an off-the-shelf solution (e.g.
Berkeley DB)? - Goal? unified lessons about dataflow/eventflow
optimization on clusters.
13Integration with Rest of Endeavour
- Give
- Be dataflow backbone for diverse clients
- Our own Telegraph apps (federated dataflow,
sensors) - Replication/delivery dataflow engine for
OceanStore - Scalable infrastructure for tacit info mining
algorithms? - Pipes for next version of Iceberg?
- Telegraph Storage Manager provides storage
(xactional/otherwise) for OceanStore? Ninja? - Take
- OceanStore to manage distributed metadata,
security - Leverage protocols out of TinyOS for sensors
- Partner with Ninja to manage local metadata?
- Work with GUIR on interacting with streams?
14More Info
- People
- Joe Hellerstein, Mike Franklin, Eric Brewer,
Christos Papadimitriou - Sirish Chandrasekaran, Amol Deshpande, Kris
Hildrum, Sam Madden, Vijayshankar Raman, Mehul
Shah - Software
- http//telegraph.cs.berkeley.edu coming soon
- ABC interactive data anlysis/cleansing at
http//control.cs.berkeley.edu - Papers
- See http//db.cs.berkeley.edu/telegraph
15Extra slides for backup
16Connectivity Heterogeneity
- Lots of folks working on data format translation,
parsing - we will borrow, not build
- currently using JDBC Cohera Net Query
- commercial tool, donated by Cohera Corp.
- gateways XML/HTML (via http) to ODBC/JDBC
- we may write Teletalk gateways from sensors
- Heterogeneity
- never a simple problem
- Control project developed interactive, online
data transformation tool ABC
17CONTROLContinuous Output and Navigation
Technology with Refinement On Line
- Data-intensive jobs are long-running. How to
give early answers and interactivity? - online interactivity over feeds
- pipelining online operators, data juggle
- online data correlation algs ripple joins,
online mining and aggregation - statistical estimators, and their performance
implications - Deliver data to satisfy statistical goals
- Appreciate interplay of massive data processing,
stats, and HCI
- Of all men's miseries, the bitterest is this to
know so much and have control over nothing - Herodotus
18Performance Regime for CONTROL
- New Greedy Performance Regime
- Maximize 1st derivative of the user-happiness
function
100
CONTROL
?
Traditional
Time
19CONTROLContinuous Output and Navigation
Technology with Refinement On Line
20CONTROLContinuous Output and Navigation
Technology with Refinement On Line
21River
- We built the worlds fastest sorting machine
- On the NOW 100 Sun workstations SAN
- But it only beat the record under ideal
conditions! - River performance adaptivity for data flows on
clusters - simplifies management and programming
- perfect for sensor-based streams
22Declarative Dataflow NOT new
- Database Systems have been doing this for years
- Xlate declarative queries into an efficient
dataflow plan - query optimization considers
- Alternate data sources (access methods)
- Alternate implementations of operators
- Multiple orders of operators
- A space of alternatives defined by transformation
rules - Estimate costs and data rates, then search
space - But in a very static way!
- Gather statistics once a week
- Optimize query at submission time
- Run a fixed plan for the life of the query
- And these ideas are ripe to elevate out of DBMSs
- And outside of DBMSs, the world is very volatile
- There are surely going to be lessons outside the
box
23Static Query Plans
- Volatile environments like sensors need to adapt
at a much finer grain
24Continuous Adaptivity Eddies
Eddy
- How to order and reorder operators over time
- based on performance, economic/admin feedback
- Vs.River
- River optimizes each operator horizontally
- Eddies optimize a pipeline vertically
25Competitive Eddies
26 27Potters Wheel Anomaly Detection
28The Data Flood is Real
Source J. Porter, Disk/Trend, Inc. http//www.dis
ktrend.com/pdf/portrpkg.pdf
29Disk Appetite, cont.
- Greg Papadopoulos, CTO Sun
- Disk sales doubling every 9 months
- Note only counts the data were saving!
- Translate
- Time to process all your data doubles every 18
months - MOORES LAW INVERTED!
- (and Moores Law may run out in the next couple
decades?) - Big challenge (opportunity?) for SW systems
research - Traditional scalability research wont help
- Ideal linear scaleup is NOT NEARLY ENOUGH!
30Data Volume Prognostications
- Today
- SwipeStream
- E.g. Wal-Mart 24 Tb Data Warehouse
- ClickStream
- Web
- Internet Archive ?? Tb
- Replicated OS/Apps
- Tomorrow
- Sensors Galore
- DARPA/Berkeley Smart Dust
- Note the privacy issues onlyget more complex!
- Both technically and ethically
Temperature, light, humidity, pressure,
accelerometer, magnetics
31Explaining Disk Appetite
- Areal density increases 60/yr
- Yet Mb/ rises much faster!
Source J. Porter, Disk/Trend, Inc. http//www.dis
ktrend.com/pdf/portrpkg.pdf