Enabling The Fortune One Million - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Enabling The Fortune One Million

Description:

Improving and creating systematic methodologies for the DADO steps. Creating systematic connections among the DADO process stages ... DADO - Operate ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 38

Provided by: georgep6

Category:

more less

Transcript and Presenter's Notes

Title: Enabling The Fortune One Million

1
Enabling The Fortune One Million

Armando Fox, Stanford University
With Randy Katz, Michael Jordan, Dave Patterson,
Scott Shenker, Ion Stoica
IBM P3AD, April 2006

2
Enabling The Fortune One Million

Armando Fox, Berkeley RAD Lab
With Randy Katz, Michael Jordan, Dave Patterson,
Scott Shenker, Ion Stoica
IBM P3AD, April 2006

3
This talk

Untainted by results, proofs, theorems, etc.
BUTan attempt to formulate an operational
vision statement for self- systems and a
strategy for attaining it
presumed achievable based on bona fide previous
results

4
Steps vs. Process
Cap Dado (The section of a pedestal between
cap and base) Base

Steps Traditional, Static Handoff Model, N groups

Process SupportDADO Evolution, 1 group

1995 Pierre Omidyar develops deploys eBay 1.0
over a long weekend
Has been rewritten twice since then.
2005 HousingMaps.com connects Google Maps and
Craigslist apartment listings
Spawned a whole bunch of map mashups -
GoogleMapsMania blog
2006 bug in Spell With Flickr mashup results
in author slapped for driving too much traffic
Prototyping getting easierdeployment at scale
getting harder at least as fast

6
RAD Lab 5-year Mission

Provide tools, technology platform to allow a
single person to Develop, Assess, Deploy, and
Operate the next-generation IT service
Enables The Fortune 1 Million
Major partnership with industry
Rest of this talk
Early progress on DADO
Reflections on upcoming technical challenges
opportunities
Lab organization tie-in to course plans, etc.

RAD Lab Robust, Adaptive, Distributed
systems
7
RAD Lab Challenges Center

The Challenges
Develop a new Service using tools that facilitate
rapid prototyping
Assess Measuring, Testing, and Debugging the new
Service in a realistic distributed environment
Deploy Scaling up a new, geographically
distributed Service
Operate a service that could quickly scale to
millions of users
The Vehicle
Interdisciplinary Center creates core technical
competency to demo 10X to 100X
Researchers are leaders in machine learning,
networking, and systems
Industrial Participants leading companies in HW,
systems SW, and online services

8
Science is a way to createknowledge

But science is also about understanding complex
artifacts
What is the science in services science?
Going from raw observations of (complex) system
behavior to actionable interpretations
Improving and creating systematic methodologies
for the DADO steps
Creating systematic connections among the DADO
process stages

But science is also about understanding complex
artifacts
What is the science in services science?
Going from raw observations of (complex) system
behavior to actionable interpretations
Improving and creating systematic methodologies
for the DADO steps
Creating systematic connections among the DADO
process stages

Jim Stohrer at IBM University Day, Almaden,
April 2006
9
DADO - Develop Joy of Middleware

Dominant way to deploy services
Innovate below abstraction
Unmodified/proprietary apps can benefit (e.g.
from instrumentation)
Modern middleware tends to make apps more
declarative
CORBA gt J2EE gt Ruby on Rails
Can get things running immediately
But usually end up being rewritten
NCSA httpd gt Apache, Ebay 0.9 gt Ebay 2.0,
Google.stanford.edu gt Google.com
Challenge can we get the best of both worlds?

10
Examples Understanding the curse of success
11
DADO - Assess example Packet Annotations

Create new ways to collect information over
distributed networkAnnotation Layer
Incrementally deployable on existing
infrastructure
iBoxes label packets at annotation layer but do
not change original packet payloads
Expose annotations to application layer

Application
Presentation
Session
Annotations
Transport
Network
Link
Phy
12
iBox Placement for Observation and Action
iBoxes strategically placed near entry/exit
points within the Enterprise network
13
DADO - Assess Distributed Debugging

Allows inspection of snapshots of distributed
app state
Faithful replay of distributed apps
virtual (Lamport) clocks allow consistent replay
Has found bugs in Chord/I3 and itself
Works with existing toolchains
Transparently intercepts libc calls
Extends gdb UI

gt replay 132.239.6.225
... running
gt break update_state()
... 1 set line 75
gt advance 10000000
... done
gt fix bug for me
user
14
DADO - Deploy RAMP

How can academics experiment withsystems of
1000 nodes?
RAMP (Research Accelerator for Multiple
Processors) for parallel HW SW research
Single FPGA 25 CPUs caches in 2005
100k 4 FPGAs / board, 4 DIMMs / FPGA ,10-20
boards low-cost Storage Server over Ethernet
? 1000 CPUs, 256 MB DRAM/CPU, 20 GB disk
storage/CPU
Parts of RAMP-1 prototype already running

15
Using RAMP

Current status
Smaller-size board, 4 machines
4 MicroBlaze cores, Micro-C/Linux, TCP/IP, NFS,
Telnet, httpdCGI, Python Ruby coming soon
Short term plans
Instrumentation plane think OpenView, but we can
instrument whatever we want
Simulate run simple Web apps on many many 100MHz
CPUs
Longer term plans
Simulate wide-area networks at scales impractical
on PlanetLab
Understand Datacenter-in-a-Box model

X
16
DADO - Operate

Apply statistical machine learning to find
patterns in behavior of complex software
Example correlate high-level site health metrics
with low-level fingerprints associated with bad
health gt info retrieval
Example sample annotated software features
(language-level constructs) and correlate feature
sets with failed runs to help pinpoint bugs
Combine SML with visualization so operator sees
understands significance of anomalies
Promising early results but just the tip of the
iceberg
State of the art for visualization in operator
tools is very immature

17
Some results so far
18
Signatures - example

Metric has value 1 if it is attributed with the
violation, -1 if it is not attributed, 0 if it is
not relevant

Attri- bution
19
DADO - Operate Open sourcelike database of
traces/logs

Goal large trace-like database of failure
logs and other relevant failure data for
research use
So far
Complete source sanitized logs of 3 Flickr
mashup front-ends
Working with affiliates to make public a
sanitized version of data used in our early
results papers
Access to Microsoft desktop crash data collected
via BOINC (paper submission forthcoming)

20
ReflectionsA good time to be using SML

Technology supports use of SML
Even building blocks of systems are
sufficiently complex, instrumentable, and have
large user bases
Advances in online algorithms research make good
fit for long-running systems
Moores Law nontrivial models can be induced and
evaluated in soft real time (seconds) for many
of these systems
Domain expertise in systems still needed
We will develop a corps of researchers whose
strength is SML/Systems crossover

21
Reflections Whats new in Services Science?
(or SOAs becoming realuh-oh!)

Workload challenges
AJAX and mashups change workload seen by back-end
servers
Nonlinear dynamics of changing workloads will
make spike provisioning more challenging (eg
Flickr mashups)
Long tail management challenges
For every Amazon or Google, 1000s of smaller
services
This ratio will increase rapidly as ease of
deploying a meta-service increases
Managing 1000s of different services sharing
resources will be harder than managing one
mega-service
decoupled control loops (Jeff Mogul)
Even if each service is well-regulated, can we
say anything about the meta-services? (like
mashups)

22
Reflections An interesting opportunity

Difficult to scale functionalities increasingly
being offered as utility services
This is mostly why SOA is taking off!
Storage Google Base, OpenDHT, Amazon S3 (eg, new
client-side Wiki software that uses S3 and no
other server!), Salesforce.com
Mapping/GIS Google Yahoo Maps
Build customized searches using search engine
APIs
Future functions like MapReduce?
Indeed, mashups are often not much more than
front ends of computation soft state
This should be easy to scale!! The rubber meets
the road here.
Experiments to be done soon on RAMP, which Ill
talk about shortly

23
RAD Lab Organization

2.5M/year, 70 industry, 20 State, 10 Fed.
govt (NSF)
30 grad students, 10 undergrads, 6 faculty, 2
staff
Founding Companies Sun, Google, Microsoft
Affiliate Members include Verisign, IBM,
Hewlett-Packard, NTT, Oracle, Nortel
Mid project review after 3 years by founding
partners
Benefits to Affiliates RAD Lab
Prefer founding partner technology in prototypes
Designate employees to act as consultants/liaisons
Real-life training for next generation of IT
researchers
Research based on real systems data (logs,
forensics, etc.)

24
Industrial collaboration

Intellectual property policy
Nonexclusive, royalty-free IP license so partners
not sued--BSD license (text available at
opensource.org)
Head start on research results for affiliates
(6-month embargo)
Impact from previous projects
RAID, RISC, NOW - multibillion-dollar industries
Berkeley regularly ranked in top 3 for systems
research (1 this year, tied with MIT)

25
Education/Course plans

Were not teaching students to think in terms of
a hosted-service development model
Inheriting, understanding, extending other
peoples long lived code
How you do testing and upgrades for an online
24x7 service
Technologies we used to teach in detail are now
encapsulated as open source, running code
Symmetric multithreaded I/O intensive apps
(Apache)
transactions and concurrency control (MySQL)
Whats important is understanding these as system
building blocks
What are their interactions?
What tradeoffs involved in composing these in a
system? What price do I pay (in performance,
robustness, or whatever) for selecting a given
behavior?

26
Course plans

Year 1 Graduate project courses
Improve the RAD Lab platform, infrastructure,
technologies
Year 2 Undergrad courses
Develop, assess, deploy, operate new apps on RAD
Lab hosting service
Improve other peoples existing services, all in
a hosted environment
Year 3 Joint courses between CS and
Business/Management
Design business model along with app
Understand how business concerns affect DADO
process
Consistent with IBMs SSME vision of creating a
multidisciplinary corps of service scientists

27
Summary

Technology bets SML, visualization, FPGAs
will help us better understand the behaviors of
these complex distributed systems
Will let us run credible experiments at scale
Will improve the tools available for operators
Eventual goal Fortune 1 Million
1 person can design, deploy and operate next eBay
without building an eBay-sized organization
Strong ties to industry
Integrated with course offerings/curriculum
http//radlab.cs.berkeley.edu

28
Acknowledgments

RAD Lab sponsors founders
Co-PIs Patterson, Katz, Stoica, Shenker, Jordan
Students whose work was mentioned in this talk
Archana Ganapathi, Peter Bodik, Dennis Geels, Wei
Xu,.

29
BACKUP SLIDES
30
DADO - DevelopBack end building block services

servers serve client programs--not people
Ideas like user-based anomaly detection dont
work
Workloads higher volume different profile
(e.g., prefetching for Google Maps)
Aggregational services multiply workloads (1
HousingMaps hit N Craigslist hits N Google
Maps hits)
Distributed debugging is tough because the other
sites are not under your control
Large sophisticated sites already deal with this
internally
but must now deal with less-predictable workload,
evil, etc

31
DADO - OperateCapturing Operator Actions

Systematically capture, index, retrieve operator
actions during incident response
Operators role largely ignored in most current
work
Goal try to capture semantics of how operators
are thinking when they react to a problem
Gradually increase trust by suggesting actions
based on past history
Auditable recommendations let operator explore
how the recomendation was made
Various techniques possible--reinforcement
learning, expert systems, collaborative
filtering...

32
Capturing operator actions

monitoring operators
web-based tools look at web server access logs
Unix command line sudo logs or history
stand-alone GUI tools instrument them?
trouble ticket DB operators involved, start/end,
type of problem
Challenge extract sufficient semantics to allow
cross-analysis of sources (timestamps, intent
of an action, etc)
similarity metric for failures
eg compare signatures of failures/problems
clickstream analysis and data mining
e-commerce sites already do this--for their
customers

33
RAD Lab Opportunity New Research Model

Chance to Partner with the Top University in
Computer Systems on the Next Great Thing
National Academy of Engineering mentions Berkeley
in 7 of 19 1B industries that came from IT
research
NAE mentions Berkeley 7 times, Stanford 5 Times,
MIT 5, CMU 3 Timesharing (SDS 940), Client-Server
Computing (BSD Unix), Graphics, Entertainment,
Internet, LANs, Workstations, GUI, VLSI Design
(Spice) ECAD 5B?/yr , RISC 10B?/yr ,
Relational DB (Ingres/Postgres) RDB 15B?/yr,
Parallel DB, Data Mining, Parallel Computing,
RAID 15B?/yr , Portable Communication (BWRC),
WWW, Speech Recognition, Broadband
Berkeley one of the top suppliers of systems
students to industry and academia
US News World Report ranking of CS Systems
universities1 Berkeley, 2 CMU, 2 MIT, 4
Stanford, 5 Washington
For example Quanta (Taiwan PC laptop clone
manufacturer) funds MIT CSAIL _at_ 4M/year for 5
years to reinvent PC April 2005 (Tparty)
RAID project (4 faculty, 20 grads, 10 undergrads)
helped create 15B industry, but not fundable
today at DARPA, NSF

34
RAD Lab Interdisciplinary Center for Reliable,
Adaptive, Distributed Systems

Working with different industries on long-range,
pre-competitive technology
Training of dozens of future leaders of IT, plus
their recruitment
Working with researchers with track records of
successful technology transfer

35
RAD Lab Timeline

2005 Launch RAD Lab
2006 Collect workloads, Internet in a Box
2007 SLT/CT distributed architectures, Iboxes,
annotative layer, class testing
2008 Development toolkit 1.0, tuple space, class
testing Mid Project Review
2009 RAD Lab software suite 1.0, class testing
2010 End of Project Party

36
DADO - Operate

Others ideas
Fast recovery means can afford false positives,
enabling automated recovery mechanisms for
servers via SLT algorithms
Microreboot exemplifies Repair as local
adaptation
Safety achieved by state separation
Linear Control Theory places constraints on SW
architectures
Will restricting systems to be controllable
make them easier to operate by humans as well by
simple controllers?
Will cost-performance still be good enough for
controllable systems?
SLT helpful in diagnosing failed components

37
DADO - DevelopControl-Theory-Friendly Systems

Problem server-like system consisting of stages
separated by queues
Lack of balance across stages results in
performance hiccups
Straightforward application of LTI control theory
to regulate queue lengths via combination of
admission control filtering
Insight build systems to allow the use of simple
linear controllers
Example Farsite Scalability TR identifies
Farsite properties that prevent it from being a
good candidate for CT
Could Farsite be architected to avoid those
properties? At what cost?