Title: Berkeley RAD Lab: Robust, Adaptive, Distributed Systems
1Berkeley RAD LabRobust, Adaptive, Distributed
Systems
- Armando Fox, Randy Katz, Michael Jordan, Dave
Patterson, Scott Shenker, Ion Stoica - November 2005
2RAD Lab
- The 5-year Vision
- Single person can go from vision to a
next-generation IT service (the Fortune 1
million) - E.g., over long holiday weekend in 1995, Pierre
Omidyar created Ebay v1.0 - The Vehicle
- Interdisciplinary Center creates core technical
competency to demo 10X to 100X - Researchers are leaders in machine learning,
networking, and systems - Industrial Participants leading companies in HW,
systems SW, and online services - Called RAD Lab for Reliable, Adaptable,
Distributed systems
3RAD Lab
Cap Dado (The section of a pedestal between
cap and base) Base
- The Science
- Both shorter-term and longer-term solutions
- Develop using primitives ? functions (MapReduce),
services (Craigslist) - Assess/debug using deterministic replay and
finding new metrics - Deploy using Internet-in-a-Box via FPGAs under
failure/slowdown workloads - Operate using Statistical Learning
Theory-friendly, Control Theory-friendly software
architectures and visualization tools
- Added Value to Industrial Participants
- Working with leading people and companies from
different industries on long-range,
pre-competitive technology - Training of dozens of future leaders of IT in
multiple disciplines, and their recruitment by
industrial participants - Working with researchers with successful track
record of rapid transfer of new technology
4Steps vs. Process
- Steps Traditional, Static Handoff Model, N groups
- Process SupportDADO Evolution, 1 group
5DADO - Develop
- Create abstractions, primitives, toolkit for
large scale systems that make it easy to
invent/deploy functions (e.g, MapReduce) - For example, Distributed Hash Tables (OpenDHT)
- Already setting the trend for IETF standards
6DADO - Assess
- We improve what we can measure
- Inspect box visibility into networks, usually
data poor - Servers data rich data often discarded
- Statistical and Machine Learning (SML) to the
rescue. It works well when - You have lots of raw data
- You have reason to believe the raw data is
related to some high-level effect youre
interested in - You dont have a model of what that relationship
is - Note SML advances ? fast analysis
7DADO - Deploy
- Re-engineer RAMP to act like 1000 node
distributed system under realistic failure and
slowdown workloads - RAMP emulates data center wide area systems as
well as MPP - Collect and apply failure data from real world
- RAMP vs. Clusters Larger scale, easier to
develop/debug, flexible HW/SW configuration,
inexpensive so no need to share - Explore via repeatable experiments as vary
parameters, configurations vs. observations on
single (aging) cluster that is often idiosyncratic
8DADO - Operate
- Idea when site misbehaves, users notice, and
change their behavior use as failure detector - Approach combine visualization with Statistical
and Machine Learning analysis so operator see
anomalies too - Experiment does distribution of hits to various
pages match the historical distribution? - Each minute, compare hit counts of top N pages to
hit counts over last 6 hours using Bayesian
networks and ?2 test, real Ebates data
To learn more, see Combining Visualization and
Statistical Analysis to Improve Operator
Confidence and Efficiency for Failure Detection
and Localization, In Proc. 2nd IEEE Intl Conf.
on Autonomic Computing, June 2005, by Peter
Bodik, Greg Friedman, Lukas Biewald, Helen Levine
(Ebates,com), George Candea, Kayur Patel, Gilman
Tolle, Jon Hui, Armando Fox, Michael I. Jordan,
David Patterson.
9Account page problem
anomalyscore
Novel Visualization
I see and understand Winning operator trust
10Founding the RADLab Start 12/1
- Looking for 3 to 5 founding companies to fund 5
years _at_ cost of 0.5M / year - 25 grad students 15 undergrads 6 faculty 2
staff - Founding companies Google, Microsoft, Sun
Microsystems - RADS Consortium model
- Preference to founding partner technology in
prototypes - Designate employees to act as consultants
- Head start for participants on research results
- Putting IP in Public Domain so partners use not
sued - Press release of founding RAD Lab partners
December 1? - Mid project review after 3 years by founding
partners
11RAD Lab Opportunity New Research Model
- Chance to Partner with the Top University in
Computer Systems on the Next Great Thing - National Academy of Engineering mentions Berkeley
in 7 of 19 1B industries that came from IT
research - NAE mentions Berkeley 7 times, Stanford 5 Times,
MIT 5, CMU 3 Timesharing (SDS 940), Client-Server
Computing (BSD Unix), Graphics, Entertainment,
Internet, LANs, Workstations, GUI, VLSI Design
(Spice) ECAD 5B?/yr , RISC 10B?/yr ,
Relational DB (Ingres/Postgres) RDB 15B?/yr,
Parallel DB, Data Mining, Parallel Computing,
RAID 15B?/yr , Portable Communication (BWRC),
WWW, Speech Recognition, Broadband - Berkeley one of the top suppliers of systems
students to industry and academia - US News World Report ranking of CS Systems
universities 1 Berkeley, 2 CMU, 2 MIT, 4 Stanford
12RAD Lab Interdisciplinary Center for Reliable,
Adaptive, Distributed Systems
- Working with different industries on long-range,
pre-competitive technology - Training of dozens of future leaders of IT, plus
their recruitment - Working with researchers with track records of
successful technology transfer
13Backup Slides
14References
To learn more, see
- Combining Visualization and Statistical Analysis
to Improve Operator Confidence and Efficiency for
Failure Detection and Localization, In Proc. 2nd
IEEE Intl Conf. on Autonomic Computing, June
2005, by Peter Bodik, Greg Friedman, Lukas
Biewald, Helen Levine (Ebates,com), George
Candea, Kayur Patel, Gilman Tolle, Jon Hui,
Armando Fox, Michael I. Jordan, David Patterson. - Microreboot -- A Technique for Cheap Recovery,
George Candea, Shinichi Kawamoto, Yuichi Fujiki,
Greg Friedman, and Armando Fox. Proc. 6th Symp.
on Operating Systems Design and Implementation
(OSDI), San Francisco, CA, Dec. 2004. - Path-Based Failure and Evolution Management,
Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim
Lloyd, Dave Patterson, Armando Fox, and Eric
Brewer In Proc. 1st USENIX/ACM Symp. on Networked
Systems Design and Implementation (NSDI '04), San
Francisco, CA, March 2004. - "Scalable Statistical Bug Isolation," Ben Liblit,
M. Naik, Alice. X. Zheng, Alex Aiken, and Micheal
I. Jordan, PLDI, 2005.
15Sustaining Innovation/Training Engine in 21st
Century
- Replicate research centers based primarily on
industrial funding to expand IT market and to
train next generation of IT leaders - Berkeley Wireless Research Center (BWRC) 50
grad students, 30 undergrads _at_ 5M per year - Stanford Network Research Center (SNRC) 50 Grad
students _at_ 5M per year - MIT Tparty 4M per year (100 from Quanta)
- Industry largely funds
- N companies, where N is 5?
- Exciting, long term technical vision
- Demonstrated by prototype(s)
16State of Research Funding Today
- Most industry research shorter term
- DARPA exiting long-term (exp.) IT research
- 03-05 BAAs IPTO 9 AI, 2 classified, 1 SW
radio, 1 sensor net, 1 reliability, all have 12
to 18 month go/no go milestones - Academic led funding reduced 50 (so far) 2001 to
2004 - Faculty consultants in consortia led by defense
contractor, get grants support 1-2 students (
NSF funding level) - NSF swamped with proposals, conservative
- 2000 to 6500 proposals in 5 years
- IT has lowest acceptance rate at NSF (between 8
to 16) - Ambitious proposal is a negative review
- Even if get NSF funding, proposal reduced to
stretch NSF e.g., got 3 x 1/3 faculty, 6 grad
students, 0 staff, 3 years - (To learn more, see www.cra.org/research)
17RAD Lab Timeline
- 2005 Launch RAD Lab 12/1
- 2006 Collect workloads, Internet in a Box
- 2007 SLT/CT distributed architectures, Iboxes,
annotative layer, class testing - 2008 Development toolkit 1.0, tuple space, class
testing Mid Project Review - 2009 RAD Lab software suite 1.0, class testing
- 2010 End of Project Party