Title: Recovery-Oriented Computing
1Recovery-Oriented Computing
- Dave Patterson and Aaron Brown
- University of California at Berkeley
- patterson,abrown_at_cs.berkeley.edu
- In cooperation with
- Armando Fox, Stanford Universityfox_at_cs.stanford.e
du - http//roc.CS.Berkeley.EDU/
- December 2001
2The past goals and assumptions of last 15 years
- Goal 1 Improve performance
- Goal 2 Improve performance
- Goal 3 Improve cost-performance
- Assumptions
- Humans are perfect (they dont make mistakes
during installation, wiring, upgrade, maintenance
or repair) - Software will eventually be bug free (good
programmers write bug-free code, debugging works) - Hardware MTBF is already very large (100 years
between failures), and will continue to increase
3Today, after 15 years ofimproving performance
- Availability is now the vital metric for servers
- near-100 availability is becoming mandatory
- for e-commerce, enterprise apps, online services,
ISPs - but, service outages are frequent
- 65 of IT managers report that their websites
were unavailable to customers over a 6-month
period - 25 3 or more outages
- outage costs are high
- social effects negative press, loss of customers
who click over to competitor - 500,000 to 5,000,000 per hour in lost revenues
Source InternetWeek 4/3/2000
4New goals ACME
- Availability
- 24x7 delivery of service to users
- Change
- support rapid deployment of new software, apps,
UI - Maintainability
- reduce burden on system administrators (cost of
ownership 5X cost of purchase) - provide helpful, forgiving sysadmin environments
- Evolutionary Growth
- allow easy system expansion over time without
sacrificing availability or maintainability
5Where does ACME stand today?
- Availability failures are common
- Traditional fault-tolerance doesnt solve the
problems - Change
- In back-end system tiers, software upgrades
difficult, failure-prone, or ignored - For application service over WWW, daily change
- Maintainability
- human operator error is single largest failure
source? - system maintenance environments are unforgiving
- Evolutionary growth
- 1U-PC cluster front-ends scale, evolve well
- back-end scalability still limited
6Recovery-Oriented Computing Philosophy
- If a problem has no solution, it may not be a
problem, but a fact, not to be solved, but to be
coped with over time - Shimon Peres
- Failures are a fact, and recovery/repair is how
we cope with them
- Since major Sys Admin job is recovery after
failure, ROC also helps with maintenance
- If necessary, start with clean slate, sacrifice
disk space and performance for ACME
7ROC Approach
- Change system administration environment to make
it forgiving - Develop ACME benchmarks to test old systems and
new ideas to measure improvement - Fastest to Recover from Failures v. Fastest on
SPEC - Work with companies to get real data on failure
causes and patterns, feedback on approach - Cluster technology that enables partition
systems, insert faults, test outputs
8One idea Undo for Sysadmin
- Major goal of ROC is to provide an Undo for
system administration - to create an environment that forgives operator
error - to let sysadmins fix latent errors even after
theyre manifested - this is no ordinary word processor undo!
- The Three Rs undo meets time travel
- Rewind roll system state backwards in time
- Repair fix latent or active error
- automatically or via human intervention
- Redo roll system state forward, replaying user
interactions lost during rewind
9Discussion Topics
- Focus on recovery appropriate?
- How much of a problem is human error?
- Undo would it be useful?
- How hard is it to diagnose problems?
- Benchmarks to evaluate progress in this area?
- Future events on this topic?
10Backup Slides for Questions
11Discussion Topics (long version)
- Focus on recovery appropriate?
- is it an important part of what you do?
- what other parts of the sysadmin experience (for
Internet services) should we look into? - How much of a problem is human error?
- are there more user errors or sysadmin errors?
- Undo would it be useful?
- how far back in time would it need to go?
- should it be exported to users, e.g. undo for
email? - How hard is it to diagnose problems?
- would you use or trust automated diagnosis tools?
12Discussion Topics (long version), p2
- Benchmarks to evaluate progress in this area
- measuring dependability of human-operated systems
- how to measure sysadmin satisfaction with system?
- Future events on this topic
- would you come to a workshop at the next LISA?
- what do you think about a hands-on workshop where
we spend the morning using various systems then
spend the afternoon analyzing and discussing
results?
13ROC Enabler ACME benchmarks
- Traditional benchmarks focus on performance
- assume perfect hardware, software, human
operators - New benchmarks needed to drive progress toward
ACME, evaluate ROC success - How else convince skeptics to adopt new
technology? - Need workload of typical failures
normal behavior(99 conf.)
QoS degradation
failure
Repair Time
14Tools for Recovery Detection
- System enables input insertion, output check of
all modules (including fault insertion) - To check module sanity to find failures faster
- To test correctness of recovery mechanisms
- insert (random) faults and known-incorrect inputs
- also enables availability benchmarks
- To expose remove latent errors from system
- To train/expand experience of operator
- Periodic reports to management on skills
- To discover if warning systems are broken
15Repairing the Past (2)
- 3 cases needing Undo
- reverse the effects of a mistyped command (rm rf
) - roll back a software upgrade without losing user
data - go back in time to retroactively install virus
filter on email server effects of virus are
squashed on redo - The 3 Rs vs. checkpointing, reboot, logging
- checkpointing gives Rewind only
- reboot may give Repair, but only for Heisenbugs
- logging can give all 3 Rs
- but need more than RDBMS logging, since system
state changes are interdependent and
non-transactional - 3R-logging requires careful dependency tracking,
and attention to state granularity and
externalized events
16Organizations were working with
- Traditional Computer Companies HP, IBM,
Microsoft, ... - Internet companiesAmazon, Google, Yahoo, ...
- Startup companies (wouldnt recognize names)
- Stanford University (Armando Fox)
- E.g., CS 294-4 class
- Visit them twice a year, 3-day retreats at Lake
Tahoe to review progress, get new insights, build
team spirit, have fun, ...
17Systems using for ROC
- ROC-I Cluster of 64 PCs modified with ability
for HW isolation, fault insertion, monitoring - Cluster of 40 IBM PCs each with 2 GB DRAM, 1
gigabit Ethernet, gigabit switch,HW monitor, each
running Vmware virtual machine monitor (software
layer)
18Why ROC?
- Working on relevant, important systems topic with
no jeopardy from near-term products - New, exciting, research field, so lots of
opportunities - Interacting with many interesting companies,
research labs, Stanford University - Contact Dave Patterson (patterson_at_cs)or Aaron
Brown (abrown_at_cs) or David Oppenheimer
(davidopp_at_cs) - See http//ROC.cs.berkeley.edu