Recovery-Oriented Computing - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Recovery-Oriented Computing

Description:

Recovery-Oriented Computing Dave Patterson and Aaron Brown University of California at Berkeley {patterson,abrown}_at_cs.berkeley.edu In cooperation with – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 19
Provided by: AaronB156
Category:

less

Transcript and Presenter's Notes

Title: Recovery-Oriented Computing


1
Recovery-Oriented Computing
  • Dave Patterson and Aaron Brown
  • University of California at Berkeley
  • patterson,abrown_at_cs.berkeley.edu
  • In cooperation with
  • Armando Fox, Stanford Universityfox_at_cs.stanford.e
    du
  • http//roc.CS.Berkeley.EDU/
  • December 2001

2
The past goals and assumptions of last 15 years
  • Goal 1 Improve performance
  • Goal 2 Improve performance
  • Goal 3 Improve cost-performance
  • Assumptions
  • Humans are perfect (they dont make mistakes
    during installation, wiring, upgrade, maintenance
    or repair)
  • Software will eventually be bug free (good
    programmers write bug-free code, debugging works)
  • Hardware MTBF is already very large (100 years
    between failures), and will continue to increase

3
Today, after 15 years ofimproving performance
  • Availability is now the vital metric for servers
  • near-100 availability is becoming mandatory
  • for e-commerce, enterprise apps, online services,
    ISPs
  • but, service outages are frequent
  • 65 of IT managers report that their websites
    were unavailable to customers over a 6-month
    period
  • 25 3 or more outages
  • outage costs are high
  • social effects negative press, loss of customers
    who click over to competitor
  • 500,000 to 5,000,000 per hour in lost revenues

Source InternetWeek 4/3/2000
4
New goals ACME
  • Availability
  • 24x7 delivery of service to users
  • Change
  • support rapid deployment of new software, apps,
    UI
  • Maintainability
  • reduce burden on system administrators (cost of
    ownership 5X cost of purchase)
  • provide helpful, forgiving sysadmin environments
  • Evolutionary Growth
  • allow easy system expansion over time without
    sacrificing availability or maintainability

5
Where does ACME stand today?
  • Availability failures are common
  • Traditional fault-tolerance doesnt solve the
    problems
  • Change
  • In back-end system tiers, software upgrades
    difficult, failure-prone, or ignored
  • For application service over WWW, daily change
  • Maintainability
  • human operator error is single largest failure
    source?
  • system maintenance environments are unforgiving
  • Evolutionary growth
  • 1U-PC cluster front-ends scale, evolve well
  • back-end scalability still limited

6
Recovery-Oriented Computing Philosophy
  • If a problem has no solution, it may not be a
    problem, but a fact, not to be solved, but to be
    coped with over time
  • Shimon Peres
  • Failures are a fact, and recovery/repair is how
    we cope with them
  • Since major Sys Admin job is recovery after
    failure, ROC also helps with maintenance
  • If necessary, start with clean slate, sacrifice
    disk space and performance for ACME

7
ROC Approach
  • Change system administration environment to make
    it forgiving
  • Develop ACME benchmarks to test old systems and
    new ideas to measure improvement
  • Fastest to Recover from Failures v. Fastest on
    SPEC
  • Work with companies to get real data on failure
    causes and patterns, feedback on approach
  • Cluster technology that enables partition
    systems, insert faults, test outputs

8
One idea Undo for Sysadmin
  • Major goal of ROC is to provide an Undo for
    system administration
  • to create an environment that forgives operator
    error
  • to let sysadmins fix latent errors even after
    theyre manifested
  • this is no ordinary word processor undo!
  • The Three Rs undo meets time travel
  • Rewind roll system state backwards in time
  • Repair fix latent or active error
  • automatically or via human intervention
  • Redo roll system state forward, replaying user
    interactions lost during rewind

9
Discussion Topics
  • Focus on recovery appropriate?
  • How much of a problem is human error?
  • Undo would it be useful?
  • How hard is it to diagnose problems?
  • Benchmarks to evaluate progress in this area?
  • Future events on this topic?

10
Backup Slides for Questions
11
Discussion Topics (long version)
  • Focus on recovery appropriate?
  • is it an important part of what you do?
  • what other parts of the sysadmin experience (for
    Internet services) should we look into?
  • How much of a problem is human error?
  • are there more user errors or sysadmin errors?
  • Undo would it be useful?
  • how far back in time would it need to go?
  • should it be exported to users, e.g. undo for
    email?
  • How hard is it to diagnose problems?
  • would you use or trust automated diagnosis tools?

12
Discussion Topics (long version), p2
  • Benchmarks to evaluate progress in this area
  • measuring dependability of human-operated systems
  • how to measure sysadmin satisfaction with system?
  • Future events on this topic
  • would you come to a workshop at the next LISA?
  • what do you think about a hands-on workshop where
    we spend the morning using various systems then
    spend the afternoon analyzing and discussing
    results?

13
ROC Enabler ACME benchmarks
  • Traditional benchmarks focus on performance
  • assume perfect hardware, software, human
    operators
  • New benchmarks needed to drive progress toward
    ACME, evaluate ROC success
  • How else convince skeptics to adopt new
    technology?
  • Need workload of typical failures

normal behavior(99 conf.)
QoS degradation
failure
Repair Time
14
Tools for Recovery Detection
  • System enables input insertion, output check of
    all modules (including fault insertion)
  • To check module sanity to find failures faster
  • To test correctness of recovery mechanisms
  • insert (random) faults and known-incorrect inputs
  • also enables availability benchmarks
  • To expose remove latent errors from system
  • To train/expand experience of operator
  • Periodic reports to management on skills
  • To discover if warning systems are broken

15
Repairing the Past (2)
  • 3 cases needing Undo
  • reverse the effects of a mistyped command (rm rf
    )
  • roll back a software upgrade without losing user
    data
  • go back in time to retroactively install virus
    filter on email server effects of virus are
    squashed on redo
  • The 3 Rs vs. checkpointing, reboot, logging
  • checkpointing gives Rewind only
  • reboot may give Repair, but only for Heisenbugs
  • logging can give all 3 Rs
  • but need more than RDBMS logging, since system
    state changes are interdependent and
    non-transactional
  • 3R-logging requires careful dependency tracking,
    and attention to state granularity and
    externalized events

16
Organizations were working with
  • Traditional Computer Companies HP, IBM,
    Microsoft, ...
  • Internet companiesAmazon, Google, Yahoo, ...
  • Startup companies (wouldnt recognize names)
  • Stanford University (Armando Fox)
  • E.g., CS 294-4 class
  • Visit them twice a year, 3-day retreats at Lake
    Tahoe to review progress, get new insights, build
    team spirit, have fun, ...

17
Systems using for ROC
  • ROC-I Cluster of 64 PCs modified with ability
    for HW isolation, fault insertion, monitoring
  • Cluster of 40 IBM PCs each with 2 GB DRAM, 1
    gigabit Ethernet, gigabit switch,HW monitor, each
    running Vmware virtual machine monitor (software
    layer)

18
Why ROC?
  • Working on relevant, important systems topic with
    no jeopardy from near-term products
  • New, exciting, research field, so lots of
    opportunities
  • Interacting with many interesting companies,
    research labs, Stanford University
  • Contact Dave Patterson (patterson_at_cs)or Aaron
    Brown (abrown_at_cs) or David Oppenheimer
    (davidopp_at_cs)
  • See http//ROC.cs.berkeley.edu
Write a Comment
User Comments (0)
About PowerShow.com