Embracing Failure: A Case for Recovery-Oriented Computing - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Embracing Failure: A Case for Recovery-Oriented Computing

Description:

Embracing Failure: A Case for Recovery-Oriented Computing Aaron B. Brown David A. Patterson Presented by John Calandrino – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 11
Provided by: JohnC342
Category:

less

Transcript and Presenter's Notes

Title: Embracing Failure: A Case for Recovery-Oriented Computing


1
Embracing Failure A Case for Recovery-Oriented
Computing
  • Aaron B. Brown
  • David A. Patterson
  • Presented by John Calandrino

2
Motivation
  • A survey in 2000 (one year prior to writing of
    this paper) found
  • 65 of surveyed web sites had customer-visible
    downtime at least once every 6 months, 25 had
    downtime 3 times
  • Is this five-nines availability?
  • 259,200 minutes in 180 days
  • Five-nines no more than 2.5 minutes downtime
    (barely customer visible)

3
In modern computer systems
  • Availability is more important than ever
  • Businesses can lose millions of dollars during a
    one hour web site outage
  • Availability is harder than ever to guarantee
  • Modern systems are distributed, heterogeneous,
    and complex, involving numerous interacting
    applications web server, internal database, etc.
  • Availability limited by weakest link in system
  • In such an environment, failures are inevitable

4
Traditional Solutions
  • Use fault-tolerant components
  • Employ rigorous software testing practices
  • Such solutions rely on outdated assumptions
  • We can design hardware/software to have
    negligible failure rates
  • Maintenance and repair are error-free
  • We can predict and tolerate system failure
  • These assumptions emphasize failure avoidance
    rather than failure recovery
  • Such systems are unprepared when failures occur

5
Hardware/Software Failures
  • Fault-tolerant hardware may exist, but that does
    not mean it is used
  • Commodity hardware is cheap and ubiquitous
  • And error-prone IDE disks, non-ECC memory, etc.
  • Even low per-node failure rates are substantial
    in larger clusters (e.g., Google cluster)
  • It may be possible to develop fault-tolerant
    software, however
  • Software is being developed, updated, and
    deployed faster than ever in the Internet age
  • In Internet time, people get sloppy

6
Human Failures
  • Arise primarily during maintenance and repair
  • Consider trying to diagnose and fix a subtle bug
    in even a few thousand lines of code
  • Also arise during other activities
    configuration, upgrading, performance tuning
  • Human error rates are nowhere near zero
  • Even highly-trained, intelligent people make
    mistakes, especially under pressure
  • Therefore, maintenance and repair are not
    error-free

7
Unanticipated Failures
  • Some failures cannot be anticipated
  • Humans are good at breaking systems, especially
    unintuitive ones
  • Systems are combined in unanticipated ways,
    generating unexpected interactions
  • Generate normal accidents
  • In this environment, it is impossible to predict
    all types of failures

8
Recovery-Oriented Computing
  • As we cannot design a system with 100
    availability, modern systems must accept failure
    as inevitable
  • Focus more on recovery and repair in addition to
    avoidance
  • Provides an essential failure safety net that
    complements failure avoidance methods
  • Focus on improving MTTR as well as MTTF

9
Recovery-Oriented Computing
  • ROC relies on a system-integrated
    recovery-oriented framework that should
  • Detect and repair failures as quickly as possible
  • Prevent propagation of errors through system
  • Helpful to have physically-partitioned system
  • Must tolerate errors during recovery/repair
  • Be trustworthy (seems to imply low failure
    rate)
  • Extensive (self-)testing of framework, failure
    fire drills
  • What about unanticipated failures?
  • Is availability a requirement of this framework?
  • How do we implement such a framework?

10
Questions
  • What if there are errors in the recovery-oriented
    framework?
  • How are these failures handled?
  • Alternately, can the framework be guaranteed not
    to fail?
  • Probably could not be a 100 guarantee
  • If this is instead a five-nines style of
    guarantee, arent we back where we started?
  • ROC not be the catch all safety net that we
    desired
Write a Comment
User Comments (0)
About PowerShow.com