Toward RecoveryOriented Computing - PowerPoint PPT Presentation

About This Presentation
Title:

Toward RecoveryOriented Computing

Description:

Armando Fox, Stanford University. David Patterson, UC Berkeley. and a cast of tens ... Fox & Brewer, HotOS 1997: BASE 'best-effort service, availability, soft state, ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 27
Provided by: fox86
Learn more at: http://www.fdis.org
Category:

less

Transcript and Presenter's Notes

Title: Toward RecoveryOriented Computing


1
Toward Recovery-Oriented Computing
  • Armando Fox, Stanford UniversityDavid Patterson,
    UC Berkeleyand a cast of tens

2
Outline
  • Whither recovery-oriented computing?
  • research/industry agenda of last 15 years
  • todays pressing problem availability (we knew
    that) - but what is new/different compared to
    previous F/T work, databases, etc?
  • Recovery-Oriented Computing as an approach to
    availability
  • Motivation and philosophy
  • sampling of research avenues
  • what ROC is not

3
Reevaluating goals assumptions
  • Goals of last 15 years
  • Goal 1 Improve performance
  • Goal 2 Improve performance
  • Goal 3 Improve cost-performance
  • Assumptions
  • Humans are perfect (they dont make mistakes
    during installation, wiring, upgrade, maintenance
    or repair)
  • Software will eventually be bug free (good
    programmers will write bug-free code, debugging
    works)
  • Hardware MTBF is already very large (100 years
    between failures), and will continue to increase

4
Results of this successful agenda
  • Good news faster computers, denser disks,
    cheaper
  • computation faster by 3 orders of magnitude
  • disk capacity greater by 3 orders of magnitude
  • Result TCO dominated by administration, not
    hardware cost
  • Bad news complex, brittle systems that fail
    frequently
  • 65 of IT managers report that their websites
    were unavailable to customers over a 6-month
    period (25 3 or more outages) Internet Week,
    4/3/2000
  • outage costs negative press, click overs to
    competitor, stock price, market cap
  • Yet availability is key metric for online
    services!

5
Direct Downtime Costs (per Hour)
  • Brokerage operations 6,450,000
  • Credit card authorization 2,600,000
  • Ebay (22 hour outage) 225,000
  • Amazon.com 180,000
  • Package shipping services 150,000
  • Home shopping channel 113,000
  • Catalog sales center 90,000
  • Airline reservation center 89,000
  • Cellular service activation 41,000
  • On-line network fees 25,000
  • ATM service fees 14,000

Sources InternetWeek 4/3/2000 Fibre Channel A
Comprehensive Introduction, R. Kembel 2000, p.8.
...based on a survey done by Contingency
Planning Research."
6
So, what are todays challenges?
  • We all seem to agree on goals
  • Dave Patterson, IPTS 2002 ACME availability,
    change, maintenance, evolution
  • Jim Gray, HPTS 2001 FAASM functionality,
    availability, agility, scalability,
    manageability
  • Butler Lampson, SOSP 1999 Always available,
    evolving while they run, growing without
    practical limit
  • John Hennessy, FCRC 1999 Availability,
    maintainability and ease of upgrades,
    scalability
  • Fox Brewer, HotOS 1997 BASE best-effort
    service, availability, soft state, eventual
    consistency
  • Were all singing the same tune, but what is new?

7
Whats New and Different
  • Evolution and change are integral
  • not true of many traditional five nines
    systems long design cycle, changes incur high
    overhead for design/spec/testing
  • Last version of space shuttle software 1 bug in
    420 KLOC, cost 35M/yr to maintain (good quality
    commercial SW 1 bug/KLOC)
  • But, recent upgrade for GPS support required
    generating 2,500 pages of specs before changing
    anything in 6.3 KLOC (1.5)
  • Performance still important, but focus changed
  • Interactive performance and availability to end
    users is key
  • Users appear willing to occasionally tolerate
    temporary degradation (service quality) in
    exchange for improved availability
  • How to capture this tradeoff soft/stale state,
    partial performance degradation, imprecise
    answers

8
ROC Philosophy
  • ROC philosophy (Peress Law)
  • If a problem has no solution, it may not be a
    problem, but a fact not to be solved, but to be
    coped with over time Shimon Peres
  • Failures (hardware, software, operator-induced)
    are a fact recovery is how we cope with them
    over time
  • Availability MTTF/MTBF MTTF / (MTTF MTTR)
    Rather than just making MTTF very large, make
    MTTR
  • Why?
  • Human errors will still cause outages minimize
    recovery time
  • Recovery time is directly measurable, and
    directly captures impact on users of a specific
    outage incident (MTTF doesnt)
  • Rapid evolution makes exhaustive
    testing/validation impossible
    unexpected/transient failures will still occur

9
1. Human Error Is Inevitable
  • Human error major factor in downtime
  • PSTN Half of all outage incidents and
    outage-minutes from 1992-1994 were due to human
    error (including errors by phone company
    maintenance workers)
  • Oracle up to half of DB failures due to human
    error (1999)
  • Microsoft blamed human error for 24-hour outage
    in Jan 2001
  • Approach
  • Learn from psychology of human error and disaster
    case studies
  • Build in system support for recovery from human
    errors
  • Use tools such as error injection, virtual
    machine technology to provide flight simulator
    training for operators

10
The 3R undo model
  • Undo time travel for system operators
  • Three Rs for recovery
  • Rewind roll system state backwards in time
  • Repair change system to prevent failure
  • e.g., edit history, fix latent error, retry
    unsuccessful operation, install preventative
    patch
  • Replay roll system state forward, replaying
    end-user interactions lost during rewind
  • All three Rs are critical
  • rewind enables undo
  • repair lets user/administrator fix problems
  • replay preserves updates, propagates fixes forward

11
Example e-mail scenario
  • Before undo
  • virus-laden message arrives
  • user copies it into a folder without looking at
    it
  • Operator invokes undo (rewind) to install virus
    filter (repair)
  • During replay
  • message is redelivered but now discarded by virus
    filter
  • copy operation is now unsafe (source message
    doesnt exist)
  • compensating action insert placeholder for
    message
  • now copy command can be executed, making history
    replay-acceptable

12
First implementation attempt
  • Undo wrapper for open source IMAP email store

3R Layer
StateTracker
Email Server
Includes - user state - mailboxes -
application - operating system
SMTP
SMTP
3RProxy
IMAP
IMAP
Non-overwritingStorage
UndoLog
control
13
3. Handling Transient Failures via Restart
  • Many failures are either (a) transient and
    fixable through reboot, or (b) non-transient, but
    reboot is the lowest-MTTR fix
  • Recursive Restarts To minimize MTTR, restarts
    the minimal set of subsystems that could cure a
    failure if that doesnt help, restart the
    next-higher containing set, etc.
  • Partial restarts/reboots
  • Return system (mostly) to well-tested,
    well-understood start state
  • High confidence way to reclaim stale/leaked
    resources
  • Unlike true checkpointing, reboot more likely to
    avoid repeated failure due to corrupted state
  • We focus on proactive restarts can also be
    reactive (SW rejuvenation)
  • Easier to run a system 365 times for 1 day than
    365 days
  • Goals
  • What is the software structure that can best
    accommodate such failure management while still
    preserving all other requirements (functionality,
    performance, consistency, etc.)
  • Develop methodology for building and managing RR
    systems (concrete engineering methods)
  • Develop the tools for building, testing,
    deploying, and managing RR systems
  • Design for fast restartability in online-service
    building blocks

14
A Hierarchy of Restartable Units
  • Siblings highly fault-isolated
  • low level by high-confidence, low-level,
    HW-assisted machinery, (eg MMU, physical
    isolation)
  • higher level by VM-level abstractions based on
    the above machinery (eg JVM, HW VM, process)
  • R-map (hierarchy of restartable component
    groups) captures restart dependencies
  • Groups of restart units can be restarted by
    common parent
  • Restarting a node restarts everything in its
    subtree
  • A failure is minimally curable at a specific node
  • Restarts farther up tree are more expensive, but
    higher confidence for curing transients

15
RR-ifying a satellite ground station
  • Biggest improvement MTTF/MTTR-based boundary
    redrawing
  • Ability to isolate unstable components without
    penalizing whole system
  • Achieve a balanced MTTF/MTTR ratio across
    components at the same level
  • Lower MTTR may be strictly better than higher
    MTTF
  • unplanned downtime is more expensive than planned
    downtime, and downtime under a heavy/critical
    workload (e.g., satellite pass) is more expensive
    than downtime under a light/non-critical
    workload.
  • high MTTF doesnt guarantee failure-free
    operation interval, but sufficiently low MTTR may
    mitigate impact of failure
  • Current work is applying RR to a ubiquitous
    computing environment, a J2EE application server,
    and an OSGI-based platform for cars ? new lessons
    will emerge (e.g., r-tree needs to be a r-DAG)
  • Most of these lessons are not surprising, but RR
    provides a uniform framework within which to
    discuss them

16
MTTR Captures Outage Costs
  • Recent software-related outages at Ebay 4.5
    hours in Apr02, 22 hours Jun99, 7 hours May99, 9
    hours Dec98
  • Assume two 4-hour (newsworthy) outages/year
  • A(18224 hours)/(18224 4 hours) 99.9
  • Dollar cost Ebay policy for 2 hour outage, fees
    credited to all affected users (US3-5M for
    Jun99)
  • Customer loyalty after Jun99 outage, Yahoo
    Auctions reported statistically significant
    increase in users
  • Ebays market cap dropped US4B after Jun99
    outage, stock price dropped 25
  • Newsworthy due to number of users affected, given
    length of outage

17
Outage costs, cont.
  • What about a 10-minute outage once per week?
  • A(724 hours)/(724 1/6 hours) 99.9 - the
    same
  • Can we quantify savings over the previous
    scenario?
  • Shorter outages affect fewer users at a time
  • Typical AOL email outage affects 1-2 of users
  • Many short outages may affect different subsets
    of users
  • Shorter outages typically not news-worthy

18
When Low MTTR Trumps High MTTF
  • MTTR is directly measurable MTTF usually not
  • Component MTTFs - tens of years
  • Software MTTF ceiling - 30 yrs (Gray, HDCC 01)
  • Result measuring MTTF requires 100s of
    system-years
  • But, MTTRs are minutes to hours, even for
    complex SW components
  • MTTR more directly captures impact of a specific
    outage
  • Very low MTTR (10 seconds) achievable with
    redundancy and failover
  • Keeps response time below user threshold of
    distraction Miller 1968, Bhatti et al 2001, Zona
    Research 1999

19
Degraded Service vs. Outage
  • How about longer MTTRs (minutes or hours)?
  • Can service be designed so that short outages
    appear to users as temporary degradation instead?
  • How much degradation will users tolerate?
  • For how long (until they abandon the site because
    it feels like a true outage - abandonment can be
    measured)
  • How frequently?
  • Even if above thresholds can be deduced, how to
    design service so that transient failures can be
    mapped onto degraded quality?

20
Examples of degraded service
  • Goal derive a set of service primitives that
    directly reflect parameterizable degradation due
    to transient failure (theory is too strong)

21
Two Frequently Asked Questions
  • Is ROC the same as autonomic computing?
  • Are you saying we should build lousy hardware and
    software and mask all those failures with ROC
    mechanisms?

22
1. Does ROCautonomic computing?
  • Self-administering?
  • For now, focus on empowering administrators, not
    eliminating them
  • Humans are good at detecting and learning from
    own mistakes, so why not? (avoiding automation
    irony)
  • Were not sure we understand sysadmins current
    techniques well enough to think about automation
  • Self-healing, self-reprovisioning,
    self-load-balancing?
  • Sure - Web services and datacenters already do
    this for many situations many techniques and
    tools are well known
  • But - do we know how (theory) to design the app
    software to make these techniques possible
  • Digital immune system - its in WinXP

23
2. What ROC is not
  • We do not advocate for
  • producing buggy software
  • building lousy hardware
  • slacking on design, testing, or careful
    administration
  • discarding existing useful techniques or tools
  • We do advocate for
  • an increased focus on lowering MTTR specifically
  • increased examination of when some guarantees can
    be traded for lower MTTR
  • systematic exploration of design for fast
    recovery in the context of a variety of
    applications
  • stealing great ideas from systems, Internet
    protocols, psychology, safety-critical systems
    design

24
Summary ROC and Online Services
  • Current software realities lead to new foci
  • Rapid evolution traditional FT methodologies
    difficult to apply
  • Human error inevitable, but humans are good at
    identifying own errors provide facilities to
    allow recovery from these
  • HW and SW failure inevitable use redundancy
    and designed-in ability to substitute temporary
    degradation for outages (design for recovery)
  • Trying to stay relevant via direct contact with
    designers/operators of large systems
  • Need real data on how large systems fail
  • Need real data on how different kinds of failures
    are perceived by users

25
Interested in ROCing?
  • Are you willing to anonymously share failure
    data?
  • Already great relationships (and in some cases
    data-sharing agreements) with BEA, IBM, HP,
    Keynote, Microsoft, Oracle, Tellme, Yahoo!,
    others
  • See http//roc.stanford.edu or http//roc.cs.berke
    ley.edu for publications, talks, research areas,
    etc.
  • Contact Armando Fox (fox_at_cs.stanford.edu) or
    Dave Patterson (patterson_at_cs.berkeley.edu)

26
Discussion Question
  • For discussion So what if you pick the low
    hanging fruit? The challenge is in reaching the
    highest leaves.
Write a Comment
User Comments (0)
About PowerShow.com