Why Recovery Should Be Free, And Often Can Be - PowerPoint PPT Presentation

About This Presentation
Title:

Why Recovery Should Be Free, And Often Can Be

Description:

Results in higher end-user-perceived availability, given ... Toward a crash-only formalism. Component frameworks force you into certain app-writing patterns ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 17
Provided by: fox66
Category:

less

Transcript and Presenter's Notes

Title: Why Recovery Should Be Free, And Often Can Be


1
Why Recovery Should Be Free,And Often Can Be
  • Armando Fox, Stanford University
  • June 2003 ROC Retreat

2
Recovery Should Be Free, and Can Be
  • Already espouse arguments about lowering MTTR
  • Mitigates impact on service as a whole Fox
    Patterson, 2002
  • Results in higher end-user-perceived
    availability, given same overall availability
    Xie et al. 2002
  • etc
  • Tim Chou, Oracle maybe more important to make
    recovery predictable (so can plan provisioning,
    anticipate impact of outage, etc.)...if we
    understand it, we can optimize its speed

3
Real win Recovery management is hard
  • Determining when to recover is hard
  • How to detect that somethings wrong?
  • How do you know when recovery is really
    necessary? (fail-stutter, etc.)
  • Will recovery make things worse? (cascading
    recovery)
  • Knowing what happens when you recover is hard
  • Will a particular recovery technique work? (the
    machinery needed to perform the recovery may also
    be broken)
  • What is the effect on online performance?
    (recovery can be expensive)
  • What if you needlessly over-recover? (cost of
    making a mistake is high)
  • If recovery were predictable and fast, it would
    simplify both failure detection and recovery
    management.

4
Simplifying Recovery Management Crash-Only
Software
  • Goal enforce simple invariants on recovery
    behavior, from outside the component(s) being
    recovered
  • Crash-only component provides PWR switch stop
    crash
  • clean shutdown loss of power kernel panic
    ...
  • One way to go down ? one way to come up start
    recover
  • Power switch is external ? uniform behavior
  • kill -9, turning off (process kill) a VM, pull
    power cord
  • Intuition the infrastructure supporting the
    power switch is usually simpler than the
    applications using it, and common across all
    those applications
  • Can crash-only software actually be built, and if
    so, how?
  • (a) provide building blocks
  • (b) formalize C/O definition and provide
    developer

5
Crash-only Building Blocks
  • JAGR/ROC-2, a self-recovering J2EE app server
    Candea et al., WIAPP 2003
  • Micro-reboots used for recovery,
    application-generic failure-path inference used
    for determining recovery strategy
  • Significantly improves performability relative to
    whole-app redeploy
  • SSM a CO session state manager Ling, Fox, AMS
    2003
  • DStore a CO persistent single-key state manager
    Huang, Fox, submitted to SRDS 2003
  • Similar in spirit to HP Labs FAB Frolund, Saito
    et al., 2003
  • Common features of both SSM and DStore
  • Redundancy used for persistence
  • Workload semantics exploited to simplify
    consistency model recovery
  • Recoveryrestart, safe to reboot any node at any
    time
  • Safe to coerce any failure to a crash (fail-stop)
    at any time

6
Building blocks, cont.
  • Pinpoint, statistical-anomaly-based failure
    detection
  • Standard tension accuracy vs. precision (false
    positives problem)
  • Different clustering techniques seem to be good
    at detecting different kinds of problems
  • Surprising result from a CS241 project
    character-frequency histograms are a good
    app-generic way to detect end-user-visible
    failures
  • Mostly integrated with JAGR and SSM
  • On burner discussions with BEA Systems for
    integrating into WebLogic Server
  • Insight if cost of over-recovering is low,
    aggressive statistics-based failure detection
    becomes more appealing

7
Toward a crash-only formalism
  • Component frameworks force you into certain
    app-writing patterns
  • Inter-EJB calls through runtime-managed level of
    indirection
  • Restrictions on how persistent state mgt can be
    expressed
  • Restrictions on state sharing difficult to do
    without using explicit external store
  • Hypothesis these are the elements that allow C/O
    to work
  • Ongoing work formalize crash-only SW
  • One possibility observational equivalence with
    respect to a request stream
  • Can be expressed using a design pattern or
    denotational semantics
  • Ideally, will lead to a tool (co-lint) telling
    you whether your component is crash-only

8
Summary Toward a Crash-only World
  • Goal simplify recovery management
  • diagnosis statistical methods even more
    appealing if the cost of making a mistake is low
  • recovery crash-only enforces invariants about
    what happens when recovery is attempted
  • allows aggressive use of fault model enforcement
    Martin et al 2002
  • Good progress on providing building blocks for
    app writers
  • JAGR J2EE app server that allows fast recovery
    via micro-reboots and application-generic fault
    injection
  • SSM a crash-only session state store (in process
    of integrating with JAGR)
  • DStore a crash-only persistent single-key store
  • PinPoint statistics-based failure detection
    (integrated with JAGR, mostly integrated with SSM)

9
Xie et al MTTR and End-User Availability
  • Let AUuser-perceived unavailability, ASsystem
    unavailability
  • Hypothesis if users retry failed requests, and
    retry succeeds because system had fast recovery,
    they will perceive higher availability
  • When retry rate is sufficiently frequent, AU
    approaches AS (for AS 99.3, this threshold is
    200-300 sec)
  • Method model user retry behavior and system
    failure/recovery using Markov models solve using
    numerical methods
  • Finding Given 2 systems with same AS, the one
    with shorter MTTR (even though it also has lower
    MTTF) appears better to the user.
  • Goal of this project validate that result
    empirically (Jeff Raymakers, Yee-Jiun Song, Wendy
    Tobagus)

10
User perceived unavailability vs retry rate
Higher user retry rates yields little improvement
in perceived availability.
sweet spot
11
Surprise! MTTF eventually catches up with you
At low MTTR, lowering MTTR and MTTF at the same
time results in worse user perceived
unavailability!
sweet spot
Variable MTTR, but fixed system availability (low
MTTR -gt low MTTF)
12
Optimization Choices
User Perceived Unavailability
Fixed MTTF
Fixed MTTR
System Unavailability
13
Results Summary
  • We can find a sweet spot (for a given system
    availability) beyond which higher user retry
    rates yield little benefit.
  • For two systems of a given availability, the one
    with lower MTTR does not always yield better user
    perceived availability.
  • For a given system, we can determine whether
    improving MTTR or MTTF will yield more
    user-visible benefits.

14
Clean shutdown vs. restart?
  • Impractical to guarantee zero crashes ? robust
    systems must be crash-safe anyway
  • In that case, why support any other kind of
    shutdown?
  • Historically, for performance (avoid synchronous
    writes, do buffering/caching, etc) - leads to
    replicated/mirrored state, more code, special
    recovery code paths...
  • Total recovery time may be shorter even if crash
    is forced
  • WinXP can be (mostly) crash-rebooted for upgrades
  • VMS sysadmins would sometimes crash the system
    rather than shut it down (if no users were logged
    on)

Crash-only software must(a) be crash-safe
(b) recover quickly
15
Why Crash-Only Simplifies Recovery
  • Hardware works, software doesnt
  • Hardware interlocks, timers, etc. have small
    state spaces of behavior, hence high confidence
    they will work as designed
  • Crash-only PWR switch is a way to approach that
    same property for software
  • Crash-only makes recovery policies easier to
    reason about
  • Opportunity to aggressively apply SW rejuvenation
  • Recovery code exercised on every restart no
    exotic-but-rarely-used code paths
  • Over-recovery may be OK from performability
    standpoint if recovery is free (performance
    correctness), you stop thinking about it as
    recovery and start thinking about it as normal
    aspect of operation

16
Towards a Crash-Only World
  • Existing software that is crash-only or
    near-crash-only
  • Stateless apps most Web servers
  • Most RDBMSs crash-safe, but long recovery
  • Postgres, BerkeleyDB/Sleepycat recovery
    codepath is the main codepath
  • Some appliance storage devices separate but
    pretty fast recovery path
  • Our goals...
  • Focus on Internet (3 tier) applications
    already crash-mostly except for persistence
    tier(s)
  • Make the app server, middle-tier persistence, and
    back-end tier (to the extent possible) truly
    crash-only
  • Deploy application-generic failure detection
    techniques (which may over-recover, but the goal
    is to make that OK)
  • Quantify improvement (we hope!) in performability
    resulting from these changes
  • By doing it in the middleware, any app on that
    middleware can benefit
Write a Comment
User Comments (0)
About PowerShow.com