Title: Why Recovery Should Be Free, And Often Can Be
1Why Recovery Should Be Free,And Often Can Be
- Armando Fox, Stanford University
- June 2003 ROC Retreat
2Recovery Should Be Free, and Can Be
- Already espouse arguments about lowering MTTR
- Mitigates impact on service as a whole Fox
Patterson, 2002 - Results in higher end-user-perceived
availability, given same overall availability
Xie et al. 2002 - etc
- Tim Chou, Oracle maybe more important to make
recovery predictable (so can plan provisioning,
anticipate impact of outage, etc.)...if we
understand it, we can optimize its speed
3Real win Recovery management is hard
- Determining when to recover is hard
- How to detect that somethings wrong?
- How do you know when recovery is really
necessary? (fail-stutter, etc.) - Will recovery make things worse? (cascading
recovery) - Knowing what happens when you recover is hard
- Will a particular recovery technique work? (the
machinery needed to perform the recovery may also
be broken) - What is the effect on online performance?
(recovery can be expensive) - What if you needlessly over-recover? (cost of
making a mistake is high) - If recovery were predictable and fast, it would
simplify both failure detection and recovery
management.
4Simplifying Recovery Management Crash-Only
Software
- Goal enforce simple invariants on recovery
behavior, from outside the component(s) being
recovered - Crash-only component provides PWR switch stop
crash - clean shutdown loss of power kernel panic
... - One way to go down ? one way to come up start
recover - Power switch is external ? uniform behavior
- kill -9, turning off (process kill) a VM, pull
power cord - Intuition the infrastructure supporting the
power switch is usually simpler than the
applications using it, and common across all
those applications - Can crash-only software actually be built, and if
so, how? - (a) provide building blocks
- (b) formalize C/O definition and provide
developer
5Crash-only Building Blocks
- JAGR/ROC-2, a self-recovering J2EE app server
Candea et al., WIAPP 2003 - Micro-reboots used for recovery,
application-generic failure-path inference used
for determining recovery strategy - Significantly improves performability relative to
whole-app redeploy - SSM a CO session state manager Ling, Fox, AMS
2003 - DStore a CO persistent single-key state manager
Huang, Fox, submitted to SRDS 2003 - Similar in spirit to HP Labs FAB Frolund, Saito
et al., 2003 - Common features of both SSM and DStore
- Redundancy used for persistence
- Workload semantics exploited to simplify
consistency model recovery - Recoveryrestart, safe to reboot any node at any
time - Safe to coerce any failure to a crash (fail-stop)
at any time
6Building blocks, cont.
- Pinpoint, statistical-anomaly-based failure
detection - Standard tension accuracy vs. precision (false
positives problem) - Different clustering techniques seem to be good
at detecting different kinds of problems - Surprising result from a CS241 project
character-frequency histograms are a good
app-generic way to detect end-user-visible
failures - Mostly integrated with JAGR and SSM
- On burner discussions with BEA Systems for
integrating into WebLogic Server - Insight if cost of over-recovering is low,
aggressive statistics-based failure detection
becomes more appealing
7Toward a crash-only formalism
- Component frameworks force you into certain
app-writing patterns - Inter-EJB calls through runtime-managed level of
indirection - Restrictions on how persistent state mgt can be
expressed - Restrictions on state sharing difficult to do
without using explicit external store - Hypothesis these are the elements that allow C/O
to work - Ongoing work formalize crash-only SW
- One possibility observational equivalence with
respect to a request stream - Can be expressed using a design pattern or
denotational semantics - Ideally, will lead to a tool (co-lint) telling
you whether your component is crash-only
8Summary Toward a Crash-only World
- Goal simplify recovery management
- diagnosis statistical methods even more
appealing if the cost of making a mistake is low - recovery crash-only enforces invariants about
what happens when recovery is attempted - allows aggressive use of fault model enforcement
Martin et al 2002 - Good progress on providing building blocks for
app writers - JAGR J2EE app server that allows fast recovery
via micro-reboots and application-generic fault
injection - SSM a crash-only session state store (in process
of integrating with JAGR) - DStore a crash-only persistent single-key store
- PinPoint statistics-based failure detection
(integrated with JAGR, mostly integrated with SSM)
9Xie et al MTTR and End-User Availability
- Let AUuser-perceived unavailability, ASsystem
unavailability - Hypothesis if users retry failed requests, and
retry succeeds because system had fast recovery,
they will perceive higher availability - When retry rate is sufficiently frequent, AU
approaches AS (for AS 99.3, this threshold is
200-300 sec) - Method model user retry behavior and system
failure/recovery using Markov models solve using
numerical methods - Finding Given 2 systems with same AS, the one
with shorter MTTR (even though it also has lower
MTTF) appears better to the user. - Goal of this project validate that result
empirically (Jeff Raymakers, Yee-Jiun Song, Wendy
Tobagus)
10User perceived unavailability vs retry rate
Higher user retry rates yields little improvement
in perceived availability.
sweet spot
11Surprise! MTTF eventually catches up with you
At low MTTR, lowering MTTR and MTTF at the same
time results in worse user perceived
unavailability!
sweet spot
Variable MTTR, but fixed system availability (low
MTTR -gt low MTTF)
12Optimization Choices
User Perceived Unavailability
Fixed MTTF
Fixed MTTR
System Unavailability
13Results Summary
- We can find a sweet spot (for a given system
availability) beyond which higher user retry
rates yield little benefit. - For two systems of a given availability, the one
with lower MTTR does not always yield better user
perceived availability. - For a given system, we can determine whether
improving MTTR or MTTF will yield more
user-visible benefits.
14Clean shutdown vs. restart?
- Impractical to guarantee zero crashes ? robust
systems must be crash-safe anyway - In that case, why support any other kind of
shutdown? - Historically, for performance (avoid synchronous
writes, do buffering/caching, etc) - leads to
replicated/mirrored state, more code, special
recovery code paths...
- Total recovery time may be shorter even if crash
is forced - WinXP can be (mostly) crash-rebooted for upgrades
- VMS sysadmins would sometimes crash the system
rather than shut it down (if no users were logged
on)
Crash-only software must(a) be crash-safe
(b) recover quickly
15Why Crash-Only Simplifies Recovery
- Hardware works, software doesnt
- Hardware interlocks, timers, etc. have small
state spaces of behavior, hence high confidence
they will work as designed - Crash-only PWR switch is a way to approach that
same property for software - Crash-only makes recovery policies easier to
reason about - Opportunity to aggressively apply SW rejuvenation
- Recovery code exercised on every restart no
exotic-but-rarely-used code paths - Over-recovery may be OK from performability
standpoint if recovery is free (performance
correctness), you stop thinking about it as
recovery and start thinking about it as normal
aspect of operation
16Towards a Crash-Only World
- Existing software that is crash-only or
near-crash-only - Stateless apps most Web servers
- Most RDBMSs crash-safe, but long recovery
- Postgres, BerkeleyDB/Sleepycat recovery
codepath is the main codepath - Some appliance storage devices separate but
pretty fast recovery path - Our goals...
- Focus on Internet (3 tier) applications
already crash-mostly except for persistence
tier(s) - Make the app server, middle-tier persistence, and
back-end tier (to the extent possible) truly
crash-only - Deploy application-generic failure detection
techniques (which may over-recover, but the goal
is to make that OK) - Quantify improvement (we hope!) in performability
resulting from these changes - By doing it in the middleware, any app on that
middleware can benefit