Why Recovery Should Be Free, And Often Can Be - PowerPoint PPT Presentation

About This Presentation

Title:

Why Recovery Should Be Free, And Often Can Be

Description:

Results in higher end-user-perceived availability, given ... Toward a crash-only formalism. Component frameworks force you into certain app-writing patterns ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 17

Provided by: fox66

Learn more at: http://roc.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Why Recovery Should Be Free, And Often Can Be

1
Why Recovery Should Be Free,And Often Can Be

Armando Fox, Stanford University
June 2003 ROC Retreat

2
Recovery Should Be Free, and Can Be

Already espouse arguments about lowering MTTR
Mitigates impact on service as a whole Fox
Patterson, 2002
Results in higher end-user-perceived
availability, given same overall availability
Xie et al. 2002
etc
Tim Chou, Oracle maybe more important to make
recovery predictable (so can plan provisioning,
anticipate impact of outage, etc.)...if we
understand it, we can optimize its speed

3
Real win Recovery management is hard

Determining when to recover is hard
How to detect that somethings wrong?
How do you know when recovery is really
necessary? (fail-stutter, etc.)
Will recovery make things worse? (cascading
recovery)
Knowing what happens when you recover is hard
Will a particular recovery technique work? (the
machinery needed to perform the recovery may also
be broken)
What is the effect on online performance?
(recovery can be expensive)
What if you needlessly over-recover? (cost of
making a mistake is high)
If recovery were predictable and fast, it would
simplify both failure detection and recovery
management.

4
Simplifying Recovery Management Crash-Only
Software

Goal enforce simple invariants on recovery
behavior, from outside the component(s) being
recovered
Crash-only component provides PWR switch stop
crash
clean shutdown loss of power kernel panic
...
One way to go down ? one way to come up start
recover
Power switch is external ? uniform behavior
kill -9, turning off (process kill) a VM, pull
power cord
Intuition the infrastructure supporting the
power switch is usually simpler than the
applications using it, and common across all
those applications
Can crash-only software actually be built, and if
so, how?
(a) provide building blocks
(b) formalize C/O definition and provide
developer

5
Crash-only Building Blocks

JAGR/ROC-2, a self-recovering J2EE app server
Candea et al., WIAPP 2003
Micro-reboots used for recovery,
application-generic failure-path inference used
for determining recovery strategy
Significantly improves performability relative to
whole-app redeploy
SSM a CO session state manager Ling, Fox, AMS
2003
DStore a CO persistent single-key state manager
Huang, Fox, submitted to SRDS 2003
Similar in spirit to HP Labs FAB Frolund, Saito
et al., 2003
Common features of both SSM and DStore
Redundancy used for persistence
Workload semantics exploited to simplify
consistency model recovery
Recoveryrestart, safe to reboot any node at any
time
Safe to coerce any failure to a crash (fail-stop)
at any time

6
Building blocks, cont.

Pinpoint, statistical-anomaly-based failure
detection
Standard tension accuracy vs. precision (false
positives problem)
Different clustering techniques seem to be good
at detecting different kinds of problems
Surprising result from a CS241 project
character-frequency histograms are a good
app-generic way to detect end-user-visible
failures
Mostly integrated with JAGR and SSM
On burner discussions with BEA Systems for
integrating into WebLogic Server
Insight if cost of over-recovering is low,
aggressive statistics-based failure detection
becomes more appealing

7
Toward a crash-only formalism

Component frameworks force you into certain
app-writing patterns
Inter-EJB calls through runtime-managed level of
indirection
Restrictions on how persistent state mgt can be
expressed
Restrictions on state sharing difficult to do
without using explicit external store
Hypothesis these are the elements that allow C/O
to work
Ongoing work formalize crash-only SW
One possibility observational equivalence with
respect to a request stream
Can be expressed using a design pattern or
denotational semantics
Ideally, will lead to a tool (co-lint) telling
you whether your component is crash-only

8
Summary Toward a Crash-only World

Goal simplify recovery management
diagnosis statistical methods even more
appealing if the cost of making a mistake is low
recovery crash-only enforces invariants about
what happens when recovery is attempted
allows aggressive use of fault model enforcement
Martin et al 2002
Good progress on providing building blocks for
app writers
JAGR J2EE app server that allows fast recovery
via micro-reboots and application-generic fault
injection
SSM a crash-only session state store (in process
of integrating with JAGR)
DStore a crash-only persistent single-key store
PinPoint statistics-based failure detection
(integrated with JAGR, mostly integrated with SSM)

9
Xie et al MTTR and End-User Availability

Let AUuser-perceived unavailability, ASsystem
unavailability
Hypothesis if users retry failed requests, and
retry succeeds because system had fast recovery,
they will perceive higher availability
When retry rate is sufficiently frequent, AU
approaches AS (for AS 99.3, this threshold is
200-300 sec)
Method model user retry behavior and system
failure/recovery using Markov models solve using
numerical methods
Finding Given 2 systems with same AS, the one
with shorter MTTR (even though it also has lower
MTTF) appears better to the user.
Goal of this project validate that result
empirically (Jeff Raymakers, Yee-Jiun Song, Wendy
Tobagus)

10
User perceived unavailability vs retry rate
Higher user retry rates yields little improvement
in perceived availability.
sweet spot
11
Surprise! MTTF eventually catches up with you
At low MTTR, lowering MTTR and MTTF at the same
time results in worse user perceived
unavailability!
sweet spot
Variable MTTR, but fixed system availability (low
MTTR -gt low MTTF)
12
Optimization Choices
User Perceived Unavailability
Fixed MTTF
Fixed MTTR
System Unavailability
13
Results Summary

We can find a sweet spot (for a given system
availability) beyond which higher user retry
rates yield little benefit.
For two systems of a given availability, the one
with lower MTTR does not always yield better user
perceived availability.
For a given system, we can determine whether
improving MTTR or MTTF will yield more
user-visible benefits.

14
Clean shutdown vs. restart?

Impractical to guarantee zero crashes ? robust
systems must be crash-safe anyway
In that case, why support any other kind of
shutdown?
Historically, for performance (avoid synchronous
writes, do buffering/caching, etc) - leads to
replicated/mirrored state, more code, special
recovery code paths...

Total recovery time may be shorter even if crash
is forced
WinXP can be (mostly) crash-rebooted for upgrades
VMS sysadmins would sometimes crash the system
rather than shut it down (if no users were logged
on)

Crash-only software must(a) be crash-safe
(b) recover quickly
15
Why Crash-Only Simplifies Recovery

Hardware works, software doesnt
Hardware interlocks, timers, etc. have small
state spaces of behavior, hence high confidence
they will work as designed
Crash-only PWR switch is a way to approach that
same property for software
Crash-only makes recovery policies easier to
reason about
Opportunity to aggressively apply SW rejuvenation
Recovery code exercised on every restart no
exotic-but-rarely-used code paths
Over-recovery may be OK from performability
standpoint if recovery is free (performance
correctness), you stop thinking about it as
recovery and start thinking about it as normal
aspect of operation

16
Towards a Crash-Only World

Existing software that is crash-only or
near-crash-only
Stateless apps most Web servers
Most RDBMSs crash-safe, but long recovery
Postgres, BerkeleyDB/Sleepycat recovery
codepath is the main codepath
Some appliance storage devices separate but
pretty fast recovery path
Our goals...
Focus on Internet (3 tier) applications
already crash-mostly except for persistence
tier(s)
Make the app server, middle-tier persistence, and
back-end tier (to the extent possible) truly
crash-only
Deploy application-generic failure detection
techniques (which may over-recover, but the goal
is to make that OK)
Quantify improvement (we hope!) in performability
resulting from these changes
By doing it in the middleware, any app on that
middleware can benefit