Title: Toward RecoveryOriented Computing
1Toward Recovery-Oriented Computing
- Armando Fox, Stanford UniversityDavid Patterson,
UC Berkeleyand a cast of tens
2Outline
- Whither recovery-oriented computing?
- research/industry agenda of last 15 years
- todays pressing problem availability (we knew
that) - but what is new/different compared to
previous F/T work, databases, etc? - Recovery-Oriented Computing as an approach to
availability - Motivation and philosophy
- sampling of research avenues
- what ROC is not
3Reevaluating goals assumptions
- Goals of last 15 years
- Goal 1 Improve performance
- Goal 2 Improve performance
- Goal 3 Improve cost-performance
- Assumptions
- Humans are perfect (they dont make mistakes
during installation, wiring, upgrade, maintenance
or repair) - Software will eventually be bug free (good
programmers will write bug-free code, debugging
works) - Hardware MTBF is already very large (100 years
between failures), and will continue to increase
4Results of this successful agenda
- Good news faster computers, denser disks,
cheaper - computation faster by 3 orders of magnitude
- disk capacity greater by 3 orders of magnitude
- Result TCO dominated by administration, not
hardware cost - Bad news complex, brittle systems that fail
frequently - 65 of IT managers report that their websites
were unavailable to customers over a 6-month
period (25 3 or more outages) Internet Week,
4/3/2000 - outage costs negative press, click overs to
competitor, stock price, market cap - Yet availability is key metric for online
services!
5Direct Downtime Costs (per Hour)
- Brokerage operations 6,450,000
- Credit card authorization 2,600,000
- Ebay (22 hour outage) 225,000
- Amazon.com 180,000
- Package shipping services 150,000
- Home shopping channel 113,000
- Catalog sales center 90,000
- Airline reservation center 89,000
- Cellular service activation 41,000
- On-line network fees 25,000
- ATM service fees 14,000
Sources InternetWeek 4/3/2000 Fibre Channel A
Comprehensive Introduction, R. Kembel 2000, p.8.
...based on a survey done by Contingency
Planning Research."
6So, what are todays challenges?
- We all seem to agree on goals
- Dave Patterson, IPTS 2002 ACME availability,
change, maintenance, evolution - Jim Gray, HPTS 2001 FAASM functionality,
availability, agility, scalability,
manageability - Butler Lampson, SOSP 1999 Always available,
evolving while they run, growing without
practical limit - John Hennessy, FCRC 1999 Availability,
maintainability and ease of upgrades,
scalability - Fox Brewer, HotOS 1997 BASE best-effort
service, availability, soft state, eventual
consistency - Were all singing the same tune, but what is new?
7Whats New and Different
- Evolution and change are integral
- not true of many traditional five nines
systems long design cycle, changes incur high
overhead for design/spec/testing - Last version of space shuttle software 1 bug in
420 KLOC, cost 35M/yr to maintain (good quality
commercial SW 1 bug/KLOC) - But, recent upgrade for GPS support required
generating 2,500 pages of specs before changing
anything in 6.3 KLOC (1.5) - Performance still important, but focus changed
- Interactive performance and availability to end
users is key - Users appear willing to occasionally tolerate
temporary degradation (service quality) in
exchange for improved availability - How to capture this tradeoff soft/stale state,
partial performance degradation, imprecise
answers
8ROC Philosophy
- ROC philosophy (Peress Law)
- If a problem has no solution, it may not be a
problem, but a fact not to be solved, but to be
coped with over time Shimon Peres - Failures (hardware, software, operator-induced)
are a fact recovery is how we cope with them
over time - Availability MTTF/MTBF MTTF / (MTTF MTTR)
Rather than just making MTTF very large, make
MTTR - Why?
- Human errors will still cause outages minimize
recovery time - Recovery time is directly measurable, and
directly captures impact on users of a specific
outage incident (MTTF doesnt) - Rapid evolution makes exhaustive
testing/validation impossible
unexpected/transient failures will still occur
91. Human Error Is Inevitable
- Human error major factor in downtime
- PSTN Half of all outage incidents and
outage-minutes from 1992-1994 were due to human
error (including errors by phone company
maintenance workers) - Oracle up to half of DB failures due to human
error (1999) - Microsoft blamed human error for 24-hour outage
in Jan 2001 - Approach
- Learn from psychology of human error and disaster
case studies - Build in system support for recovery from human
errors - Use tools such as error injection, virtual
machine technology to provide flight simulator
training for operators
10The 3R undo model
- Undo time travel for system operators
- Three Rs for recovery
- Rewind roll system state backwards in time
- Repair change system to prevent failure
- e.g., edit history, fix latent error, retry
unsuccessful operation, install preventative
patch - Replay roll system state forward, replaying
end-user interactions lost during rewind - All three Rs are critical
- rewind enables undo
- repair lets user/administrator fix problems
- replay preserves updates, propagates fixes forward
11Example e-mail scenario
- Before undo
- virus-laden message arrives
- user copies it into a folder without looking at
it - Operator invokes undo (rewind) to install virus
filter (repair) - During replay
- message is redelivered but now discarded by virus
filter - copy operation is now unsafe (source message
doesnt exist) - compensating action insert placeholder for
message - now copy command can be executed, making history
replay-acceptable
12First implementation attempt
- Undo wrapper for open source IMAP email store
3R Layer
StateTracker
Email Server
Includes - user state - mailboxes -
application - operating system
SMTP
SMTP
3RProxy
IMAP
IMAP
Non-overwritingStorage
UndoLog
control
133. Handling Transient Failures via Restart
- Many failures are either (a) transient and
fixable through reboot, or (b) non-transient, but
reboot is the lowest-MTTR fix - Recursive Restarts To minimize MTTR, restarts
the minimal set of subsystems that could cure a
failure if that doesnt help, restart the
next-higher containing set, etc. - Partial restarts/reboots
- Return system (mostly) to well-tested,
well-understood start state - High confidence way to reclaim stale/leaked
resources - Unlike true checkpointing, reboot more likely to
avoid repeated failure due to corrupted state - We focus on proactive restarts can also be
reactive (SW rejuvenation) - Easier to run a system 365 times for 1 day than
365 days - Goals
- What is the software structure that can best
accommodate such failure management while still
preserving all other requirements (functionality,
performance, consistency, etc.) - Develop methodology for building and managing RR
systems (concrete engineering methods) - Develop the tools for building, testing,
deploying, and managing RR systems - Design for fast restartability in online-service
building blocks
14A Hierarchy of Restartable Units
- Siblings highly fault-isolated
- low level by high-confidence, low-level,
HW-assisted machinery, (eg MMU, physical
isolation) - higher level by VM-level abstractions based on
the above machinery (eg JVM, HW VM, process)
- R-map (hierarchy of restartable component
groups) captures restart dependencies - Groups of restart units can be restarted by
common parent - Restarting a node restarts everything in its
subtree - A failure is minimally curable at a specific node
- Restarts farther up tree are more expensive, but
higher confidence for curing transients
15RR-ifying a satellite ground station
- Biggest improvement MTTF/MTTR-based boundary
redrawing - Ability to isolate unstable components without
penalizing whole system - Achieve a balanced MTTF/MTTR ratio across
components at the same level - Lower MTTR may be strictly better than higher
MTTF - unplanned downtime is more expensive than planned
downtime, and downtime under a heavy/critical
workload (e.g., satellite pass) is more expensive
than downtime under a light/non-critical
workload. - high MTTF doesnt guarantee failure-free
operation interval, but sufficiently low MTTR may
mitigate impact of failure - Current work is applying RR to a ubiquitous
computing environment, a J2EE application server,
and an OSGI-based platform for cars ? new lessons
will emerge (e.g., r-tree needs to be a r-DAG) - Most of these lessons are not surprising, but RR
provides a uniform framework within which to
discuss them
16MTTR Captures Outage Costs
- Recent software-related outages at Ebay 4.5
hours in Apr02, 22 hours Jun99, 7 hours May99, 9
hours Dec98 - Assume two 4-hour (newsworthy) outages/year
- A(18224 hours)/(18224 4 hours) 99.9
- Dollar cost Ebay policy for 2 hour outage, fees
credited to all affected users (US3-5M for
Jun99) - Customer loyalty after Jun99 outage, Yahoo
Auctions reported statistically significant
increase in users - Ebays market cap dropped US4B after Jun99
outage, stock price dropped 25 - Newsworthy due to number of users affected, given
length of outage
17Outage costs, cont.
- What about a 10-minute outage once per week?
- A(724 hours)/(724 1/6 hours) 99.9 - the
same - Can we quantify savings over the previous
scenario? - Shorter outages affect fewer users at a time
- Typical AOL email outage affects 1-2 of users
- Many short outages may affect different subsets
of users - Shorter outages typically not news-worthy
18When Low MTTR Trumps High MTTF
- MTTR is directly measurable MTTF usually not
- Component MTTFs - tens of years
- Software MTTF ceiling - 30 yrs (Gray, HDCC 01)
- Result measuring MTTF requires 100s of
system-years - But, MTTRs are minutes to hours, even for
complex SW components - MTTR more directly captures impact of a specific
outage - Very low MTTR (10 seconds) achievable with
redundancy and failover - Keeps response time below user threshold of
distraction Miller 1968, Bhatti et al 2001, Zona
Research 1999
19Degraded Service vs. Outage
- How about longer MTTRs (minutes or hours)?
- Can service be designed so that short outages
appear to users as temporary degradation instead? - How much degradation will users tolerate?
- For how long (until they abandon the site because
it feels like a true outage - abandonment can be
measured) - How frequently?
- Even if above thresholds can be deduced, how to
design service so that transient failures can be
mapped onto degraded quality?
20Examples of degraded service
- Goal derive a set of service primitives that
directly reflect parameterizable degradation due
to transient failure (theory is too strong)
21Two Frequently Asked Questions
- Is ROC the same as autonomic computing?
- Are you saying we should build lousy hardware and
software and mask all those failures with ROC
mechanisms?
221. Does ROCautonomic computing?
- Self-administering?
- For now, focus on empowering administrators, not
eliminating them - Humans are good at detecting and learning from
own mistakes, so why not? (avoiding automation
irony) - Were not sure we understand sysadmins current
techniques well enough to think about automation - Self-healing, self-reprovisioning,
self-load-balancing? - Sure - Web services and datacenters already do
this for many situations many techniques and
tools are well known - But - do we know how (theory) to design the app
software to make these techniques possible - Digital immune system - its in WinXP
232. What ROC is not
- We do not advocate for
- producing buggy software
- building lousy hardware
- slacking on design, testing, or careful
administration - discarding existing useful techniques or tools
- We do advocate for
- an increased focus on lowering MTTR specifically
- increased examination of when some guarantees can
be traded for lower MTTR - systematic exploration of design for fast
recovery in the context of a variety of
applications - stealing great ideas from systems, Internet
protocols, psychology, safety-critical systems
design
24Summary ROC and Online Services
- Current software realities lead to new foci
- Rapid evolution traditional FT methodologies
difficult to apply - Human error inevitable, but humans are good at
identifying own errors provide facilities to
allow recovery from these - HW and SW failure inevitable use redundancy
and designed-in ability to substitute temporary
degradation for outages (design for recovery) - Trying to stay relevant via direct contact with
designers/operators of large systems - Need real data on how large systems fail
- Need real data on how different kinds of failures
are perceived by users
25Interested in ROCing?
- Are you willing to anonymously share failure
data? - Already great relationships (and in some cases
data-sharing agreements) with BEA, IBM, HP,
Keynote, Microsoft, Oracle, Tellme, Yahoo!,
others - See http//roc.stanford.edu or http//roc.cs.berke
ley.edu for publications, talks, research areas,
etc. - Contact Armando Fox (fox_at_cs.stanford.edu) or
Dave Patterson (patterson_at_cs.berkeley.edu)
26Discussion Question
- For discussion So what if you pick the low
hanging fruit? The challenge is in reaching the
highest leaves.