Title: Rx: Treating Bugs As Allergies
1Rx Treating Bugs As Allergies A Safe Method to
Survive Software Failures. Qin, Tucek,
Sundaresan, Zhou (UIUC). SOSP05
- Shimin ChenLBA Reading Group Presentation
2Motivation
- High availability is important
- Critical applications process control, etc.
- Financial company an hour of downtime costs 6
million - SW defects account for up to 40 of system
failures - Common memory-related bugs and concurrency bugs
- Bugs still occur in production runs
- Even after SW company spends enormous effort on
testing - ? Ask for mechanisms for surviving software bugs
3Previous Work on Surviving SW Failures
- Four categories
- Rebooting
- Checkpointing and recovery
- Application-specific mechanisms
- Recent proposals
- Failure-oblivious computing
- Reactive immune system
4Previous Work 1 Rebooting
- Schemes
- Whole program restart
- Micro-rebooting of partial system components
- SW rejuvenation (proactively restart processes)
- Problem
- Cannot deal with deterministic bugs
- Restart time
5Previous Work 2 General checkpointing and
recovery
- Schemes
- Checkpoint, rollback, re-execute
- Or use a backup server
- Problems
- Cannot deal with deterministic bugs
- Progressive retry in distributed systems
- Reorder messages to get around SW bugs, but not
bugs on single system - N-version programming
- Too expensive
6Previous Work 3 Application-Specific Recovery
Mechanisms
- Multi-process model (MPM)
- Kill a request-handling process and start a new
one - Problems
- Cannot handle deterministic bugs
- What if shared data structure is corrupted?
7Previous Work 4 Recent Non-Conventional Proposals
- Failure-oblivious computing
- Manufacture values for out-of-bound reads
- Discard out-of-bound writes
- Reactive immune system
- Detect failures of function calls
- Forcefully return from the function with a
manufactured error return value (e.g. -1 for int,
0 for unsigned int etc.) - Problem
- Unsafe for correctness-critical applications
(e.g. banking)
8New Proposal Rx
- Rollback the program to a recent checkpoint when
a bug is detected - Dynamically change the execution environment
based on the failure symptoms - Re-execute the buggy code in the new environment
- Features
- Comprehensive can deal with deterministic bugs
- Safe do not speculatively fix bugs, but change
environment - Noninvasive no changes to app source code
- Efficient
- Informative help locating the bugs
9Outline
- Introduction
- Main Idea of Rx
- Rx Design Implementation
- Evaluation
- Summary
10Main Idea
Record the changes for offline diagnosis
11Useful Execution Environmental Changes
- Must be safe and may avoid bugs
- Memory management based
- Buffer overflows, dangling pointers, etc.
- Timing based
- Concurrency bugs
- User request based
- Dropping unexpected (malicious) user request
- As a last resort
12(No Transcript)
13Outline
- Introduction
- Main Idea of Rx
- Rx Design Implementation
- Evaluation
- Summary
14Rx Components Overview
4
1
2
3
5
15Sensors for Detecting SW Failures
- OS-raised exceptions
- Assertion failures, segfault, divide-by-zero,
etc. - Fine-grain detection
- buffer overflow, accesses to freed memory etc.
- Only implemented OS-raised exceptions
16Checkpoint and Rollback (Flashback)
- Memory state fork-like operation
- Files keep a copy of each accessed files and
file pointers for a checkpoint - Checkpoint management
- Equal intervals or exponential landmarks
- Limit oldest checkpoint by considering recovery
time goal - Multi-threaded process checkpointing
- Send a signal to all threads to make them exit
from blocked syscalls with EINTR - Take checkpoint
- Library wrapper in Rx retries syscalls
- High cost so cannot be frequent
17Environment Wrappers
- Memory wrapper (intercepting library calls)
- Delaying free
- keep a freed buffer for a threshold (process)
time - FIFO recycling
- Padding buffers
- adds two fixed-size padding to both ends of
allocated buffers - Allocation isolation
- put allocated buffers to isolated locations
- Zero-filling
- Do the above during re-execution for failed code
region only
18Other Wrappers
- Message wrapper (in proxy)
- Randomly shuffle message orders of different
connections while keeping the message order of
the same connection - Randomize packet sizes
- Process scheduling change process priority
- Signal delivery randomize hw interrupt delivery
time while preserving order - Dropping user requests
- Binary search for bad requests
- Drop at most 10 of requests
19Proxy
20Control Unit
- Coordinate checkpoint/roll back, environment
changes etc. - Failure vector ltS1, S2, , Smgt per failure
symptom (exception type, PC adderss, call chain
etc.) - Si is the score for environmental change i
- If change i is successful, Si if failed, Si -
- - Try the changes with scores greater than a
certain threshold first
21Outline
- Introduction
- Main Idea of Rx
- Rx Design Implementation
- Evaluation
- Summary
22Setup
- A client machine and a server machine
- 2.4GHz x86 CPU, 512KB L2 cache, 1GB DRAM
- 100Mbps Ethernet
Injected bugs
23Overall Results
24Checkpoint Overhead
- Time with checkpoint interval of 200ms, 5
overhead (MySQL)
- Workloads
- apache, squid 5 threads, GET files with size
uniform 1KB, 512KB - CVS client exports a 30KB file
- MySQL 5 client threads, transactions on a small
table
25Summary
- Rx re-executing the buggy program region in a
modified execution environment - Not panacea
- Semantic bugs, resource leaks
- Latent bugs (long delay from bug to symptom)