Rx: Treating Bugs As Allergies - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Rx: Treating Bugs As Allergies

Description:

Rx: Treating Bugs As Allergies A Safe Method to Survive Software Failures. Qin, Tucek, Sundaresan, Zhou (UIUC). SOSP 05 Shimin Chen LBA Reading Group Presentation – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 26
Provided by: cmue99
Category:

less

Transcript and Presenter's Notes

Title: Rx: Treating Bugs As Allergies


1
Rx Treating Bugs As Allergies A Safe Method to
Survive Software Failures. Qin, Tucek,
Sundaresan, Zhou (UIUC). SOSP05
  • Shimin ChenLBA Reading Group Presentation

2
Motivation
  • High availability is important
  • Critical applications process control, etc.
  • Financial company an hour of downtime costs 6
    million
  • SW defects account for up to 40 of system
    failures
  • Common memory-related bugs and concurrency bugs
  • Bugs still occur in production runs
  • Even after SW company spends enormous effort on
    testing
  • ? Ask for mechanisms for surviving software bugs

3
Previous Work on Surviving SW Failures
  • Four categories
  • Rebooting
  • Checkpointing and recovery
  • Application-specific mechanisms
  • Recent proposals
  • Failure-oblivious computing
  • Reactive immune system

4
Previous Work 1 Rebooting
  • Schemes
  • Whole program restart
  • Micro-rebooting of partial system components
  • SW rejuvenation (proactively restart processes)
  • Problem
  • Cannot deal with deterministic bugs
  • Restart time

5
Previous Work 2 General checkpointing and
recovery
  • Schemes
  • Checkpoint, rollback, re-execute
  • Or use a backup server
  • Problems
  • Cannot deal with deterministic bugs
  • Progressive retry in distributed systems
  • Reorder messages to get around SW bugs, but not
    bugs on single system
  • N-version programming
  • Too expensive

6
Previous Work 3 Application-Specific Recovery
Mechanisms
  • Multi-process model (MPM)
  • Kill a request-handling process and start a new
    one
  • Problems
  • Cannot handle deterministic bugs
  • What if shared data structure is corrupted?

7
Previous Work 4 Recent Non-Conventional Proposals
  • Failure-oblivious computing
  • Manufacture values for out-of-bound reads
  • Discard out-of-bound writes
  • Reactive immune system
  • Detect failures of function calls
  • Forcefully return from the function with a
    manufactured error return value (e.g. -1 for int,
    0 for unsigned int etc.)
  • Problem
  • Unsafe for correctness-critical applications
    (e.g. banking)

8
New Proposal Rx
  • Rollback the program to a recent checkpoint when
    a bug is detected
  • Dynamically change the execution environment
    based on the failure symptoms
  • Re-execute the buggy code in the new environment
  • Features
  • Comprehensive can deal with deterministic bugs
  • Safe do not speculatively fix bugs, but change
    environment
  • Noninvasive no changes to app source code
  • Efficient
  • Informative help locating the bugs

9
Outline
  • Introduction
  • Main Idea of Rx
  • Rx Design Implementation
  • Evaluation
  • Summary

10
Main Idea
Record the changes for offline diagnosis
11
Useful Execution Environmental Changes
  • Must be safe and may avoid bugs
  • Memory management based
  • Buffer overflows, dangling pointers, etc.
  • Timing based
  • Concurrency bugs
  • User request based
  • Dropping unexpected (malicious) user request
  • As a last resort

12
(No Transcript)
13
Outline
  • Introduction
  • Main Idea of Rx
  • Rx Design Implementation
  • Evaluation
  • Summary

14
Rx Components Overview
4
1
2
3
5
15
Sensors for Detecting SW Failures
  • OS-raised exceptions
  • Assertion failures, segfault, divide-by-zero,
    etc.
  • Fine-grain detection
  • buffer overflow, accesses to freed memory etc.
  • Only implemented OS-raised exceptions

16
Checkpoint and Rollback (Flashback)
  • Memory state fork-like operation
  • Files keep a copy of each accessed files and
    file pointers for a checkpoint
  • Checkpoint management
  • Equal intervals or exponential landmarks
  • Limit oldest checkpoint by considering recovery
    time goal
  • Multi-threaded process checkpointing
  • Send a signal to all threads to make them exit
    from blocked syscalls with EINTR
  • Take checkpoint
  • Library wrapper in Rx retries syscalls
  • High cost so cannot be frequent

17
Environment Wrappers
  • Memory wrapper (intercepting library calls)
  • Delaying free
  • keep a freed buffer for a threshold (process)
    time
  • FIFO recycling
  • Padding buffers
  • adds two fixed-size padding to both ends of
    allocated buffers
  • Allocation isolation
  • put allocated buffers to isolated locations
  • Zero-filling
  • Do the above during re-execution for failed code
    region only

18
Other Wrappers
  • Message wrapper (in proxy)
  • Randomly shuffle message orders of different
    connections while keeping the message order of
    the same connection
  • Randomize packet sizes
  • Process scheduling change process priority
  • Signal delivery randomize hw interrupt delivery
    time while preserving order
  • Dropping user requests
  • Binary search for bad requests
  • Drop at most 10 of requests

19
Proxy
20
Control Unit
  • Coordinate checkpoint/roll back, environment
    changes etc.
  • Failure vector ltS1, S2, , Smgt per failure
    symptom (exception type, PC adderss, call chain
    etc.)
  • Si is the score for environmental change i
  • If change i is successful, Si if failed, Si -
    -
  • Try the changes with scores greater than a
    certain threshold first

21
Outline
  • Introduction
  • Main Idea of Rx
  • Rx Design Implementation
  • Evaluation
  • Summary

22
Setup
  • A client machine and a server machine
  • 2.4GHz x86 CPU, 512KB L2 cache, 1GB DRAM
  • 100Mbps Ethernet

Injected bugs
23
Overall Results
24
Checkpoint Overhead
  • Time with checkpoint interval of 200ms, 5
    overhead (MySQL)
  • Workloads
  • apache, squid 5 threads, GET files with size
    uniform 1KB, 512KB
  • CVS client exports a 30KB file
  • MySQL 5 client threads, transactions on a small
    table

25
Summary
  • Rx re-executing the buggy program region in a
    modified execution environment
  • Not panacea
  • Semantic bugs, resource leaks
  • Latent bugs (long delay from bug to symptom)
Write a Comment
User Comments (0)
About PowerShow.com