Recursive Restartability In a Networked Ground Station (RRINGS) - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Recursive Restartability In a Networked Ground Station (RRINGS)

Description:

In conjunction with fault detection, enabling the ground station (GS) for ... Restart implemented as a java System.exec(...) call. Fall 2001 - CS444A ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 14
Provided by: jwc1
Category:

less

Transcript and Presenter's Notes

Title: Recursive Restartability In a Networked Ground Station (RRINGS)


1
Recursive Restartability In a Networked Ground
Station (RRINGS)
  • Rushabh Doshi and Rakesh Gowda
  • Computer Science DepartmentStanford University

2
Introduction
  • Hypothesis
  • In conjunction with fault detection, enabling the
    ground station (GS) for Recursive Restartability
    (RR) will increase system availability
  • Approach
  • Verify the applicability of RR to a single GS
    node.
  • Design a framework for enabling RR in
    new/existing GS modules and systems.
  • Integrate with Fault Detection (FD) component.

3
Current State of Art
  • Restart scalpel is a novel approach (Candea ,
    Fox).
  • Sledgehammer Restarts
  • MS cluster Server (formerly Wolfpack) uses
    clustering and application level restarts to
    achieve higher availability.
  • Unnamed Internet portal does prophylactic
    restarts on Apache
  • However, none of above use an RR scalpel
  • We are developing RR scalpel techniques

4
Program Flow
  • Wait for a fault message from Fault Detector
  • Consult an oracle to tell you what to restart
  • Restart those components
  • A decision tree is the oracle
  • Construct the decision tree
  • Capturing restart dependency information

5
RR Tree
  • RR Tree captures Restart dependency information
  • Parents must be able to restart children

ise
istr
istu
Pipeline
pbcom
fedr
ise IServiceEstimator istr IserviceTracker istu
IServiceTuner fedr FedRadio pbcom
PipelineByteCOMPort
6
From RR Trees to Decision Trees
  • Components have different restart times
  • Components have different failure rates
  • Use this information to augment Decision Tree
  • Preserve dependencies
  • Reduce MTTR
  • Move slower components up, push faster components
    down
  • Capture historical information Groups of
    components that fail together
  • Move high-failure components to single nodes

7
Restructuring helps!
  • Sample Restart times for different components

8
Better RR Tree
Pipeline
ise
istr
istu
pbcom
ise IServiceEstimator istr IserviceTracker istu
IServiceTuner fedr FedRadio pbcom
PipelineByteCOMPort
fedr
9
Making the Decision
  • Algorithm
  • Get a fault, restart the node and children
  • May not be able to kill the node
  • Restart may not solve the problem
  • If this does not fix the problem
  • Retry a constant number of times
  • Go up one level
  • Repeat
  • Log all faults and restarts

10
Kill Restart mechanism
  • Need for a softer kill
  • All components may not be misbehaving
  • Give components a chance to free resources
  • If soft kill fails, follow with hard kill
  • kill 9 system call on linux
  • Restart implemented as a java System.exec() call

11
Designing a system for RR
  • Goal is to decrease MTTR
  • Decompose components into smaller pieces
  • Advantages
  • Fault isolation
  • Move slow-restart pieces up (and fast-restart
    down)
  • Significantly decreases MTTR
  • Example fedr and pbom4
  • Disadvantages
  • Some components may not be decomposable
  • IPC can make things difficult (they were together
    for a reason) coordination aspect
  • State management

12
State Management
  • Stateful components need to resynchronize after
    restart
  • Resynch complexity is a function of system design
  • GS Resynchronization
  • All components keep softstate
  • Hardstate in control GUI that we are not
    modeling here.
  • Future GS Resynchronization
  • Protect system goal state in a safe stable
    storage.
  • Components refresh from this stable storage
  • Details not yet defined.

13
Results
  • Increased reliability in GS through RR
  • Developed framework for enabling new GS modules
  • Future work
  • Develop protected stable storage techniques
  • Extend framework for a multi-component GS
  • Extend framework to a federated Virtual GS
Write a Comment
User Comments (0)
About PowerShow.com