Recursive Restartability In a Networked Ground Station (RRINGS)

About This Presentation

Title:

Recursive Restartability In a Networked Ground Station (RRINGS)

Description:

In conjunction with fault detection, enabling the ground station (GS) for ... Restart implemented as a java System.exec(...) call. Fall 2001 - CS444A ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 14

Provided by: jwc1

Category:

more less

Transcript and Presenter's Notes

Title: Recursive Restartability In a Networked Ground Station (RRINGS)

1
Recursive Restartability In a Networked Ground
Station (RRINGS)

Rushabh Doshi and Rakesh Gowda
Computer Science DepartmentStanford University

2
Introduction

Hypothesis
In conjunction with fault detection, enabling the
ground station (GS) for Recursive Restartability
(RR) will increase system availability
Approach
Verify the applicability of RR to a single GS
node.
Design a framework for enabling RR in
new/existing GS modules and systems.
Integrate with Fault Detection (FD) component.

3
Current State of Art

Restart scalpel is a novel approach (Candea ,
Fox).
Sledgehammer Restarts
MS cluster Server (formerly Wolfpack) uses
clustering and application level restarts to
achieve higher availability.
Unnamed Internet portal does prophylactic
restarts on Apache
However, none of above use an RR scalpel
We are developing RR scalpel techniques

4
Program Flow

Wait for a fault message from Fault Detector
Consult an oracle to tell you what to restart
Restart those components
A decision tree is the oracle
Construct the decision tree
Capturing restart dependency information

5
RR Tree

RR Tree captures Restart dependency information
Parents must be able to restart children

ise
istr
istu
Pipeline
pbcom
fedr
ise IServiceEstimator istr IserviceTracker istu
IServiceTuner fedr FedRadio pbcom
PipelineByteCOMPort
6
From RR Trees to Decision Trees

Components have different restart times
Components have different failure rates
Use this information to augment Decision Tree
Preserve dependencies
Reduce MTTR
Move slower components up, push faster components
down
Capture historical information Groups of
components that fail together
Move high-failure components to single nodes

7
Restructuring helps!

Sample Restart times for different components

8
Better RR Tree
Pipeline
ise
istr
istu
pbcom
ise IServiceEstimator istr IserviceTracker istu
IServiceTuner fedr FedRadio pbcom
PipelineByteCOMPort
fedr
9
Making the Decision

Algorithm
Get a fault, restart the node and children
May not be able to kill the node
Restart may not solve the problem
If this does not fix the problem
Retry a constant number of times
Go up one level
Repeat
Log all faults and restarts

10
Kill Restart mechanism

Need for a softer kill
All components may not be misbehaving
Give components a chance to free resources
If soft kill fails, follow with hard kill
kill 9 system call on linux
Restart implemented as a java System.exec() call

11
Designing a system for RR

Goal is to decrease MTTR
Decompose components into smaller pieces
Advantages
Fault isolation
Move slow-restart pieces up (and fast-restart
down)
Significantly decreases MTTR
Example fedr and pbom4
Disadvantages
Some components may not be decomposable
IPC can make things difficult (they were together
for a reason) coordination aspect
State management

12
State Management