Crash Detection - PowerPoint PPT Presentation

About This Presentation

Title:

Crash Detection

Description:

Connection Setup. Connect nodes as a Binomial tree ... Tree Setup - Phase I. TCP connection setup. Multicast / Reduction ... Sent to client during Setup phase ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 33

Provided by: Syste98

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Crash Detection

1
Middleware for Active Reduction Operations in
Distributed Systems
By Nitin Bahadur Gokul Nadathur Department of
Computer Sciences University of Wisconsin-Madison
Spring 2000
2
Talk Outline

Motivation and Goals
General Architecture of the middleware
Components of the middleware
Providing reliability - handling of node failures
Applications developed using the middleware
Performance
Conclusions and possible extensions

3
Motivation and Goals

A middleware for an application with Master -
Worker paradigm
Scalable framework for communication and
computing client response (Reduction)
Unicast does not scale - so use multicast
Introducing reduction operations dynamically in
clients
A general framework for communication among
clients

4
The Big Picture...
Master App
ARTL
Client App
Client App
ARTL
ARTL
Client App
ARTL
5
ART - Library Architecture
Application specific callbacks
Application
Application API
Reduction functions
Framework for processing messages
ARTL specific message
Event Handler
Outgoing message
ARTL Communication Layer
Incoming Packet
Network
ARTL messages 1. Query from master 2. Response
from downstream nodes
6
ART - Library Architecture
Application specific callbacks
Application
Application API
Reduction functions
Framework for processing messages
ARTL specific message
Event Handler
Outgoing message
ARTL Communication Layer
Incoming Packet
Network
ARTL messages 1. Query from master 2. Response
from downstream nodes
7
Communication Subsystem

Connection Setup
Connect nodes as a Binomial tree
Send and receive ARTL and application messages
Detect node failure and act accordingly
Integrate restarted node in current tree structure

8
Why use Binomial Tree
Client App
Client App
Master App
3
2
1
2
Master App
Client App
Client App
1
2
Client App
Client App
Binomial Tree Query Propagation time 2
Unicast Mechanism Query Propagation time 3
9
Reduction
Reduction at 5 and 3
Example Reduction operations Min(), Max()
Responses
10
Tree connection setup
11
Tree Setup - Phase I
TCP connection setup
12
Tree Setup - Phase II
TCP connection setup
13
Tree Setup - Phase III
TCP connection setup
14
Inter node communication
Data
ARTL Header

Unicast and multicast data transmission
ARTL receives application messages for which no
receive has been posted
these are sent to a callback function registered
by application
ARTL receives data on behalf of application when
application explicitly posts a receive

15
ART - Library Architecture
Application specific callbacks
Application
Application API
Reduction functions
Framework for processing messages
ARTL Encapsulated message
Event Handler
Outgoing message
ARTL Communication Layer
Incoming Packet
Network
ARTL messages 1. Query from master 2. Response
from downstream nodes
16
Reduction Functions

Implemented as Shared objects
Sent to client during Setup phase
Each reduction function is associated with a
particular response it reduces

17
Event Handler
Network
Thread Pool
Event Handler
Application
18
Multithreaded Architecture

No prior Knowledge about behavior of reduction
function
Exploit concurrency - multiple processor per node
Static Pool of threads - Creation and destruction
of threads is bad (Firefly RPC)

19
Crash Reconfiguration
20
Crash Reconfiguration
Crash Reconfiguration at depth 1
21
Crash Reconfiguration
Crash Reconfiguration at depth 2
22
Crash Reconfiguration
Crash Reconfiguration at depth 1
23
Crash Reconfiguration
Crash Reconfiguration at depth 1
24
Crash Detection

Break in TCP connection with parent/child
a signal is received at the other end of
connection
Use of periodic refresh messages to inform parent
that child is up and running
useful in WAN environments

25
Crash Handling

Parent of node down informs master
All nodes are informed of a node failure
Master recomputes tree
If leaf node down, then no problem
If intermediate node down, some reconfiguration
is required

26
Node Restart

Restarted node contacts master to tell it about
restart
Master sends it current state of network and the
shared object(s)
All nodes are informed of a node restart
Master recomputes tree and informs the new nodes
parent about its new child
Parent and child establish connections

27
SysMon - A System monitor
Monitors the load average from /procdisplays
Min, Max and average loads Per-node load is
also displayedARTL Reduction operations Min,
Max and Average
28
SysMon - A System monitor
Node failures are detected and SysMon pops up an
alert
29
File Transfer Application

Transfers a file from master to all clients
File can be executed at clients (if required)
execution can be instantaneous on receiving file
execution can be delayed until all nodes have
received the file

30
File Transfer Performance
31
Total Startup Time vs Number of Nodes
Client processes started using ssh on different
machines
32
Conclusions and Extensions