Parallel Checkpointing - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Parallel Checkpointing

Description:

These applications are also long-running ... So, how can long-running applications execute in this highly failure-prone environment? ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 32

Provided by: SathishV4

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Checkpointing

1
Parallel Checkpointing

- Sathish Vadhiyar

2
Introduction

Checkpointing?
storing applications state in order to resume
later

3
Motivation

The largest parallel systems in the world have
from 130,000 to 213,000 parallel processing
elements. (http//www.top500.org)
Large-scale applications, that can use these
large number of processes, are continuously being
built.
These applications are also long-running
As the number of processing elements, increase
the Mean Time Between Failure (MTBF) decreases
So, how can long-running applications execute in
this highly failure-prone environment?
checkpointing and fault-tolerance

4
Uses of Checkpointing

Fault tolerance rollback recovery
Process migration job swapping
Debugging

5
Types of Checkpointing

OS-level
Process preemption
E.g. Berkeleys checkpointing Linux
User-level Transparent
Checkpointing performed by the program itself
Transparency achieved by linking application with
a special library
e.g. Planks libckpt
User-level non-transparent
Users insert checkpointing calls in their
programs
e.g. SRS (Vadhiyar)

6
Rules for Consistent Checkpointing

In a parallel program, each process has events
and local state
An event changes the local state of a process
Global state an external view of the parallel
application (e.g. lines S, S, S) used for
checkpointing and restarting
Consists of local states and messages in transit

7
Rules for Consistent Checkpointing

Types of global states
Consistent global state from where program can
be restarted correctly
Inconsistent - Otherwise

8
Rules for Consistent Checkpointing

Chandy Lamport 2 rules for consistent global
states
1. if a receive event is part of local state of a
process, the corresponding send event must be
part of the local state of the sender.
2. if a send event is part of the local state of
a process and the matching receive is not part of
the local state of the receiver, then the message
must be part of the state of the network.
S violates rule 1. Hence cannot lead to
consistent global state

9
Independent checkpointing

Processors checkpoint periodically without
coordination
Can lead to domino effect each rollback of a
processor to a previous checkpoint forces another
processor to rollback even further

10
Checkpointing Methods

Coordinated checkpointing
All processes coordinate to take a consistent
checkpoint (e.g. using a barrier)
Will always lead to consistent checkpoints
Checkpointing with message logging
Independent checkpoints are taken by processes
and a process logs the messages it receives after
the last checkpoint
Thus recovery is by previous checkpoint and the
logged messages.

Message Logging

12
Message Logging MPICH-V, MPICH-V2(George
Bosilca et. al. SC 2002, Aurelien Bouteiller et.
al. SC 2003)

MPICH-V uncoordinated, pessimistic
receiver-based message logging
Architecture consists of computational nodes,
dispatcher, channel memory, checkpoint servers
MPICH-V2 uncoordinated, pessimistic
sender-based message logging
Usage of message ids, logical clocks
Senders log message payload and resends messages

13
MPICH V receiver-based

Each process has an associated channel memory
(CM) called home CM
When P1 wants to send messages to P2, it contacts
P2s home CM and sends message to it.
P2 retrieves message from its home CM.
During checkpoint, the process state is stored to
CS (checkpoint server)
During restart, checkpoint retrieved from CS and
messages from CM.
Dispatcher manages all services

14
MPICH-V
15
MPICH-V2 - sender-based

Sender based logging to avoid channel memories
When q sends m to p, q logs m.
m is associated with an id
When p receives m, it stores (id, l) where l is
the logical clock.
When p crashes and restarts from a checkpoint
state, retrieves (id, l) sets from storage and
asks other processes to resend from the sets.

Coordinated Checkpointing

17
Coordinated Checkpointing SRS

User inserts checkpointing calls specifying data
to be checkpointed
Advantage Small amount of checkpointed data
Disadvantage User burden
SRS (Stop-ReStart) A user-level checkpointing
library that allows reconfiguration of
applications.
Reconfiguration of number of processors and / or
data distribution.

18
INTERNALS
MPI Application
STOP
Poll
Runtime Support System (RSS)
SRS
STOP
Read with possible redistribution
Start
ReStart
19
SRS Example Original Code
MPI_Init(argc, argv) local_size
global_size/size if(rank 0) for(i0
iltglobal_size i) global_Ai i
MPI_Scatter (global_A, local_size,
MPI_INT, local_A, local_size, MPI_INT, 0, comm)
iter_start 0 for(iiter_start
iltglobal_size i) proc_number
i/local_size local_index ilocal_size
if(rank proc_number) local_Alocal_index
10 MPI_Finalize()
20
SRS Example Modified Code
MPI_Init(argc, argv) SRS_Init()
local_size global_size/size restart_value
SRS_Restart_Value() if(restart_value 0)
if(rank 0) for(i0
iltglobal_size i) global_Ai
i MPI_Scatter
(global_A, local_size, MPI_INT, local_A,
local_size, MPI_INT, 0, comm) iter_start
0 else SRS_Read(A, local_A, BLOCK,
NULL) SRS_Read(iterator, iter_start,
SAME, NULL) SRS_Register(A, local_A,
GRADS_INT, local_size, BLOCK, NULL)
SRS_Register(iterator, I, GRADS_INT, 1, 0,
NULL)
21
SRS Example Modified Code (Contd..)
for(iiter_start iltglobal_size i)
stop_value SRS_Check_Stop() if(stop_value
1) MPI_Finalize() exit(0)
proc_number i/local_size local_index
ilocal_size if(rank proc_number)
local_Alocal_index 10
SRS_Finish() MPI_Finalize()
22
Challenges a precompiler for SRS

A precompiler that reads a plain source code and
converts it into SRS-instrumented source code
Need to automatically analyze variables that need
to be checkpointed
Need to automatically determine data
distributions
Should locate phases between which checkpoints
should be taken automatically determining
phases of an application

From Schulz et al., SC 2004
23
Challenges determining checkpointing intervals

How frequently should checkpoints be taken?
Frequent leads to checkpointing overhead
Sporadic leads to high recovery cost
Should model applications using a Markov model
consisting of running, recovery and failure
states
Should use failure distributions for determining
the weights of transitions between states

24
Challenges determining checkpointing intervals
25
Checkpointing Performance

Reducing times for checkpointing, recovery etc.
Checkpoint latency hiding
Checkpoint buffering during checkpointing, copy
data to local buffer, store buffer to disk in
parallel with application progress
Copy-on-write buffering only modified pages are
copied to a buffer and stored. Can be implemented
using fork() forked checkpointing

26
Checkpointing Performance

Reducing checkpoint size
Memory exclusion no need to store dead and
read-only variables
Incremental checkpointing store only that part
of data that have been modified from the previous
checkpoint
Checkpoint compression

27
References

James S. Plank, An Overview of Checkpointing in
Uniprocessor and Distributed Systems, Focusing on
Implementation and Performance'', University of
Tennessee Technical Report CS-97-372, July, 1997
James Plank and Thomason. Processor Allocation
and Checkpointing Interval Selection in Cluster
Computing Systems. JPDC 2001.

28
References

MPICH-V Toward a Scalable Fault Tolerant MPI for
Volatile Nodes -- George Bosilca, Aurélien
Bouteiller, Franck Cappello, Samir Djilali,
Gilles Fédak, Cécile Germain, Thomas Hérault,
Pierre Lemarinier, Oleg Lodygensky, Frédéric
Magniette, Vincent Néri, Anton Selikhov --
SuperComputing 2002, Baltimore USA, November 2002
MPICH-V2 a Fault Tolerant MPI for Volatile Nodes
based on the Pessimistic Sender Based Message
Logging -- Aurélien Bouteiller, Franck Cappello,
Thomas Hérault, Géraud Krawezik, Pierre
Lemarinier, Frédéric Magniette -- To appear in
SuperComputing 2003, Phoenix USA, November 2003

29
References

Vadhiyar, S. and Dongarra, J. SRS - A Framework
for Developing Malleable and Migratable Parallel
Applications for Distributed Systems. Parallel
Processing Letters, Vol. 13, number 2, pp.
291-312, June 2003.

30
References