Title: Parallel Checkpointing
1Parallel Checkpointing
2Introduction
- Checkpointing?
- storing applications state in order to resume
later
3Motivation
- The largest parallel systems in the world have
from 130,000 to 213,000 parallel processing
elements. (http//www.top500.org) - Large-scale applications, that can use these
large number of processes, are continuously being
built. - These applications are also long-running
- As the number of processing elements, increase
the Mean Time Between Failure (MTBF) decreases - So, how can long-running applications execute in
this highly failure-prone environment?
checkpointing and fault-tolerance
4Uses of Checkpointing
- Fault tolerance rollback recovery
- Process migration job swapping
- Debugging
5Types of Checkpointing
- OS-level
- Process preemption
- E.g. Berkeleys checkpointing Linux
- User-level Transparent
- Checkpointing performed by the program itself
- Transparency achieved by linking application with
a special library - e.g. Planks libckpt
- User-level non-transparent
- Users insert checkpointing calls in their
programs - e.g. SRS (Vadhiyar)
6Rules for Consistent Checkpointing
- In a parallel program, each process has events
and local state - An event changes the local state of a process
- Global state an external view of the parallel
application (e.g. lines S, S, S) used for
checkpointing and restarting - Consists of local states and messages in transit
7Rules for Consistent Checkpointing
- Types of global states
- Consistent global state from where program can
be restarted correctly - Inconsistent - Otherwise
8Rules for Consistent Checkpointing
- Chandy Lamport 2 rules for consistent global
states - 1. if a receive event is part of local state of a
process, the corresponding send event must be
part of the local state of the sender. - 2. if a send event is part of the local state of
a process and the matching receive is not part of
the local state of the receiver, then the message
must be part of the state of the network. - S violates rule 1. Hence cannot lead to
consistent global state
9Independent checkpointing
- Processors checkpoint periodically without
coordination - Can lead to domino effect each rollback of a
processor to a previous checkpoint forces another
processor to rollback even further
10Checkpointing Methods
- Coordinated checkpointing
- All processes coordinate to take a consistent
checkpoint (e.g. using a barrier) - Will always lead to consistent checkpoints
- Checkpointing with message logging
- Independent checkpoints are taken by processes
and a process logs the messages it receives after
the last checkpoint - Thus recovery is by previous checkpoint and the
logged messages.
11 12Message Logging MPICH-V, MPICH-V2(George
Bosilca et. al. SC 2002, Aurelien Bouteiller et.
al. SC 2003)
- MPICH-V uncoordinated, pessimistic
receiver-based message logging - Architecture consists of computational nodes,
dispatcher, channel memory, checkpoint servers - MPICH-V2 uncoordinated, pessimistic
sender-based message logging - Usage of message ids, logical clocks
- Senders log message payload and resends messages
13MPICH V receiver-based
- Each process has an associated channel memory
(CM) called home CM - When P1 wants to send messages to P2, it contacts
P2s home CM and sends message to it. - P2 retrieves message from its home CM.
- During checkpoint, the process state is stored to
CS (checkpoint server) - During restart, checkpoint retrieved from CS and
messages from CM. - Dispatcher manages all services
14MPICH-V
15MPICH-V2 - sender-based
- Sender based logging to avoid channel memories
- When q sends m to p, q logs m.
- m is associated with an id
- When p receives m, it stores (id, l) where l is
the logical clock. - When p crashes and restarts from a checkpoint
state, retrieves (id, l) sets from storage and
asks other processes to resend from the sets.
16- Coordinated Checkpointing
17Coordinated Checkpointing SRS
- User inserts checkpointing calls specifying data
to be checkpointed - Advantage Small amount of checkpointed data
- Disadvantage User burden
- SRS (Stop-ReStart) A user-level checkpointing
library that allows reconfiguration of
applications. - Reconfiguration of number of processors and / or
data distribution.
18INTERNALS
MPI Application
STOP
Poll
Runtime Support System (RSS)
SRS
STOP
Read with possible redistribution
Start
ReStart
19SRS Example Original Code
MPI_Init(argc, argv) local_size
global_size/size if(rank 0) for(i0
iltglobal_size i) global_Ai i
MPI_Scatter (global_A, local_size,
MPI_INT, local_A, local_size, MPI_INT, 0, comm)
iter_start 0 for(iiter_start
iltglobal_size i) proc_number
i/local_size local_index ilocal_size
if(rank proc_number) local_Alocal_index
10 MPI_Finalize()
20SRS Example Modified Code
MPI_Init(argc, argv) SRS_Init()
local_size global_size/size restart_value
SRS_Restart_Value() if(restart_value 0)
if(rank 0) for(i0
iltglobal_size i) global_Ai
i MPI_Scatter
(global_A, local_size, MPI_INT, local_A,
local_size, MPI_INT, 0, comm) iter_start
0 else SRS_Read(A, local_A, BLOCK,
NULL) SRS_Read(iterator, iter_start,
SAME, NULL) SRS_Register(A, local_A,
GRADS_INT, local_size, BLOCK, NULL)
SRS_Register(iterator, I, GRADS_INT, 1, 0,
NULL)
21SRS Example Modified Code (Contd..)
for(iiter_start iltglobal_size i)
stop_value SRS_Check_Stop() if(stop_value
1) MPI_Finalize() exit(0)
proc_number i/local_size local_index
ilocal_size if(rank proc_number)
local_Alocal_index 10
SRS_Finish() MPI_Finalize()
22Challenges a precompiler for SRS
- A precompiler that reads a plain source code and
converts it into SRS-instrumented source code - Need to automatically analyze variables that need
to be checkpointed - Need to automatically determine data
distributions - Should locate phases between which checkpoints
should be taken automatically determining
phases of an application
From Schulz et al., SC 2004
23Challenges determining checkpointing intervals
- How frequently should checkpoints be taken?
- Frequent leads to checkpointing overhead
- Sporadic leads to high recovery cost
- Should model applications using a Markov model
consisting of running, recovery and failure
states - Should use failure distributions for determining
the weights of transitions between states
24Challenges determining checkpointing intervals
25Checkpointing Performance
- Reducing times for checkpointing, recovery etc.
- Checkpoint latency hiding
- Checkpoint buffering during checkpointing, copy
data to local buffer, store buffer to disk in
parallel with application progress - Copy-on-write buffering only modified pages are
copied to a buffer and stored. Can be implemented
using fork() forked checkpointing
26Checkpointing Performance
- Reducing checkpoint size
- Memory exclusion no need to store dead and
read-only variables - Incremental checkpointing store only that part
of data that have been modified from the previous
checkpoint - Checkpoint compression
27References
- James S. Plank, An Overview of Checkpointing in
Uniprocessor and Distributed Systems, Focusing on
Implementation and Performance'', University of
Tennessee Technical Report CS-97-372, July, 1997 - James Plank and Thomason. Processor Allocation
and Checkpointing Interval Selection in Cluster
Computing Systems. JPDC 2001.
28References
- MPICH-V Toward a Scalable Fault Tolerant MPI for
Volatile Nodes -- George Bosilca, Aurélien
Bouteiller, Franck Cappello, Samir Djilali,
Gilles Fédak, Cécile Germain, Thomas Hérault,
Pierre Lemarinier, Oleg Lodygensky, Frédéric
Magniette, Vincent Néri, Anton Selikhov --
SuperComputing 2002, Baltimore USA, November 2002 - MPICH-V2 a Fault Tolerant MPI for Volatile Nodes
based on the Pessimistic Sender Based Message
Logging -- Aurélien Bouteiller, Franck Cappello,
Thomas Hérault, Géraud Krawezik, Pierre
Lemarinier, Frédéric Magniette -- To appear in
SuperComputing 2003, Phoenix USA, November 2003
29References
- Vadhiyar, S. and Dongarra, J. SRS - A Framework
for Developing Malleable and Migratable Parallel
Applications for Distributed Systems. Parallel
Processing Letters, Vol. 13, number 2, pp.
291-312, June 2003.
30References
- Schulz et al. Implementation and Evaluation of a
Scalable Application-level Checkpoint-Recovery
Scheme for MPI Programs. SC 2004.
31Thank You