Parallel Checkpointing - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Parallel Checkpointing

Description:

These applications are also long-running ... So, how can long-running applications execute in this highly failure-prone environment? ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 32
Provided by: SathishV4
Category:

less

Transcript and Presenter's Notes

Title: Parallel Checkpointing


1
Parallel Checkpointing
  • - Sathish Vadhiyar

2
Introduction
  • Checkpointing?
  • storing applications state in order to resume
    later

3
Motivation
  • The largest parallel systems in the world have
    from 130,000 to 213,000 parallel processing
    elements. (http//www.top500.org)
  • Large-scale applications, that can use these
    large number of processes, are continuously being
    built.
  • These applications are also long-running
  • As the number of processing elements, increase
    the Mean Time Between Failure (MTBF) decreases
  • So, how can long-running applications execute in
    this highly failure-prone environment?
    checkpointing and fault-tolerance

4
Uses of Checkpointing
  • Fault tolerance rollback recovery
  • Process migration job swapping
  • Debugging

5
Types of Checkpointing
  • OS-level
  • Process preemption
  • E.g. Berkeleys checkpointing Linux
  • User-level Transparent
  • Checkpointing performed by the program itself
  • Transparency achieved by linking application with
    a special library
  • e.g. Planks libckpt
  • User-level non-transparent
  • Users insert checkpointing calls in their
    programs
  • e.g. SRS (Vadhiyar)

6
Rules for Consistent Checkpointing
  • In a parallel program, each process has events
    and local state
  • An event changes the local state of a process
  • Global state an external view of the parallel
    application (e.g. lines S, S, S) used for
    checkpointing and restarting
  • Consists of local states and messages in transit

7
Rules for Consistent Checkpointing
  • Types of global states
  • Consistent global state from where program can
    be restarted correctly
  • Inconsistent - Otherwise

8
Rules for Consistent Checkpointing
  • Chandy Lamport 2 rules for consistent global
    states
  • 1. if a receive event is part of local state of a
    process, the corresponding send event must be
    part of the local state of the sender.
  • 2. if a send event is part of the local state of
    a process and the matching receive is not part of
    the local state of the receiver, then the message
    must be part of the state of the network.
  • S violates rule 1. Hence cannot lead to
    consistent global state

9
Independent checkpointing
  • Processors checkpoint periodically without
    coordination
  • Can lead to domino effect each rollback of a
    processor to a previous checkpoint forces another
    processor to rollback even further

10
Checkpointing Methods
  • Coordinated checkpointing
  • All processes coordinate to take a consistent
    checkpoint (e.g. using a barrier)
  • Will always lead to consistent checkpoints
  • Checkpointing with message logging
  • Independent checkpoints are taken by processes
    and a process logs the messages it receives after
    the last checkpoint
  • Thus recovery is by previous checkpoint and the
    logged messages.

11
  • Message Logging

12
Message Logging MPICH-V, MPICH-V2(George
Bosilca et. al. SC 2002, Aurelien Bouteiller et.
al. SC 2003)
  • MPICH-V uncoordinated, pessimistic
    receiver-based message logging
  • Architecture consists of computational nodes,
    dispatcher, channel memory, checkpoint servers
  • MPICH-V2 uncoordinated, pessimistic
    sender-based message logging
  • Usage of message ids, logical clocks
  • Senders log message payload and resends messages

13
MPICH V receiver-based
  • Each process has an associated channel memory
    (CM) called home CM
  • When P1 wants to send messages to P2, it contacts
    P2s home CM and sends message to it.
  • P2 retrieves message from its home CM.
  • During checkpoint, the process state is stored to
    CS (checkpoint server)
  • During restart, checkpoint retrieved from CS and
    messages from CM.
  • Dispatcher manages all services

14
MPICH-V
15
MPICH-V2 - sender-based
  • Sender based logging to avoid channel memories
  • When q sends m to p, q logs m.
  • m is associated with an id
  • When p receives m, it stores (id, l) where l is
    the logical clock.
  • When p crashes and restarts from a checkpoint
    state, retrieves (id, l) sets from storage and
    asks other processes to resend from the sets.

16
  • Coordinated Checkpointing

17
Coordinated Checkpointing SRS
  • User inserts checkpointing calls specifying data
    to be checkpointed
  • Advantage Small amount of checkpointed data
  • Disadvantage User burden
  • SRS (Stop-ReStart) A user-level checkpointing
    library that allows reconfiguration of
    applications.
  • Reconfiguration of number of processors and / or
    data distribution.

18
INTERNALS
MPI Application
STOP
Poll
Runtime Support System (RSS)
SRS
STOP
Read with possible redistribution
Start
ReStart
19
SRS Example Original Code
MPI_Init(argc, argv) local_size
global_size/size if(rank 0) for(i0
iltglobal_size i) global_Ai i
MPI_Scatter (global_A, local_size,
MPI_INT, local_A, local_size, MPI_INT, 0, comm)
iter_start 0 for(iiter_start
iltglobal_size i) proc_number
i/local_size local_index ilocal_size
if(rank proc_number) local_Alocal_index
10 MPI_Finalize()
20
SRS Example Modified Code
MPI_Init(argc, argv) SRS_Init()
local_size global_size/size restart_value
SRS_Restart_Value() if(restart_value 0)
if(rank 0) for(i0
iltglobal_size i) global_Ai
i MPI_Scatter
(global_A, local_size, MPI_INT, local_A,
local_size, MPI_INT, 0, comm) iter_start
0 else SRS_Read(A, local_A, BLOCK,
NULL) SRS_Read(iterator, iter_start,
SAME, NULL) SRS_Register(A, local_A,
GRADS_INT, local_size, BLOCK, NULL)
SRS_Register(iterator, I, GRADS_INT, 1, 0,
NULL)
21
SRS Example Modified Code (Contd..)
for(iiter_start iltglobal_size i)
stop_value SRS_Check_Stop() if(stop_value
1) MPI_Finalize() exit(0)
proc_number i/local_size local_index
ilocal_size if(rank proc_number)
local_Alocal_index 10
SRS_Finish() MPI_Finalize()
22
Challenges a precompiler for SRS
  • A precompiler that reads a plain source code and
    converts it into SRS-instrumented source code
  • Need to automatically analyze variables that need
    to be checkpointed
  • Need to automatically determine data
    distributions
  • Should locate phases between which checkpoints
    should be taken automatically determining
    phases of an application

From Schulz et al., SC 2004
23
Challenges determining checkpointing intervals
  • How frequently should checkpoints be taken?
  • Frequent leads to checkpointing overhead
  • Sporadic leads to high recovery cost
  • Should model applications using a Markov model
    consisting of running, recovery and failure
    states
  • Should use failure distributions for determining
    the weights of transitions between states

24
Challenges determining checkpointing intervals
25
Checkpointing Performance
  • Reducing times for checkpointing, recovery etc.
  • Checkpoint latency hiding
  • Checkpoint buffering during checkpointing, copy
    data to local buffer, store buffer to disk in
    parallel with application progress
  • Copy-on-write buffering only modified pages are
    copied to a buffer and stored. Can be implemented
    using fork() forked checkpointing

26
Checkpointing Performance
  • Reducing checkpoint size
  • Memory exclusion no need to store dead and
    read-only variables
  • Incremental checkpointing store only that part
    of data that have been modified from the previous
    checkpoint
  • Checkpoint compression

27
References
  • James S. Plank, An Overview of Checkpointing in
    Uniprocessor and Distributed Systems, Focusing on
    Implementation and Performance'', University of
    Tennessee Technical Report CS-97-372, July, 1997
  • James Plank and Thomason. Processor Allocation
    and Checkpointing Interval Selection in Cluster
    Computing Systems. JPDC 2001.

28
References
  • MPICH-V Toward a Scalable Fault Tolerant MPI for
    Volatile Nodes -- George Bosilca, Aurélien
    Bouteiller, Franck Cappello, Samir Djilali,
    Gilles Fédak, Cécile Germain, Thomas Hérault,
    Pierre Lemarinier, Oleg Lodygensky, Frédéric
    Magniette, Vincent Néri, Anton Selikhov --
    SuperComputing 2002, Baltimore USA, November 2002
  • MPICH-V2 a Fault Tolerant MPI for Volatile Nodes
    based on the Pessimistic Sender Based Message
    Logging -- Aurélien Bouteiller, Franck Cappello,
    Thomas Hérault, Géraud Krawezik, Pierre
    Lemarinier, Frédéric Magniette -- To appear in
    SuperComputing 2003, Phoenix USA, November 2003

29
References
  • Vadhiyar, S. and Dongarra, J. SRS - A Framework
    for Developing Malleable and Migratable Parallel
    Applications for Distributed Systems. Parallel
    Processing Letters, Vol. 13, number 2, pp.
    291-312, June 2003.

30
References
  • Schulz et al. Implementation and Evaluation of a
    Scalable Application-level Checkpoint-Recovery
    Scheme for MPI Programs. SC 2004.

31
Thank You
Write a Comment
User Comments (0)
About PowerShow.com