Title: Towards a Portable Fault Tolerant Component Framework for MPI
1Towards a Portable Fault Tolerant Component
Framework for MPI
- Hideyuki Jitsumoto and Satoshi Matsuoka
- ltjitsumo0_at_is.titech.ac.jpgt
- Dept. of Mathematical and Computing Sciences,
Tokyo Institute of Technology
2Best MPI we want
- Ideal MPI is
- Dont stop by fault
- Scientific calculations take a long time
- Cluster and Grid have low reliability
- Can be used easily
- Can be used everywhere
- There are various OS/HW (Linux, Windows/ GbE,
Myrinet, InfiniBand )
3Related work (1/3)
- LAM/MPI Burns et al. 94 / OpenMPI Gabriel et
al. 03 - Modularized
- Replaceable HW dependence code
- But it doesnt consider much about fault/recovery
model (mainly method of communication) - Not good recovery protocol
- All of failure recover by checkpointing/restart
- Non automatic restart (They dont have a daemon
nor monitor) - FT-MPI Fagg et al. 99
- Modify user code for FT
- User can implement their original recovery model
- But user get much coding cost
4Related work(2/3) conclusion
- The current fault tolerant MPI impl.
- Cant be use easily
- Users must implement FT method adapted to various
exec. environment - Cant be use everywhere
- Only a single recovery method
5Related work(3/3) comparison
- Interoperable FT technique Dont have HW
dependence code about FT or Can replace them - Transparent MPI application run unchanged
- Various Fault/Recovery Model dealing with
various faults on each user-environment - 1 MPICH-V can be adapt to various environments
using V1, V2, V/CL and V3 properly
we need Extensible/Facility of LAM/MPI and
Flexibility of FT-MPI.
6Our Goal
- Cuckoo MPI System Fault/Recovery-model-aware
component-based fault tolerant MPI - Portable
- components for adapting to different underlying
computing environment - Flexible
- components for handling different fault and
recovery models - Transparent
- transparent to user code components to handle
different execution phases and appropriate
recovery
7Recovery Protocol (1/2) Protocol
- Cuckoo MPI used following RP
- IGNORE ignore a fault
- Checkpointing/Restart, Migration
- RESTART restart at same node
- MIGRATION migration at different node
- Process Replication
- TRANSFER change replication process to primary
one
8Recovery Protocol (2/2)
- Recovery-cost and recoverable fault are different
in each protocol
Recovery Protocol must be selected to reduce
overhead according to the kind of fault (Fault
model)
9Fault Model on MPI process
- Physical Fault
- Fault occurred on HW
- Recoverable by MIGRATION/TRANSFER
- Network Fault
- Last redundant path was cut, or performance
degradation was large - Recoverable by IGNORE/MIGRATION/TRANSFER
- Process Fault
- Process ABEND
- Recoverable by RESTART/TRANSFER
- Etc
Cuckoo MPI has modules select appropriate RP for
Fault
10Components
- components for supporting parallel FT algorithms
Parallel FT Protocol
FT Daemon
Monitoring Tools Interface
RP
Fault Detector
Special Network
Replicator
Checkpointer
Process FD
Restart
Physical FD
Ignore
Network FD
Migration
- components foradapting to different HW
- components for definition of Fault Model
- components torecover a process
11What flexibility does the component give?
- The method of dealing effectively with frequency
of communications, band-width, number of
processes, and so on.
Parallel FT Protocol
FT Daemon
Monitoring Tools Interface
RP
Fault Detector
- Fault model to be able to deal with the system
- The method of recovery(e.g., the way of
selecting migration node )
12Recovery Model (1/3)
Monitor
2. Acquire Nodes Status
FT Daemon
3.Fault Detection
4.Select appropriate Recovery Protocol
Process
1. Invoke checking function periodically
RESTART
MIGRATION
TRANSFER
IGNORE
5. Recovery
13Recovery Model (2/3)
Process goes ABEND Node is alive
If (Process ! alive, Node alive) select
RESTART protocol If fault occurred many time
(e.g., twice), select MIGRATION protocol
Even if process fault occurred, node is
alive. Then, I want to use RESTART. But, if
process fault occurred many time Possibly, it
may be physical fault Then, I want to use
MIGRATION
Monitor
Select Migration (fault occurred twice)
FT Daemon
Select Restart
Process
RESTART
MIGRATION
TRANSFER
IGNORE
Process
14Recovery Model (3/3)
There are no free node when process migrate.
Monitor
Make cold swaps hot assign processes to that ?
FT Daemon
Suspend MPI process?
Assign processes to busy node ?
Process
RESTART
MIGRATION
TRANSFER
IGNORE
15Impl. - MPI Process (1/3)
- About p4mpd, all of MPI communication use
onlyp4_sendx,p4_recv, p4_message_available - wrap these functions (e.g., logging, message
drain) - Process handle messages from mpdman at signal
handler - add function to parse extra-operation (for FT) on
handler
MPI Process
Application
MPI
ADI
MPICH
p4mpd
CH interface
Cuckoo IF
Cuckoo Component
FT Protocol
Cuckoo
Checkpointer
16Impl. - MPD mpdman (2/3)
- MPD/mpdman handle messages from lhs/rhs
- Add function to parse extra-operation from
- lhs mpdman to mpdman
- rhs mpdman to mpdman
- MPD to mpdman
- lhs MPD to MPD
- rhs MPD to MPD
- mpdman to MPD
- Add a function that reconstructs the ring of
mpdman
MPD/mpdman
MPD/mpdman
CH interface
Cuckoo IF
Cuckoo Component
Parallel FT Protocol
Monitoring Tools
Fault Detector
Recovery Protocol
17Impl. (3/3)
Communication
Cuckoo Interface
p4mpd
FT / Parallel FT Protocol
int p_p4_sendx() if(ck_rbif-gtsend !
NULL) return ck_rbif-gtsend()
else return p4_sendx()
int init() ck_rbif-gtsend
dlsym(RB_send)
int RB_send() ( FT Code ex. Logging)
res ck_ch_send() ( FT Code )
return res
int ck_ch_send() return p4_sendx()
CH Interface
Extra operation parsing
void Mpd_man_msg_handler() if ( strncmp(
buf, cmd. else if( (strcmp(buf,
cmdaddin_) 0)
(ck_rbif-gteman ! NULL))
ck_rbif-gteman ()
int RB_eman() cmd ck_ch_getval(cmd)
if(strcmp(cmd, rdy)0) .
int init() ck_rbif-gteman
dlsym(RB_eman)
p4mpd
FT / Parallel FT Protocol
int ck_ch_getval() return
mpd_getval_r()
18Evaluation(1/2) Facility
Application
..
- FT-MPI
- User must implement fault-tolerance
- The application program is modified
- Implements fault-tolerance
User and Cluster Administrator
FT Application
19Evaluation(2/2) Facility
Application
Implement
Parallel FT Protocol
- Cuckoo FTMPI
- User only selects the components
- The application program isnt modified
Parallel FT Protocol
Parallel FT Protocol
Fault Detector
Fault Detector
Fault Detector
Recovery Protocol
Recovery Protocol
Component publisher (IT Specialist)
Recovery Protocol
Easy !
Application
FT Components
FT Application
User and Cluster Administrator
The load to the user is reduced with the
flexibility kept
20Evaluation (3/3) - Performance
- Now implementing.
- Ill report as soon as I finish implementation
and evaluation.
21Future Work
- Apply to MPICH-2
- Interaction with some software
- RI2N (Boku_at_Tsukuba Univ.)
- Speculative CKPT (Yamagata_at_Tokyo Institute of
Tech.) - Fault Injector (Maruyama_at_Tokyo Institute of
Tech.) - Overview is followinghttp//www.para.tutics.tut.a
c.jp/megascale/research.html - Dynamical FT processing(e.g., Change
checkpointing cycle)