Title: An Early Experience on
1An Early Experience on Job Checkpoint/Restart -
Working with SGI Irix OS and the Portable Batch
System (PBS)
Sherry Chang
schang_at_nas.nasa.gov Scientific Consultant NASA
Advanced Supercomputing Division NASA Ames
Research Center Moffett Field, CA
2Helping Our Users
- Scientific Consultants at NAS help users to run
their jobs - successfully with little or no impact on other
users
- This presentation originated from helping with a
users case
3Outline
- Examples a Gaussian job and an MPI job
-
- Four Methods for checkpoint/restart
- Future testing and wish list
4Why Do Checkpointing?
- Halt and restart resource-intensive codes that
- take a long time to run
- Prevent job loss if system fails
- Improve a system's load balancing and scheduling
- Replace hardware, maintenance
5How to Do Checkpoint/Restart ?
- User code has its own checkpoint capability
Example many CFD codes certain gaussian jobs
- OS has built-in checkpoint/restart utility
Example The Cray Unicos OS - chkpnt and restart
The SGI Irix OS - cpr implemented in 6.n
releases
Batch systems NQE, LSF, PBS support
checkpoint/restart
6Who Can Checkpoint and Restart ?
owner of process(
es
)
superuser
Rules
Only the checkpoint owner or superuser is
permitted
to perform a checkpoint.
If the processes have multiple owners, only the
superuser
is permitted to checkpoint them.
Only the checkpoint owner or superuser can
restart
checkpointed
process(
es
) from a statefile.
If the superuser performed a checkpoint, only
the superuser
can restart it.
7Sample Gaussian Job
Gaussian Script o2.com
nproc2 chko2.chk p CCSD/6-31g OPT O2
Geometry Optimization 0 1 O O 1 r r 1.500
8Sequence of calculations (links) of this
gaussian job
Link 1 initialization
Link 101 read title and molecule specification
Link 114 EF numerical optimization
Link 202
reorientate
coordinates, etc.
Link 301 generates basis set info
Link 302 calculate overlap, kinetic, and
potential integral
multipole
integral
Link 303 calculate
Link 401 form initial MO guess
iteratively
solve SCF
Link 502
Link 801 initialize transformation of 2-e
integrals
Link 804 integral transformation
Link 913 calculate post-SCF energies and
gradient terms
Link 601 population and related analysis
9Sample MPI Job
- Program pi.f calculate the value of p
mpirun -np 3 ./pi gt pi.out
- Use -miser or -cpr option to allow
checkpoint/restart
mpirun -miser -np 3 ./pi gt pi.out
10SGI's CPR Commands
The cpr command provides a command-line interface
for
checkpoint
cpr -c statefile -p idtype,idtype...
-fgku
find information
of an existing checkpoint statefile
cpr -i statefile
restart
cpr -j -r statefile
delete
checkpoint statefile
cpr -D statefile
11Checkpoint cpr -c statefile -p idtype
-p specifies the process or set of processes to
checkpoint
Processes may have any type in the following list
PID
default
for Unix process and POSIX pthread ID (
)
(
)
use 'ps' command to find PID
HID
for process hierarchy (tree) rooted at that
PID
GID
for Unix process group ID
SID
for Unix process session ID see termio(7)
ASH
for IRIX Array Session ID see
array_services(5)
(
)
use 'array ps' command to find ASH ID
SGP
for IRIX sproc shared group see sproc(2)
12Using CPR Interactively
- g98 o2.com o2.out
- 1 19432 (parent process ID)
- ps
- PID TTY TIME CMD
- 19431 ttyq2 001 l101.exe
- 19432 ttyq2 000 g98
- 19435 ttyq2 002 l101.exe
- Do the first checkpoint
- cpr -c chk1 -p 19432HID -k
- Checkpointing id 19432 (type HID) to directory
chk1 - Checkpoint done
child process
parent process
child process
Caveat Multiple-processor Gaussian jobs do not
automatically clear its 'shared memory segments'
when the job is checkpointed.
13A Caveat with Multiple-processor
Gaussian
Job
Multiple-processor gaussian jobs do not
automatically
clear its 'shared memory segments' when the job
is aborted.
- memory not freed, cause problems for other jobs
-
cpr
-r may fail if older version of OS is used
ipcs
-a
IPC status from /
dev
/
kmem
as of Tue Nov 28 132436 2000
Shared Memory
T ID KEY MODE OWNER
GROUP CREATOR CGROUP NATTCH SEGSZ
CPID LPID ATIME DTIME CTIME
m 0 0x53637445 --
rw
-r--r-- root root root root
1 48124 845 845 170457
no-entry 170457
m 73 0x0000116d --
rw
------- schang g1179 schang g1179
0 50331664 19340 19340 130356
no-entry 130356
m 144 0x000072ea --
rw
------- schang g1179 schang g1179
0 50331664 19431 19431 132051
132109 131735
ipcrm
-m 73
ipcrm
-m 144
14Using CPR Interactively - continued
cpr -r chk1 1 19458 Restarting processes
from directory chk1 Process restarted
successfully. 1 Done cpr -r
chk1 ps PID TTY TIME CMD 19431 ttyq2
027 l114.exe 19432 ttyq2 000 g98
15Failure - 1
- checkpoint stalled using cpr interactively
- for mpi jobs
mpirun -miser -np 3 ./pi gt pi.out 1
3527372 cpr -c chk1 -p 3527372HID -k () 2
3539292 (no progress at all)
Production systems checkpoint stalled most of
times Test-bed systems successful
16Using CPR within PBS
- PBS script for first checkpoint
- PBS script for subsequent cpr
PBS -l ncpus2 PBS -l mem50mw PBS -l
walltime10000 setenv g98root
/usr/local/pkg setenv g98root/g98/bsd/g98.login
cd PBS_O_WORKDIR g98 o2.com o2.out sleep
20 cpr -c chk1 -p ps -u schang grep g98 awk
print 1HID -k
PBS -l ncpus2 PBS -l mem50mw PBS -l
walltime10000 setenv g98root
/usr/local/pkg setenv g98root/g98/bsd/g98.login
cd PBS_O_WORKDIR cpr -r chk1 sleep 60 cpr -c
chk2 -p 3971074HID -k
Find the PID of the parent process
- Alternative start job and do first
- checkpoint interactively
- Caveat Restart will fail if PBS
- stdout/stderr not present
17A Caveat with PBS
If a job is started within PBS instead of
interactively, the PBS
standard output/error files associated with the
first PBS script
have to be present in /PBS/spool in order for
subsequent
cpr
to succeed.
The PBS standard output and error files are
copied over to PBS_O_WORKDIR
and removed from /PBS/spool when PBS job is
completed.
Subsequent
cpr
will fail if such files are not present in
/PBS/spool
CPR Error
ckpt
_
fstat
open /PBS/spool/131.t3.
nas
..ER (No such file or directory)
Solutions
- simply use 'touch' to recreate the needed
stdout
and
stderr
files in /PBS/spool,
even though they have no content.
cd
/PBS/spool
touch 131.t3.
nas
..ER
- An alternative to avoid this trouble is to
run your job and do the first
checkpoint through an interactive session
18Failure - 2
- restart failed using cpr in PBS script
- both for mpi and gaussian jobs
- checkpoint/restart successful for a few cycles,
restart failed in a later cycle
cpr -c chk1 -pxxxxHID -k cpr -r chk1 cpr -c
chk2 -pxxxxHID -k cpr -r chk2 . cpr -c chkn
-pxxxxHID -k cpr -r chkn
successful
failed
Error Messages
CPR Error Failed to place mld 0 (Invalid
argument) CPR Error Unexpected status EOF CPR
Error Cleaning up the failed restart
19Failure - 3
- restart failed from a checkpoint state which was
once - successfully restarted
- both for mpi and gaussian jobs
cpr -c chk1 -pxxxxHID -k cpr -r chk1 cpr -c
chk2 -pxxxxHID -k cpr -r chk2 . cpr -c chkn
-pxxxxHID -k cpr -r chkn
successful
failed
cpr -r chk2
failed
Error Message same as previous page
20Using qhold and qrls of PBS
qsub o2.script 1121.evelyn.nas.nasa.gov
qhold 1121 ls -l /PBS/checkpoint drwxr-xr-x
3 root root 28 May 1 1220
1121.evelyn.CK qstat -a Job ID
S 1121.evelyn H qrls 1121 Job ID
S 1121.evelyn R
PBS script o2.script
qhold qrls
21Failure - 4 (hopper)
Turing qsub mpi.pbs 8128.fermi.nas.nasa.gov
22Failure on hopper
qhold
/
qrls
- continued
Error Message from /PBS/mom_logs
pbs
_mom
File exists (17) in
create_
cpuset
,
failed to create
cpuset
8128.
fer
mach_restart
pbs
_mom
Svr
pbs
_momFile exists (17) in
,
Cannot
assign
cpuset
to 8128.
fermi
.
nas
.
nasa
.
gov
mom_restart_job
pbs
_mom
Svr
pbs
_momError 0 (0) in
, 8128.
fermi
.
nas
.
nasa
.
gov
task 1
failed from file /PBS/checkpoint/8128.
fermi
..CK/0000000001
pbs
_momJob8128.
fermi
.
nas
.
nasa
.
gov
Restart failed, error 0
pbs
_momJob8128.
fermi
.
nas
.
nasa
.
gov
kill_job
Ed Hook
What might cause the function
'createCpuset
()'
(from
libcpuset
.so) to error out, with
'errno'
set to 'EEXIST' (17) ??
Bron
Nelson
looking at the kernel sources, this error is
only
returned when the kernel believes that there is
already an existing
cpuset
with the requested name.
23Failure -5 (t3)
t3 qsub mpi.pbs 148.t3.nas.nasa.gov
33 sec
72 sec
Job ran for 40 seconds and then got killed
24Failure on t3
qhold
/
qrls
- continued
Job failed due to PBS time-tracking error
Walltime
used (
t1
t2
)
time job completed -
adjusted wall start time
lt allowed
walltime
?
25Failure on t3
qhold
/
qrls
- continued
Message from /PBS/bin/
tracejob
qsub
qhold
qrls
Bug in
pbs
_mom
PBS Mom should have adjusted wall start 34 sec
earlier than 111901.
It mistakenly adjusted wall start 150 sec later
than 111901
26Using qsub -c of PBS for Automatic Checkpointing
Periodically
qsub o2.script or qsub -c c3
o2.script 1120.evelyn.nas.nasa.gov
PBS script o2.script
PBS -l ncpus2 PBS -l mem50mw PBS -l
walltime10000 PBS -c c3 setenv g98root
/usr/local/pkg setenv g98root/g98/bsd/g98.login
cd PBS_O_WORKDIR g98 o2.com o2.out
If PBS mom or system crashes
PBS should automatically restart a job that has a
checkpoint directory associated with it after
the system is back
27Summary of cpr, qhold/qrls and qsub-c
cpr
interactively
For interactive jobs, prevent loss of valuable
results if system crashes
- typically by job owner
cpr
within PBS
Job needs very long
walltime
which exceeds the limit of any queue
prevent loss of valuable results if system crashes
- by job owner
qhold
/
qrls
qsub
-c
system crash prevent loss of valuable results
- by job owner
PBS should automatically restart the job that has
a checkpoint directory
28Future Testing and Wish List
Future Testing
- A wide variety of user applications - OpenMP,
pvm, mpi
- System-wide checkpoint/restart
Ultimate Goal
Make sure checkpoint/restart is reliable in a
real production environment
29IRIX/CPR - A Popular Topic
Recent email-exchanges on this topic,
sgi-tech_at_cug.org
Barry Sharp - Boeing
Paul White - CSC
Miroslaw Kupczyk - Poznan Supercomputing and
Network Center
Torgny Faxen - National Supercomputing Center,
Sweden
- MPI, OpenMP, Gaussian - no pvm yet
SGI- supportfolio bug report provides limited
information
would like to see more exchange on the details of
the success and failure cases
30Acknowledgement
- Ed Hook - SciCon, PBS expert
- Lorraine Freeman - sysadmin
- Bron Nelson - SGI on-site analyst
- Chuck Niggley - SciCon Group Lead
- NASA Advanced Supercomputing Division