An Early Experience on

About This Presentation

Title:

An Early Experience on

Description:

An Early Experience on. Job Checkpoint/Restart - Working with SGI Irix OS and ... cpr -c chk1 -p `ps -u schang | grep g98 | awk '{print $1}'`:HID -k ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 31

Provided by: sch118

Category:

more less

Transcript and Presenter's Notes

Title: An Early Experience on

1
An Early Experience on Job Checkpoint/Restart -
Working with SGI Irix OS and the Portable Batch
System (PBS)
Sherry Chang
schang_at_nas.nasa.gov Scientific Consultant NASA
Advanced Supercomputing Division NASA Ames
Research Center Moffett Field, CA
2
Helping Our Users

Scientific Consultants at NAS help users to run
their jobs
successfully with little or no impact on other
users

This presentation originated from helping with a
users case

3
Outline

Why, How, and Who ?

Examples a Gaussian job and an MPI job

Introduction to SGIs cpr

Four Methods for checkpoint/restart

Success and Failures

Future testing and wish list

4
Why Do Checkpointing?

Halt and restart resource-intensive codes that
take a long time to run

- Prevent job loss if system fails

Improve a system's load balancing and scheduling

Replace hardware, maintenance

5
How to Do Checkpoint/Restart ?

User code has its own checkpoint capability

Example many CFD codes certain gaussian jobs

OS has built-in checkpoint/restart utility

Example The Cray Unicos OS - chkpnt and restart
The SGI Irix OS - cpr implemented in 6.n
releases
Batch systems NQE, LSF, PBS support
checkpoint/restart
6
Who Can Checkpoint and Restart ?

owner of process(
es
)

superuser
Rules

Only the checkpoint owner or superuser is
permitted

to perform a checkpoint.

If the processes have multiple owners, only the
superuser
is permitted to checkpoint them.

Only the checkpoint owner or superuser can
restart

checkpointed
process(
es
) from a statefile.

If the superuser performed a checkpoint, only
the superuser
can restart it.
7
Sample Gaussian Job
Gaussian Script o2.com
nproc2 chko2.chk p CCSD/6-31g OPT O2
Geometry Optimization 0 1 O O 1 r r 1.500
8
Sequence of calculations (links) of this
gaussian job

Link 1 initialization
Link 101 read title and molecule specification
Link 114 EF numerical optimization

Link 202
reorientate
coordinates, etc.
Link 301 generates basis set info
Link 302 calculate overlap, kinetic, and
potential integral

multipole
integral
Link 303 calculate
Link 401 form initial MO guess
iteratively
solve SCF
Link 502

Link 801 initialize transformation of 2-e
integrals
Link 804 integral transformation
Link 913 calculate post-SCF energies and
gradient terms

Link 601 population and related analysis
9
Sample MPI Job

Program pi.f calculate the value of p

mpirun -np 3 ./pi gt pi.out

Use -miser or -cpr option to allow
checkpoint/restart

mpirun -miser -np 3 ./pi gt pi.out

10
SGI's CPR Commands
The cpr command provides a command-line interface
for
checkpoint

cpr -c statefile -p idtype,idtype...
-fgku
find information

of an existing checkpoint statefile
cpr -i statefile
restart

cpr -j -r statefile

delete
checkpoint statefile
cpr -D statefile
11
Checkpoint cpr -c statefile -p idtype
-p specifies the process or set of processes to
checkpoint
Processes may have any type in the following list
PID
default

for Unix process and POSIX pthread ID (
)
(
)
use 'ps' command to find PID
HID

for process hierarchy (tree) rooted at that
PID
GID

for Unix process group ID
SID

for Unix process session ID see termio(7)
ASH

for IRIX Array Session ID see
array_services(5)
(
)
use 'array ps' command to find ASH ID
SGP

for IRIX sproc shared group see sproc(2)
12
Using CPR Interactively

Start a Gaussian Job

g98 o2.com o2.out
1 19432 (parent process ID)
ps
PID TTY TIME CMD
19431 ttyq2 001 l101.exe
19432 ttyq2 000 g98
19435 ttyq2 002 l101.exe
Do the first checkpoint
cpr -c chk1 -p 19432HID -k
Checkpointing id 19432 (type HID) to directory
chk1
Checkpoint done

child process
parent process
child process
Caveat Multiple-processor Gaussian jobs do not
automatically clear its 'shared memory segments'
when the job is checkpointed.
13
A Caveat with Multiple-processor
Gaussian
Job
Multiple-processor gaussian jobs do not
automatically
clear its 'shared memory segments' when the job
is aborted.
- memory not freed, cause problems for other jobs
-
cpr
-r may fail if older version of OS is used

ipcs
-a
IPC status from /
dev
/
kmem
as of Tue Nov 28 132436 2000
Shared Memory
T ID KEY MODE OWNER
GROUP CREATOR CGROUP NATTCH SEGSZ
CPID LPID ATIME DTIME CTIME
m 0 0x53637445 --
rw
-r--r-- root root root root
1 48124 845 845 170457
no-entry 170457
m 73 0x0000116d --
rw
------- schang g1179 schang g1179
0 50331664 19340 19340 130356
no-entry 130356
m 144 0x000072ea --
rw
------- schang g1179 schang g1179
0 50331664 19431 19431 132051
132109 131735

ipcrm
-m 73

ipcrm
-m 144
14
Using CPR Interactively - continued

Restart

cpr -r chk1 1 19458 Restarting processes
from directory chk1 Process restarted
successfully. 1 Done cpr -r
chk1 ps PID TTY TIME CMD 19431 ttyq2
027 l114.exe 19432 ttyq2 000 g98
15
Failure - 1

checkpoint stalled using cpr interactively

- for mpi jobs
mpirun -miser -np 3 ./pi gt pi.out 1
3527372 cpr -c chk1 -p 3527372HID -k () 2
3539292 (no progress at all)
Production systems checkpoint stalled most of
times Test-bed systems successful
16
Using CPR within PBS

PBS script for first checkpoint

PBS script for subsequent cpr

PBS -l ncpus2 PBS -l mem50mw PBS -l
walltime10000 setenv g98root
/usr/local/pkg setenv g98root/g98/bsd/g98.login
cd PBS_O_WORKDIR g98 o2.com o2.out sleep
20 cpr -c chk1 -p ps -u schang grep g98 awk
print 1HID -k
PBS -l ncpus2 PBS -l mem50mw PBS -l
walltime10000 setenv g98root
/usr/local/pkg setenv g98root/g98/bsd/g98.login
cd PBS_O_WORKDIR cpr -r chk1 sleep 60 cpr -c
chk2 -p 3971074HID -k

Find the PID of the parent process

Alternative start job and do first
checkpoint interactively

Caveat Restart will fail if PBS
stdout/stderr not present

17
A Caveat with PBS
If a job is started within PBS instead of
interactively, the PBS
standard output/error files associated with the
first PBS script
have to be present in /PBS/spool in order for
subsequent
cpr
to succeed.

The PBS standard output and error files are
copied over to PBS_O_WORKDIR
and removed from /PBS/spool when PBS job is
completed.

Subsequent
cpr
will fail if such files are not present in
/PBS/spool
CPR Error
ckpt
_
fstat
open /PBS/spool/131.t3.
nas
..ER (No such file or directory)

Solutions
- simply use 'touch' to recreate the needed
stdout
and
stderr
files in /PBS/spool,
even though they have no content.

cd
/PBS/spool
touch 131.t3.
nas
..ER

- An alternative to avoid this trouble is to
run your job and do the first
checkpoint through an interactive session
18
Failure - 2

restart failed using cpr in PBS script

- both for mpi and gaussian jobs
- checkpoint/restart successful for a few cycles,
restart failed in a later cycle

cpr -c chk1 -pxxxxHID -k cpr -r chk1 cpr -c
chk2 -pxxxxHID -k cpr -r chk2 . cpr -c chkn
-pxxxxHID -k cpr -r chkn

successful
failed
Error Messages

CPR Error Failed to place mld 0 (Invalid
argument) CPR Error Unexpected status EOF CPR
Error Cleaning up the failed restart
19
Failure - 3

restart failed from a checkpoint state which was
once
successfully restarted

- both for mpi and gaussian jobs

cpr -c chk1 -pxxxxHID -k cpr -r chk1 cpr -c
chk2 -pxxxxHID -k cpr -r chk2 . cpr -c chkn
-pxxxxHID -k cpr -r chkn

successful
failed
cpr -r chk2
failed

Error Message same as previous page
20
Using qhold and qrls of PBS
qsub o2.script 1121.evelyn.nas.nasa.gov
qhold 1121 ls -l /PBS/checkpoint drwxr-xr-x
3 root root 28 May 1 1220
1121.evelyn.CK qstat -a Job ID
S 1121.evelyn H qrls 1121 Job ID
S 1121.evelyn R

PBS script o2.script

qhold qrls
21
Failure - 4 (hopper)
Turing qsub mpi.pbs 8128.fermi.nas.nasa.gov
22
Failure on hopper
qhold
/
qrls
- continued
Error Message from /PBS/mom_logs
pbs
_mom
File exists (17) in
create_
cpuset
,

failed to create
cpuset
8128.
fer
mach_restart
pbs
_mom
Svr

pbs
_momFile exists (17) in
,
Cannot
assign
cpuset
to 8128.
fermi
.
nas
.
nasa
.
gov
mom_restart_job
pbs
_mom
Svr

pbs
_momError 0 (0) in
, 8128.
fermi
.
nas
.
nasa
.
gov

task 1
failed from file /PBS/checkpoint/8128.
fermi
..CK/0000000001
pbs
_momJob8128.
fermi
.
nas
.
nasa
.
gov
Restart failed, error 0

pbs
_momJob8128.
fermi
.
nas
.
nasa
.
gov
kill_job
Ed Hook
What might cause the function
'createCpuset
()'
(from
libcpuset
.so) to error out, with
'errno'

set to 'EEXIST' (17) ??
Bron
Nelson
looking at the kernel sources, this error is
only
returned when the kernel believes that there is
already an existing
cpuset
with the requested name.
23
Failure -5 (t3)
t3 qsub mpi.pbs 148.t3.nas.nasa.gov
33 sec
72 sec
Job ran for 40 seconds and then got killed
24
Failure on t3
qhold
/
qrls
- continued
Job failed due to PBS time-tracking error

Walltime
used (
t1

t2
)
time job completed -

adjusted wall start time
lt allowed
walltime
?
25
Failure on t3
qhold
/
qrls
- continued
Message from /PBS/bin/
tracejob
qsub
qhold
qrls
Bug in
pbs
_mom
PBS Mom should have adjusted wall start 34 sec
earlier than 111901.
It mistakenly adjusted wall start 150 sec later
than 111901
26
Using qsub -c of PBS for Automatic Checkpointing
Periodically
qsub o2.script or qsub -c c3
o2.script 1120.evelyn.nas.nasa.gov
PBS script o2.script
PBS -l ncpus2 PBS -l mem50mw PBS -l
walltime10000 PBS -c c3 setenv g98root
/usr/local/pkg setenv g98root/g98/bsd/g98.login
cd PBS_O_WORKDIR g98 o2.com o2.out

If PBS mom or system crashes
PBS should automatically restart a job that has a
checkpoint directory associated with it after
the system is back
27
Summary of cpr, qhold/qrls and qsub-c
cpr
interactively

For interactive jobs, prevent loss of valuable
results if system crashes
- typically by job owner
cpr
within PBS

Job needs very long
walltime
which exceeds the limit of any queue
prevent loss of valuable results if system crashes
- by job owner
qhold
/
qrls

qsub
-c

system crash prevent loss of valuable results

- by job owner
PBS should automatically restart the job that has
a checkpoint directory
28
Future Testing and Wish List
Future Testing

A wide variety of user applications - OpenMP,
pvm, mpi

Large parallel jobs

System-wide checkpoint/restart

System crash simulation

Efficiency

Ultimate Goal
Make sure checkpoint/restart is reliable in a
real production environment
29
IRIX/CPR - A Popular Topic
Recent email-exchanges on this topic,
sgi-tech_at_cug.org
Barry Sharp - Boeing
Paul White - CSC
Miroslaw Kupczyk - Poznan Supercomputing and
Network Center
Torgny Faxen - National Supercomputing Center,
Sweden

Irix OS

NQE, LSF

MPI, OpenMP, Gaussian - no pvm yet

Irix vs Unicos

SGI- supportfolio bug report provides limited
information
would like to see more exchange on the details of
the success and failure cases
30
Acknowledgement

Ed Hook - SciCon, PBS expert

Lorraine Freeman - sysadmin

Bron Nelson - SGI on-site analyst

Chuck Niggley - SciCon Group Lead

NASA Advanced Supercomputing Division

Write a Comment

User Comments (0)

About PowerShow.com

An Early Experience on - PowerPoint PPT Presentation

An Early Experience on

An Early Experience on. Job Checkpoint/Restart - Working with SGI Irix OS and ... cpr -c chk1 -p `ps -u schang | grep g98 | awk '{print $1}'`:HID -k ... – PowerPoint PPT presentation