Title: Batch Systems
1Batch Systems
- In a number of scientific computing environments,
multiple users must share a compute resource - research clusters
- supercomputing centers
- On multi-user HPC clusters, the batch system is a
key component for aggregating compute nodes into
a single, sharable computing resource - The batch system becomes the nerve center for
coordinating the use of resources and controlling
the state of the system in a way that must be
fair to its users - As current and future expert users of large-scale
compute resources, you need to be familiar with
the basics of a batch system
2Batch Systems
- The core functionality of all batch systems are
essentially the same, regardless of the size or
specific configuration of the compute hardware - Multiple Job Queues
- queues provide an orderly environment for
managing a large number of jobs - queues are defined with a variety of limits for
maximum run times, memory usage, and processor
counts they are often assigned different
priority levels as well - may be interactive or non-interactive
- Job Control
- submission of individual jobs to do some work
(eg. serial, or parallel HPC applications) - simple monitoring and manipulation of individual
jobs, and collection of resource usage statistics
(e.g., memory usage, CPU usage, and elapsed
wall-clock time per job) - Job Scheduling
- policy which decides priority between individual
user jobs - allocates resources to scheduled jobs
3Batch Systems
- Job Scheduling Policies
- the scheduler must decide how to prioritize all
the jobs on the system and allocate necessary
resources for each job (processors, memory,
file-systems, etc) - scheduling process can be easy or non-trivial
depending on the size and desired functionality - first in, first out (FIFO) scheduling jobs are
simply scheduled in the order in which they are
submitted - political scheduling enables some users to have
more priority than others - fairshare scheduling, scheduler ensures users
have equal access over time - Additional features may also impact scheduling
order - advanced reservations - resources can be reserved
in advance for a particular user or job - backfill - can be combined with any of the
scheduling paradigms to allow smaller jobs to run
while waiting for enough resources to become
available for larger jobs - back-fill of smaller jobs helps maximize the
overall resource utilization - back-fill can be your friend for small duration
jobs
4Batch Systems
- Common batch systems you may encounter in
scientific computing - Platform LSF
- PBS
- Loadleveler (IBM)
- SGE
- All have similar functionality but different
syntax - Reasonably straight forward to convert your job
scripts from one system to another - Above all include specific batch system
directives which can be placed in a shell script
to request certain resources (processors, queues,
etc). - We will focus on LSF primarily since it is the
system running on Lonestar
5Batch Submission Process
internet
Compute Nodes
Server
Head
Submission bsub lt job
C1
C2
C3
C4
Queue Job Script waits for resources on
Server Master Compute Node that executes
the job script, launches ALL MPI processes
Launch ssh to each compute node to start
executable (e.g. a.out)
ibrun ./a.out
mpirun np ./a.out
6LSF Batch System
- Lonestar uses Platform LSF for both the batch
queuing system and scheduling mechanism (provides
similar functionality to PBS, but requires
different commands for job submission and
monitoring) - LSF includes global fairshare, a mechanism for
ensuring no one user monopolizes the computing
resources - Batch jobs are submitted on the front end and are
subsequently executed on compute nodes as
resources become available - Order of job execution depends on a variety of
parameters - Submission Time
- Queue Priority some queues have higher
priorities than others - Backfill Opportunities small jobs may be
back-filled while waiting for bigger jobs to
complete - Fairshare Priority users who have recently used
a lot of compute resources will have a lower
priority than those who are submitting new jobs - Advanced Reservations jobs my be blocked in
order to accommodate advanced reservations (for
example, during maintenance windows) - Number of Actively Scheduled Jobs there are
limits on the maximum number of concurrent
processors used by each user
7Lonestar Queue Definitions
8Lonestar Queue Definitions
- Additional Queue Limits
- In the normal and high queues, only a maximum of
512 processes can be used at one time. Jobs
requiring more processors are deferred for
possible scheduling until running jobs complete.
For example, a single user can have the following
job combinations eligible for scheduling - Run 2 jobs requiring 256 procs
- Run 4 jobs requiring 128 procs each
- Run 8 jobs requiring 64 procs each
- Run 16 jobs requiring 32 procs each
- A maximum of 25 queued jobs per user is allowed
at one time
9LSF Fairshare
- A global fairshare mechanism is implemented on
Lonestar to provide fair access to its
substantial compute resources - Fairshare computes a dynamic priority for each
user and uses this priority in making scheduling
decisions - Dynamic priority is based on the following
criteria - Number of shares assigned
- Resources used by jobs belonging to the user
- Number of job slots reserved
- Run time of running jobs
- Cumulative actual CPU time (not normalized),
adjusted so that recently used CPU time is
weighted more heavily than CPU time used in the
distant past
10LSF Fairshare
- bhpart Command to see current fairshare
priority. For example
lslogin1--gt bhpart -r HOST_PARTITION_NAME
GlobalPartition HOSTS all SHARE_INFO_FOR
GlobalPartition/ USER/GROUP SHARES PRIORITY
STARTED RESERVED CPU_TIME RUN_TIME avijit
1 0.333 0 0 0.0
0 chona 1 0.333 0
0 0.0 0 ewalker 1
0.333 0 0 0.0
0 minyard 1 0.333 0 0
0.0 0 phaa406 1 0.333
0 0 0.0 0 bbarth
1 0.333 0 0 0.0
0 milfeld 1 0.333 0 0
2.9 0 karl 1 0.077
0 0 51203.4 0 vmcalo
1 0.000 320 0 2816754.8
7194752
11Commonly Used LSF Commands
Note most of these commands support a -l
argument for long listings. For example bhist
l ltjobIDgt will give a detailed history of a
specific job. Consult the man pages for each of
these commands for more information.
12LSF Batch System
- LSF Defined Environment Variables
13LSF Batch System
- Comparison of LSF, PBS and Loadleveler commands
that provide similar functionality
14Batch System Concerns
- Submission (need to know)
- Required Resources
- Run-time Environment
- Directory of Submission
- Directory of Execution
- Files for stdout/stderr Return
- Email Notification
- Job Monitoring
- Job Deletion
- Queued Jobs
- Running Jobs
15LSF Basic MPI Job Script
Total number of processes
- !/bin/csh
- BSUB -n 32
- BSUB -J hello
- BSUB -o J.out
- BSUB -e J.err
- BSUB -q normal
- BSUB -P A-ccsc
- BSUB -W 015
- echo "Master Host "hostname
- echo "LSF_SUBMIT_DIR LS_SUBCWD"
- echo "PWD_DIR "pwd
- ibrun ./hello
Execution command
executable
Parallel application manager and mpirun wrapper
script
16LSF Extended MPI Job Script
- !/bin/csh
- BSUB -n 32
- BSUB -J hello
- BSUB -o J.out
- BSUB -e J.err
- BSUB -q normal
- BSUB -P A-ccsc
- BSUB -W 015
- BSUB -w ended(1123)'
- BSUB -u karl_at_tacc.utexas.edu
- BSUB -B
- BSUB -N
- echo "Master Host "hostname
- echo "LSF_SUBMIT_DIR LS_SUBCWD"
- ibrun ./hello
Total number of processes
Dependency on Job lt1123gt
Email address
Email when job begins execution
Email job report informationupon completion
17LSF Job Script Submission
- When submitting jobs to LSF using a job script, a
redirection is required for bsub to read the
commands. Consider the following
scriptlslogin1gt cat job.script!/bin/cshBSUB
-n 32BSUB -J helloBSUB -o J.outBSUB -e
J.errBSUB -q normalBSUB -W 015echo "Master
Host "hostnameecho "LSF_SUBMIT_DIR
LS_SUBCWDecho "PWD_DIR "pwdibrun ./hello - To submit the joblslogin1 bsub lt job
Re-direction is required!
18LSF Interactive Execution
- Several ways to run interactively
- Submit entire command to bsub directlygt bsub
q development -I -n 2 -W 015 ibrun
./helloYour job is being routed to the
development queueJob lt11822gt is submitted to
queue ltdevelopmentgt.ltltWaiting for dispatch
...gtgtltltStarting on compute-1-0gtgt Hello, world!
--gt Process 0 of 2 is alive.
-gtcompute-1-0 --gt Process 1 of 2 is
alive. -gtcompute-1-0 - Submit using normal job script and include
additional -I directivegt bsub -I lt job.script
19Batch Script Suggestions
- Echo issuing commands
- (set -x and set echo for ksh and csh).
- Avoid absolute pathnames
- Use relative path names or environment variables
(HOME, WORK) - Abort job when a critical command fails.
- Print environment
- Include the "env" command if your batch job
doesn't execute the same as in an interactive
execution. - Use ./ prefix for executing commands in the
current directory - The dot means to look for commands in the present
working directory. Not all systems include "."
in your PATH variable. (usage ./a.out). - Track your CPU time
20LSF Job Monitoring (showq utility)
lslogin1 showq ACTIVE JOBS-------------------- JO
BID JOBNAME USERNAME STATE PROC
REMAINING STARTTIME 11318 1024_90_96x6
vmcalo Running 64 180919 Fri Jan 9
104353 11352 naf phaa406 Running
16 175115 Fri Jan 9 102549 11357
24N phaa406 Running 16 181912
Fri Jan 9 105346 23 Active jobs 504 of
556 Processors Active (90.65) IDLE
JOBS---------------------- JOBID JOBNAME
USERNAME STATE PROC WCLIMIT
QUEUETIME 11169 poroe8 xgai
Idle 128 100000 Thu Jan 8 101706 11645
meshconv019 bbarth Idle 16
240000 Fri Jan 9 162418 3 Idle
jobs BLOCKED JOBS------------------- JOBID
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME 11319 1024_90_96x6 vmcalo
Deferred 64 240000 Thu Jan 8
180911 11320 1024_90_96x6 vmcalo Deferred
64 240000 Thu Jan 8 180911 17 Blocked
jobs Total Jobs 43 Active Jobs 23 Idle
Jobs 3 Blocked Jobs 17
21LSF Job Monitoring (bjobs command)
lslogin1 bjobs JOBID USER STAT QUEUE
FROM_HOST EXEC_HOST JOB_NAME
SUBMIT_TIME 11635 bbarth RUN normal
lonestar 2compute-8 shconv009 Jan 9 1624
2compute-9-22
2compute-3-25
2compute-8-30
2compute-1-27
2compute-4-2
2compute-3-9
2compute-6-13 11640 bbarth
RUN normal lonestar 2compute-3
shconv014 Jan 9 1624
2compute-6-2
2compute-6-5
2compute-3-12
2compute-4-27
2compute-7-28
2compute-3-5
2compute-7-5 11657 bbarth PEND normal
lonestar shconv028 Jan 9
1638 11658 bbarth PEND normal lonestar
shconv029 Jan 9 1638 11662
bbarth PEND normal lonestar
shconv033 Jan 9 1638 11663 bbarth PEND
normal lonestar shconv034 Jan
9 1638 11667 bbarth PEND normal
lonestar shconv038 Jan 9
1638 11668 bbarth PEND normal lonestar
shconv039 Jan 9 1638
Note Use bjobs -u all to see jobs from all
users.
22LSF Job Monitoring (lsuser utility)
- lslogin1 lsuser -u vap
- JOBID QUEUE USER NAME
PROCS SUBMITTED - 547741 normal vap vap_hd_sh_p96
14 Tue Jun 7 103701 2005 - HOST R15s R1m R15m PAGES
MEM SWAP TEMP - compute-11-11 2.0 2.0 1.4 4.9P/s
1840M 2038M 24320M - compute-8-3 2.0 2.0 2.0 1.9P/s
1839M 2041M 23712M - compute-7-23 2.0 2.0 1.9 2.3P/s
1838M 2038M 24752M - compute-3-19 2.0 2.0 2.0 2.6P/s
1847M 2041M 23216M - compute-14-19 2.0 2.0 2.0 2.1P/s
1851M 2040M 24752M - compute-3-21 2.0 2.0 1.7 2.0P/s
1845M 2038M 24432M - compute-13-11 2.0 2.0 1.5 1.8P/s
1841M 2040M 24752M
23LSF Job Manipulation/Monitoring
- To kill a running or queued job (takes 30
seconds to complete) bkill ltjobIDgt bkill -r
ltjobIDgt (Use when bkill alone wont delete the
job) - To suspend a queued job bstop ltjobIdgt
- To resume a suspended job bresume ltjobIDgt
- To see more information on why a job is
pending bjobs p ltjobIDgt - To see a historical summary of a job bhist
ltjobIDgtlslogin1gt bhist 11821Summary of time in
seconds spent in various statesJOBID USER
JOB_NAME PEND PSUSP RUN USUSP SSUSP
UNKWN TOTAL11821 karl hello 131 0
127 0 0 0 258