Title: SGE and Modules: Getting Started in CRCHPCC
1SGE and Modules Getting Started in CRC/HPCC
- In-Saeng Suh and Rich Sudlow
- OIT, Univ. of Notre Dame
2Contents
- Overview http//crc.nd.edu
- Modules
- Examples in Modules
- ND-HPCC Batch System
- SGE in ND
- SGE Commands
- Examples in SGE Client Commands
3Overview
- Introduction to ND-HPCC
- Provide faculty, staff, and graduate students at
ND high-end facilities and applications for their
research. - Facilities
- http//crc.nd.edu/resources/facilities.shtml
- Software
- http//crc.nd.edu/resources/software.shtml
4Using Modules
- What is a module ?
- User interface to the Modules package.
- The module package provides for the dynamic
modification of the users environment
via modulefiles - They provide lots of flexibility. (But more
complexity too). HPCC staff however try to
minimize this complexity for users by loading
frequently used modules by default, e.g.,
SGE.
5- What is a module ? (Continue)
- Modulefile contains the information needed to
configure the shell for a specific
application. - Modules typically add to the PATH, MANPATH,
LD_LIBRARY_PATH variables. - Modules allow a user access to multiple versions
of software, e.g., matlab/7.0_SP2 (default),
6.1, 6.5, 7.0, 7.0_SP3. - Usually, the default is set with the latest
version by system administrators.
6Modules at ND-HPCC
- ND-HPCC has implemented and enhanced the modules
to manage the user environment. - Modules have been setup used for a number of
years on the SGI architecture and have been added
approximately September, 2001 for the Sun HPCC
environment and then Linux system. - In general, ND users are not required to
explicitly specify a users environment variables
for applications installed at ND-AFS.
7Module Initialization
- The module package and the module command are
initialized when a shell-specific initialization
script is sourced into the shell. - This is already done in /usr/local/Startup/Cshrc
- Most module files are created by the system
administrators although user may create their
own. - Many modules are loaded by default when a
user logs in (or uses the batch system).
8Module Initialization
- Modules can be written to prevent conflicts,
e.g., loading two different versions of an
application. May be written to force a user
to specifically request (and understand)
their changes. - For example, two version of matlab cannot be
loaded simultaneously .
9Module Subcommands
module ____
- avail
- List all available modulefiles in the current
MODULEPATH - list
- List all modules currently loaded.
- load add
- Load modulefile into the shell environment
- unload rm
- Remove modulefile from the shell environment
10Module Subcommands (II)
- swap switch modulefile1 modulefile2
- Switch loaded modulefile1 to modulefile2
- show display modulefile
- Display information about a modulefile, e.g.,
the full path and environment changes. - help
- Print the usage of each sub-command.
gt man module
- The which command seeks your PATH for a
- match for the name specified after which
11Modulefiles
- Modulefiles are written in the tcl (Tool Command
Language) and are interpreted by modulecmd. - Environment variables are unset when unloading a
modulefile.
12Modulefile Sample
13- The HPCC Batch System
- Serial queue - x86 64 dual CPU xeon (each 2
smp), - x86_64 144 dual-core opteron
(dcopt) - Parallel queue xeon (64), opteron(16),
dcopt(288) - Serial/Parallel SMP queue Sun Solaris 5 V880
- Batch queuing system
- We use SGE (Sun Grid Engine) 5.3 which is
released as SGEEE (Sun Grid Engine
Enterprise Edition).
14(No Transcript)
15SGE in ND-AFS
- All SGE commands and environment setup are
contained in the sge/5.3 module which is loaded
by default. - It allows for transparent use of the AFS file
system used extensively at Notre Dame. - The token lifetime used for all batch jobs
- 720 hours or 30 days
16SGE Commands
- SGEs command line user interface
- Manage queues, submit, and delete jobs,
check job status and queues and jobs. - Prerequisite
- Skipping commands that are not appropriate for
batch work (configuring the prompt, setting the
delete character, etc) - Near the top in your .login and/or .cshrc files,
if ( ?ENVIRONMENT ! 0 ) exit 0 - This stops executions of your interactive
commands.
17SGE client commands
- qsub
- The user interface for submitting a job to SGE.
- qstat
- A status listing of all jobs and queues
associated with the cluster. - qdel
- To delete SGE jobs, regardless whether they are
running or spooled. - qalter
- Alters the attributes of already submitted but
still pending jobs.
18Examples Job submission
- To run the job on the command line, type
- gtqsub -l archsolaris64 -M afs_id_at_nd.edu -m ae -r
y a.out
- To submit a batch script job file, e.g.,
sample.job
!/bin/csh -l archsolaris64 -M
afs_id_at_nd.edu -m ae -r y a.out
(Executable compiled at Solaris)
gt qsub sample.gob
19qsub Options to SGE(I)
The following options can be given to qsub on the
command line, or preceded with in batch
scripts.
- -M your_afs_id_at_nd.edu (Optional )
- Specify an address where SGE should send email
about your job. - -m abe (Optional )
- Tell SGE to send email to the specified address
if the job aborts, begins, or ends. - -r y or n (Optional )
- Tell SGE if your job is rerunnable. Most jobs
are rerunnable but application jobs such as
Gaussian are not. So you should specify -r n
20qsub Options to SGE(II)
- -j y or n (Optional )
- Specify whether or not the standard error stream
of the job is merged with the standard output
stream. - Default is to merge the standard error and the
standard output stream. - -pe of processor (Required for parallel jobs
!!) - Specify of processor your job will need for
parallel jobs. The default is one CPU. The
Maximum of CPU can be requested is
specified in the parallel queue.
!! Jobs requesting a large number of CPUs might
spend a long time waiting in queue
21qsub Options to SGE(III)
- -l Request a resource
- cputhhmmss (Optional )
- Requests resources of this much CPU time to
run your job. cput is the sum of all the
time used by all threads in parallel jobs. - Not currently used - but may be beneficial in the
future - limit is token lifetime.
!! Note that better to run several shorter jobs
than one long job.
22- -l Request a resource (continue)
- archglinux or solaris64
- (Optional -depends
on architecture) - Requests resources of this architecture type to
run your job on. - glinux specifies the Linux architecture (32/64
bit), while solaris64 specifies the Sun Solaris
64 bit architecture. - Specifying an architecture is unnecessary if you
are running an application which is available on
both architectures, e.g., Gaussian.
(May run quicker if no architecture is
specified)
23- archglinux or solaris64 (continue)
- Compilation and run depend upon what kind of
architecture is used.
!! Note that by default the batch queuing system
will run the job on the fastest system
which meets the requirements that you
specify.
24Examples Job submission
- To run the job on the command line, type
- gtqsub -l archsolaris64 -M afs_id_at_nd.edu -m ae -r
y a.out
- To submit a batch script job file, e.g.,
sample.job
!/bin/csh -l archsolaris64 -M
afs_id_at_nd.edu -m ae -r y a.out
(Executable compiled at Solaris)
gt qsub sample.gob
25Examples Job submission
- Application job gaussian, matlab, ...
E.g.) sample_qsub.job
!/bin/csh -l cput360000 -M
your_afs_id_at_nd.edu -m ae -r n g03 lt
testDFT.com
gtqsub sample_qsub.job
26Parallel Job submission
- To submit a batch script job file, e.g.,
sample_parallel.job
!/bin/csh -l cput360000 -l
archglinux -pe smp-xeon 4 -M
afs_id_at_nd.edu -m ae module load
mpich/. mpirun -np 4 ./a.out
(a.out is compiled with MPI or OpenMP
library on x84 xeon architecture)
gt qsub sample_parallel.job
27Parallel Job Submission
E. g.,) gaussian.job
!/bin/csh -l cput360000 -pe smp-xeon
4 -M your_afs_id_at_nd.edu -m ae -r n g03l
lt testDFT.com
gt qsub gaussian.job
28SGE client commands
- qsub
- The user interface for submitting a job to SGE.
- qstat
- The status of all jobs and queues associated with
the cluster. - qdel
- To cancel SGE jobs, regardless whether they
are running or spooled. - qalter
- Changes the attributes of already submitted but
still pending jobs.
29qstat Options on qstat (I)
Jobs can be monitored using the qstat command.
- qstat - without arguments will print the
status of all jobs. - The job ID number
- Priority of job
- Name of job
- ID of user who submitted job
- Submit or start time and date of the job
- If running, the queue in which the job is running
- The function of the running job (MASTER or SLAVE)
- The job array task ID
30qstat Options on qstat (II)
- State of the job States can be
- t(ransferring)
- r(unning)
- s(uspended)
- S(uspended)
- T(hreshold)
- R(estarted)
- qstat -j Job-ID
- Prints either for all pending jobs or the jobs
contained in job list, the reason for not
being scheduled
31qstat Options on qstat (III)
- qstat -f Job-ID
- Provides a full listing of the job which has the
listed Job- ID (or all jobs if no Job-ID is
given). - The printed information for each queue,
- The queue name
- The queue type Types or combinations of types
can be - B(atch)
- I(nteractive)
- C(heckpointing)
- P(arallel)
- T(ransfer)
- The number of used and available job slots
- The load average on the queue host
- The architecture of the queue host
32qstat Options on qstat (IV)
- qstat -f Job ID (Continue)
- The state of the queue - Queue states or
combinations of states can be - u(nknown)
- a(larm)
- A(larm)
- C(alendar suspended)
- s(uspended)
- S(ubordinate)
- d(isable)
- D(isable)
- E(rror)
33SGE client commands
- qdel
- To delete SGE jobs, regardless whether they are
running or pending jobs. - Usage qdel Job_ID
- deletes the job that matches the Job ID
- qalter
- Usage qalter Job_ID
34SGE client commands
- qhost
- displays status information about SGE execution
hosts. - qmon
- An X-windows Motif command interface and
monitoring facility.
35qmon GUI
36Thank you !!