Title: CRC Basics
1CRC Basics
- University of Notre Dame
- Center for Research Computing
- January 9th, 2007
2Training Outline
- Overview of the CRC
- Resources Available
- User Accounts
- Accessing the Front-end Machines
- Modules
- Batch Submission Monitoring Process
- Storage
- Examples
3Overview of the CRC
- The University of Notre Dame Center for Research
Computing provides - Resources
- Expertise
- Outreach
- Web page
- http//crc.nd.edu/
- Read the motd or
- Message of the Day
- Automatic on login
- Also provided via the
- web at
- http//www.nd.edu/rich/motd
- Updated when there is news to share read daily
at login
4Resources Available
- Hardware (click on Facilities)
- Software (click on Software)
- Staff (click Contact
Information) - Engineering Support
- System Administration
- Administrative Assistance
Link references above are off the main CRC web
page http//crc.nd.edu
5User Accounts
- To establish a new account send an email to
CRCsupport_at_nd.edu - Include the following information
- Name
- ND netID
- Status (ex. undergrad or grad student, faculty,
staff) - Advisors Name, Title, and netID
- Please do not use hpcc_at_nd.edu old address
- Use address listed above CRCsupport_at_nd.edu
- Once new user account is established you will
receive a follow-up email with instructions
6Accessing the Front-end Machines
- Direct access to the compute nodes is not
permitted - Front-end machines are provided for
- Compiling code
- Testing smaller jobs or running smaller jobs
interactively - Submitting larger jobs into the queue
- There is a 6 hour time limit on jobs running on
front end machines - Front-end machines and compute clusters are
located at Union Station - Details provided at http//crc.nd.edu/facilities/
union_station.shtml - Under Facilities link, click on Webcam 1 or 2
to see the nodes - Front-end machines
- stats.hpcc.nd.edu (Sun Fire V880, Solaris OS)
- xeon.hpcc.nd.edu (Sun Fire V60x, Linux OS)
- opteron.hpcc.nd.edu (Sun Fire V20z, Linux OS)
7CRC Compute Nodes
- CRC compute nodes are in Linux clusters
- opteron01-16.hpcc.nd.edu (Sun V20x, with AMD
Opteron processors) - xeon001-128.hpcc.nd.edu (128-node Sun Fire V60x
with Intel Xeon processor) - dcopt001-288.hpcc.nd.edu (288-node Sun Fire X2100
dual-core AMD Opteron) - Users access front-end nodes through secure shell
(SSH) login to one of the three front-end
machines listed above - prompt ssh your_netID_at_opteron.hpcc.nd.edu
- Users can access their files through the Linux
clusters throughout campus, but must be logged
into one of the front-end machines to submit jobs
to the HPC clusters, via qsub. - There is a one month time limit on total run time
for each job starting from the moment it is
submitted into the queue with qsub command
8Parallel Architectures
Source RS/6000 SP Practical MPI
Programming SG24-5380-00, Aug 1999,
www.redbooks.ibm.com
9SMP Architecture
Source RS/6000 SP Practical MPI
Programming SG24-5380-00, Aug 1999,
www.redbooks.ibm.com
10MPP Architecture
Source RS/6000 SP Practical MPI
Programming SG24-5380-00, Aug 1999,
www.redbooks.ibm.com
11OpenMP and MPI
- CRC hardware/software supports both OpenMP and
MPI - If you have access to the source code, look at it
- OpenMP API typically includes calls such as
- !OMP PARALLEL DO
- !OMP PRIVATE ()
- !OMP END PARALLEL DO
- MPI requires libraries such as MPICH
- MPICH libraries loaded as a module on the CRC
nodes. Try this prompt module avail mpich - MPI-ready source code will include calls such as
- MPI_COMM
- MPI_SCATTER
- MPI_GATHER
- MPI_RECV
- etc.
- For more information on parallel programming
attend the training session on Parallel Computing
on April 12th, 2007 from 1030-1130
12Modules
- The module package provides for the dynamic
modification of the users environment via
modulefiles - CRC staff reduces complexity for users by
loading frequently used modules by default (ex
amber, gaussian) - Modulefile contains the information needed to
configure the shell for a specific application. - Modules typically add to the PATH, MANPATH,
LD_LIBRARY_PATH - Modules allow a user access to multiple versions
of software, - ex. module avail matlab returns
- matlab/6.1, 6.5, 7.0, 7.0_SP2, 7.0_SP3, 7.2, 7.3
(default). - Usually, the default is set with the latest
version by CRC system administrators - Our intent is for users to understand
modulefiles, but typically modify them as little
as possible and even then only when necessary
13Modules
- The module package and the module command are
initialized when a shell-specific initialization
script is sourced into the shell. - Generic shell is in /usr/local/Startup/Cshrc
- When your account is activated this is placed in
your HOME directory as .cshrc - After making changes to .cshrc, always remember
to source the file prompt source .cshrc - Most module files are created by the system
administrators although user may create their
own. - Many modules are loaded by default when a user
logs in (or uses the batch system).
14Modules
- Modules can be written to prevent conflicts,
e.g., loading two different versions of an
application. - May be written to force a user to specifically
request (and understand) their changes. - For example, two version of matlab cannot be
loaded simultaneously . - Modulefiles are written in the tcl (Tool Command
Language) and are interpreted by modulecmd. - Environment variables are unset when unloading a
modulefile
15Modules
prompt module ____
- avail
- List all available modulefiles in the current
MODULEPATH - list
- List all modules currently loaded.
- load add
- Load modulefile into the shell environment
- unload rm
- Remove modulefile from the shell environment
16Modules
prompt module ____
- swap switch modulefile1 modulefile2
- Switch loaded modulefile1 to modulefile2
- show display modulefile
- Display information about a modulefile
- (e.g. the full path and environment changes).
- help
- Print the usage of each sub-command.
prompt man module ltreturngt
- The which command seeks your PATH for a match
to the module name specified after which ____
17Modules
prompt module avail ? shows all modulefiles
available prompt module avail matlab ? shows
matlab versions prompt module load
matlab prompt which matlab /afs/nd.edu/i386_lin
ux24/opt/und/matlab/7.2/bin/matlab prompt
module unload matlab prompt which matlab
which should return matlab Command not
found prompt module load matlab/7.3 prompt
which matlab should return full path to 7.3
18Batch Submission Monitoring Process
- Hardware
- Serial queue - x86 64 dual CPU xeon (each 2
smp), - x86_64 144 dual-core opteron
(dcopt) - Parallel queue xeon (64), opteron(16),
dcopt(288) - Serial/Parallel SMP queue Sun Solaris 5 V880
- qhost command will show you the various
architectures. Try this prompt qhost
ltreturngt - Batch queuing system
- We use SGE (Sun Grid Engine) 5.3p7 which is
released as SGEEE (Sun Grid Engine Enterprise
Edition).
19SGE at the CRC
- All SGE commands and environment setup are
contained in the sge/5.3 module which is loaded
by default. - It allows for transparent use of the AFS file
system used extensively at Notre Dame. - The token lifetime used for all batch jobs
- 720 hours or 30 days
20SGE Commands
- SGEs command line user interface
- Manage queues, submit and delete jobs, check
status of queues and jobs. - Prerequisite
- Skipping commands that are not appropriate for
batch work (configuring the prompt, setting the
delete character, etc) - Near the top of your .cshrc file, you will see
- if ( ?ENVIRONMENT ! 0 ) exit 0
- or if ( ?USER 0 ?prompt 0) exit
- This stops execution of your interactive commands.
21SGE client commands
- qsub
- The user interface for submitting a job to SGE.
- qstat
- A status listing of all jobs and queues
associated with the cluster. - qdel
- To delete SGE jobs, regardless of whether they
are running or spooled. - qalter
- Alters the attributes of already submitted but
still pending jobs.
22Example Job submission
- To run the job on the command line, type
- prompt qsub -l archlx24-amd64 -M
afs_id_at_nd.edu -m ae -r y a.out
- To submit a batch script job file, first build
sample.sh
!/bin/csh -l archlx24-amd64 -M
your_netID_at_nd.edu -m ae -r y a.out
(Executable compiled for x86_64)
Lets look at each of the options in the shell
script above
23qsub options in SGE
The following options can be given to qsub on the
command line, or preceded with in batch
scripts.
- -M your_netID_at_nd.edu (Optional )
- Specify an address where SGE should send email
about your job. - -m abe (Optional )
- Tell SGE to send email to the specified address
if the job aborts (a), begins (b), or ends (e). - -r y or n (Optional )
- Tell SGE if your job is rerunnable. Most jobs
are rerunnable but application jobs such as
Gaussian are not. So you should specify -r n for
Gaussian jobs
24qsub options in SGE
- -j y or n (Optional )
- Specify whether or not the standard error stream
of the job is merged with the standard output
stream. - Default is to merge the standard error and the
standard output stream. - -pe (parallel_queue_name) (_of_processors)
(Required for parallel jobs !!) - Try this qconf spl ltentergt ? to list queue
names - Specify of processors your job will need for
parallel jobs. The default is one CPU. The
maximum of CPUs that can be requested is
specified in the parallel queue. - Ex. -pe ompi8- 8
!! Jobs requesting a large of CPUs might wait
longer in the queue
25qsub options in SGE
- -l Request a resource see options below
- archlx24-amd64 or glinux or solaris64
- Requests resources of this architecture type
- lx24-amd64 specifies the 64-bit AMD Opteron
architecture - glinux specifies the Linux architecture (32/64
bit) - solaris64 specifies the Sun Solaris 64-bit
architecture. - Specifying an architecture is unnecessary if you
are running an application which is available on
both architectures, e.g., Gaussian. (May run
quicker if no architecture is specified) - Compilation and run time depends upon what kind
of architecture is used
26Example Job submission
- To run the job on the command line, type
- prompt qsub -l archlx24-amd64 -M
afs_id_at_nd.edu -m ae -r y a.out
- To submit a batch script job file first build
sample.sh
!/bin/csh -l archlx24-amd64 -M
your_netID_at_nd.edu -m ae -r y a.out
(Executable compiled for x86_64)
To submit the job into the queue, type qsub
sample.sh ltentergt
27Real Example
Dont try this at home, this is from my
files prompt cd test/sundials/cvode/examples_pa
r prompt vi job_submit.sh ltentergt , then edit
the file prompt qsub job_submit.sh
ltentergt prompt qstat f ltentergt
.or. prompt qstat u ebensman ltentergt Look
for job_id Output will be in job_submit.sh.oltjob
_idgt
28Sample emails
- Email at job start.
- Job 256616 (job_submit.sh) Started
- User ebensman
- Queue dcopt145.q
- Host dcopt145.hpcc.nd.edu
- Start Time 01/08/2007 163120
- Email when job completes.
- Job 256616 (job_submit.sh) Complete
- User ebensman
- Queue dcopt145.q
- Host dcopt145.hpcc.nd.edu
- Start Time 01/08/2007 163121
- End Time 01/08/2007 163125
- User Time 000000
- System Time 000001
- Wallclock Time 000004
- CPU 000001
- Max vmem 139.21M
29Examples Job submission
- Application job gaussian, matlab, ...
sample_qsub.sh
!/bin/csh -l archglinux -M
your_netID_at_nd.edu -m ae -r n g03 lt
testDFT.com
prompt qsub sample_qsub.sh
30Parallel Job submission
- To submit a batch script job file, e.g.,
sample_parallel.sh
!/bin/csh -l archlx24-amd64 -pe ompi4- 4
? list available using command qconf -spl
-M afs_id_at_nd.edu -m ae module load mpich/. ?
if not already done mpirun -np NSLOTS
machinefile TMPDIR/machines /afs/nd.edu/ltyour_us
er_in_afsgt/ltyour_netidgt//path to executable
program/a.out
To see available nodes in a queue, type qconf
sp ltpegt Where ltpegt is the output of qconf
spl (ex. ompi4-4)
prompt qsub sample_parallel.sh
31Parallel Job Submission
First, build a script file called gaussian.sh
!/bin/csh then pick only 1 of the next 3
lines -pe smp-xeon 2 ? on Xeon cluster or
-pe smp-opteron 2 ? on Opteron cluster or...
-pe smp 4 ? on Solaris nodes -M
your_afs_id_at_nd.edu -m ae -r n g03l lt
testDFT.com ? testDFT.com is input file
prompt qsub gaussian.sh
32SGE client commands
- qsub
- The user interface for submitting a job to SGE.
- qstat
- The status of all jobs and queues associated with
the cluster. - qdel
- To cancel SGE jobs, regardless whether they
are running or spooled. - qalter
- Changes the attributes of already submitted but
still pending jobs.
33qstat options
Jobs can be monitored using the qstat command.
- qstat - without arguments will print the
status of all jobs. - The job ID number
- Priority of job
- Name of job
- ID of user who submitted job
- Submit or start time and date of the job
- If running, the queue in which the job is running
- The function of the running job (MASTER or SLAVE)
- The job array task ID
34qstat options
- State of the job States can be
- t(ransferring)
- r(unning)
- s(uspended)
- S(uspended)
- T(hreshold)
- R(estarted)
- qstat -j Job-ID
- Prints either for all pending jobs or the jobs
contained in job list, the reason for not being
scheduled
35qstat options
- qstat -f Job-ID
- Provides a full listing of the job which has the
listed Job- ID (or all jobs if no Job-ID is
given). - The printed information for each queue,
- The queue name
- The queue type (qtype) Types or combinations of
types can be - B(atch)
- I(nteractive)
- C(heckpointing)
- P(arallel)
- T(ransfer)
- BP Batch, Parallel
- BIP Batch, Interactive, Parallel
- The number of used and available job slots
(used/tot.) - The load average (load_avg) on the queue host
- The architecture (arch) of the queue host
36qstat options
- qstat -f Job ID
- The state of the queue - Queue states or
combinations of states can be - u(nknown)
- a(larm)
- A(larm)
- C(alendar suspended)
- s(uspended)
- S(ubordinate)
- d(isable)
- D(isable)
- E(rror)
37SGE client commands
- qdel
- To delete SGE jobs, regardless of whether they
are running or pending jobs. - Usage qdel Job_ID
- deletes the job that matches the Job ID
- To find your Job_ID, type qstat u ltyour_netidgt
- Example qstat u ebensman
- qalter
- Usage qalter Job_ID
- qmon
- An X-windows Motif command interface and
monitoring facility. - qhost
- displays status information about SGE execution
hosts
38qmon GUI
At a command prompt, type qmon ltreturngt
39Storage
- CRC currently has three modes for storage
- Local scratch ( /scratch) on the compute nodes
- Fastest access not available for parallel
processes serial only - On Sun nodes fibre channel
- On Xeon nodes SCSI
- On Opteron nodes SCSI/SATA
- Distributed scratch space ( /dscratch)
- NetApp appliance (installed fall 2006) 30TB
available - /dscratch is not backed up like /afs/ space
- Users are encouraged to backup critical data
- /dscratch is meant for short-term storage of
non-critical data - Andrew File System (AFS) distributed 6TB
available - Allows users flexibility of common storage with
other campus systems - Backed up on a routine basis allows for
recovery of lost data - Continue to look at alternative filesystems
- Currently reviewing faster OpenAFS Lustre
40Feedback
- Give us your feedback
- Was this class beneficial?
- What material should be added?
- What material should be deleted?
- What other training would you like to receive
from the CRC?