CRC Basics - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

CRC Basics

Description:

dcopt001-288.hpcc.nd.edu (288-node Sun Fire X2100 dual-core AMD Opteron) ... Serial queue - x86: 64 dual CPU xeon (each 2 smp), x86_64: 144 dual-core opteron (dcopt) ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 41
Provided by: edben
Category:
Tags: crc | basics

less

Transcript and Presenter's Notes

Title: CRC Basics


1
CRC Basics
  • University of Notre Dame
  • Center for Research Computing
  • January 9th, 2007

2
Training Outline
  • Overview of the CRC
  • Resources Available
  • User Accounts
  • Accessing the Front-end Machines
  • Modules
  • Batch Submission Monitoring Process
  • Storage
  • Examples

3
Overview of the CRC
  • The University of Notre Dame Center for Research
    Computing provides
  • Resources
  • Expertise
  • Outreach
  • Web page
  • http//crc.nd.edu/
  • Read the motd or
  • Message of the Day
  • Automatic on login
  • Also provided via the
  • web at
  • http//www.nd.edu/rich/motd
  • Updated when there is news to share read daily
    at login

4
Resources Available
  • Hardware (click on Facilities)
  • Software (click on Software)
  • Staff (click Contact
    Information)
  • Engineering Support
  • System Administration
  • Administrative Assistance

Link references above are off the main CRC web
page http//crc.nd.edu
5
User Accounts
  • To establish a new account send an email to
    CRCsupport_at_nd.edu
  • Include the following information
  • Name
  • ND netID
  • Status (ex. undergrad or grad student, faculty,
    staff)
  • Advisors Name, Title, and netID
  • Please do not use hpcc_at_nd.edu old address
  • Use address listed above CRCsupport_at_nd.edu
  • Once new user account is established you will
    receive a follow-up email with instructions

6
Accessing the Front-end Machines
  • Direct access to the compute nodes is not
    permitted
  • Front-end machines are provided for
  • Compiling code
  • Testing smaller jobs or running smaller jobs
    interactively
  • Submitting larger jobs into the queue
  • There is a 6 hour time limit on jobs running on
    front end machines
  • Front-end machines and compute clusters are
    located at Union Station
  • Details provided at http//crc.nd.edu/facilities/
    union_station.shtml
  • Under Facilities link, click on Webcam 1 or 2
    to see the nodes
  • Front-end machines
  • stats.hpcc.nd.edu (Sun Fire V880, Solaris OS)
  • xeon.hpcc.nd.edu (Sun Fire V60x, Linux OS)
  • opteron.hpcc.nd.edu (Sun Fire V20z, Linux OS)

7
CRC Compute Nodes
  • CRC compute nodes are in Linux clusters
  • opteron01-16.hpcc.nd.edu (Sun V20x, with AMD
    Opteron processors)
  • xeon001-128.hpcc.nd.edu (128-node Sun Fire V60x
    with Intel Xeon processor)
  • dcopt001-288.hpcc.nd.edu (288-node Sun Fire X2100
    dual-core AMD Opteron)
  • Users access front-end nodes through secure shell
    (SSH) login to one of the three front-end
    machines listed above
  • prompt ssh your_netID_at_opteron.hpcc.nd.edu
  • Users can access their files through the Linux
    clusters throughout campus, but must be logged
    into one of the front-end machines to submit jobs
    to the HPC clusters, via qsub.
  • There is a one month time limit on total run time
    for each job starting from the moment it is
    submitted into the queue with qsub command

8
Parallel Architectures
Source RS/6000 SP Practical MPI
Programming SG24-5380-00, Aug 1999,
www.redbooks.ibm.com
9
SMP Architecture
Source RS/6000 SP Practical MPI
Programming SG24-5380-00, Aug 1999,
www.redbooks.ibm.com
10
MPP Architecture
Source RS/6000 SP Practical MPI
Programming SG24-5380-00, Aug 1999,
www.redbooks.ibm.com
11
OpenMP and MPI
  • CRC hardware/software supports both OpenMP and
    MPI
  • If you have access to the source code, look at it
  • OpenMP API typically includes calls such as
  • !OMP PARALLEL DO
  • !OMP PRIVATE ()
  • !OMP END PARALLEL DO
  • MPI requires libraries such as MPICH
  • MPICH libraries loaded as a module on the CRC
    nodes. Try this prompt module avail mpich
  • MPI-ready source code will include calls such as
  • MPI_COMM
  • MPI_SCATTER
  • MPI_GATHER
  • MPI_RECV
  • etc.
  • For more information on parallel programming
    attend the training session on Parallel Computing
    on April 12th, 2007 from 1030-1130

12
Modules
  • The module package provides for the dynamic
    modification of the users environment via
    modulefiles
  • CRC staff reduces complexity for users by
    loading frequently used modules by default (ex
    amber, gaussian)
  • Modulefile contains the information needed to
    configure the shell for a specific application.
  • Modules typically add to the PATH, MANPATH,
    LD_LIBRARY_PATH
  • Modules allow a user access to multiple versions
    of software,
  • ex. module avail matlab returns
  • matlab/6.1, 6.5, 7.0, 7.0_SP2, 7.0_SP3, 7.2, 7.3
    (default).
  • Usually, the default is set with the latest
    version by CRC system administrators
  • Our intent is for users to understand
    modulefiles, but typically modify them as little
    as possible and even then only when necessary

13
Modules
  • The module package and the module command are
    initialized when a shell-specific initialization
    script is sourced into the shell.
  • Generic shell is in /usr/local/Startup/Cshrc
  • When your account is activated this is placed in
    your HOME directory as .cshrc
  • After making changes to .cshrc, always remember
    to source the file prompt source .cshrc
  • Most module files are created by the system
    administrators although user may create their
    own.
  • Many modules are loaded by default when a user
    logs in (or uses the batch system).

14
Modules
  • Modules can be written to prevent conflicts,
    e.g., loading two different versions of an
    application.
  • May be written to force a user to specifically
    request (and understand) their changes.
  • For example, two version of matlab cannot be
    loaded simultaneously .
  • Modulefiles are written in the tcl (Tool Command
    Language) and are interpreted by modulecmd.
  • Environment variables are unset when unloading a
    modulefile

15
Modules
prompt module ____
  • avail
  • List all available modulefiles in the current
    MODULEPATH
  • list
  • List all modules currently loaded.
  • load add
  • Load modulefile into the shell environment
  • unload rm
  • Remove modulefile from the shell environment

16
Modules
prompt module ____
  • swap switch modulefile1 modulefile2
  • Switch loaded modulefile1 to modulefile2
  • show display modulefile
  • Display information about a modulefile
  • (e.g. the full path and environment changes).
  • help
  • Print the usage of each sub-command.

prompt man module ltreturngt
  • The which command seeks your PATH for a match
    to the module name specified after which ____

17
Modules
prompt module avail ? shows all modulefiles
available prompt module avail matlab ? shows
matlab versions prompt module load
matlab prompt which matlab /afs/nd.edu/i386_lin
ux24/opt/und/matlab/7.2/bin/matlab prompt
module unload matlab prompt which matlab
which should return matlab Command not
found prompt module load matlab/7.3 prompt
which matlab should return full path to 7.3
18
Batch Submission Monitoring Process
  • Hardware
  • Serial queue - x86 64 dual CPU xeon (each 2
    smp),
  • x86_64 144 dual-core opteron
    (dcopt)
  • Parallel queue xeon (64), opteron(16),
    dcopt(288)
  • Serial/Parallel SMP queue Sun Solaris 5 V880
  • qhost command will show you the various
    architectures. Try this prompt qhost
    ltreturngt
  • Batch queuing system
  • We use SGE (Sun Grid Engine) 5.3p7 which is
    released as SGEEE (Sun Grid Engine Enterprise
    Edition).

19
SGE at the CRC
  • All SGE commands and environment setup are
    contained in the sge/5.3 module which is loaded
    by default.
  • It allows for transparent use of the AFS file
    system used extensively at Notre Dame.
  • The token lifetime used for all batch jobs
  • 720 hours or 30 days

20
SGE Commands
  • SGEs command line user interface
  • Manage queues, submit and delete jobs, check
    status of queues and jobs.
  • Prerequisite
  • Skipping commands that are not appropriate for
    batch work (configuring the prompt, setting the
    delete character, etc)
  • Near the top of your .cshrc file, you will see
  • if ( ?ENVIRONMENT ! 0 ) exit 0
  • or if ( ?USER 0 ?prompt 0) exit
  • This stops execution of your interactive commands.

21
SGE client commands
  • qsub
  • The user interface for submitting a job to SGE.
  • qstat
  • A status listing of all jobs and queues
    associated with the cluster.
  • qdel
  • To delete SGE jobs, regardless of whether they
    are running or spooled.
  • qalter
  • Alters the attributes of already submitted but
    still pending jobs.

22
Example Job submission
  • To run the job on the command line, type
  • prompt qsub -l archlx24-amd64 -M
    afs_id_at_nd.edu -m ae -r y a.out
  • To submit a batch script job file, first build
    sample.sh

!/bin/csh -l archlx24-amd64 -M
your_netID_at_nd.edu -m ae -r y a.out
(Executable compiled for x86_64)
Lets look at each of the options in the shell
script above
23
qsub options in SGE
The following options can be given to qsub on the
command line, or preceded with in batch
scripts.
  • -M your_netID_at_nd.edu (Optional )
  • Specify an address where SGE should send email
    about your job.
  • -m abe (Optional )
  • Tell SGE to send email to the specified address
    if the job aborts (a), begins (b), or ends (e).
  • -r y or n (Optional )
  • Tell SGE if your job is rerunnable. Most jobs
    are rerunnable but application jobs such as
    Gaussian are not. So you should specify -r n for
    Gaussian jobs

24
qsub options in SGE
  • -j y or n (Optional )
  • Specify whether or not the standard error stream
    of the job is merged with the standard output
    stream.
  • Default is to merge the standard error and the
    standard output stream.
  • -pe (parallel_queue_name) (_of_processors)
    (Required for parallel jobs !!)
  • Try this qconf spl ltentergt ? to list queue
    names
  • Specify of processors your job will need for
    parallel jobs. The default is one CPU. The
    maximum of CPUs that can be requested is
    specified in the parallel queue.
  • Ex. -pe ompi8- 8

!! Jobs requesting a large of CPUs might wait
longer in the queue
25
qsub options in SGE
  • -l Request a resource see options below
  • archlx24-amd64 or glinux or solaris64
  • Requests resources of this architecture type
  • lx24-amd64 specifies the 64-bit AMD Opteron
    architecture
  • glinux specifies the Linux architecture (32/64
    bit)
  • solaris64 specifies the Sun Solaris 64-bit
    architecture.
  • Specifying an architecture is unnecessary if you
    are running an application which is available on
    both architectures, e.g., Gaussian. (May run
    quicker if no architecture is specified)
  • Compilation and run time depends upon what kind
    of architecture is used

26
Example Job submission
  • To run the job on the command line, type
  • prompt qsub -l archlx24-amd64 -M
    afs_id_at_nd.edu -m ae -r y a.out
  • To submit a batch script job file first build
    sample.sh

!/bin/csh -l archlx24-amd64 -M
your_netID_at_nd.edu -m ae -r y a.out
(Executable compiled for x86_64)
To submit the job into the queue, type qsub
sample.sh ltentergt
27
Real Example
Dont try this at home, this is from my
files prompt cd test/sundials/cvode/examples_pa
r prompt vi job_submit.sh ltentergt , then edit
the file prompt qsub job_submit.sh
ltentergt prompt qstat f ltentergt
.or. prompt qstat u ebensman ltentergt Look
for job_id Output will be in job_submit.sh.oltjob
_idgt
28
Sample emails
  • Email at job start.
  • Job 256616 (job_submit.sh) Started
  • User ebensman
  • Queue dcopt145.q
  • Host dcopt145.hpcc.nd.edu
  • Start Time 01/08/2007 163120
  • Email when job completes.
  • Job 256616 (job_submit.sh) Complete
  • User ebensman
  • Queue dcopt145.q
  • Host dcopt145.hpcc.nd.edu
  • Start Time 01/08/2007 163121
  • End Time 01/08/2007 163125
  • User Time 000000
  • System Time 000001
  • Wallclock Time 000004
  • CPU 000001
  • Max vmem 139.21M

29
Examples Job submission
  • Application job gaussian, matlab, ...

sample_qsub.sh
!/bin/csh -l archglinux -M
your_netID_at_nd.edu -m ae -r n g03 lt
testDFT.com
prompt qsub sample_qsub.sh
30
Parallel Job submission
  • To submit a batch script job file, e.g.,
    sample_parallel.sh

!/bin/csh -l archlx24-amd64 -pe ompi4- 4
? list available using command qconf -spl
-M afs_id_at_nd.edu -m ae module load mpich/. ?
if not already done mpirun -np NSLOTS
machinefile TMPDIR/machines /afs/nd.edu/ltyour_us
er_in_afsgt/ltyour_netidgt//path to executable
program/a.out
To see available nodes in a queue, type qconf
sp ltpegt Where ltpegt is the output of qconf
spl (ex. ompi4-4)
prompt qsub sample_parallel.sh
31
Parallel Job Submission
First, build a script file called gaussian.sh
!/bin/csh then pick only 1 of the next 3
lines -pe smp-xeon 2 ? on Xeon cluster or
-pe smp-opteron 2 ? on Opteron cluster or...
-pe smp 4 ? on Solaris nodes -M
your_afs_id_at_nd.edu -m ae -r n g03l lt
testDFT.com ? testDFT.com is input file
prompt qsub gaussian.sh
32
SGE client commands
  • qsub
  • The user interface for submitting a job to SGE.
  • qstat
  • The status of all jobs and queues associated with
    the cluster.
  • qdel
  • To cancel SGE jobs, regardless whether they
    are running or spooled.
  • qalter
  • Changes the attributes of already submitted but
    still pending jobs.

33
qstat options
Jobs can be monitored using the qstat command.
  • qstat - without arguments will print the
    status of all jobs.
  • The job ID number
  • Priority of job
  • Name of job
  • ID of user who submitted job
  • Submit or start time and date of the job
  • If running, the queue in which the job is running
  • The function of the running job (MASTER or SLAVE)
  • The job array task ID

34
qstat options
  • State of the job States can be
  • t(ransferring)
  • r(unning)
  • s(uspended)
  • S(uspended)
  • T(hreshold)
  • R(estarted)
  • qstat -j Job-ID
  • Prints either for all pending jobs or the jobs
    contained in job list, the reason for not being
    scheduled

35
qstat options
  • qstat -f Job-ID
  • Provides a full listing of the job which has the
    listed Job- ID (or all jobs if no Job-ID is
    given).
  • The printed information for each queue,
  • The queue name
  • The queue type (qtype) Types or combinations of
    types can be
  • B(atch)
  • I(nteractive)
  • C(heckpointing)
  • P(arallel)
  • T(ransfer)
  • BP Batch, Parallel
  • BIP Batch, Interactive, Parallel
  • The number of used and available job slots
    (used/tot.)
  • The load average (load_avg) on the queue host
  • The architecture (arch) of the queue host

36
qstat options
  • qstat -f Job ID
  • The state of the queue - Queue states or
    combinations of states can be
  • u(nknown)
  • a(larm)
  • A(larm)
  • C(alendar suspended)
  • s(uspended)
  • S(ubordinate)
  • d(isable)
  • D(isable)
  • E(rror)

37
SGE client commands
  • qdel
  • To delete SGE jobs, regardless of whether they
    are running or pending jobs.
  • Usage qdel Job_ID
  • deletes the job that matches the Job ID
  • To find your Job_ID, type qstat u ltyour_netidgt
  • Example qstat u ebensman
  • qalter
  • Usage qalter Job_ID
  • qmon
  • An X-windows Motif command interface and
    monitoring facility.
  • qhost
  • displays status information about SGE execution
    hosts

38
qmon GUI
At a command prompt, type qmon ltreturngt
39
Storage
  • CRC currently has three modes for storage
  • Local scratch ( /scratch) on the compute nodes
  • Fastest access not available for parallel
    processes serial only
  • On Sun nodes fibre channel
  • On Xeon nodes SCSI
  • On Opteron nodes SCSI/SATA
  • Distributed scratch space ( /dscratch)
  • NetApp appliance (installed fall 2006) 30TB
    available
  • /dscratch is not backed up like /afs/ space
  • Users are encouraged to backup critical data
  • /dscratch is meant for short-term storage of
    non-critical data
  • Andrew File System (AFS) distributed 6TB
    available
  • Allows users flexibility of common storage with
    other campus systems
  • Backed up on a routine basis allows for
    recovery of lost data
  • Continue to look at alternative filesystems
  • Currently reviewing faster OpenAFS Lustre

40
Feedback
  • Give us your feedback
  • Was this class beneficial?
  • What material should be added?
  • What material should be deleted?
  • What other training would you like to receive
    from the CRC?
Write a Comment
User Comments (0)
About PowerShow.com