Title: Working Effectively at NERSC
1 - Working Effectively at NERSC
- In-Depth Advice for New Users
- October 6, 2003
- Thomas M. DeBoni
- NERSC User Services Group
- TMDeBoni_at_LBL.GOV
- 510-486-8617
2Topics
- IBM SP system
- About seaborg
- Hardware
- Resources
- Security
- Using seaborg
- Interactively
- Batch jobs
- Programming, optimization, and performance
- HPSS Mass Storage Systems
- About HPSS
- Hardware and file systems
- Using HPSS
- Security and access utilities
3IBM SP System - seaborg
- A capability system
- Intended for large jobs, rather than mass
throughput - Goals are high availability, high utilization,
effective shared use by large user community - This means mostly batch-oriented
- Some resources are dedicated to software
development, pre- and post-processing,
interaction with servers and mass storage, remote
logins, etc. - So, some interactive, and fast-turnaround usage
is possible, too
4SP Hardware 1
5SP Hardware , 2
- Each Nighthawk II node contains
- 16 Power3 processors
- 375 MHz, 1.5 GF/s peak performance (4
flops/cpu/clock, via FMA floating-point
multiply-add instruction) - L1 cache 64 KB data, 32 KB instructions
- L1 line size 128 Bytes
- L2 cache 8192 KB
- Colony switch two adaptors/node
- Most nodes contain 16 GB memory
- 64 nodes have 32 GB
- 4 nodes have 64 GB memory
6Useful SP References
- IBM Power 3 Documentation
- http//publib-b.boulder.ibm.com/Redbooks.nsf/
- RedbookAbstracts/sg245155.html?Open
- NERSC Guide for New Users
- http//hpcf.nersc.gov/help/new_user/
- NERSC Document on Running on seaborg
- http//hpcf.nersc.gov/computers/SP/running_jobs/
7SP User Resources, 1
- GPFS (General Parallel File System) used more
extensively on seaborg than any other SP - HOME
- Quota 10 GB, 15000 inodes
- Not backed up
- SCRATCH
- Quota 250 GB, 50000 inodes
- Persistent, but not permanent (may be purged)
- The best target for parallel I/O by large-scale
high performance programs - Special-request, temporary expansions of scratch
space are possible, but reviewed by management - Not backed up
8SP User Resources, 2
- Use myquota command to see where you stand with
regard to your limits - myquota
- ------- Block (MB) ------ ---------
Inode --------- - FileSystem Usage Quota InDoubt Usage
Quota InDoubt - ---------- ------- ------- ------- -------
------- ------- - /u4 4525 10240 75 6038
15000 175 - /scratch 388 256000 0 143
50000 0 - Please use only HOME and SCRATCH file systems
others exist but are required to keep the system
running, and their overuse may crash nodes -
9SP Security, 1
- Newly assigned passwords are single-use only
must be changed on first use - Password, shell changes are made on the special
seaborg node sadmin.nersc.gov - Connect via ssh, using initial password
- Use passwd or chsh command
- You should be disconnected when password change
is done - Change must propagate to all nodes usually takes
no more than an hour (usually happens at ten
minutes after the hour) - Support initializes new accounts and passwords
afterwards, consultants can reset passwords
10SP Security, 2
- Use SSH to connect to login nodes
- NERSC recommeds openssh
- Keep it up to date
- Terminal session connections should be made to
seaborg.nersc.gov - Do not connect to specific login nodes
- Do not connect directly to any compute nodes
- No incoming ftp allowed
- Use scp or sftp
- Outgoing ftp is allowed
- Dont expose cleartext passwords
- Dont share accounts
11Interactive Use of seaborg, 1
- Everything on seaborg is done in terms of nodes
- units of compute resources - A node is a shared-memory computer
- 16 cpus
- 16, 32, or 64 GB RAM
- Access to GPFS (parallel file system), switch,
and networks - You are charged for all 16 cpus in each node you
use, except for login sessions - A one-cpu batch job costs as much as a 16-cpu
parallel batch job - 17 cpus cost as much as 32, etc.
12Interactive Use of seaborg, 2
- Two node pools available to users
- Login nodes (6)
- Terminal sessions, interactive serial jobs
- Compute nodes (380)
- Everything else
- There are also GPFS nodes, network nodes, and
spare nodes - Good general reference
- http//hpcf.nersc.gov/computers/SP/
13Interactive Use of seaborg, 3
- Login sessions are the same as on any other Unix
system no need to specify or know which node
youre on - Serial code execution is easy
- ./a.out
- Limits 128MB, 3600 CPU seconds
- Parallel execution is slightly harder
- poe ./a.out nodes 2 procs 32
- Limits 8 nodes (128 processors), 30 minutes
- These limits are intentional large/long runs
should be run as batch jobs. - Note Use of POE is optional, if job was compiled
for parallel execution
14Interactive Use of seaborg, 4
- When you use POE (explicitly or implicitly), you
get resources from LoadLeveler which manages the
compute nodes - POE gets these resources from the INTERACTIVE
class - You may not succeed, if resources are not
immediately available - llsubmit Processed command file through Submit
Filter "/usr/common/nsg/etc/subfilter". - ERROR 0031-365 LoadLeveler unable to run job,
reason - LoadL_negotiator 2544-870 Step
s00509.nersc.gov.199123.0 was not considered to
be run in this - scheduling cycle due to its relatively low
priority or because there are not enough free
resources.
15Batch Use of seaborg
- You need to submit your run as a batch job if
- You want to make sure your job runs
- If you need more resources (cpu, memory) than the
interactive class allows - You need longer run times than the interactive
class allows - You dont want to share your CPU with other users
- You can monitor a batch job while it runs
- Use llqs, watch the files in SCRATCH
- There is no straightforward way to steer a
batch job
16Batch Execution, 1
- LoadLeveler is your friend
- llsubmit, llqs, llcancel
17Batch Execution, 2
- When you think It really means
- Class Queue
- Node 16 processors
- Process Processor
- Task Processor
- Execution time Wall-clock time
- Thread Loop slice (usually)
- Queue limits TODAYs queue limits
- These definitions can be deviated from, but not
usually to any useful effect - E.g. running more than 16 threads or tasks per
node
18Batch Execution, 3
- Batch jobs up to 8 hours long are safe from
system downtime - 8 hour warning of scheduled downtimes
- Jobs will not be allowed to begin execution after
Tdowntime - 80000 - Exception Backfill class
- Batch jobs longer than 8 hours, and backfill
class jobs will be killed at scheduled system
downtime - You will be charged
- Protect yourself with frequent checkpoints
- There are no refunds given on seaborg
19Batch Execution, 4
- Policies exist governing how jobs may be
submitted, how classes may be used, etc. - Premium class may be useful in urgent situations
- Just before a publication deadline
- Just before a major meeting
- Charges accrue rapidly
- New users and students tend to overuse this class
- There are no refunds given on seaborg
- NERSC Batch Policy Guide
- http//hpcf.nersc.gov/computers/SP/running_jobs/ba
tch.htmlpolicy
20Batch Execution, 5
- Misbehaving jobs will be killed
- Misbehavior means anything that inhibits other
jobs from using their fair share of the machine,
such as - Generating LOTS of files (e.g., filling system
logs) - Making LOTS of calls to system()
- Doing LOTS of small-block I/O
- Doing LOTS of small-message communication
- Chained or self-submission to short-runtime
classes - Using compute nodes for interactive work
- There are no refunds given on seaborg
21Batch Execution, 6
- Misbehavior does not (necessarily) mean
- Idling your processors during serial work
- Doing inefficient calculation
- Doing inefficient I/O
- Doing inefficient intertask communication
- There are no refunds given on seaborg
22Batch Scripting, 1
- A batch job consists of a script file and its
computational requirements - Codes to execute
- Files to manipulate
- Shell commands to execute
- A batch script is a shell script with some
initial comment lines that are significant to
LoadLeveler - The LoadLeveler lines characterize the job and
its resource needs - Submit a job by naming its script file in a
submission command - llsubmit myjob
23Batch Scripting, 2
- cat myjob
- _at_ job_name myjob optional
- _at_ account_no repo_name optional
- _at_ requirements (Memory 65536) optional
- _at_ output myjob.out advised
- _at_ error myjob.err advised
- _at_ environment COPY_ALL advised
- _at_ notification complete advised
- _at_ network.MPI csss,not_shared,us default
- _at_ node_usage not_shared default
- _at_ job_type parallel necessary
- _at_ class regular necessary
- _at_ task_per_node 16 necessary
- _at_ node 4 necessary
- _at_ wall_clock_limit 010000 necessary
- _at_ queue necessary
- ./a.out lt input_file gt output_file
24Batch Scripting, 3
- Batch (shell) scripts can be, essentially,
complete programs, containing - Sequential commands
- Conditional operations
- Loops
- They can require debugging
- They can be executed interactively
- Large parallel executions will not occur if
resource requirements exceed interactive limits - Good advice keep it simple
25Batch Jobs, 1
- Potentially useful things to do in a batch job
include - Moving files to from storage
- Moving files to from SCRATCH
- Creating and listing directories
- Renaming files to identify their origins
(appending dates, times, etc.) - Echoing messages for audit trail purposes
- Writing restart files
- Checking command completion status
- Moving files to from other systems
- Multiple code executions
- But watch out for expensive serial operations
26Batch Jobs, 2
- Potentially troublesome things to do in a batch
job include - Fetching files from storage (slow tape mounts)
- Moving files to from other systems (slow
network transfers) - Compilation
- Batch file editing
- Post-processing output data
- I/O to HOME (might exceed quotas)
- Steering - anything requiring interaction with
the job - Anything significant that uses less than your
full set of parallel resources
27Monitoring Batch Jobs, 1
- llq
- llqs is formatted nicer
- llqs -u username
- Status (ST) field
- R Running
- I Idle
- NQ Not Queued
- ST Starting
- RP Remove Pending
- HU User Hold
- HS System Hold
28Monitoring Batch Jobs, 2
- s00513 239 llsubmit moldy.scr
- --------------------------------------------------
------------ - User deboni Repo mpccc
- Job Name moldyjob.216e.26 Group mpccc
- Class Of Service debug Job Class
debug - Job Accepted Mon Aug 25 162151 2003
- --------------------------------------------------
------------ - llsubmit Processed command file through Submit
Filter "/usr/common/nsg/etc/subfilter". - llsubmit The job "s00613.nersc.gov.69612" has
been submitted. - s00513 240 llqs -u deboni
- Step Id JobName UserName Class ST
NDS TK WallClck Submit Time - --------------- ---------- -------- --------- --
--- -- -------- ----------- - s00613.69612.0 moldyjob.2 deboni debug I
4 16 002900 8/25 1621 - s00513 241 llqs -u deboni
- Step Id JobName UserName Class ST
NDS TK WallClck Submit Time
29Programming seaborg, 1
- Languages
- IBMs compilers are organized into compiler sets,
with separate front ends (names) - Fortran 77, 90, 95, HPF xlf, pghpf
- C xlc, gcc
- C xlc, kcc
- Some exist in multiple versions (xlf 7, xlf 8)
- Some have dubious futures (kcc)
- Some compile different language versions (C)
- Special versions for shared-memory xlf_r xlc_r,
etc. - Special versions for MPI mpxlf90, mpCC, etc.
- Note Some of us recommend routine use of the
_r versions, since they are compatible with
64-bit addressing
30Programming seaborg, 2
- 32-bit addressing is the default
- Gives your code access to a 2 GB address space
- You must specify your heap and stack sizes during
compilation - 64-bit addressing is available
- Gives your code access to all the RAM on the
large-memory nodes - No heap or stack size specs needed
- Code must be fully compiled, using _r
compilers, and relinked with 64-bit libraries - NERSC Guide to SP Memory Management
- http//hpcf.nersc.gov/software/ibm/sp_memory.html
31Programming seaborg, 3
- Parallelism
- Shared memory with Pthreads, OpenMP, and IBM SMP
directives - Single-node only
- Distributed memory with MPI, LAPI
- Single and multiple nodes
- Max 4096 tasks (cpus)
- Hybrid, with both OpenMP and MPI
- Max 6080 CPUs (entire compute pool)
- http//hpcf.nersc.gov/computers/SP/programming.ht
ml
32Programming seaborg, 4
- Libraries
- Math
- ESSL, PESSL, MASS, WSMP, IMSL, NAG, NAG-SMP,
NAG-MPI, LAPACK, PARPACK, SuperLU, ccSHT, FFTW,
ACTS (Aztec, PETSc, ScaLAPACK), Sparse Solvers,
Random Numbers - Graphics
- NCAR
- I/O
- netCDF, HDF, HDF5, MPI I/O
- Physics
- CERNLIB
- A number of canned applications are also
available (Gaussian 98, NWChem, etc.) - http//hpcf.nersc.gov/software/ibm/
33Debugging on seaborg
- Debugging - dont optimize an incorrect program
- You may need to compile with -g or -G options
- TotalView - a visual debugger for serial and
parallel programs - http//hpcf.nersc.gov/software/ibm/totalview.php
- pdbx - an IBM text-based parallel debugger
- http//hpcf.nersc.gov/vendor_docs/ibm/pe/am103mst1
2.htmlHDRUPDBX - gdb - the GNU debugger
- http//hpcf.nersc.gov/software/tools/GNU.html
- Assure - a source tool for checking OpenMP usage
- http//hpcf.nersc.gov/software/tools/kap.html
- ZeroFault - a tool for analyzing memory usage in
running code - http//hpcf.nersc.gov/software/tools/zerofault.ht
ml
34Optimizing on seaborg
- Optimization is essential
- The compilers can do a lot for you, but you may
have to experiment to find the best set of
options - -On start at O3, try O4, O5
- -qstrict strict arithmetic highly advised
- -qarchpwr3 specific for seaborg highly advised
- -qtunepwr3 specific for seaborg highly advised
- -qhot high order transforms try it
- -qipa interprocedural analysis try it
- -qessl, high performance libraries for
intrinsics and - -lessl,-lmass ordinary arithmetic try them
-
- Note By default, arithmetic on seaborg is
slightly better than IEEE standard this can be
overridden, if desired, for compatibility with
other machines -qfloatnomaf - http//hpcf.nersc.gov/computers/SP/options.html
35Tuning Code on seaborg, 1
- Measure your codes performance
- Processor performance counters are built into the
cpu chips, and are accessible in sets - Access these counters through three interfaces
- hpmcount - a preamble command
- hpmcount a.out
- poe hpmcount a.out -nodes x -procs y
- poe - a special utility for aggregating the
counts for parallel jobs - poe hpmcount a.out -nodes x -procs y
- hpmlib - a library for instrumenting regions of
code - http//hpcf.nersc.gov/software/ibm/hpmcount/
36Tuning Code on seaborg, 2POE Output from a
4-Node Run
- hpmcount (V 2.4.2) summary (aggregate of 64 POE
tasks) - Average execution time (wall clock time)
976.545 seconds - Average amount of time in user mode
964.780313 seconds - Average amount of time in system mode
1.516094 seconds - Total maximum resident set size
0.525768 Gbytes - Total shared memory use in text segment
1476910504 Kbytessec - Total unshared memory use in data segment
51641993652 Kbytessec - PM_CYC (Cycles)
23086152670268 - PM_INST_CMPL (Instructions completed)
26765551984318 - PM_TLB_MISS (TLB misses)
7839521349 - PM_ST_CMPL (Stores completed)
4334617829585 - PM_LD_CMPL (Loads completed)
10148999201076 - PM_FPU0_CMPL (FPU 0 instructions)
3387695961706 - PM_FPU1_CMPL (FPU 1 instructions)
2085096714987 - PM_EXEC_FMA (FMAs executed)
2695124470750 - Utilization rate
98.48678125 - Avg number of loads per TLB miss
3198.709015625 - Load and store operations
14483617.031 M
37Tuning Code on seaborg, 3
- How good is good performance?
- Peak floating point rate is 1.5 Gflips/processor
or 24 Gflips/node - This is not achievable, as memory cannot keep up
with the operand demand it would generate - You should be able to get
- 100 Mfllips/processor with compilation options
- 200 - 300 Mflips/processor with the right library
choice - 400 - 500 Mflips/processor with careful
engineering - 500 Mflips/processor with heroic effort
- Targets for optimization include
- Memory - strides and cache use
- Virtual memory - TLB use
- Communications organization
- I/O organization
- Results may be a code highly customized for
an/our SP system
38Tuning Code on seaborg, 4
- Measurement in depth
- Profile your code with xprofiler
- http//hpcf.nersc.gov/software/ibm/xprofiler/
- Measure your codes MPI performance with vampir
- http//hpcf.nersc.gov/software/tools/vampir.html/
- Analyze and tune code performance with tau
- http//acts.nersc.gov/tau/at-nersc.html
- Analyze execution traces with paraver, dimemas
- http//hpcf.nersc.gov/software/tools/cepba.html/
39A Word About Modules
- module - a Unix utility for managing libraries,
search paths, and environment variables used in
compilation and loading - Allows easy management of software packages in
the face of evolving file systems, etc. - Most seaborg modules and the module utility are
maintained by User Services Group - Modules are in use on all NERSC computers
- Software packages, utilities, libraries, etc. are
installed into modules which can be made
available with a single command - Loading a module makes it available, and you
need not know where the software components are
stored - Obviates hard-coded paths to software in your
makefiles - Useful module commands
- module avail - shows all installed software
packages - module load module_name - load named package
- module list - shows all loaded packages
40HPSS Mass Storage Systems
- HPSS High Performance Storage System
- Designed/developed by government/industry
consortium (including IBM) - Used at NERSC for archival storage
- 35 disk cache, 8.5 PB tape storage
- Connects to NERSC systems at up to 150 MB/sec.
- 4 TB/day transferred in or out
- Two systems archive.nersc.gov, hpss.nersc.gov
- Access
- On-site or off-site
- Use hsi, ftp, or pftp
- No clear-text passwords allowed
- Dont use full domain names for on-site access
- Accounts
- Allocations by Storage Resource Units (SRUs)
- http//hpcf.nersc.gov/storage/hpss
41About HPSS, 1
- archive, aka, the user system, is primarily
intended for NERSC Users - hpss, the other system, is primarily intended for
disaster recovery, full file system backups, etc. - You have space on both, but archive has more
capacity, will likely be less busy and more
responsive - HPSS is not a normal Unix file system it is a
separate ensemble of systems accessed by special
utilities - a hierarchy of disk and tape hardware
- a database subsystem to keep track of files,
tapes, disk caches, etc. - Access times are unpredictable, due to tape mount
latency, but are typically short be careful in
batch jobs
42About HPSS, 2
- HPSS is not a normal Unix file system
- It should not be used as an I/O target of running
codes (difficult to do, in any case) - It can be used as an I/O target of a batch job
- Prefetch files (potentially dangerous, due to
tape mount latency) - Post-store files (good idea, but allow a bit of
extra time in a batch job to complete it) - It has very fast network connections to all NERSC
computers, but is not connected as an I/O device - It can do third-party transfers
- It can do simultaneous connections to, and
transfers between, multiple HPSS systems or sites
43About HPSS, 3
- There are no backups of HPSS it is the backup
- Files and directories can be shared
- Via normal permissions
- Via project directories
- Project directories are available on request
- A number of users need to share files
- A special group is created, and they are made
members of it - A special directory, owned by that group, is
created in HPSS - The group members can move files into it, and
share them there - Somebody pays for this usage, usually the
requester
44About HPSS, 4
- HPSS is undergoing more or less constant upgrades
and improvements - System software
- Access utilities
- Tape drives
- Tapes
- This tends to keep the NERSC systems ahead of
demand - There is a regular debug/maintenance period every
Tuesday from 1000 AM to 1200 noon - Access to HPSS will stall during this period
- Watch out for this in batch jobs!
- hpss_avail inquiry utility tests for
availability
45Using HPSS, 1
- HPSS system software does not allow shells
- No incoming ssh is possible (so, no sftp or scp)
- Incoming ftp is allowed
- Cleartext login names and passwords are refused
- In-house access utilities auto-authenticate (hsi,
pftp) after credentials are initialized - NERSC uses DCE authentication on HPSS
- Support assigns initial passwords afterwards,
consultants can reset them - All authentication services provided by special
authentication server, auth.nersc.gov - Change passwords by connecting to special
authentication server (see next slide) - No delay in usability of new authentication info,
once set up or changed - http//hpcf.nersc.gov/storage/hpss/passwords
46Using HPSS, 2
- Access the authentication server by
- ssh -l auth auth.nersc.gov
- Use password exposed by module load www on any
NERSC computer - Change DCE password by using chpass command
- Get combo string by using ftppass command
- Store combo strings in .netrc file to allow
auto-authentication set file permissions to 600 - http//hpcf.nersc.gov/storage/hpss/ftp_nopass.html
47Using HPSS, 3
- Example .netrc file
- ls -l .netrc
- -rw------- 1 fubar ccc 518 Apr 17
2002 .netrc - cat .netrc
- machine hpss.nersc.gov
- login 0X2g19BrJ2tGSDF72j9NMHs5jS5Cwn2Bdlxn4zKisr
k - password 0X2g19BrJ2tGSDF72j9NMHs5jS5Cwn2Bdlxn4zK
isrk - machine archive.nersc.gov
- login 0R0gH3BrJ2tGSDF72j9NMHs5jS5Cwn2Bdlxn4zKisr
k - password 0R0gH3BrJ2tGSDF72j9NMHs5jS5Cwn2Bdlxn4zK
isrk
48Using HPSS, 4
- Access utilities
- From outside NERSC, use ftp
- Special authentication encrypted combo strings
are used for login name and password - Each combo string is good for use on one remote
machine as many combo strings can be generated
as are needed - From inside NERSC, ftp may be an alias for pftp
- Locally-produced parallel version of ftp
- Familiar command structure and syntax
- Auto-authenticates, once credentials are
initialized - Generate credentials by connecting and logging in
normally - ftp -l archive
- http//hpcf.nersc.gov/storage/hpss/ftpaccess
49Using HPSS, 5
- Access utilities
- hsi
- Uses DCE authentication (can use other forms)
- Connects to archive by default
- Rich, powerful, convenient command set
- Useable as interactive session or one-line
command - Effiecient at recursive and meta-data operations
- get -R, put -R, cp -R
- chmod, chown, mv, mdel
- Allows simultaneous multi-site connections,
transfers - Generate credentials for auto-authentication
- hsi -l
- http//hpcf.nersc.gov/storage/hpss/hsiaccess
- http//hpcf.nersc.gov/storage/hpss/hsi/
50Using HPSS, 6
- Dos and Donts of HPSS
- Dont open and close an access session within a
loop this beats up on the servers - Do multiple operations in a single session e.g.,
build a list of files to access, and get/put them
all in one session - Dont store lots of small files HPSS is
optimized for large files - Do aggregate small files into larger ones for
storage e.g., tar can be used with hsi in a Unix
pipe for reads or writes (see man hsi for
details) - Dont use ftp to move files around within HPSS
- Do use hsi to rename, move, or change permissions
- Dont let others access your HPSS space directly
- Do tell us if you have special sharing needs
- Broad hierarchies are (arguably) more efficient
than deep ones - Watch out for name collisions from truncation
(HPSS allows longer names than some Unix systems) - Watch out for your gets and puts - avoid
accidental overwrites - Large recursive operations can fail from resource
exhaustion
51Blank Slide