Working Effectively at NERSC

About This Presentation

Title:

Working Effectively at NERSC

Description:

Intended for large jobs, rather than mass throughput ... PESSL, MASS, WSMP, IMSL, NAG, NAG-SMP, NAG-MPI, LAPACK, PARPACK, SuperLU, ccSHT, ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 52

Provided by: tomde6

Category:

more less

Transcript and Presenter's Notes

Title: Working Effectively at NERSC

1

Working Effectively at NERSC
In-Depth Advice for New Users
October 6, 2003
Thomas M. DeBoni
NERSC User Services Group
TMDeBoni_at_LBL.GOV
510-486-8617

2
Topics

IBM SP system
About seaborg
Hardware
Resources
Security
Using seaborg
Interactively
Batch jobs
Programming, optimization, and performance
HPSS Mass Storage Systems
About HPSS
Hardware and file systems
Using HPSS
Security and access utilities

3
IBM SP System - seaborg

A capability system
Intended for large jobs, rather than mass
throughput
Goals are high availability, high utilization,
effective shared use by large user community
This means mostly batch-oriented
Some resources are dedicated to software
development, pre- and post-processing,
interaction with servers and mass storage, remote
logins, etc.
So, some interactive, and fast-turnaround usage
is possible, too

4
SP Hardware 1

IBM SP seaborg.nersc.gov

5
SP Hardware , 2

Each Nighthawk II node contains
16 Power3 processors
375 MHz, 1.5 GF/s peak performance (4
flops/cpu/clock, via FMA floating-point
multiply-add instruction)
L1 cache 64 KB data, 32 KB instructions
L1 line size 128 Bytes
L2 cache 8192 KB
Colony switch two adaptors/node
Most nodes contain 16 GB memory
64 nodes have 32 GB
4 nodes have 64 GB memory

6
Useful SP References

IBM Power 3 Documentation
http//publib-b.boulder.ibm.com/Redbooks.nsf/
RedbookAbstracts/sg245155.html?Open
NERSC Guide for New Users
http//hpcf.nersc.gov/help/new_user/
NERSC Document on Running on seaborg
http//hpcf.nersc.gov/computers/SP/running_jobs/

7
SP User Resources, 1

GPFS (General Parallel File System) used more
extensively on seaborg than any other SP
HOME
Quota 10 GB, 15000 inodes
Not backed up
SCRATCH
Quota 250 GB, 50000 inodes
Persistent, but not permanent (may be purged)
The best target for parallel I/O by large-scale
high performance programs
Special-request, temporary expansions of scratch
space are possible, but reviewed by management
Not backed up

8
SP User Resources, 2

Use myquota command to see where you stand with
regard to your limits
myquota
------- Block (MB) ------ ---------
Inode ---------
FileSystem Usage Quota InDoubt Usage
Quota InDoubt
---------- ------- ------- ------- -------
------- -------
/u4 4525 10240 75 6038
15000 175
/scratch 388 256000 0 143
50000 0
Please use only HOME and SCRATCH file systems
others exist but are required to keep the system
running, and their overuse may crash nodes

9
SP Security, 1

Newly assigned passwords are single-use only
must be changed on first use
Password, shell changes are made on the special
seaborg node sadmin.nersc.gov
Connect via ssh, using initial password
Use passwd or chsh command
You should be disconnected when password change
is done
Change must propagate to all nodes usually takes
no more than an hour (usually happens at ten
minutes after the hour)
Support initializes new accounts and passwords
afterwards, consultants can reset passwords

10
SP Security, 2

Use SSH to connect to login nodes
NERSC recommeds openssh
Keep it up to date
Terminal session connections should be made to
seaborg.nersc.gov
Do not connect to specific login nodes
Do not connect directly to any compute nodes
No incoming ftp allowed
Use scp or sftp
Outgoing ftp is allowed
Dont expose cleartext passwords
Dont share accounts

11
Interactive Use of seaborg, 1

Everything on seaborg is done in terms of nodes
- units of compute resources
A node is a shared-memory computer
16 cpus
16, 32, or 64 GB RAM
Access to GPFS (parallel file system), switch,
and networks
You are charged for all 16 cpus in each node you
use, except for login sessions
A one-cpu batch job costs as much as a 16-cpu
parallel batch job
17 cpus cost as much as 32, etc.

12
Interactive Use of seaborg, 2

Two node pools available to users
Login nodes (6)
Terminal sessions, interactive serial jobs
Compute nodes (380)
Everything else
There are also GPFS nodes, network nodes, and
spare nodes
Good general reference
http//hpcf.nersc.gov/computers/SP/

13
Interactive Use of seaborg, 3

Login sessions are the same as on any other Unix
system no need to specify or know which node
youre on
Serial code execution is easy
./a.out
Limits 128MB, 3600 CPU seconds
Parallel execution is slightly harder
poe ./a.out nodes 2 procs 32
Limits 8 nodes (128 processors), 30 minutes
These limits are intentional large/long runs
should be run as batch jobs.
Note Use of POE is optional, if job was compiled
for parallel execution

14
Interactive Use of seaborg, 4

When you use POE (explicitly or implicitly), you
get resources from LoadLeveler which manages the
compute nodes
POE gets these resources from the INTERACTIVE
class
You may not succeed, if resources are not
immediately available
llsubmit Processed command file through Submit
Filter "/usr/common/nsg/etc/subfilter".
ERROR 0031-365 LoadLeveler unable to run job,
reason
LoadL_negotiator 2544-870 Step
s00509.nersc.gov.199123.0 was not considered to
be run in this
scheduling cycle due to its relatively low
priority or because there are not enough free
resources.

15
Batch Use of seaborg

You need to submit your run as a batch job if
You want to make sure your job runs
If you need more resources (cpu, memory) than the
interactive class allows
You need longer run times than the interactive
class allows
You dont want to share your CPU with other users
You can monitor a batch job while it runs
Use llqs, watch the files in SCRATCH
There is no straightforward way to steer a
batch job

16
Batch Execution, 1

LoadLeveler is your friend
llsubmit, llqs, llcancel

17
Batch Execution, 2

When you think It really means
Class Queue
Node 16 processors
Process Processor
Task Processor
Execution time Wall-clock time
Thread Loop slice (usually)
Queue limits TODAYs queue limits
These definitions can be deviated from, but not
usually to any useful effect
E.g. running more than 16 threads or tasks per
node

18
Batch Execution, 3

Batch jobs up to 8 hours long are safe from
system downtime
8 hour warning of scheduled downtimes
Jobs will not be allowed to begin execution after
Tdowntime - 80000
Exception Backfill class
Batch jobs longer than 8 hours, and backfill
class jobs will be killed at scheduled system
downtime
You will be charged
Protect yourself with frequent checkpoints
There are no refunds given on seaborg

19
Batch Execution, 4

Policies exist governing how jobs may be
submitted, how classes may be used, etc.
Premium class may be useful in urgent situations
Just before a publication deadline
Just before a major meeting
Charges accrue rapidly
New users and students tend to overuse this class
There are no refunds given on seaborg
NERSC Batch Policy Guide
http//hpcf.nersc.gov/computers/SP/running_jobs/ba
tch.htmlpolicy

20
Batch Execution, 5

Misbehaving jobs will be killed
Misbehavior means anything that inhibits other
jobs from using their fair share of the machine,
such as
Generating LOTS of files (e.g., filling system
logs)
Making LOTS of calls to system()
Doing LOTS of small-block I/O
Doing LOTS of small-message communication
Chained or self-submission to short-runtime
classes
Using compute nodes for interactive work
There are no refunds given on seaborg

21
Batch Execution, 6

Misbehavior does not (necessarily) mean
Idling your processors during serial work
Doing inefficient calculation
Doing inefficient I/O
Doing inefficient intertask communication
There are no refunds given on seaborg

22
Batch Scripting, 1

A batch job consists of a script file and its
computational requirements
Codes to execute
Files to manipulate
Shell commands to execute
A batch script is a shell script with some
initial comment lines that are significant to
LoadLeveler
The LoadLeveler lines characterize the job and
its resource needs
Submit a job by naming its script file in a
submission command
llsubmit myjob

23
Batch Scripting, 2

cat myjob
_at_ job_name myjob optional
_at_ account_no repo_name optional
_at_ requirements (Memory 65536) optional
_at_ output myjob.out advised
_at_ error myjob.err advised
_at_ environment COPY_ALL advised
_at_ notification complete advised
_at_ network.MPI csss,not_shared,us default
_at_ node_usage not_shared default
_at_ job_type parallel necessary
_at_ class regular necessary
_at_ task_per_node 16 necessary
_at_ node 4 necessary
_at_ wall_clock_limit 010000 necessary
_at_ queue necessary
./a.out lt input_file gt output_file

24
Batch Scripting, 3

Batch (shell) scripts can be, essentially,
complete programs, containing
Sequential commands
Conditional operations
Loops
They can require debugging
They can be executed interactively
Large parallel executions will not occur if
resource requirements exceed interactive limits
Good advice keep it simple

25
Batch Jobs, 1

Potentially useful things to do in a batch job
include
Moving files to from storage
Moving files to from SCRATCH
Creating and listing directories
Renaming files to identify their origins
(appending dates, times, etc.)
Echoing messages for audit trail purposes
Writing restart files
Checking command completion status
Moving files to from other systems
Multiple code executions
But watch out for expensive serial operations

26
Batch Jobs, 2

Potentially troublesome things to do in a batch
job include
Fetching files from storage (slow tape mounts)
Moving files to from other systems (slow
network transfers)
Compilation
Batch file editing
Post-processing output data
I/O to HOME (might exceed quotas)
Steering - anything requiring interaction with
the job
Anything significant that uses less than your
full set of parallel resources

27
Monitoring Batch Jobs, 1

llq
llqs is formatted nicer
llqs -u username
Status (ST) field
R Running
I Idle
NQ Not Queued
ST Starting
RP Remove Pending
HU User Hold
HS System Hold

28
Monitoring Batch Jobs, 2

s00513 239 llsubmit moldy.scr
--------------------------------------------------
------------
User deboni Repo mpccc
Job Name moldyjob.216e.26 Group mpccc
Class Of Service debug Job Class
debug
Job Accepted Mon Aug 25 162151 2003
--------------------------------------------------
------------
llsubmit Processed command file through Submit
Filter "/usr/common/nsg/etc/subfilter".
llsubmit The job "s00613.nersc.gov.69612" has
been submitted.
s00513 240 llqs -u deboni
Step Id JobName UserName Class ST
NDS TK WallClck Submit Time
--------------- ---------- -------- --------- --
--- -- -------- -----------
s00613.69612.0 moldyjob.2 deboni debug I
4 16 002900 8/25 1621
s00513 241 llqs -u deboni
Step Id JobName UserName Class ST
NDS TK WallClck Submit Time

29
Programming seaborg, 1

Languages
IBMs compilers are organized into compiler sets,
with separate front ends (names)
Fortran 77, 90, 95, HPF xlf, pghpf
C xlc, gcc
C xlc, kcc
Some exist in multiple versions (xlf 7, xlf 8)
Some have dubious futures (kcc)
Some compile different language versions (C)
Special versions for shared-memory xlf_r xlc_r,
etc.
Special versions for MPI mpxlf90, mpCC, etc.
Note Some of us recommend routine use of the
_r versions, since they are compatible with
64-bit addressing

30
Programming seaborg, 2

32-bit addressing is the default
Gives your code access to a 2 GB address space
You must specify your heap and stack sizes during
compilation
64-bit addressing is available
Gives your code access to all the RAM on the
large-memory nodes
No heap or stack size specs needed
Code must be fully compiled, using _r
compilers, and relinked with 64-bit libraries
NERSC Guide to SP Memory Management
http//hpcf.nersc.gov/software/ibm/sp_memory.html

31
Programming seaborg, 3

Parallelism
Shared memory with Pthreads, OpenMP, and IBM SMP
directives
Single-node only
Distributed memory with MPI, LAPI
Single and multiple nodes
Max 4096 tasks (cpus)
Hybrid, with both OpenMP and MPI
Max 6080 CPUs (entire compute pool)
http//hpcf.nersc.gov/computers/SP/programming.ht
ml

32
Programming seaborg, 4

Libraries
Math
ESSL, PESSL, MASS, WSMP, IMSL, NAG, NAG-SMP,
NAG-MPI, LAPACK, PARPACK, SuperLU, ccSHT, FFTW,
ACTS (Aztec, PETSc, ScaLAPACK), Sparse Solvers,
Random Numbers
Graphics
NCAR
I/O
netCDF, HDF, HDF5, MPI I/O
Physics
CERNLIB
A number of canned applications are also
available (Gaussian 98, NWChem, etc.)
http//hpcf.nersc.gov/software/ibm/

33
Debugging on seaborg

Debugging - dont optimize an incorrect program
You may need to compile with -g or -G options
TotalView - a visual debugger for serial and
parallel programs
http//hpcf.nersc.gov/software/ibm/totalview.php
pdbx - an IBM text-based parallel debugger
http//hpcf.nersc.gov/vendor_docs/ibm/pe/am103mst1
2.htmlHDRUPDBX
gdb - the GNU debugger
http//hpcf.nersc.gov/software/tools/GNU.html
Assure - a source tool for checking OpenMP usage
http//hpcf.nersc.gov/software/tools/kap.html
ZeroFault - a tool for analyzing memory usage in
running code
http//hpcf.nersc.gov/software/tools/zerofault.ht
ml

34
Optimizing on seaborg

Optimization is essential
The compilers can do a lot for you, but you may
have to experiment to find the best set of
options
-On start at O3, try O4, O5
-qstrict strict arithmetic highly advised
-qarchpwr3 specific for seaborg highly advised
-qtunepwr3 specific for seaborg highly advised
-qhot high order transforms try it
-qipa interprocedural analysis try it
-qessl, high performance libraries for
intrinsics and
-lessl,-lmass ordinary arithmetic try them
Note By default, arithmetic on seaborg is
slightly better than IEEE standard this can be
overridden, if desired, for compatibility with
other machines -qfloatnomaf
http//hpcf.nersc.gov/computers/SP/options.html

35
Tuning Code on seaborg, 1

Measure your codes performance
Processor performance counters are built into the
cpu chips, and are accessible in sets
Access these counters through three interfaces
hpmcount - a preamble command
hpmcount a.out
poe hpmcount a.out -nodes x -procs y
poe - a special utility for aggregating the
counts for parallel jobs
poe hpmcount a.out -nodes x -procs y
hpmlib - a library for instrumenting regions of
code
http//hpcf.nersc.gov/software/ibm/hpmcount/

36
Tuning Code on seaborg, 2POE Output from a
4-Node Run

hpmcount (V 2.4.2) summary (aggregate of 64 POE
tasks)
Average execution time (wall clock time)
976.545 seconds
Average amount of time in user mode
964.780313 seconds
Average amount of time in system mode
1.516094 seconds
Total maximum resident set size
0.525768 Gbytes
Total shared memory use in text segment
1476910504 Kbytessec
Total unshared memory use in data segment
51641993652 Kbytessec
PM_CYC (Cycles)
23086152670268
PM_INST_CMPL (Instructions completed)
26765551984318
PM_TLB_MISS (TLB misses)
7839521349
PM_ST_CMPL (Stores completed)
4334617829585
PM_LD_CMPL (Loads completed)
10148999201076
PM_FPU0_CMPL (FPU 0 instructions)
3387695961706
PM_FPU1_CMPL (FPU 1 instructions)
2085096714987
PM_EXEC_FMA (FMAs executed)
2695124470750
Utilization rate
98.48678125
Avg number of loads per TLB miss
3198.709015625
Load and store operations
14483617.031 M

37
Tuning Code on seaborg, 3

How good is good performance?
Peak floating point rate is 1.5 Gflips/processor
or 24 Gflips/node
This is not achievable, as memory cannot keep up
with the operand demand it would generate
You should be able to get
100 Mfllips/processor with compilation options
200 - 300 Mflips/processor with the right library
choice
400 - 500 Mflips/processor with careful
engineering
500 Mflips/processor with heroic effort
Targets for optimization include
Memory - strides and cache use
Virtual memory - TLB use
Communications organization
I/O organization
Results may be a code highly customized for
an/our SP system

38
Tuning Code on seaborg, 4

Measurement in depth
Profile your code with xprofiler
http//hpcf.nersc.gov/software/ibm/xprofiler/
Measure your codes MPI performance with vampir
http//hpcf.nersc.gov/software/tools/vampir.html/
Analyze and tune code performance with tau
http//acts.nersc.gov/tau/at-nersc.html
Analyze execution traces with paraver, dimemas
http//hpcf.nersc.gov/software/tools/cepba.html/

39
A Word About Modules

module - a Unix utility for managing libraries,
search paths, and environment variables used in
compilation and loading
Allows easy management of software packages in
the face of evolving file systems, etc.
Most seaborg modules and the module utility are
maintained by User Services Group
Modules are in use on all NERSC computers
Software packages, utilities, libraries, etc. are
installed into modules which can be made
available with a single command
Loading a module makes it available, and you
need not know where the software components are
stored
Obviates hard-coded paths to software in your
makefiles
Useful module commands
module avail - shows all installed software
packages
module load module_name - load named package
module list - shows all loaded packages

40
HPSS Mass Storage Systems

HPSS High Performance Storage System
Designed/developed by government/industry
consortium (including IBM)
Used at NERSC for archival storage
35 disk cache, 8.5 PB tape storage
Connects to NERSC systems at up to 150 MB/sec.
4 TB/day transferred in or out
Two systems archive.nersc.gov, hpss.nersc.gov
Access
On-site or off-site
Use hsi, ftp, or pftp
No clear-text passwords allowed
Dont use full domain names for on-site access
Accounts
Allocations by Storage Resource Units (SRUs)
http//hpcf.nersc.gov/storage/hpss

41
About HPSS, 1

archive, aka, the user system, is primarily
intended for NERSC Users
hpss, the other system, is primarily intended for
disaster recovery, full file system backups, etc.
You have space on both, but archive has more
capacity, will likely be less busy and more
responsive
HPSS is not a normal Unix file system it is a
separate ensemble of systems accessed by special
utilities
a hierarchy of disk and tape hardware
a database subsystem to keep track of files,
tapes, disk caches, etc.
Access times are unpredictable, due to tape mount
latency, but are typically short be careful in
batch jobs

42
About HPSS, 2

HPSS is not a normal Unix file system
It should not be used as an I/O target of running
codes (difficult to do, in any case)
It can be used as an I/O target of a batch job
Prefetch files (potentially dangerous, due to
tape mount latency)
Post-store files (good idea, but allow a bit of
extra time in a batch job to complete it)
It has very fast network connections to all NERSC
computers, but is not connected as an I/O device
It can do third-party transfers
It can do simultaneous connections to, and
transfers between, multiple HPSS systems or sites

43
About HPSS, 3

There are no backups of HPSS it is the backup
Files and directories can be shared
Via normal permissions
Via project directories
Project directories are available on request
A number of users need to share files
A special group is created, and they are made
members of it
A special directory, owned by that group, is
created in HPSS
The group members can move files into it, and
share them there
Somebody pays for this usage, usually the
requester

44
About HPSS, 4

HPSS is undergoing more or less constant upgrades
and improvements
System software
Access utilities
Tape drives
Tapes
This tends to keep the NERSC systems ahead of
demand
There is a regular debug/maintenance period every
Tuesday from 1000 AM to 1200 noon
Access to HPSS will stall during this period
Watch out for this in batch jobs!
hpss_avail inquiry utility tests for
availability

45
Using HPSS, 1

HPSS system software does not allow shells
No incoming ssh is possible (so, no sftp or scp)
Incoming ftp is allowed
Cleartext login names and passwords are refused
In-house access utilities auto-authenticate (hsi,
pftp) after credentials are initialized
NERSC uses DCE authentication on HPSS
Support assigns initial passwords afterwards,
consultants can reset them
All authentication services provided by special
authentication server, auth.nersc.gov
Change passwords by connecting to special
authentication server (see next slide)
No delay in usability of new authentication info,
once set up or changed
http//hpcf.nersc.gov/storage/hpss/passwords

46
Using HPSS, 2

Access the authentication server by
ssh -l auth auth.nersc.gov
Use password exposed by module load www on any
NERSC computer
Change DCE password by using chpass command
Get combo string by using ftppass command
Store combo strings in .netrc file to allow
auto-authentication set file permissions to 600
http//hpcf.nersc.gov/storage/hpss/ftp_nopass.html

47
Using HPSS, 3

Example .netrc file
ls -l .netrc
-rw------- 1 fubar ccc 518 Apr 17
2002 .netrc
cat .netrc
machine hpss.nersc.gov
login 0X2g19BrJ2tGSDF72j9NMHs5jS5Cwn2Bdlxn4zKisr
k
password 0X2g19BrJ2tGSDF72j9NMHs5jS5Cwn2Bdlxn4zK
isrk
machine archive.nersc.gov
login 0R0gH3BrJ2tGSDF72j9NMHs5jS5Cwn2Bdlxn4zKisr
k
password 0R0gH3BrJ2tGSDF72j9NMHs5jS5Cwn2Bdlxn4zK
isrk

48
Using HPSS, 4

Access utilities
From outside NERSC, use ftp
Special authentication encrypted combo strings
are used for login name and password
Each combo string is good for use on one remote
machine as many combo strings can be generated
as are needed
From inside NERSC, ftp may be an alias for pftp
Locally-produced parallel version of ftp
Familiar command structure and syntax
Auto-authenticates, once credentials are
initialized
Generate credentials by connecting and logging in
normally
ftp -l archive
http//hpcf.nersc.gov/storage/hpss/ftpaccess

49
Using HPSS, 5

Access utilities
hsi
Uses DCE authentication (can use other forms)
Connects to archive by default
Rich, powerful, convenient command set
Useable as interactive session or one-line
command
Effiecient at recursive and meta-data operations
get -R, put -R, cp -R
chmod, chown, mv, mdel
Allows simultaneous multi-site connections,
transfers
Generate credentials for auto-authentication
hsi -l
http//hpcf.nersc.gov/storage/hpss/hsiaccess
http//hpcf.nersc.gov/storage/hpss/hsi/

50
Using HPSS, 6

Dos and Donts of HPSS
Dont open and close an access session within a
loop this beats up on the servers
Do multiple operations in a single session e.g.,
build a list of files to access, and get/put them
all in one session
Dont store lots of small files HPSS is
optimized for large files
Do aggregate small files into larger ones for
storage e.g., tar can be used with hsi in a Unix
pipe for reads or writes (see man hsi for
details)
Dont use ftp to move files around within HPSS
Do use hsi to rename, move, or change permissions
Dont let others access your HPSS space directly
Do tell us if you have special sharing needs
Broad hierarchies are (arguably) more efficient
than deep ones
Watch out for name collisions from truncation
(HPSS allows longer names than some Unix systems)
Watch out for your gets and puts - avoid
accidental overwrites
Large recursive operations can fail from resource
exhaustion