Title: Running on the SDSC Blue Gene
1Running on the SDSC Blue Gene
- Mahidhar Tatineni
- Blue Gene Workshop
- SDSC, April 5, 2007
2BG System OverviewSDSCs three-rack system
3BG System Overview Integrated system
4BG System OverviewMultiple operating systems
functions
- Compute nodes run Compute Node Kernel (CNK
blrts) - Each run only one job at a time
- Each use very little memory for CNK
- I/O nodes run Embedded Linux
- Run CIOD to manage compute nodes
- Perform file I/O
- Run GPFS
- Front-end nodes run SuSE Linux
- Support user logins
- Run cross compilers linker
- Run parts of mpirun to submit jobs LoadLeveler
to manage jobs - Service node runs SuSE Linux
- Uses DB2 to manage four system databases
- Runs control system software, including MMCS
- Runs other parts of mpirun LoadLeveler
- Software comes in drivers We are currently
running Driver V1R3M1
5SDSC Blue Gene Getting Started Logging on
moving files
- Logging on
- ssh bglogin.sdsc.edu
- or
- ssh -l username bglogin.sdsc.edu
- Alternate login node bg-login4.sdsc.edu
- (We will use bg-login4 for the workshop)
- Moving files
- scp file username_at_bglogin.sdsc.edu
- or
- scp -r directory username_at_bglogin.sdsc.edu
6SDSC Blue Gene Getting started Places to store
your files
- /users (home directory)
- 1.1 TB NFS mounted file system
- Recommended for storing source / important files.
- Do not write data/output to this area Slow and
limited in size! - Regular backups
- /bggpfs available for parallel I/O via GPFS
- 18.5 TB accessed via IA-64 NSD servers
- No backups
- 700 TB /gpfs-wan available for parallel I/O and
shared with DataStar and TG IA-64 cluster.
7SDSC Blue Gene Checking your allocation
- Use the reslist command to check your allocation
on the SDSC Blue Gene - Sample output is as follows
- bg-login1 mahidhar/bg_workshopgt reslist -u
ux452208 - Querying database, this may take several seconds
... - Output shown is local machine usage. For full
usage on roaming accounts, please use tgusage. - SBG Blue Gene at SDSC
- SU Hours SU
Hours - Name UID ACID ACC PCTG
ALLOCATED USED USER - ux452208 452208 1606 U 100 99999
0 Guest8, Hpc - MKG000 1606 99999
40
8Accessing HPSS from the Blue Gene
- What is HPSS
- The centralized, long-term data storage
- system at SDSC is the
- High Performance Storage System (HPSS)
- Setup your authentication
- run get_hpss_keytab script
- Use hsi, and htar clients to connect to HPSS. For
example - hsi put mytar.tar
- htar -c -f mytar.tar -L file_or_directory
9Using the compilersImportant programming
considerations
- Front-end nodes have different processors run
different OS than compute nodes - Hence codes must be cross compiled
- Care must be taken with configure scripts
- Discovery of system characteristics during
compilation (e.g., via configure) may require
modifications to the configure script. - Make sure that if code has to be executed during
the configure, it runs on the compute nodes. - Alternately, system characteristics can be
specified by user and the configure modified to
take this into account. - Some system calls are not supported by the
compute node kernel
10Using the compilersCompiler versions, paths,
wrappers
- Compilers (version numbers the same as on
DataStar) - XL Fortran V10.1 blrts_xlf blrts_xlf90
- XL C/C V8.0 blrts_xlc blrts_xlC
- Paths to compilers in default .bashrc
- export PATH/opt/ibmcmp/xlf/bg/10.1/binPATH
- export PATH/opt/ibmcmp/vac/bg/8.0/binPATH
- export PATH/opt/ibmcmp/vacpp/bg/8.0/binPATH
- Compilers with MPI wrappers (recommended)
- mpxlf, mpxlf90, mpcc, mpCC
- Path to MPI-wrapped compilers in default .bashrc
- export PATH/usr/local/apps/binPATH
11Using the compilers Options
- Compiler options
- -qarch440 uses only single FPU per processor
(minimum option) - -qarch440d allows both FPUs per processor
(alternate option) - -qtune440 tunes for the 440 processor
- -O3 gives minimal optimization with no
SIMDization - -O3 qarch440d adds backend SIMDization
- -O3 qhot adds TPO (a high-level inter-procedural
optimizer) SIMDization, more loop optimization - -O4 adds compile-time interprocedural analysis
- -O5 adds link-time interprocedural analysis
- (TPO SIMDization default with O4 and O5)
- Current recommendation
- Start with -O3 qarch440d qtune440
- Try O4, -O5 next
12Using libraries
- ESSL
- Version 4.2 is available in /usr/local/apps/lib
- MASS/MASSV
- Version 4.3 is available in /usr/local/apps/lib
- FFTW
- Versions 2.1.5 and 3.1.2 available in both single
double precision. The libraries are located in
/usr/local/apps/V1R3 - NETCDF
- Versions 3.6.0p1 and 3.6.1 are available in
/usr/local/apps/V1R3 - Example link paths
- -Wl,--allow-multiple-definition
-L/usr/local/apps/lib -lmassv -lmass -lesslbg
-L/usr/local/apps/V1R3/fftw-3.1.2s/lib -lfftw3f
13Running jobs Overview
- There are two compute modes
- Coprocessor (CO) mode one compute processor per
node - Virtual node (VN) mode two compute processors
per node - Jobs run in partitions or blocks
- These are typically powers of two
- Blocks must be allocated (or booted) before run
are restricted to a single user at a time - Only batch jobs are supported
- Batch jobs are managed by LoadLeveler
- Users can monitor jobs using llq b llq -x
14Running jobs LoadLeveler for batch jobs
- Here is an example LoadLeveler run script
(test.cmd) - !/usr/bin/ksh
- _at_ environment COPY_ALL
- _at_ job_type BlueGene
- _at_ account_no ltyour user accountgt
- _at_ class parallel
- _at_ bg_partition ltpartition name for example
topgt - _at_ output file.(jobid).out
- _at_ error file.(jobid).err
- _at_ notification complete
- _at_ notify_user ltyour email addressgt
- _at_ wall_clock_limit 001000
- _at_ queue
- mpirun -mode VN -np ltnumber of procsgt -exe ltyour
executablegt -cwd ltworking directorygt - Submit as follows
- llsubmit test.cmd
15Running jobs mpirun options
- Key mpirun options are
- -mode compute mode CO or VN
- -np number of compute processors
- -mapfile logical mapping of processors
- -cwd full path of current working directory
- -exe full path of executable
- -args arguments of executable (in double quotes)
- -env environmental variables (in double quotes)
- (These are mostly different than for TeraGrid)
16Running jobs Partition Layout and Usage
Guidelines
- To make effective use of the Blue Gene,
production runs should generally use one-fourth
or more of the machine, i.e., 256 or more compute
nodes. Thus predefined partitions are provided
for production runs. - SDSC All 3076 nodes
- R01R02 2048 nodes combing rack 1 2
- rack, R01, R02all 1,024 nodes of each rack 0,
rack 1, and rack 2 - top, bot R01-top, R01-bot R02-top, R02-bot 512
nodes - top2561 top2562 256 nodes in each half of the
top midplane of rack 0 - bot2561 bot2562 256 nodes in each half of the
bottom midplane of rack 0 - Smaller 64 (bot64-1, , bot64-8) and 128
(bot128-1 , , bot128-4) node partitions are
available for test runs. - Use the /usr/local/apps/utils/showq command to
get more information on partition requests of
jobs in the queue.
17Running jobs Partition Layout
18Running Jobs Reservation
- There is a reservation in place for todays
workshop for all the guest users. - The reservation ID is bgsn.76.r
- Set the LL_RES_ID variable to bgsn.76.r. This
will automatically bind jobs to the reservation. - csh/tcsh setenv LL_RES_ID bgsn.76.r
- bash export LL_RES_IDbgsn.76.r
19Running Jobs Example 1
- The examples featured in todays talk are
included in the following directory - /bggpfs/projects/bg_workshop
- Copy them to your directory by using the
following command - cp -r /bggpfs/projects/bg_workshop
/users/ltyour_dirgt - In the first example we will compile a simple mpi
program (mpi_hello_c.c/mpi_hello_f.f), use the
sample Loadleveler script (example1.cmd) to
submit and run the job.
20Example 1 (contd.)
- Compile the example files using the mpcc/mpxlf
wrappers - mpcc -o hello mpi_hello_c.c
- mpxlf o hello mpi_hello_f.f
- Modify the loadleveler submit file
(example1.cmd). Add the account number, partition
name, email address, and mpirun options - Use llsubmit to put the job in the queue
- llsubmit example1.cmd
21Running Jobs Example 2
- In example 2 we will use a IO benchmark (IOR) to
illustrate the use of arguments with mpirun - The mpirun line is as follows
- mpirun -np 64 -mode CO -cwd /bggpfs/projects/bg_wo
rkshop exe /bggpfs/projects/bg_workshop/IOR
-args "-a MPIIO -b 32m -t 4m -i 3 - The mode, -exe, and args options are used in
this example. The args option is used to pass
options to the IOR executable.
22Checkpoint-Restart on the Blue Gene
- Checkpoint and restart are among the primary
techniques for fault recovery on the Blue Gene. - The current version of the checkpoint library
requires users to manually insert calls in their
code to checkpoint their code at the proper place
in their codes. - The process can be initialized by calling the
BGLCheckpointInit() function. - Checkpoint files can be written by making a call
to BGLCheckpoint(). This can be done any number
of times and the checkpoint files are
distinguished by a sequence number. - The environment variables BGL_CHKPT_RESTART_SEQNO
and BGL_CHKPT_DIR_PATH control the restart number
and location.
23Example for Checkpoint-Restart
- Let us look at the entire checkpoint restart
process using the example provided in the
/bggpfs/projects/bg_workshop directory. - We are using a simple Poisson solver to
illustrate the checkpoint process (file
poisson-chkpt.f) - Compile the program using mpxlf and including the
checkpoint library - mpxlf o pchk poisson-chkpt.f /bgl/BlueLight/ppcfl
oor/bglsys/lib/libchkpt.rts.a - Use the chkpt.cmd file to submit the job
- The program writes checkpoint files after every
1000 steps. The checkpoint files are tagged with
the node ids and the sequence number. For
example - ckpt.x06-y01-z00.1.2
24Example for Checkpoint-Restart (Contd.)
- Verify that the checkpoint restart works
- From the first run (when the checkpoint files
were written) - Done Step 3997 Error 1.83992678887004613
- Done Step 3998 Error 1.83991115295111185
- Done Step 3999 Error 1.83989551716504351
- Done Step 4000 Error 1.83987988151185511
- Done Step 4001 Error 1.83986424599153198
- Done Step 4002 Error 1.83984861060408078
- Done Step 4003 Error 1.83983297534951951
- From the second run (continued from step 4000,
sequence 4) - Done Step 4000 Error 1.83987988151185511
- Done Step 4001 Error 1.83986424599153198
- Done Step 4002 Error 1.83984861060408078
- We get identical results from both runs
25BG System Overview References
- Blue Gene Web site at SDSC
- http//www.sdsc.edu/us/resources/bluegene
- Loadleveler guide
- http//publib.boulder.ibm.com/infocenter/clresctr
/vxrx/index.jsp?topic/com.ibm.cluster.loadl.doc/l
oadl331/am2ug30305.html - Blue Gene Application development guide (from IBM
redbooks) - http//www.redbooks.ibm.com/abstracts/sg247179.ht
ml