Running on the SDSC Blue Gene - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Running on the SDSC Blue Gene

Description:

Software comes in drivers: We are currently running Driver V1R3M1 ... Running jobs: Partition Layout and Usage Guidelines ... Running Jobs: Reservation ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 26

Provided by: WaynePf3

Category:

more less

Transcript and Presenter's Notes

Title: Running on the SDSC Blue Gene

1
Running on the SDSC Blue Gene

Mahidhar Tatineni
Blue Gene Workshop
SDSC, April 5, 2007

2
BG System OverviewSDSCs three-rack system
3
BG System Overview Integrated system
4
BG System OverviewMultiple operating systems
functions

Compute nodes run Compute Node Kernel (CNK
blrts)
Each run only one job at a time
Each use very little memory for CNK
I/O nodes run Embedded Linux
Run CIOD to manage compute nodes
Perform file I/O
Run GPFS
Front-end nodes run SuSE Linux
Support user logins
Run cross compilers linker
Run parts of mpirun to submit jobs LoadLeveler
to manage jobs
Service node runs SuSE Linux
Uses DB2 to manage four system databases
Runs control system software, including MMCS
Runs other parts of mpirun LoadLeveler
Software comes in drivers We are currently
running Driver V1R3M1

5
SDSC Blue Gene Getting Started Logging on
moving files

Logging on
ssh bglogin.sdsc.edu
or
ssh -l username bglogin.sdsc.edu
Alternate login node bg-login4.sdsc.edu
(We will use bg-login4 for the workshop)
Moving files
scp file username_at_bglogin.sdsc.edu
or
scp -r directory username_at_bglogin.sdsc.edu

6
SDSC Blue Gene Getting started Places to store
your files

/users (home directory)
1.1 TB NFS mounted file system
Recommended for storing source / important files.
Do not write data/output to this area Slow and
limited in size!
Regular backups
/bggpfs available for parallel I/O via GPFS
18.5 TB accessed via IA-64 NSD servers
No backups
700 TB /gpfs-wan available for parallel I/O and
shared with DataStar and TG IA-64 cluster.

7
SDSC Blue Gene Checking your allocation

Use the reslist command to check your allocation
on the SDSC Blue Gene
Sample output is as follows
bg-login1 mahidhar/bg_workshopgt reslist -u
ux452208
Querying database, this may take several seconds
...
Output shown is local machine usage. For full
usage on roaming accounts, please use tgusage.
SBG Blue Gene at SDSC
SU Hours SU
Hours
Name UID ACID ACC PCTG
ALLOCATED USED USER
ux452208 452208 1606 U 100 99999
0 Guest8, Hpc
MKG000 1606 99999
40

8
Accessing HPSS from the Blue Gene

What is HPSS
The centralized, long-term data storage
system at SDSC is the
High Performance Storage System (HPSS)
Setup your authentication
run get_hpss_keytab script
Use hsi, and htar clients to connect to HPSS. For
example
hsi put mytar.tar
htar -c -f mytar.tar -L file_or_directory

9
Using the compilersImportant programming
considerations

Front-end nodes have different processors run
different OS than compute nodes
Hence codes must be cross compiled
Care must be taken with configure scripts
Discovery of system characteristics during
compilation (e.g., via configure) may require
modifications to the configure script.
Make sure that if code has to be executed during
the configure, it runs on the compute nodes.
Alternately, system characteristics can be
specified by user and the configure modified to
take this into account.
Some system calls are not supported by the
compute node kernel

10
Using the compilersCompiler versions, paths,
wrappers

Compilers (version numbers the same as on
DataStar)
XL Fortran V10.1 blrts_xlf blrts_xlf90
XL C/C V8.0 blrts_xlc blrts_xlC
Paths to compilers in default .bashrc
export PATH/opt/ibmcmp/xlf/bg/10.1/binPATH
export PATH/opt/ibmcmp/vac/bg/8.0/binPATH
export PATH/opt/ibmcmp/vacpp/bg/8.0/binPATH
Compilers with MPI wrappers (recommended)
mpxlf, mpxlf90, mpcc, mpCC
Path to MPI-wrapped compilers in default .bashrc
export PATH/usr/local/apps/binPATH

11
Using the compilers Options

Compiler options
-qarch440 uses only single FPU per processor
(minimum option)
-qarch440d allows both FPUs per processor
(alternate option)
-qtune440 tunes for the 440 processor
-O3 gives minimal optimization with no
SIMDization
-O3 qarch440d adds backend SIMDization
-O3 qhot adds TPO (a high-level inter-procedural
optimizer) SIMDization, more loop optimization
-O4 adds compile-time interprocedural analysis
-O5 adds link-time interprocedural analysis
(TPO SIMDization default with O4 and O5)
Current recommendation
Start with -O3 qarch440d qtune440
Try O4, -O5 next

12
Using libraries

ESSL
Version 4.2 is available in /usr/local/apps/lib
MASS/MASSV
Version 4.3 is available in /usr/local/apps/lib
FFTW
Versions 2.1.5 and 3.1.2 available in both single
double precision. The libraries are located in
/usr/local/apps/V1R3
NETCDF
Versions 3.6.0p1 and 3.6.1 are available in
/usr/local/apps/V1R3
Example link paths
-Wl,--allow-multiple-definition
-L/usr/local/apps/lib -lmassv -lmass -lesslbg
-L/usr/local/apps/V1R3/fftw-3.1.2s/lib -lfftw3f

13
Running jobs Overview

There are two compute modes
Coprocessor (CO) mode one compute processor per
node
Virtual node (VN) mode two compute processors
per node
Jobs run in partitions or blocks
These are typically powers of two
Blocks must be allocated (or booted) before run
are restricted to a single user at a time
Only batch jobs are supported
Batch jobs are managed by LoadLeveler
Users can monitor jobs using llq b llq -x

14
Running jobs LoadLeveler for batch jobs

Here is an example LoadLeveler run script
(test.cmd)
!/usr/bin/ksh
_at_ environment COPY_ALL
_at_ job_type BlueGene
_at_ account_no ltyour user accountgt
_at_ class parallel
_at_ bg_partition ltpartition name for example
topgt
_at_ output file.(jobid).out
_at_ error file.(jobid).err
_at_ notification complete
_at_ notify_user ltyour email addressgt
_at_ wall_clock_limit 001000
_at_ queue
mpirun -mode VN -np ltnumber of procsgt -exe ltyour
executablegt -cwd ltworking directorygt
Submit as follows
llsubmit test.cmd

15
Running jobs mpirun options

Key mpirun options are
-mode compute mode CO or VN
-np number of compute processors
-mapfile logical mapping of processors
-cwd full path of current working directory
-exe full path of executable
-args arguments of executable (in double quotes)
-env environmental variables (in double quotes)
(These are mostly different than for TeraGrid)

16
Running jobs Partition Layout and Usage
Guidelines

To make effective use of the Blue Gene,
production runs should generally use one-fourth
or more of the machine, i.e., 256 or more compute
nodes. Thus predefined partitions are provided
for production runs.
SDSC All 3076 nodes
R01R02 2048 nodes combing rack 1 2
rack, R01, R02all 1,024 nodes of each rack 0,
rack 1, and rack 2
top, bot R01-top, R01-bot R02-top, R02-bot 512
nodes
top2561 top2562 256 nodes in each half of the
top midplane of rack 0
bot2561 bot2562 256 nodes in each half of the
bottom midplane of rack 0
Smaller 64 (bot64-1, , bot64-8) and 128
(bot128-1 , , bot128-4) node partitions are
available for test runs.
Use the /usr/local/apps/utils/showq command to
get more information on partition requests of
jobs in the queue.

17
Running jobs Partition Layout
18
Running Jobs Reservation

There is a reservation in place for todays
workshop for all the guest users.
The reservation ID is bgsn.76.r
Set the LL_RES_ID variable to bgsn.76.r. This
will automatically bind jobs to the reservation.
csh/tcsh setenv LL_RES_ID bgsn.76.r
bash export LL_RES_IDbgsn.76.r

19
Running Jobs Example 1

The examples featured in todays talk are
included in the following directory
/bggpfs/projects/bg_workshop
Copy them to your directory by using the
following command
cp -r /bggpfs/projects/bg_workshop
/users/ltyour_dirgt
In the first example we will compile a simple mpi
program (mpi_hello_c.c/mpi_hello_f.f), use the
sample Loadleveler script (example1.cmd) to
submit and run the job.

20
Example 1 (contd.)

Compile the example files using the mpcc/mpxlf
wrappers
mpcc -o hello mpi_hello_c.c
mpxlf o hello mpi_hello_f.f
Modify the loadleveler submit file
(example1.cmd). Add the account number, partition
name, email address, and mpirun options
Use llsubmit to put the job in the queue
llsubmit example1.cmd

21
Running Jobs Example 2

In example 2 we will use a IO benchmark (IOR) to
illustrate the use of arguments with mpirun
The mpirun line is as follows
mpirun -np 64 -mode CO -cwd /bggpfs/projects/bg_wo
rkshop exe /bggpfs/projects/bg_workshop/IOR
-args "-a MPIIO -b 32m -t 4m -i 3
The mode, -exe, and args options are used in
this example. The args option is used to pass
options to the IOR executable.

22
Checkpoint-Restart on the Blue Gene

Checkpoint and restart are among the primary
techniques for fault recovery on the Blue Gene.
The current version of the checkpoint library
requires users to manually insert calls in their
code to checkpoint their code at the proper place
in their codes.
The process can be initialized by calling the
BGLCheckpointInit() function.
Checkpoint files can be written by making a call
to BGLCheckpoint(). This can be done any number
of times and the checkpoint files are
distinguished by a sequence number.
The environment variables BGL_CHKPT_RESTART_SEQNO
and BGL_CHKPT_DIR_PATH control the restart number
and location.

23
Example for Checkpoint-Restart

Let us look at the entire checkpoint restart
process using the example provided in the
/bggpfs/projects/bg_workshop directory.
We are using a simple Poisson solver to
illustrate the checkpoint process (file
poisson-chkpt.f)
Compile the program using mpxlf and including the
checkpoint library
mpxlf o pchk poisson-chkpt.f /bgl/BlueLight/ppcfl
oor/bglsys/lib/libchkpt.rts.a
Use the chkpt.cmd file to submit the job
The program writes checkpoint files after every
1000 steps. The checkpoint files are tagged with
the node ids and the sequence number. For
example
ckpt.x06-y01-z00.1.2

24
Example for Checkpoint-Restart (Contd.)

Verify that the checkpoint restart works
From the first run (when the checkpoint files
were written)
Done Step 3997 Error 1.83992678887004613
Done Step 3998 Error 1.83991115295111185
Done Step 3999 Error 1.83989551716504351
Done Step 4000 Error 1.83987988151185511
Done Step 4001 Error 1.83986424599153198
Done Step 4002 Error 1.83984861060408078
Done Step 4003 Error 1.83983297534951951
From the second run (continued from step 4000,
sequence 4)
Done Step 4000 Error 1.83987988151185511
Done Step 4001 Error 1.83986424599153198
Done Step 4002 Error 1.83984861060408078
We get identical results from both runs

25
BG System Overview References

Blue Gene Web site at SDSC
http//www.sdsc.edu/us/resources/bluegene
Loadleveler guide
http//publib.boulder.ibm.com/infocenter/clresctr
/vxrx/index.jsp?topic/com.ibm.cluster.loadl.doc/l
oadl331/am2ug30305.html
Blue Gene Application development guide (from IBM
redbooks)
http//www.redbooks.ibm.com/abstracts/sg247179.ht
ml