Title: Single Node Optimization Techniques on the NERSC SP
1Single Node Optimization Techniques on the NERSC
SP
December 17, 2002 Michael Stewart NERSC User
Services Group pmstewart_at_lbl.gov 510-486-6648
1
2Introduction
- Single processor optimization techniques using
compiler arguments and library routines. - Multi-threading optimization techniques to
partition work among several processors.
2
3IBM Default No Optimization!
- Can have very bad consequences
- do i1,bignum
- xxa(i)
- enddo
- bignum stores of x are done with the default, no
optimization. - When optimized, store motion is done
intermediate values of x are kept in registers
and the actual store is done only once, outside
the loop.
3
4NERSC/IBM Recommendation
- For all compiles - Fortran, C, C
- -O3 -qstrict -qarchpwr3 -qtunepwr3
- Compromise between minimizing compile time and
maximizing compiler optimization while keeping
numeric results bitwise identical to those
produced by an unoptimized compile. - With these options, optimization only done within
a procedure (e.g. subroutine, function) and the
executable produced will be single threaded.
4
5Higher levels of Single Threaded Optimizations
- -qhot
- Fortran only
- Do loop specific optimizations padding to
minimize cache misses, "vectorizing" functions
like sqrt, loop unrolling, etc. - Can double or triple compile time, use only with
selected do-loop dominated routines. - -O4/-O5
- Interprocedural optimizations.
- Specify for both compiles and links.
- (Fortran compiles) Use with qnohot.
5
6Library Routines from the ESSL Library
- Replacing code with ESSL routines is the single
most useful optimization. - The ESSL library contains a large number of
mathematical functions tuned for optimal Power3
performance. - Supports both 32 and 64 datatypes.
- Runs much faster than user written code
regardless of the compiler options used. - Has both a single threaded (-lessl) and
multi-threaded version (-lesslsmp). - Can replace the slow Fortran intrinsics (matmul,
dot_product, etc.) with much faster ESSL versions
with the qessl option.
6
7Single Threaded Matrix Multiplication Example
- A,B,C 2500 by 2500 real8 matrices
- Direct
- do i,j,k
- C(i,j)C(i,j)A(i,k)B(k,j)
- Unoptimized 16 MFlops
- -O3 qarchpwr3 qtunepwr3 175 MFlops
- -qhot 440 MFlops.
- ESSL routine DGEMM 1249 MFlops.
- Fortran intrinsic MATMUL 148 MFlops (1240
MFlops with qessl).
7
8Processes and Threads
- Under AIX, a process is an execution frame.
- Not a schedulable entity.
- Common ID, execution environment, address space
and system resources. - Number of processes in a job specified by procs
or _at_totaltasks. - Threads are schedulable entities associated with
a specific process. - Each process has at least one thread associated
with it. - Each thread has a minimal set of properties
necessary for it to be scheduled like a stack,
priority, etc. - All threads of a process have the same address
space and any change in the common environment
made by one thread affects all of the processs
threads.
8
9Multi-threaded Optimization Techniques
- Applicable primarily when the number of processes
on a node is less than the number of the
processors on the node (16 on seaborg). - Goal is to increase the number of threads running
on a node until they are greater (not much) than
or equal to the number of processes on the node. - Reduce the amount of processor idle time and
decrease the elapsed time of a job.
9
10Methods of Multi-threading Codes
- Insert OpenMP directives into the source code,
use - qsmpomp to activate them.
- Let the compiler multi-thread the code,
-qsmpauto. - Use the multi-threaded version of the essl
library, -lesslsmp. No source code changes
required. - When qessl specified, the Fortran intrinsics
will be replaced with multi-threaded ESSL
routines if qesslsmp is also specified.
10
11Run Time Behavior of Multi-threaded Codes
- Default is to run with a number of threads equal
to the number of processors on a node of the
system, 16 on seaborg. - Can change the number of threads at run time by
setting the OMP_NUMBER_THREADS environment
variable to the desired number of threads. - These environment variable settings can improve
the performance of a multithreaded code
significantly - XLSMPOPTS "schedulestatic spins0
yields0" - AIXTHREAD_MNRATIO "11"
11
12Multi-threaded Matrix Multiply
- Running matrix multiply on a dedicated node
(unrealistic). - Inserting OpenMP directives with qsmpomp
maximum speedup of 8.7 to 1520 MFlops. Speedup
levels off at 10 threads. - Automatic parallelization (-qsmpauto) maximum
speedup of 6.4 to 2670 MFlops. Speedup levels
off at 9 threads. - DGEMM with lesslsmp maximum speedup of 15.8 to
19,700 MFlops. Speedup does not level off until
number of threads exceeds 16. - MATMUL with qessl lesslsmp maximum speedup of
15.8 to 19,500 MFlops. Speedup does not level off
until number of threads exceeds 16.
12
13Hybrid Codes Mixed MPI and OpenMP
- Job runs with one or more MPI processes on a node
with 1 or more OpenMP threads per process such
that (number of threads)(number of
processes)16. - Optimum choice of number of processes depends on
the code. - Publicly available ASCI Purple hybrid benchmarks
run on 1 node with various combinations of
processes and threads - SPhot Best performance with 16 single threaded
MPI processes. - sPPM Best performance with 1 process with 16
OpenMP threads.
13
14Recommendations
- Always use ESSL routines when you can.
- If you use Fortran intrinsics, always compile
with qessl. - If your non-OpenMP code uses fewer than 16
processes per node, consider compiling with
lesslsmp and running with OMP_NUM_THREADS set to
a value such that threadsprocesses is the least
integer greater or equal to 16. - For hybrid codes, experiment with a
representative data set to determine the best
combination of threads and processes on a node. - Remember you will be charged for all 16
processors on a node whether you use them or not.
14
15References
- See the web page http//hpcf.nersc.gov/computers/S
P/nodopt.html for an expanded version of this
presentation and associated links.
15
16Finis
- End of this presentation.
16