Single Node Optimization Techniques on the NERSC SP - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

Single Node Optimization Techniques on the NERSC SP

Description:

Single processor optimization techniques using compiler arguments and ... to minimize cache misses, 'vectorizing' functions like sqrt, loop unrolling, etc. ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 17

Provided by: Ner46

Category:

more less

Transcript and Presenter's Notes

Title: Single Node Optimization Techniques on the NERSC SP

1
Single Node Optimization Techniques on the NERSC
SP
December 17, 2002 Michael Stewart NERSC User
Services Group pmstewart_at_lbl.gov 510-486-6648
1
2
Introduction

Single processor optimization techniques using
compiler arguments and library routines.
Multi-threading optimization techniques to
partition work among several processors.

2
3
IBM Default No Optimization!

Can have very bad consequences
do i1,bignum
xxa(i)
enddo
bignum stores of x are done with the default, no
optimization.
When optimized, store motion is done
intermediate values of x are kept in registers
and the actual store is done only once, outside
the loop.

3
4
NERSC/IBM Recommendation

For all compiles - Fortran, C, C
-O3 -qstrict -qarchpwr3 -qtunepwr3
Compromise between minimizing compile time and
maximizing compiler optimization while keeping
numeric results bitwise identical to those
produced by an unoptimized compile.
With these options, optimization only done within
a procedure (e.g. subroutine, function) and the
executable produced will be single threaded.

4
5
Higher levels of Single Threaded Optimizations

-qhot
Fortran only
Do loop specific optimizations padding to
minimize cache misses, "vectorizing" functions
like sqrt, loop unrolling, etc.
Can double or triple compile time, use only with
selected do-loop dominated routines.
-O4/-O5
Interprocedural optimizations.
Specify for both compiles and links.
(Fortran compiles) Use with qnohot.

5
6
Library Routines from the ESSL Library

Replacing code with ESSL routines is the single
most useful optimization.
The ESSL library contains a large number of
mathematical functions tuned for optimal Power3
performance.
Supports both 32 and 64 datatypes.
Runs much faster than user written code
regardless of the compiler options used.
Has both a single threaded (-lessl) and
multi-threaded version (-lesslsmp).
Can replace the slow Fortran intrinsics (matmul,
dot_product, etc.) with much faster ESSL versions
with the qessl option.

6
7
Single Threaded Matrix Multiplication Example

A,B,C 2500 by 2500 real8 matrices
Direct
do i,j,k
C(i,j)C(i,j)A(i,k)B(k,j)
Unoptimized 16 MFlops
-O3 qarchpwr3 qtunepwr3 175 MFlops
-qhot 440 MFlops.
ESSL routine DGEMM 1249 MFlops.
Fortran intrinsic MATMUL 148 MFlops (1240
MFlops with qessl).

7
8
Processes and Threads

Under AIX, a process is an execution frame.
Not a schedulable entity.
Common ID, execution environment, address space
and system resources.
Number of processes in a job specified by procs
or _at_totaltasks.
Threads are schedulable entities associated with
a specific process.
Each process has at least one thread associated
with it.
Each thread has a minimal set of properties
necessary for it to be scheduled like a stack,
priority, etc.
All threads of a process have the same address
space and any change in the common environment
made by one thread affects all of the processs
threads.

8
9
Multi-threaded Optimization Techniques

Applicable primarily when the number of processes
on a node is less than the number of the
processors on the node (16 on seaborg).
Goal is to increase the number of threads running
on a node until they are greater (not much) than
or equal to the number of processes on the node.
Reduce the amount of processor idle time and
decrease the elapsed time of a job.

9
10
Methods of Multi-threading Codes

Insert OpenMP directives into the source code,
use
qsmpomp to activate them.
Let the compiler multi-thread the code,
-qsmpauto.
Use the multi-threaded version of the essl
library, -lesslsmp. No source code changes
required.
When qessl specified, the Fortran intrinsics
will be replaced with multi-threaded ESSL
routines if qesslsmp is also specified.

10
11
Run Time Behavior of Multi-threaded Codes

Default is to run with a number of threads equal
to the number of processors on a node of the
system, 16 on seaborg.
Can change the number of threads at run time by
setting the OMP_NUMBER_THREADS environment
variable to the desired number of threads.
These environment variable settings can improve
the performance of a multithreaded code
significantly
XLSMPOPTS "schedulestatic spins0
yields0"
AIXTHREAD_MNRATIO "11"

11
12
Multi-threaded Matrix Multiply

Running matrix multiply on a dedicated node
(unrealistic).
Inserting OpenMP directives with qsmpomp
maximum speedup of 8.7 to 1520 MFlops. Speedup
levels off at 10 threads.
Automatic parallelization (-qsmpauto) maximum
speedup of 6.4 to 2670 MFlops. Speedup levels
off at 9 threads.
DGEMM with lesslsmp maximum speedup of 15.8 to
19,700 MFlops. Speedup does not level off until
number of threads exceeds 16.
MATMUL with qessl lesslsmp maximum speedup of
15.8 to 19,500 MFlops. Speedup does not level off
until number of threads exceeds 16.

12
13
Hybrid Codes Mixed MPI and OpenMP

Job runs with one or more MPI processes on a node
with 1 or more OpenMP threads per process such
that (number of threads)(number of
processes)16.
Optimum choice of number of processes depends on
the code.
Publicly available ASCI Purple hybrid benchmarks
run on 1 node with various combinations of
processes and threads
SPhot Best performance with 16 single threaded
MPI processes.
sPPM Best performance with 1 process with 16
OpenMP threads.

13
14
Recommendations

Always use ESSL routines when you can.
If you use Fortran intrinsics, always compile
with qessl.
If your non-OpenMP code uses fewer than 16
processes per node, consider compiling with
lesslsmp and running with OMP_NUM_THREADS set to
a value such that threadsprocesses is the least
integer greater or equal to 16.
For hybrid codes, experiment with a
representative data set to determine the best
combination of threads and processes on a node.
Remember you will be charged for all 16
processors on a node whether you use them or not.

14
15
References