Single Node Optimization Techniques on the NERSC SP - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Single Node Optimization Techniques on the NERSC SP

Description:

Single processor optimization techniques using compiler arguments and ... to minimize cache misses, 'vectorizing' functions like sqrt, loop unrolling, etc. ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 17
Provided by: Ner46
Category:

less

Transcript and Presenter's Notes

Title: Single Node Optimization Techniques on the NERSC SP


1
Single Node Optimization Techniques on the NERSC
SP
December 17, 2002 Michael Stewart NERSC User
Services Group pmstewart_at_lbl.gov 510-486-6648
1
2
Introduction
  • Single processor optimization techniques using
    compiler arguments and library routines.
  • Multi-threading optimization techniques to
    partition work among several processors.

2
3
IBM Default No Optimization!
  • Can have very bad consequences
  • do i1,bignum
  • xxa(i)
  • enddo
  • bignum stores of x are done with the default, no
    optimization.
  • When optimized, store motion is done
    intermediate values of x are kept in registers
    and the actual store is done only once, outside
    the loop.

3
4
NERSC/IBM Recommendation
  • For all compiles - Fortran, C, C
  • -O3 -qstrict -qarchpwr3 -qtunepwr3
  • Compromise between minimizing compile time and
    maximizing compiler optimization while keeping
    numeric results bitwise identical to those
    produced by an unoptimized compile.
  • With these options, optimization only done within
    a procedure (e.g. subroutine, function) and the
    executable produced will be single threaded.

4
5
Higher levels of Single Threaded Optimizations
  • -qhot
  • Fortran only
  • Do loop specific optimizations padding to
    minimize cache misses, "vectorizing" functions
    like sqrt, loop unrolling, etc.
  • Can double or triple compile time, use only with
    selected do-loop dominated routines.
  • -O4/-O5
  • Interprocedural optimizations.
  • Specify for both compiles and links.
  • (Fortran compiles) Use with qnohot.

5
6
Library Routines from the ESSL Library
  • Replacing code with ESSL routines is the single
    most useful optimization.
  • The ESSL library contains a large number of
    mathematical functions tuned for optimal Power3
    performance.
  • Supports both 32 and 64 datatypes.
  • Runs much faster than user written code
    regardless of the compiler options used.
  • Has both a single threaded (-lessl) and
    multi-threaded version (-lesslsmp).
  • Can replace the slow Fortran intrinsics (matmul,
    dot_product, etc.) with much faster ESSL versions
    with the qessl option.

6
7
Single Threaded Matrix Multiplication Example
  • A,B,C 2500 by 2500 real8 matrices
  • Direct
  • do i,j,k
  • C(i,j)C(i,j)A(i,k)B(k,j)
  • Unoptimized 16 MFlops
  • -O3 qarchpwr3 qtunepwr3 175 MFlops
  • -qhot 440 MFlops.
  • ESSL routine DGEMM 1249 MFlops.
  • Fortran intrinsic MATMUL 148 MFlops (1240
    MFlops with qessl).

7
8
Processes and Threads
  • Under AIX, a process is an execution frame.
  • Not a schedulable entity.
  • Common ID, execution environment, address space
    and system resources.
  • Number of processes in a job specified by procs
    or _at_totaltasks.
  • Threads are schedulable entities associated with
    a specific process.
  • Each process has at least one thread associated
    with it.
  • Each thread has a minimal set of properties
    necessary for it to be scheduled like a stack,
    priority, etc.
  • All threads of a process have the same address
    space and any change in the common environment
    made by one thread affects all of the processs
    threads.

8
9
Multi-threaded Optimization Techniques
  • Applicable primarily when the number of processes
    on a node is less than the number of the
    processors on the node (16 on seaborg).
  • Goal is to increase the number of threads running
    on a node until they are greater (not much) than
    or equal to the number of processes on the node.
  • Reduce the amount of processor idle time and
    decrease the elapsed time of a job.

9
10
Methods of Multi-threading Codes
  • Insert OpenMP directives into the source code,
    use
  • qsmpomp to activate them.
  • Let the compiler multi-thread the code,
    -qsmpauto.
  • Use the multi-threaded version of the essl
    library, -lesslsmp. No source code changes
    required.
  • When qessl specified, the Fortran intrinsics
    will be replaced with multi-threaded ESSL
    routines if qesslsmp is also specified.

10
11
Run Time Behavior of Multi-threaded Codes
  • Default is to run with a number of threads equal
    to the number of processors on a node of the
    system, 16 on seaborg.
  • Can change the number of threads at run time by
    setting the OMP_NUMBER_THREADS environment
    variable to the desired number of threads.
  • These environment variable settings can improve
    the performance of a multithreaded code
    significantly
  • XLSMPOPTS "schedulestatic spins0
    yields0"
  • AIXTHREAD_MNRATIO "11"

11
12
Multi-threaded Matrix Multiply
  • Running matrix multiply on a dedicated node
    (unrealistic).
  • Inserting OpenMP directives with qsmpomp
    maximum speedup of 8.7 to 1520 MFlops. Speedup
    levels off at 10 threads.
  • Automatic parallelization (-qsmpauto) maximum
    speedup of 6.4 to 2670 MFlops. Speedup levels
    off at 9 threads.
  • DGEMM with lesslsmp maximum speedup of 15.8 to
    19,700 MFlops. Speedup does not level off until
    number of threads exceeds 16.
  • MATMUL with qessl lesslsmp maximum speedup of
    15.8 to 19,500 MFlops. Speedup does not level off
    until number of threads exceeds 16.

12
13
Hybrid Codes Mixed MPI and OpenMP
  • Job runs with one or more MPI processes on a node
    with 1 or more OpenMP threads per process such
    that (number of threads)(number of
    processes)16.
  • Optimum choice of number of processes depends on
    the code.
  • Publicly available ASCI Purple hybrid benchmarks
    run on 1 node with various combinations of
    processes and threads
  • SPhot Best performance with 16 single threaded
    MPI processes.
  • sPPM Best performance with 1 process with 16
    OpenMP threads.

13
14
Recommendations
  • Always use ESSL routines when you can.
  • If you use Fortran intrinsics, always compile
    with qessl.
  • If your non-OpenMP code uses fewer than 16
    processes per node, consider compiling with
    lesslsmp and running with OMP_NUM_THREADS set to
    a value such that threadsprocesses is the least
    integer greater or equal to 16.
  • For hybrid codes, experiment with a
    representative data set to determine the best
    combination of threads and processes on a node.
  • Remember you will be charged for all 16
    processors on a node whether you use them or not.

14
15
References
  • See the web page http//hpcf.nersc.gov/computers/S
    P/nodopt.html for an expanded version of this
    presentation and associated links.

15
16
Finis
  • End of this presentation.

16
Write a Comment
User Comments (0)
About PowerShow.com