Build Environment

About This Presentation

Title:

Build Environment

Description:

HP-UX hardware and runtime environment (7) ARIES ... read man mpidebug for MPI debugging instructions. page 36. 6/10/09. Queens University Belfast ... – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 54

Provided by: sabinebu

Category:

more less

Transcript and Presenter's Notes

Title: Build Environment

1
(No Transcript)
2
Build Environment Performance Tuning for
Itanium 2 Processor with emphasis on HP-UX

Frank Haase
Michael Riedmann
European Performance Centre
(aka Benchmark Centre)
Boeblingen - Germany

3
Agenda

Compilers - understand how to use them
effectively for high performance
HP and Intel compilers
compiler options
directives/pragmas
Debugging
WDB/GDB, HP Caliper
MPI
Using HP MPI on SMP and Clusters

4
HP and Intel Compilers

see http//www.hp.com/go/lang

5
Performance Expectations

ITANIUM-2 vs PA-RISCperformance ratio1.5 X
2.5 X
ITANIUM-2 vs ALPHAperformance ratio1.0 X 1.2
X
If these ratios cant be achieved in a
benchmarkthen something is wrong.

6
HP-UX hardware and runtime environment (1) Data
models

Evolution from 32 bit to 64 bit. Kernel is 64
bit. User processes can be either 32 or 64
bit.Benefit No forced 64 bit migration from 32
bit platforms.
Compiler default is 32 bit.For 64 bit data model
use DD64 on IPFDA2.0W on PA-RISC
64 bit model is LP6432 bit model is
ILP32Caution with long in C !Use long long to
always get 64 bit integers or use size_t
whenever possible.

7
HP-UX hardware and runtime environment (2) Data
models and system libraries

Each system library and API is available in 4
flavours in seperate subdirectories
/usr/lib/pa1.1 PA-RISC 32bit
/usr/lib/pa20_64 PA-RISC 64bit
/usr/lib/hpux32 IPF 32bit
/usr/lib/hpux64 IPF 64bit
Same thing for extra products like MPI, MLIB, ...
Object format is ELF32 and ELF 64 except for
PA-RISC 32bit (DA2.0 - SOM)
Mixed linking is impossible. Linker returns
explicit message.

8
HP-UX hardware and runtime environment (3) Data
alignment

LINUX is little endian
HP-UX is big endian with both IPF and PA-RISC.
Binary data compatibility with MIPS, SPARC,
POWERBinary data incompatibility withALPHA,
IA-32, LINUX
Data alignment is very similar between Tru64 and
HP-UXUnaligned access on HP-UX causes SIGBUS

9
HP-UX hardware and runtime environment (4)
Exception handling

HP-UX ignores FP exceptions by default. Link with
FPDVONZ for Tru64 like behaviour.API for
runtime control of FPU is defined in
/usr/include/fenv.h
NULL pointer dereference by default returns
0.Link with z for Tru64 like SIGSEGV generation
malloc(0) returns a valid pointer

10
HP-UX hardware and runtime environment (5) Memory
page size

HP-UX supports variable sized pages / large
pages. HW page size is still 4k.
Page size is a property of the executable and can
be modified with the chatr commandchatr pd 4M
pi 4M chatr pd L pi D
Large pages can drastically reduce TLB
misses.Many HPTC apps get a huge performance
boost from large pages.

11
HP-UX hardware and runtime environment (6) Large
files

Large files are default for HP-UX / IPF but not
PA-RISC
On PA-RISC use o largefiles for newfs and mount
Some HP-UX commands dont support large files,
e.g. tar, cpio and pax fail to backup large
files, same problem with some open source tools
Rebuild 32 bit programs with-D_LARGEFILE64_SOURCE
No problem with 64 bit programs

12
HP-UX hardware and runtime environment (7) ARIES

PA-RISC binaries can be run on IPF through
dynamic translation
Slowdown is 3X for GUIs up to 10X for solvers
The slowdown is hardly noticable with interactive
tools like Vim, Netscape, Acroread
Many HP-UX tools on IPF like SAM are still
PA-RISC binaries
Use file command to identify the nature of an
executable
IPF migration approach with ISVs
Rebuild solvers first
Rebuild GUIs, pre post later

13
HP-UX Compilers C/C

/opt/aCC/bin/aCC /opt/ansic/bin/cc is the ANSI
C/C compiler
-AA ANSI C with namespace std and new C
standard library. This is the default.
-AP Turn off AA and use older classic C
runtime libraries. Very useful for porting
legacy and open source codes.
-Aa strict ANSI C (TRU64 -std1)
-Ae ANSI C with extensions (TRU64 -std)
no support for KR mode
instantiation files are written to a repository
on TRU64,to the object file on HP-UX

14
HP-UX Compilers Fortran (77),90,95

/opt/fortran90/bin/f90 is the Fortran 90/95
compiler
Supports native OpenMP 1.1 and legacy CONVEX/HP
directives Oopenmp enable OpenMP
directives Oparallel O3 enable legacy
CONVEX/HP directives
Legacy f77 compiler is obsolete, f90 handles f77
codes very well.Use U77 to enable BSD 3f
intrinsics
f90 adds trailing underscore to function names on
IPF and PA-RISC 64 bit. No trailing underscores
for PA-RISC 32 bit. Explicit control with ppu
add trailing underscore noppu do not add
trailing underscore
Pragma exampleDEC ALIAS ? HP ALIAS

15
HP-UX Compilers Mixing Languages

HP-UX compiler drivers do NOT recognize other
languages,need to compile C and F programs
separately
Make sure C symbols are lowercase and have a
trailing underscore or compile F sources with
noppu
If aCC or ld is used for linkingFORTRAN
libraries have to be passed explicitly to the
linker(libF90.a, libIO77.a, -lm, -lc)
If f90 is used for linking it will find its
libraries automatically
what returns exact compiler
version string

16
Intel V7.0 Linux Compilers

/opt/intel/compiler points to latest compiler
efc Fortran
/opt/intel/compiler70/ia64/bin/efc
ecc C, C
/opt/intel/compiler70/ia64/bin/ecc
Source subset of following to set up
environment/opt/intel/compiler70/ia64/bin/eccvars
.csh,sh/opt/intel/compiler70/ia64/bin/efcvars.
csh,sh
Useful web page - http//www.intel.com/software/pr
oducts/compilers/

17
CompilersDirectives and Pragmas

HP Fortran compiler directive
cdir
HP C compiler pragma
pragma _cnx
(note blank between pragma and _cnx above)
OpenMP directives
comp
Preferred for directive based parallelism
COMP parallel do private(x,y) shared(z)
Intel compiler directives
cdec
cdir

18
HP-UX and Intel Linux CompilersArchitecture and
Data Model Switches

DA2.0N DS2.0 PA-RISC (2.0) 32 bit
DA2.0W DS2.0 PA-RISC (2.0) 64 bit
DSmckinley DD32 ITANIUM-2 32 bit
DSmckinley DD64 ITANIUM-2 64 bit
(DSmckinley DSitanium2)
-tpp2 ITANIUM-2 64 bit with INTEL compiler
Recommendations
DSmckinley and -tpp2 are also the best choice
for Madison code.
DSitanium, DSblended and tpp1 should be used
only if target is ITANIUM-1. This code performs
20 slower on ITANIUM-2.

19
Intel Linux CompilersOptimisation Levels

-O2
very safe, register rotation, no extra unrolling,
no prefetch instructions
-O3
usually safe, lots of optimizations including
load word pair generation, up to 8-way unrolling,
prefetch instructions

20
HP-UX CompilersOptimisation Levels (1)

O0 Default, minimal optimisation Fastest
compile time Good debugging support
O1 Basic block level optimisation Pretty fast
compile time, Improved runtime performance Go
od debugging support
O2 Full routine level optimisation Register
rotation and data prefetching Limited debugging
support, Good runtime performance Inlining
for sqrt
O2 is sufficient for most FORTRAN codes

21
HP-UX CompilersOptimisation Levels (2)

O3 Full source file level optimisation No
debugging support (-g is invalid) Adds
subroutine cloning and inlining (only within the
source file) Adds transformations for nested
loops Inlines all math intrinsics on
IPF Matches and inlines inverse square roots if
Ofltaccrelaxed Use Oinfo or Oreportall for
optimisation report
O3 is not always better than O2. Use it
deliberately for
inlining of math intrinsics and frequently called
routines
transformation of nested loops
optimized inverse square roots (e.g. quantum
chemistry)

22
HP-UX CompilersGlobal and Profile Based
Optimisation

O4 Performs global optimisation at link time.
Can be combined with Profile Based Optimisation
(PBO).
Oprofilecollect Make an instrumented
executable for profiling. After execution it
will dump the data in flow.data
Oprofileuse Use profile data from flow.data
and use it for global optimisation
O4 and PBO is most useful for C and C as it
provides global inlining capability and reduces
branch mispredictionThe benefit for FORTRAN
codes is very limited due to common programming
practices.

23
HP-UX CompilersPrefetching

Onodataprefetchdirectindirectnone
Control generation of data prefetch instructions
for data structures referenced within inner most
loops. The defined values for kind are
direct Enable generation of data prefetch
instructions for the benefit of direct memory
accesses, but not indirect memory accesses.
indirect Enable generation of data prefetch
instructions for the benefit of both direct and
indirect memory accesses. This is the default at
optimization levels O2 and above.
none Disable generation of data prefetch
instructions. This is the default at
optimization levels O1 and below.

24
HP-UX and Intel Linux CompilersFortran prefetch
directives

HP-UX cdir prefetch (expression)
no special compile options needed
Intel cdir noprefetch A,B,..
Allows user to prefetch explicitly where the
compiler fails e.g. when addresses are computed
do i 1,n
ia func(i)
cdir prefetch b(func(i50))
b(ia) b(ia)a(i)
enddo

25
HP-UX CompilersFloating-Point Accuracy

Ofltaccstrictdefaultlimitedrelaxed
Control the level of FP optimizations that the
compiler may perform.
Useful for debugging when there are numerical
instabilities
defaultAllow contractions, such as fused
multiply-add (FMA), but disallows any other
optimization that can result in numerical
differences.
limitedLike default, but also allows floating
point optimizations which may affect the
generation and propagation of infinities, NaNs,
and the sign of zero.
relaxedIn addition to the optimizations allowed
by limited, permits optimizations, such as
reordering of expressions, even if parenthesized,
that may affect a rounding error. This is the
same asOnofltacc.
strictDisallow any floating point optimization
that can result in numerical differences. This
is the same as Ofltacc.

26
Intel Linux CompilersFloating-Point Accuracy

-IPF_fma- (-IPF_fma- to turn off fma
generation)
Enable/disable the combining of floating point
multiplies and add / subtract operations. Note
fmas are still generated but each corresponds to
either an fmpy (fma x,y,f0) instruction or an
fadd (fma x,f1,y) instructions
-IPF_fltacc-
Enable / disable optimizations that affect
floating point accuracy

27
Inline Math Intrinsics with Olibcalls

Not all intrinsics are treated equal
abs is inlined at all optimisation levels
sqrt is inlined at O2 and above
Other math intrinsics like exp, log, pow, sin,
are inlined at O3
Reciprocal square roots (y 1./sqrt(x))
IPF can compute rsqrt directly (no separate
div/sqrt)
HP-UX comes with nonstandard rsqrt intrinsic
With Ofltaccrelaxed the f90 compiler matches
and calls rsqrt at O2 but does inlining of rsqrt
only at O3. Use it carefully !
Nice performance boost in quantum chemistry
(Coulomb forces)

28
Important Linker Options

Flush denormalized values to zero
HP-UXLinking with FPD flushes denormalized
values to zero
LinuxCompile the main routine with ftz and link
normally
Archived libraries
HP-UX-Wl, -aarchive or Wl,-aarchive_shared to
ensure archived libraries used as much as
possible
Linux-static prevents linking with shared
librariesld default corresponds to HP-UXs
-shared_archive-Bstatic to use archived
libraries-Bdynamic to use shared libraries

29
Compiler Flags for Parallelism

HP-UX
OopenmpEnable OpenMP directives. Available at
any optimisation level.
Oparallel O3Enable HP/Convex directives and
automatic parallelisation. Requires O3
Oparallel O3 OnoautoparDisable automatic
parallelisation. Keep directive based
parallelism.
Oparallel O3 OnodynselDisable dynamic loop
selection.

Intel Linux
-openmpEnable OpenMP directives. Available at
any optimisation level.
-parallelEnable automatic parallelisation.Op
enMP is process based with Intel Linux while it
is pthread based with HP-UXLinking on HP-UX
involves libomp, libcps, libpthread

30
Environment Variables for Parallelism

HP-UX MP_NUMBER_OF_THREADSsets the number of
threads with HP / CONVEX directives
HP-UX MP_IDLE_THREADS_WAIT set the of
milliseconds a thread spins before suspending
itself.If the number is less than 0, the threads
will spin waitUseful to prevent context switches
and thread migration
HP-UX MP_GANG ONOFFEnable / disable gang
scheduling for multithreaded and MPI appsUseful
for oversubscribed and throughput scenarios
HP-UX and Linux OMP_NUM_THREADSset OpenMP
parallelism
HP-UX and Linux MLIB_NUMBER_OF_THREADSset MLIB
shared memory parallelism

31
HP-UX CompilersDangerous and Useless Switches

Wrong floating point answers can be caused by
O3, O4, Ofltaccdefaultrelaxed,
Onoparmsoverlap, FPD Use with caution and
check your answers.
Useless switches, dont waste you time !
Ovectorize Matches specific loop patterns and
replaces with optimized library calls. Usefull
only for SPECfp.
Oaggressive, Oall Lots of aggressive
optimisations including Ovectorize
fastallocatable was never observed to improve
anything

32
Recommended Build Approach

Get reference timings/outputs from PA-RISC or
whatever
Set the right architecture and data model
switches
Start with O2 Odataprefetch Onolimit g
Wl,pd,L
In case of wrong answers or divergence add
OfltaccstrictIn case of right answers you can
add FPD Ofltaccrelaxedand check answers again
For C/C try Onoparmsoverlap and check answers
Now make a profile with prospect or caliper
Try O3 and Oloop_block for selected hotspot
routines
Start trying source changes
For C/C try profile based optimisation

33
HP-UX Debuggers and Profilers

see http//www.hp.com/go/wdb
http//www.hp.com/go/hpcaliper

34
WDB/GDB Debugger

has replaced all previous debuggers on HP-UX
(xdb, dde) for both PA-RISC and IPF.
Choice of user interfaces
gdb for command line use
vdb runs gdb in a split terminal window like xdb
wdb is a Motif GUI on top of gdb
Location /opt/langtools/bin

35
HP WDB Features

Support for 32 bit and 64 bit data models and all
languages
Debugs optimized code up to O2
Support for pthreads and consequently OpenMP
HW watchpoints
Memory checking (currently only on PA-RISC)
Array browsing
User definable buttons
Basic MPI supportAn extra wdb instance is
started for every MPI process, so this is usable
up to maybe 4 processes.read man mpidebug for
MPI debugging instructions

36
Debug Preparation

On PA-RISC
Compile and link with gDebug information is in
the executable
No support for 64 bit optimized programs. Compile
at O0
32 bit programs can be debugged at O1 (good) and
O2 (limited)
Trouble with NFS mounted executables. Try to work
on local disk.
On IPF
Compile and link with gDebug information stays
in the object filesCompile with noobjdebug for
debug info in the executable
Both 32 bit and 64 bit programs can be debugged
at O1 (good) and O2 (limited).Note that
register variables cant be viewed at O2

37
Profiling with CALIPER (1)

Highlights
Dynamic instrumentation requires no preparation
for executables
All Itanium PMU counters are available for event
counts
Supports multithreaded execution
Wish List
Support for MPI profiling and 3D charts for
parallel profiles
Better source line mapping
Two modes of operation
Low intrusion hardware counters based
measurements
Sample based/ clock driven profiling
Measure execution cycles, cache misses, branch
mispredictions, etc.
Dynamic binary instrumentation for precise counts
e.g. function call graph, basic block arc counts
accurately
event driven (Gprof or Cxperf type)

38
Profiling with CALIPER (2)

Usage/opt/caliper/bin/caliper config-file
Common config-files (many more available)
cgprof call graph profile (intrusive)
sample_ip sampling profile like PROSPECT (non
intrusive)
pbo create flow.data for PBO (requires O1
build)
dcache_miss cache metrics
Output options (many more available )
-o single file ASCII output
--html HTML output into directory

39
Profiling with CALIPER (3) measurements

type config_file comments
optimization pbo black box
total measurements total_cpu exact
totals, no impact
sample measurements branch_prediction
sampled details, low
dicache_miss impact
ditlb_miss
sample_cpuip
precise measurements arc_count exact
details, high
func_count impact
func_cover
hybrid cgprof sampled exact, high
impact

40
Profiling with CALIPER (4) measurement strategy

if automatic optimization
if call graph profile
if correctness testing
determine what takes time
determine where event happens
determine when event happens

Pbo
cgprof
arc_count, func_count, func_cover
total_cpu w/ various events
branch_prediction,dicache_miss,
ditlb_miss,sample_ip w/ select trigger
sample_cpu w/ select trigger

41
Profiling with CALIPER (5)

All PMU counters of ITANIUM can be used with
CALIPER config files. See the list in
/opt/caliper/doc/text/itanium2_cpu_counters
For counter description see the Intel Itanium-2
Processor Reference Manual For Software
Development and Optimisation
Measure total of Flops, Underflows and Clock
Cycles
caliper total_cpu global-counters\
FP_OPS_RETIRED, FP_FLUSH_TO_ZERO, CPU_CYCLES \
Measure of MFlops per subroutine
caliper sample_ip \ sampling-counterFP_O
PS_RETIRED,,1000000 \

42
total_cpu example

caliper total_cpu crafty.O3
HP Caliper Total CPU Counts Report
Target Application
Program /home/daveb/crafty.O3
Invocation crafty.O3
Process ID 8859 (started by Caliper)
Start time 075639 AM
End time 075643 AM
Last modified June 01, 2002 at 0756 AM
Processor Information
Machine name longsp8
Number of processors 2
Processor type Itanium2
Processor speed 798 MHz

Report Help
Use the caliper option --info to append help to
this report,
or see /opt/caliper/doc/text/total_cpu.help.
-----------------------------------------
Counter Priv. Mask Count
-----------------------------------------
INST_DISPERSED 8 (USER) 5029272413
NOPS_RETIRED 8 (USER) 690880642
CPU_CYCLES 8 (USER) 3140760527
-----------------------------------------
CPI
0.6245 CPU_CYCLES / INST_DISPERSED
Useful CPI
0.7239 CPU_CYCLES / (INST_DISPERSED -
NOPS_RETIRED)

43
branch_prediction (html) example
44
More Profiling Tools ...

TUSC aka TRUSS is a system call tracer. Not
supported but usable and robust. Get it here
ftp//ftp.cup.hp.com/dist/networking/tools
Visual Threads thread tracer was recently ported
from Tru64 to HP-UX (IPF only).Currently in beta
status. See
http//shvlhd.zko.cpqcorp.net/ThreadTools
GlancePlus, the official HP-UX system monitor
with GUI offers detailed statistics and graphs on
all relevant system resources like CPUs, Memory,
Swap, Disks, Network, ProcessesInvocation
/opt/perf/bin/gpm

45
Using HP MPI
46
Using HP MPI (1)Overview

Is part of HP-UX TCOE
Installs in /opt/mpi
Implements full MPI 1.2 and 90 of MPI 2 standard
definition. Missing only dynamic process
deletion.
Optimized for both SMP and cluster use
Supports hybrid OMP MPI programs by thread safe
library
Supports up to 8 byte integers and 16 byte reals
by i8 and autodbl switches
Integrated with LSF PAM
Supported with TotalView (currently PA-RISC only)

47
Using HP MPI (2)Build

Add /opt/mpi/bin to PATH
Make sure that HyperFabric2 fileset is installed
in /opt/cliceven if no HF2 cards are there. This
is not part of HP MPI.
Use convenience scripts mpif90 and mpicc for
compiling and linking.No need to worry about
include and library paths.
Linking for HMP is tricky. Use archive libraries.
Suggestion -Wl,-a,archive_shared \
lmpi_hmp lclic_csi

48
Using HP MPI (3)Run

Invocation on SMP systems
export MPI_FLAGS/opt/mpi/bin/mpirun np 64
Cluster execution uses remsh mechanism ?
HOME/.rhosts requiredCreation of an appfile
with 1 entry for each hostAll necessary env
variables must be distributed through the appfile
-h rx01 np 2 e MPI_HMPON e MPI_FLAGS
-h rx02 np 2 e MPI_HMPON e
MPI_FLAGS -h rx03 np 2 e
MPI_HMPON e MPI_FLAGS -h
rx..
Invocation from master node
/opt/mpi/bin/mpirun f

49
Using HP MPI (4)Environment variables

Yield a waiting CPU after 1s of spinwaiting.
Useful on SMP systems to avoid context switches
and process migration if not oversubscribed.
export MPI_FLAGSy1000
Set gang scheduling (MPI OMP), useful on
oversubscribed SMPs
export MP_GANGON
Run with the chosen debugger (see man mpidebug
for details)
export MPI_FLAGSewdb
Choose HMP instead of TCP/IP if HF2 is available.
HMP will identify the HF2 interface even if it is
not corresponding to the hostname.
export MPI_HMPON

50
Using HP MPI (5)Lightweight instrumentation

Lightweight instrumentation which will dump
messaging statistics in a .instr text file at
termination.
-e MPI_INSTR
mpirun i
Statistics include
User time, MPI overhead, blocking time, system
time for each rank
Number of calls for each MPI call for each rank
Message sizes for each pair of ranks
Despite the missing GUI these statistics are very
useful to determine load imbalance and total MPI
overhead. Intrusion is far below 10. No
relinking required.

51
Using HP MPI (6)Profiling with CALIPER on a
cluster

Step 1 Create a little command script
!/bin/sh/opt/caliper/bin/caliper sample_ip
\-html.(hostname).
Step 2 Invoke the command script instead of the
program from appfile
-h rx01 np 2 e MPI_HMPON e MPI_FLAGS
-h rx02 np 2 e MPI_HMPON e
MPI_FLAGS -h rx03 np 2 e
MPI_HMPON e MPI_FLAGS -h rx..
Step 3 Activate lightweight instrumentation and
run it
/opt/mpi/bin/mpirun i f
MPI will dump its statistics and CALIPER will
create a html report directory for each rank.
This method works on clusters as well as on SMPs

52
Using HP MPI (7)PALLAS MPI-1 key results on
RX2600
() Due to protocol complexity HMP has higher
latency than MYRICOM GM (kernel bypass might improve HMP latency to
15?sec but is not commited as Infiniband is
already on the horizon to replace HyperFabric2.
53
(No Transcript)

Write a Comment

User Comments (0)