Title: Parallel Programming on the
1Parallel Programming on the SGI Origin2000
Taub Computer Center Technion
Anne Weill-Zrahia
With thanks to Moshe Goldberg, TCC and Igor
Zacharov SGI
Mar 2005
2Parallel Programming on the SGI Origin2000
- Parallelization Concepts
- SGI Computer Design
- Efficient Scalar Design
- Parallel Programming -OpenMP
- Parallel Programming- MPI
33) Efficient Scalar Programming
4Remember to use make
Make implicit documentation minimize
compile time 100 oops free make /
pmake / smake
OBJS f1.o f2.o f3.o FFLAGS -O3 r12k LDFLAGS
-lm r12k All pgm1 Pgm1 (OBJS) lttabgt f77 o
pgm1 (OBJS) (LDFLAGS) F2.o f2.f lttabgt f77
(FFLAGS) static c f2.f Clean lttabgt -rm f
(OBJS) pgm1 core
5Speedup opportunities
- Program may run slowly because not all resources
are used efficiently - on the processor
- non optimal scheduling of instructions (too
many wait states) - memory access
- memory access pattern is not optimized for
the architecture - not all data in cache line is used (spatial
locality) - data in the cache is not reused (temporal
locality) - Performance analysis is used to diagnose the
problem - Compiler will attempt to optimize program.
However, it might not be - possible
- data representation can inhibit compiler
optimization - algorithm presentation can inhibit
optimization - Often it is necessary to rewrite critical parts
of code (loops) so that - compiler can do better performance optimization.
Understanding the - optimization techniques helps the compiler to use
them effectively.
6Compiler optimization techniques
- Here are some optimization techniques built into
the compiler - Loop based
- loop interchange
- outer and inner loop unrolling
- cache blocking
- loop fusion (merge) and fission (split)
- - General
- procedure inlining
- data and array padding
- The algorithm should be presented in the program
in such a way - that the compiler can apply optimization
techniques, leading to - best performance on the specific computer.
7Some simple arithmetic replacements
do i 1,n a sin(x(i)) v(i)
2.0a enddo Replace by do i 1,n
v(i)2.0sin(x(i)) enddo
do j 1,m do k 1,n v(k,j)2.0(a(k)/b(j))
enddo Enddo Replace by do j 1,m btemp
2.0/b(j) do k 1,n v(k,j) btempa(k)
enddo enddo
8Array Indexing
Arrays can be indexed in several ways. For
example
Explicit addressing Do j1,m do k1,n ..
A(k(j-1)n) .. enddo enddo
Direct addressing Do j1,m do k1,n ..
A(k,j) .. enddo enddo
Loop carried addressing Do j1,m do k-1,n
kkkk1 .. A(kk) .. enddo enddo
Indirect addressing Do j1,m do k1,n ..
A(index(k,j) enddo enddo
- The addressing scheme will have an impact on
performance - Arrays should be accessed in most natural direct
way for compiler - to apply loop optimization techniques
9Data storage in memory
Data storage is language dependent
Fortran stores multi-dimensional arrays in column
order
j i
In memory leftmost index changes first
a(i,j)
i
i
i
j
j1
j2
C stores multi-dimensional arrays in row order
i j
In memory rightmost index changes first
Aij
j
j
j
i
i2
i1
For arrays that do not fit in the cache,
accessing elements in storage order gives much
faster performance
10Fortran loop interchange
Interchanged loop Do j1,m do i1,n
c(i,j)a(i,j)b(i,j) enddo enddo
Original loop do i1,n do j1,m
c(i,j)a(i,j)b(i,j) enddo enddo
m
j i
m
j i
n
n
Storage order
Access order
The distribution of data in memory is not
changed, only the access pattern changes
The compiler can do this automatically, but there
are complicated cases
11Index reversal
Original loop do i1,n do j1,m
c(i,j)a(i,j)b(j,i) enddo enddo
The access is wrong for A and C, but it is right
for B. Interchange will be good for A and C, but
bad for B.
Possible solution Index reversal of B that is,
b(i,j) is replace By b(j,i). But this must be
done everywhere in the program. (It must be done
manually, the compiler will not do it.)
Interchanged loop index reversal do j1,m do
i1,n c(i,j)a(i,j)b(i,j) enddo enddo
12Loop interchange in C
In C, the situation is the opposite of what it is
in Fortran
Original loop for(j0jltmj)
for(i0iltni) cijaijbji
Addressing of c and a are wrong
Addressing of b is correct
Index reversal loop for(j0jltmj)
for(i0iltni) cjiajibji
Interchanged loop for(i0iltni)
for(j0jltmj) cijaijbji
The performance benefits in C are the same as in
Fortran.
In most practical situations, loop interchange
(supported by the Compiler) is easier to achieve
than index reversal.
13Loop fusion
Loop fusion (merging two or more loops together)
- fusing loops that refer to the same data
enhances temporal locality - larger loop body allow more effective scalar
optimizations and - instruction scheduling
More fusion, peeling a0b01 c0a0/2 for
(i1iltni) aibi1 ciai/2
dI-11/ci dn1/cn1
Fused loops for (i0iltni) aibi1
ciai/2 for (i0iltni) di1/ci1
Original loops for (i0iltni)
aibi1 for (i0iltni) ciai/2 for
(i0iltni) di1/ci1
- loop peeling can bread data dependencies when
fusing loops - sometimes temporary arrays can be replaced by
scalars (manual only) - compiler will attempt to fuse loops if they are
adjacent, that is, no code - between the loops to be fused
14Loop fission
Loop fission (splitting) or loop distribution
Improve memory locality by splitting out loops
that refer to different independent arrays
for (i0iltn-1i) bi1cixy
ci11/bi1 for (i0iltn-1i)
ai1ai1bi for (i0iltn-1i)
di1sqrt(ci1) In1
for (i1iltni) aiaibi-1
bici-1xy ci1/bi
disqrt(ci)
15Array placement effects
Wrong data placement in memory can lead to an
effect of cache thrashing. The compiler has
two techniques built in to avoid thrashing -
array padding - leading dimension extension
In principle leading dimension of arrays should
be an odd number - that is if a
multi-dimensional array has small dimensions
(such as a(32,32,32)) the leading dimensions
should be odd numbers never a power of 2.
16RISC memory levels
Single CPU
CPU
Cache
Main memory
17RISC memory levels
Single CPU
CPU
Cache
Main memory
18Direct mapped cache thrashing
common // a(4096),b(4096) do i1,n prod prod
a(i)b(i) enddo
Virtual memory
A(1) A(2) A(4095) A(4096) B(1) B(2) B(4095)
B(4096)
Direct mapped cache (16KB)
Cache line 4 words
16 KB
A(1) A(2) A(3) A(4) A(5) A(6)
A(7) A(8) A(4089) A(4090) A(4091)
A(4092) A(4093) A(4094) A(4095) A(4096)
Thrashing every memory reference is a cache miss
The rule avoid leading dimensions that are a
power of 2 !
19Array Padding Example
common // a(1024,1024),b(1024,1024),c(1024,1024) d
o j1,1024 do i1,1024 a(i,j)a(i,j)b(i,j)
c(i,j) enddo enddo
AddrC(1,1) AddrB(1,1) 102410244 Position
in the cache C(1,1)B(1,1), since
(102410244)mod32K0
common // a(1024,1024),pad1(129),
b(1024,1024),pad2(129), c(1024,1024) do
j1,1024 do i1,1024 a(i,j)a(i,j)b(i,j)c(i,j) e
nddo enddo
- Padding will cause cache lines
- to be placed in different places
- Compiler will try to do padding
- automatically
AddrC(1,1) AddrB(1,1) 1024102441294 Po
sition in the cache C(1,1)B(129,1)
20Dangers of array padding
- Compiler will automatically pad local data
- At -O3 optimization, compiler will pad common
blocks - all routines with common blocks must be
compiled with -O3 - otherwise compiler will not perform this
optimization - padding of common blocks is safe as long as
the Fortran standard - is not violated
subroutine sub Common // a(512,512),b(512,512) Do
i1,2512,512 a(i)0.0 enddo return end
- The remedy is to fix violation, or not to use
this optimization - either by compiling with lower optimization or
using compiler flag - -OPTreorg_commonoff
21Loop unrolling
Loop unrolling perform multiple loop iterations
at the same time
do i1,n,unroll ..(i).. ..(i1)..
..(i2).. .. ..(iunroll-1).. enddo
do i1,n,1 ..(i).. enddo
Advantages of loop unrolling more
opportunities for super-scalar code more
data reuse exploit presence of cache lines
reduction in loop overhead Disadvantages of loop
unrolling cleanup code required
Cleanup
do in-mod(n,unroll)1,n ..(i).. enddo
NOTE the compiler will unroll code automatically
based on an estimate of how much time the loop
body will take
22Blocking for cache (tiling)
Blocking for cache an optimization that is
good for data sets that to not fit into the data
cache a way to increase spatial locality of
reference (that is, exploit full cache
lines) a way to increase temporal locality of
reference (that is, to improve data reuse)
it is mostly beneficial with multi-dimensional
arrays
do i1,n ..(i).. enddo
Only nb elements at a time of the inner loop
are activated
do il1,n,nb do iil,min(ilnb-1,n)
..(i).. enddo enddo
23Principle of blocking
do i1,n a(i)i enddo
do i11,n,iblk i2i1iblk-1 do ii1,i2
a(i)i enddo enddo
do i11,n,iblk i2i1iblk-1 if (i2.gt.n) i2n
do ii1,i2 a(i)i enddo enddo
24Blocking example transpose
do j1,n do i1,n a(i,j)b(j,i)
enddo enddo
Either A or B is accessed in non-unit stride bad
reuse of data
Blocking the loops for cache will do the
transpose block by block, reusing the elements
in the blocks
do jt1,n,jtblk do it1,n,itblk do
jjt,jtjtblk-1 do iit,ititblk-1
a(i,j)b(j,i) enddo enddo enddo enddo
25A Recent Example Matrix multiply
26Matrix Multiply
Remove if from loop
27Profile -- original
28Profile move if statement
29Exercise 2 -- Matrix loop order
30(No Transcript)
31(No Transcript)
32Procedure inlining
Inlining replace a function (or subroutine) call
by source
do i1,n call dowork(a(i),c(i)) enddo
subroutine dowork(x,y) y1.0x(1.0x.5) end
Advantages increase opportunities for
optimizations more opportunities for loop
nest optimizations reduce call overhead
(minor)
do i1,n c(i)1.0a(i)(1.0a(i)0.5) enddo
Inhibitions to inlining mismatched arguments
(type or shape) no inlining across languages
(C to fortran) so static (SAVE) variables
no recursive routines no functions with
alternate entries no nested subroutines (as in
F90)
Candidates for inlining are modules that are
small not much source code are called many
times (say in a loop) do not take much time
per call
33A simple matrix multiplication (triple-nested
loop)
subroutine mm(m,n,p,a,lda,b,ldb,c,ldc)
integer m,n,p,lda,ldb,ldc dimension
a(lda,p),b(ldb,n),c(ldc,m) do i1,m
do j1,n do k1,p
c(i,j)c(i,j)a(i,k)b(k,j) end do
end do end do
Try to speed it up!
341) Loop reversal
do i1,m do j1,n do
k1,p c(i,j)c(i,j)a(i,k)b(k,j)
end do end do end do
Loop constant
do j1,n do k1,p
tb(k,j) do i1,m
c(i,j)c(i,j)a(i,k)t end do
end do end do
Reverse loop order
352) Inner loop unrolling
do j1,n do k1,p tb(k,j) do
i1,m c(i,j)c(i,j)a(i,k)t end
do end do end do
do j1,n do k1,p tb(k,j)
do i1,(m-4)1,4
c(i0,j)c(i0,j)a(i0,k)t
c(i1,j)c(i1,j)a(i1,k)t
c(i2,j)c(i2,j)a(i2,k)t
c(i3,j)c(i3,j)a(i3,k)t end do
do ii,m
c(i,j)c(i,j)a(i,k)t end do
end do end do
Unroll inner loop
Cleanup loop
(1) Reduces loop overhead (2) Sometimes improves
data reuse
363) Middle loop unrolling
do j1,n do k1,(p-4)1,4
t0b(k0,j) t1b(k1,j)
t2b(k2,j) t3b(k3,j) do i1,m
c(i,j)c(i,j)a(i,k0)t0
a(i,k1)t1
a(i,k2)t2 a(i,k3)t3
end do end do do kk,p
t0b(k,j) do i1,m
c(i,j)c(i,j)a(i,k)t0 end do end do
end do
do j1,n do k1,p tb(k,j) do
i1,m c(i,j)c(i,j)a(i,k)t end
do end do end do
Unroll middle loop
Cleanup loop
(1) Fewer c(i,j) load/store operations (2)
Better locality of b(k,j) references
374) Outer Loop Unrolling
do j1,(n-4)1,4 do k1,p
t0b(k,j0) t1b(k,j1)
t2b(k,j2) t3b(k,j3)
do i1,m c(i,j0)c(i,j0)a(i,k
)t0 c(i,j1)c(i,j1)a(i,k)t1
c(i,j2)c(i,j2)a(i,k)t2
c(i,j3)c(i,j3)a(i,k)t3 end
do end do end do do jj,n
do k1,p t0b(k,j)
do i1,m c(i,j)c(i,j)a(i,k)t0
end do end do
do j1,n do k1,p tb(j,k) do
i1,m c(i,j)c(i,j)a(i,k)t end
do end do end do
Unroll outer loop
Cleanup loop
Improvement because of reuse of a(i,k) in the loop
385) Loop Blocking
do j11,n,jblk j2j1jblk-1
if (j2.gt.n) j2n do k11,p,kblk
k2k1kblk-1 if (k2.gt.p)
k2p do i11,m,iblk
i2i1iblk-1 if (i2.gt.m) i2m
do jj1,j2 do
kk1,k2 tb(k,j)
do ii1,i2
c(i,j)c(i,j)a(i,k)t end
do end do end
do end do end do end
do
do j1,n do k1,p tb(k,j) do
i1,m c(i,j)c(i,j)a(i,k)t end
do end do end do
Original code
Blocked loop
Improves locality of reference (removes
out-of-cache memory references)
39Optimizations - summary
Scalar optimization improving memory table
access by code transformation and grouping
independent instructions improving memory
access by modifying loop nests to take better
advantage of memory hierarchy Compilers are good
at instruction level optimizations and loop
transformations. However, there are differences
in the languages F77 is easiest for compiler
to work with C is more difficult F90/C
are most complex for optimizing The user is
responsible for presenting the code in a way that
allows compiler optimizations Dont
violate the language standard write clean and
clear code consider the data structures for
(false) sharing and alignment consider the
data structures in terms of data dependencies
use most natural presentation of algorithms in
multi-dimensional arrays
40Exercise 3 -- loop unroll/block
41(No Transcript)
42Compiler switches and options
The compiler is the primary tool of program
optimization structure of the compiler and
the compilation process compiler
optimizations steering the compilation
compiler options structure of the run time
libraries and scientific libraries
computational domain and computation accuracy
43The Compiler
The compiler manages the resources of the
computer registers integer/floating-point
execution units load/store/prefetch for data
flow in/out of processor knowledge of the
implementation details of processor and
system architecture are built into the compiler
User program (C/C/Fortran)
High level representation intermediate
representation Low level representation
Solving data dependencies control flow
dependencies parallelism compacting the
code optimal scheduling
Machine instructions
44MIPSpro compiler components
source
front-end (source to WHIRL)
F77/f90 cc/CC
code generator
inter- procedure analyzer
loop nest analyzer
parallel optimizer
linker
executable object
There are no source-to-source optimizers or
compilers source code is translated to WHIRL
intermediate language - same intermediate for
different levels of interpretation - WHIRL2F
and WHIRL2C translate back into Fortran or C
Inter-Procedural analyzer requires final
translation at link time
45Compiler optimizations
- Global optimizer dead code elimination
copy propagation loop normalization
memory alias analysis strength
reduction - Inter-Procedural analyzer
cross-file function inlining dead function
elimination dead variable elimination
padding common variables constant
propagation - Automatic parallelizer
loop level work distribution
- Loop nest optimizer loop unrolling
loop interchange loop fusion/fission
loop blocking memory prefetch
padding local variables - Code generation
inner loop unrolling if-conversion
read/write optimization recurrence
breaking instruction scheduling
46SGI archtecture, ABI, languages
- Instruction Set Architechture (ISA)
mips4 (r10000, r12000, r14000 processors)
mips3 (r4400) mips2 (r3000, r4000), uses old
compilers - ABI (Application Binary Interface)
-n32 (32 bit pointers, 4 byte integers, 4
byte real) -64 (64 bit pointers, 4 byte
integers, 4 byte real)
Languages - Fortran 77 - Fortran 90 -
C - C
47Optimization levels
-O0 turn off all optimizations -O1
only local optimizations -O2 or O extensive but
conservative optimizations -O3 extensive
optimizations, sometimes introduces
errors -ipa inter-procedural analysis (-O2
and O3 only) -pfa or mp automatic
parallelization -g0 or g3 debugging switch (g0
forces O0, g3 to debug with O3)
48Compiler man pages
Primary man pages man f77(1) f90(1) cc(1)
CC(1) ld(1) Some of the option groups are large
and have been given their own man pages man
opt(5) man lno(5) man ipa(5) man
DEBUG_GROUP man mp(3F) man pe_environ(5)
man sigfpe(3C)
49Options ABI and ISA
Option Functionality -n32 MIPSpro
compiler, 32 bit addressing -64 MIPSpro
compiler, 64 bit addressing -o32/-32 old
ucode compiler, 32 bit addressing -mips1234
ISA -MIPS12 implies old ucode
Two other ways to define ABI and ISA
environment variable SGI_ABI can be set to n32
or 64 ABI/ISA/processor/optimization can be
set in a file /compiler.defaults or
/etc/compiler.defaults, The location of the
file can also be defined by COMPILER_DEFAULTS_PATH
environment variable. Typical line in
default file DEFAULTabi-n32isamips4procr
14000arit3optO3 For example f77 o prog
n32 mips4 r12000 O3 source.f
50Some compiler options
option functionality -d8/d16 double
precision variables as 8 or 16 bytes -r8
REAL is REAL8 and COMPLEX is COMPLEX16
(explicit sizes are preserved REAL4
remains 32bit) -i8 convert INTEGER to
INTEGER8 and LOGICAL to 8
bytes -static local variables will be
initialized in fixed
locations -col72120 source line is 72 or 120
columns -g or g3 create symbol table for
debugging -Dname define name for the
pre-processor -Idir define include
directory dir -alignN force alignment on bit
boundary N8,16,etc -version show compiler
version -show compiler in verbose mode,
display all switches
51Scientific Libraries
Standard scientific library contains Basic
Linear Algebra operations and algorithms -
BLAS1, BLAS2, BLAS3 (man intro_blas1,_blas2,_blas3
) - LAPACK (man intro_lapack) Fast
Fourier Transform (FFT) - 1D, 2D, 3D,
multiple 1D transformations (man intro_fft)
Convolutions (Signal Processing) Sparse
solvers (man intro_solvers)
To use - -lscs serial versions -
-lscs_mp mp parallel version f77 o laprog
O mp lapprog.f lscs_mp - man intro_scsl
for detailed description - older versions
-lcomplib.sgimath or lcomplib.sgimath_mp
52(No Transcript)
53file lp.m
54file lp.anl
55Computational domain
Range of numbers (from /usr/include/intern
al/limits_core.h) FLT_DIG 6 / digits of
precision of a "float" / FLT_MAX
3.40282347E38F FLT_MIN 1.17549435E-38F DBL_D
IG 15 / digits of precision of a "double"
/ DBL_MAX 1.7976931348623157E308 DBL_MIN
2.2250738585072014E-308 LONGLONG_MIN
-9223372036854775807LL-1LL LONGLONG_MAX
9223372036854775807LL ULONGLONG_MAX
18446744073709551615LLU Extended precision
(REAL16) is supported by the compiler. But this
is in software, and it is very slow (by factor
40)
56Compiler summary
Compiler is the primary tool of program
optimization - compilation is the process of
transforming code from high level to low
level (processor code) - MIPSpro compiler
targets the MIPS rN000 processor, and it has
built-in the features of the processor, and the
origin2000 architecture - Large number of
options to control the compilation process
ABI, ISA and selection of optimization options
setting assumptions about the program
behavior - There are optimized and parallelized
libraries for scientific computations -
When programming a computer, it is important to
remember the limitations stemming from
limited validity range of numerical values.
57A Recent Example Matrix multiply
58Matrix Multiply
Remove if from loop
59Original
Matrix Multiply Subroutine calls
Library routines
60Profile -- original
61Profile move if statement
62Profile library subroutine
63Exercise 4 use a library program
64(No Transcript)
65initial values
initialize
main loop
66initial values
initialize
main loop
67(No Transcript)
68(No Transcript)