Parallel Programming on the

About This Presentation

Title:

Parallel Programming on the

Description:

Parallel Programming on the SGI Origin2000 Taub Computer Center Technion Anne Weill-Zrahia With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 69

Provided by: Moshe83

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Programming on the

1
Parallel Programming on the SGI Origin2000
Taub Computer Center Technion
Anne Weill-Zrahia
With thanks to Moshe Goldberg, TCC and Igor
Zacharov SGI
Mar 2005
2
Parallel Programming on the SGI Origin2000

Parallelization Concepts
SGI Computer Design
Efficient Scalar Design
Parallel Programming -OpenMP
Parallel Programming- MPI

3
3) Efficient Scalar Programming
4
Remember to use make
Make implicit documentation minimize
compile time 100 oops free make /
pmake / smake
OBJS f1.o f2.o f3.o FFLAGS -O3 r12k LDFLAGS
-lm r12k All pgm1 Pgm1 (OBJS) lttabgt f77 o
pgm1 (OBJS) (LDFLAGS) F2.o f2.f lttabgt f77
(FFLAGS) static c f2.f Clean lttabgt -rm f
(OBJS) pgm1 core
5
Speedup opportunities

Program may run slowly because not all resources
are used efficiently
on the processor
non optimal scheduling of instructions (too
many wait states)
memory access
memory access pattern is not optimized for
the architecture
not all data in cache line is used (spatial
locality)
data in the cache is not reused (temporal
locality)
Performance analysis is used to diagnose the
problem
Compiler will attempt to optimize program.
However, it might not be
possible
data representation can inhibit compiler
optimization
algorithm presentation can inhibit
optimization
Often it is necessary to rewrite critical parts
of code (loops) so that
compiler can do better performance optimization.
Understanding the
optimization techniques helps the compiler to use
them effectively.

6
Compiler optimization techniques

Here are some optimization techniques built into
the compiler
Loop based
loop interchange
outer and inner loop unrolling
cache blocking
loop fusion (merge) and fission (split)
- General
procedure inlining
data and array padding
The algorithm should be presented in the program
in such a way
that the compiler can apply optimization
techniques, leading to
best performance on the specific computer.

7
Some simple arithmetic replacements
do i 1,n a sin(x(i)) v(i)
2.0a enddo Replace by do i 1,n
v(i)2.0sin(x(i)) enddo
do j 1,m do k 1,n v(k,j)2.0(a(k)/b(j))
enddo Enddo Replace by do j 1,m btemp
2.0/b(j) do k 1,n v(k,j) btempa(k)
enddo enddo
8
Array Indexing
Arrays can be indexed in several ways. For
example
Explicit addressing Do j1,m do k1,n ..
A(k(j-1)n) .. enddo enddo
Direct addressing Do j1,m do k1,n ..
A(k,j) .. enddo enddo
Loop carried addressing Do j1,m do k-1,n
kkkk1 .. A(kk) .. enddo enddo
Indirect addressing Do j1,m do k1,n ..
A(index(k,j) enddo enddo

The addressing scheme will have an impact on
performance
Arrays should be accessed in most natural direct
way for compiler
to apply loop optimization techniques

9
Data storage in memory
Data storage is language dependent
Fortran stores multi-dimensional arrays in column
order
j i
In memory leftmost index changes first
a(i,j)
i
i
i
j
j1
j2
C stores multi-dimensional arrays in row order
i j
In memory rightmost index changes first
Aij
j
j
j
i
i2
i1
For arrays that do not fit in the cache,
accessing elements in storage order gives much
faster performance
10
Fortran loop interchange
Interchanged loop Do j1,m do i1,n
c(i,j)a(i,j)b(i,j) enddo enddo
Original loop do i1,n do j1,m
c(i,j)a(i,j)b(i,j) enddo enddo
m
j i
m
j i
n
n
Storage order
Access order
The distribution of data in memory is not
changed, only the access pattern changes
The compiler can do this automatically, but there
are complicated cases
11
Index reversal
Original loop do i1,n do j1,m
c(i,j)a(i,j)b(j,i) enddo enddo
The access is wrong for A and C, but it is right
for B. Interchange will be good for A and C, but
bad for B.
Possible solution Index reversal of B that is,
b(i,j) is replace By b(j,i). But this must be
done everywhere in the program. (It must be done
manually, the compiler will not do it.)
Interchanged loop index reversal do j1,m do
i1,n c(i,j)a(i,j)b(i,j) enddo enddo
12
Loop interchange in C
In C, the situation is the opposite of what it is
in Fortran
Original loop for(j0jltmj)
for(i0iltni) cijaijbji
Addressing of c and a are wrong
Addressing of b is correct
Index reversal loop for(j0jltmj)
for(i0iltni) cjiajibji
Interchanged loop for(i0iltni)
for(j0jltmj) cijaijbji
The performance benefits in C are the same as in
Fortran.
In most practical situations, loop interchange
(supported by the Compiler) is easier to achieve
than index reversal.
13
Loop fusion
Loop fusion (merging two or more loops together)

fusing loops that refer to the same data
enhances temporal locality
larger loop body allow more effective scalar
optimizations and
instruction scheduling

More fusion, peeling a0b01 c0a0/2 for
(i1iltni) aibi1 ciai/2
dI-11/ci dn1/cn1
Fused loops for (i0iltni) aibi1
ciai/2 for (i0iltni) di1/ci1
Original loops for (i0iltni)
aibi1 for (i0iltni) ciai/2 for
(i0iltni) di1/ci1

loop peeling can bread data dependencies when
fusing loops
sometimes temporary arrays can be replaced by
scalars (manual only)
compiler will attempt to fuse loops if they are
adjacent, that is, no code
between the loops to be fused

14
Loop fission
Loop fission (splitting) or loop distribution
Improve memory locality by splitting out loops
that refer to different independent arrays
for (i0iltn-1i) bi1cixy
ci11/bi1 for (i0iltn-1i)
ai1ai1bi for (i0iltn-1i)
di1sqrt(ci1) In1
for (i1iltni) aiaibi-1
bici-1xy ci1/bi
disqrt(ci)
15
Array placement effects
Wrong data placement in memory can lead to an
effect of cache thrashing. The compiler has
two techniques built in to avoid thrashing -
array padding - leading dimension extension
In principle leading dimension of arrays should
be an odd number - that is if a
multi-dimensional array has small dimensions
(such as a(32,32,32)) the leading dimensions
should be odd numbers never a power of 2.
16
RISC memory levels
Single CPU
CPU
Cache
Main memory
17
RISC memory levels
Single CPU
CPU
Cache
Main memory
18
Direct mapped cache thrashing
common // a(4096),b(4096) do i1,n prod prod
a(i)b(i) enddo
Virtual memory
A(1) A(2) A(4095) A(4096) B(1) B(2) B(4095)
B(4096)
Direct mapped cache (16KB)
Cache line 4 words
16 KB
A(1) A(2) A(3) A(4) A(5) A(6)
A(7) A(8) A(4089) A(4090) A(4091)
A(4092) A(4093) A(4094) A(4095) A(4096)
Thrashing every memory reference is a cache miss
The rule avoid leading dimensions that are a
power of 2 !
19
Array Padding Example
common // a(1024,1024),b(1024,1024),c(1024,1024) d
o j1,1024 do i1,1024 a(i,j)a(i,j)b(i,j)
c(i,j) enddo enddo
AddrC(1,1) AddrB(1,1) 102410244 Position
in the cache C(1,1)B(1,1), since
(102410244)mod32K0
common // a(1024,1024),pad1(129),
b(1024,1024),pad2(129), c(1024,1024) do
j1,1024 do i1,1024 a(i,j)a(i,j)b(i,j)c(i,j) e
nddo enddo

Padding will cause cache lines
to be placed in different places
Compiler will try to do padding
automatically

AddrC(1,1) AddrB(1,1) 1024102441294 Po
sition in the cache C(1,1)B(129,1)
20
Dangers of array padding

Compiler will automatically pad local data
At -O3 optimization, compiler will pad common
blocks
all routines with common blocks must be
compiled with -O3
otherwise compiler will not perform this
optimization
padding of common blocks is safe as long as
the Fortran standard
is not violated

subroutine sub Common // a(512,512),b(512,512) Do
i1,2512,512 a(i)0.0 enddo return end

The remedy is to fix violation, or not to use
this optimization
either by compiling with lower optimization or
using compiler flag
-OPTreorg_commonoff

21
Loop unrolling
Loop unrolling perform multiple loop iterations
at the same time
do i1,n,unroll ..(i).. ..(i1)..
..(i2).. .. ..(iunroll-1).. enddo
do i1,n,1 ..(i).. enddo
Advantages of loop unrolling more
opportunities for super-scalar code more
data reuse exploit presence of cache lines
reduction in loop overhead Disadvantages of loop
unrolling cleanup code required
Cleanup
do in-mod(n,unroll)1,n ..(i).. enddo
NOTE the compiler will unroll code automatically
based on an estimate of how much time the loop
body will take
22
Blocking for cache (tiling)
Blocking for cache an optimization that is
good for data sets that to not fit into the data
cache a way to increase spatial locality of
reference (that is, exploit full cache
lines) a way to increase temporal locality of
reference (that is, to improve data reuse)
it is mostly beneficial with multi-dimensional
arrays
do i1,n ..(i).. enddo
Only nb elements at a time of the inner loop
are activated
do il1,n,nb do iil,min(ilnb-1,n)
..(i).. enddo enddo
23
Principle of blocking
do i1,n a(i)i enddo
do i11,n,iblk i2i1iblk-1 do ii1,i2
a(i)i enddo enddo
do i11,n,iblk i2i1iblk-1 if (i2.gt.n) i2n
do ii1,i2 a(i)i enddo enddo
24
Blocking example transpose
do j1,n do i1,n a(i,j)b(j,i)
enddo enddo
Either A or B is accessed in non-unit stride bad
reuse of data
Blocking the loops for cache will do the
transpose block by block, reusing the elements
in the blocks
do jt1,n,jtblk do it1,n,itblk do
jjt,jtjtblk-1 do iit,ititblk-1
a(i,j)b(j,i) enddo enddo enddo enddo
25
A Recent Example Matrix multiply
26
Matrix Multiply
Remove if from loop
27
Profile -- original
28
Profile move if statement
29
Exercise 2 -- Matrix loop order
30
(No Transcript)
31
(No Transcript)
32
Procedure inlining
Inlining replace a function (or subroutine) call
by source
do i1,n call dowork(a(i),c(i)) enddo
subroutine dowork(x,y) y1.0x(1.0x.5) end
Advantages increase opportunities for
optimizations more opportunities for loop
nest optimizations reduce call overhead
(minor)
do i1,n c(i)1.0a(i)(1.0a(i)0.5) enddo
Inhibitions to inlining mismatched arguments
(type or shape) no inlining across languages
(C to fortran) so static (SAVE) variables
no recursive routines no functions with
alternate entries no nested subroutines (as in
F90)
Candidates for inlining are modules that are
small not much source code are called many
times (say in a loop) do not take much time
per call
33
A simple matrix multiplication (triple-nested
loop)
subroutine mm(m,n,p,a,lda,b,ldb,c,ldc)
integer m,n,p,lda,ldb,ldc dimension
a(lda,p),b(ldb,n),c(ldc,m) do i1,m
do j1,n do k1,p
c(i,j)c(i,j)a(i,k)b(k,j) end do
end do end do
Try to speed it up!
34
1) Loop reversal
do i1,m do j1,n do
k1,p c(i,j)c(i,j)a(i,k)b(k,j)
end do end do end do
Loop constant
do j1,n do k1,p
tb(k,j) do i1,m
c(i,j)c(i,j)a(i,k)t end do
end do end do
Reverse loop order
35
2) Inner loop unrolling
do j1,n do k1,p tb(k,j) do
i1,m c(i,j)c(i,j)a(i,k)t end
do end do end do
do j1,n do k1,p tb(k,j)
do i1,(m-4)1,4
c(i0,j)c(i0,j)a(i0,k)t
c(i1,j)c(i1,j)a(i1,k)t
c(i2,j)c(i2,j)a(i2,k)t
c(i3,j)c(i3,j)a(i3,k)t end do
do ii,m
c(i,j)c(i,j)a(i,k)t end do
end do end do
Unroll inner loop
Cleanup loop
(1) Reduces loop overhead (2) Sometimes improves
data reuse
36
3) Middle loop unrolling
do j1,n do k1,(p-4)1,4
t0b(k0,j) t1b(k1,j)
t2b(k2,j) t3b(k3,j) do i1,m
c(i,j)c(i,j)a(i,k0)t0
a(i,k1)t1
a(i,k2)t2 a(i,k3)t3
end do end do do kk,p
t0b(k,j) do i1,m
c(i,j)c(i,j)a(i,k)t0 end do end do
end do
do j1,n do k1,p tb(k,j) do
i1,m c(i,j)c(i,j)a(i,k)t end
do end do end do
Unroll middle loop
Cleanup loop
(1) Fewer c(i,j) load/store operations (2)
Better locality of b(k,j) references
37
4) Outer Loop Unrolling
do j1,(n-4)1,4 do k1,p
t0b(k,j0) t1b(k,j1)
t2b(k,j2) t3b(k,j3)
do i1,m c(i,j0)c(i,j0)a(i,k
)t0 c(i,j1)c(i,j1)a(i,k)t1
c(i,j2)c(i,j2)a(i,k)t2
c(i,j3)c(i,j3)a(i,k)t3 end
do end do end do do jj,n
do k1,p t0b(k,j)
do i1,m c(i,j)c(i,j)a(i,k)t0
end do end do
do j1,n do k1,p tb(j,k) do
i1,m c(i,j)c(i,j)a(i,k)t end
do end do end do
Unroll outer loop
Cleanup loop
Improvement because of reuse of a(i,k) in the loop
38
5) Loop Blocking
do j11,n,jblk j2j1jblk-1
if (j2.gt.n) j2n do k11,p,kblk
k2k1kblk-1 if (k2.gt.p)
k2p do i11,m,iblk
i2i1iblk-1 if (i2.gt.m) i2m
do jj1,j2 do
kk1,k2 tb(k,j)
do ii1,i2
c(i,j)c(i,j)a(i,k)t end
do end do end
do end do end do end
do
do j1,n do k1,p tb(k,j) do
i1,m c(i,j)c(i,j)a(i,k)t end
do end do end do
Original code
Blocked loop
Improves locality of reference (removes
out-of-cache memory references)
39
Optimizations - summary
Scalar optimization improving memory table
access by code transformation and grouping
independent instructions improving memory
access by modifying loop nests to take better
advantage of memory hierarchy Compilers are good
at instruction level optimizations and loop
transformations. However, there are differences
in the languages F77 is easiest for compiler
to work with C is more difficult F90/C
are most complex for optimizing The user is
responsible for presenting the code in a way that
allows compiler optimizations Dont
violate the language standard write clean and
clear code consider the data structures for
(false) sharing and alignment consider the
data structures in terms of data dependencies
use most natural presentation of algorithms in
multi-dimensional arrays
40
Exercise 3 -- loop unroll/block
41
(No Transcript)
42
Compiler switches and options
The compiler is the primary tool of program
optimization structure of the compiler and
the compilation process compiler
optimizations steering the compilation
compiler options structure of the run time
libraries and scientific libraries
computational domain and computation accuracy
43
The Compiler
The compiler manages the resources of the
computer registers integer/floating-point
execution units load/store/prefetch for data
flow in/out of processor knowledge of the
implementation details of processor and
system architecture are built into the compiler
User program (C/C/Fortran)
High level representation intermediate
representation Low level representation
Solving data dependencies control flow
dependencies parallelism compacting the
code optimal scheduling
Machine instructions
44
MIPSpro compiler components
source
front-end (source to WHIRL)
F77/f90 cc/CC
code generator
inter- procedure analyzer
loop nest analyzer
parallel optimizer
linker
executable object
There are no source-to-source optimizers or
compilers source code is translated to WHIRL
intermediate language - same intermediate for
different levels of interpretation - WHIRL2F
and WHIRL2C translate back into Fortran or C
Inter-Procedural analyzer requires final
translation at link time
45
Compiler optimizations
- Global optimizer dead code elimination
copy propagation loop normalization
memory alias analysis strength
reduction - Inter-Procedural analyzer
cross-file function inlining dead function
elimination dead variable elimination
padding common variables constant
propagation - Automatic parallelizer
loop level work distribution
- Loop nest optimizer loop unrolling
loop interchange loop fusion/fission
loop blocking memory prefetch
padding local variables - Code generation
inner loop unrolling if-conversion
read/write optimization recurrence
breaking instruction scheduling
46
SGI archtecture, ABI, languages
- Instruction Set Architechture (ISA)
mips4 (r10000, r12000, r14000 processors)
mips3 (r4400) mips2 (r3000, r4000), uses old
compilers - ABI (Application Binary Interface)
-n32 (32 bit pointers, 4 byte integers, 4
byte real) -64 (64 bit pointers, 4 byte
integers, 4 byte real)
Languages - Fortran 77 - Fortran 90 -
C - C
47
Optimization levels
-O0 turn off all optimizations -O1
only local optimizations -O2 or O extensive but
conservative optimizations -O3 extensive
optimizations, sometimes introduces
errors -ipa inter-procedural analysis (-O2
and O3 only) -pfa or mp automatic
parallelization -g0 or g3 debugging switch (g0
forces O0, g3 to debug with O3)
48
Compiler man pages
Primary man pages man f77(1) f90(1) cc(1)
CC(1) ld(1) Some of the option groups are large
and have been given their own man pages man
opt(5) man lno(5) man ipa(5) man
DEBUG_GROUP man mp(3F) man pe_environ(5)
man sigfpe(3C)
49
Options ABI and ISA
Option Functionality -n32 MIPSpro
compiler, 32 bit addressing -64 MIPSpro
compiler, 64 bit addressing -o32/-32 old
ucode compiler, 32 bit addressing -mips1234
ISA -MIPS12 implies old ucode
Two other ways to define ABI and ISA
environment variable SGI_ABI can be set to n32
or 64 ABI/ISA/processor/optimization can be
set in a file /compiler.defaults or
/etc/compiler.defaults, The location of the
file can also be defined by COMPILER_DEFAULTS_PATH
environment variable. Typical line in
default file DEFAULTabi-n32isamips4procr
14000arit3optO3 For example f77 o prog
n32 mips4 r12000 O3 source.f
50
Some compiler options
option functionality -d8/d16 double
precision variables as 8 or 16 bytes -r8
REAL is REAL8 and COMPLEX is COMPLEX16
(explicit sizes are preserved REAL4
remains 32bit) -i8 convert INTEGER to
INTEGER8 and LOGICAL to 8
bytes -static local variables will be
initialized in fixed
locations -col72120 source line is 72 or 120
columns -g or g3 create symbol table for
debugging -Dname define name for the
pre-processor -Idir define include
directory dir -alignN force alignment on bit
boundary N8,16,etc -version show compiler
version -show compiler in verbose mode,
display all switches
51
Scientific Libraries
Standard scientific library contains Basic
Linear Algebra operations and algorithms -
BLAS1, BLAS2, BLAS3 (man intro_blas1,_blas2,_blas3
) - LAPACK (man intro_lapack) Fast
Fourier Transform (FFT) - 1D, 2D, 3D,
multiple 1D transformations (man intro_fft)
Convolutions (Signal Processing) Sparse
solvers (man intro_solvers)
To use - -lscs serial versions -
-lscs_mp mp parallel version f77 o laprog
O mp lapprog.f lscs_mp - man intro_scsl
for detailed description - older versions
-lcomplib.sgimath or lcomplib.sgimath_mp
52
(No Transcript)
53
file lp.m
54
file lp.anl
55
Computational domain
Range of numbers (from /usr/include/intern
al/limits_core.h) FLT_DIG 6 / digits of
precision of a "float" / FLT_MAX
3.40282347E38F FLT_MIN 1.17549435E-38F DBL_D
IG 15 / digits of precision of a "double"
/ DBL_MAX 1.7976931348623157E308 DBL_MIN
2.2250738585072014E-308 LONGLONG_MIN
-9223372036854775807LL-1LL LONGLONG_MAX
9223372036854775807LL ULONGLONG_MAX
18446744073709551615LLU Extended precision
(REAL16) is supported by the compiler. But this
is in software, and it is very slow (by factor
40)
56
Compiler summary
Compiler is the primary tool of program
optimization - compilation is the process of
transforming code from high level to low
level (processor code) - MIPSpro compiler
targets the MIPS rN000 processor, and it has
built-in the features of the processor, and the
origin2000 architecture - Large number of
options to control the compilation process
ABI, ISA and selection of optimization options
setting assumptions about the program
behavior - There are optimized and parallelized
libraries for scientific computations -
When programming a computer, it is important to
remember the limitations stemming from
limited validity range of numerical values.
57
A Recent Example Matrix multiply
58
Matrix Multiply
Remove if from loop
59
Original
Matrix Multiply Subroutine calls
Library routines
60
Profile -- original
61
Profile move if statement
62
Profile library subroutine
63
Exercise 4 use a library program
64
(No Transcript)
65
initial values
initialize
main loop
66
initial values
initialize
main loop
67
(No Transcript)
68
(No Transcript)

Write a Comment

User Comments (0)