Title: Introduccion de nuevos servicios para el publico Portuguese
1Optimization for the Cray XT3MPP Supercomputer
John M. Levesque June, 2005
2Outline of Optimization Section
- Single node (socket) Optimization
- I/O optimization
- MPI optimization
- SHMEM
- Scaling to large processor counts
3Bandwidth as a function of Array Size
4AMD Opteron Processor
- 36 entry FPU instruction scheduler
- 64-bit/80-bit FP Realized throughput (1 Mul 1
Add)/cycle 1.9 FLOPs/cycle - 32-bit FP Realized throughput (2 Mul 2
Add)/cycle 3.4 FLOPs/cycle
5Simplified memory hierachy on the AMD Opteron
registers
16 SSE2 128-bit registers 16 64 bit registers
2 x 8 Bytes per clock, i.e. Either 2 loads, 1
load 1 store, or 2 stores (38 GB/s on 2.4 Ghz)
- 64 Byte cache line
- complete data cache lines are loaded from main
- memory, if not in L2 cache
- if L1 data cache needs to be refilled, then
- storing back to L2 cache
- 64 Byte cache line
- write back cache data offloaded from L1 data
- cache are stored here first
- until they are flushed out to main memory
L1 data cache
8 Bytes per clock
L2 cache
...
16 Bytes wide data bus gt 6.4 GB/s for DDR400
Main memory
6C 11 OPERATIONS - 2 OPERANDS RATIO 11/2
II1 ISTRIDE 128 DO
41075 I 1, N Y(II) c0 X(II) (C1
X(II) (C2 X(II)
(C3 X(II) (C4 X(II)
(C5 X(II) ))))) II
II ISTRIDE41075 CONTINUE
7C 3 OPERATIONS - 5 OPERANDS RATIO 3/5
DO 41023 I1, N A(I) B(I)
C(I) D(I) E(I)41023 CONTINUE
8C 17 OPERATIONS - 2 OPERANDS RATIO
17/2 DO LLLLL 1,NREPS DO 41018 I
1,N Y(IY(I)) c0 X(IX(I)) (C1
X(IX(I)) (C2 X(IX(I)) (C3
X(IX(I)) (C4 X(IX(I)) (C5
X(IX(I)) (C6 X(IX(I)) (C7
X(IX(I)) (C8 X(IX(I))
))))))))41018 CONTINUE
9(No Transcript)
10(No Transcript)
11C DIMENSION A(128,N) DO 41080 I
1,N A( 1,I) C1A(13,I) C2 A(12,I)
C3A(11,I) C4A(10,I) C5 A(
9,I) C6A( 8,I) C7A( 7,I)
C0(A( 5,I) A( 6,I) ) A( 3,I)41080 CONTINUE
12C DIMENSION B(13,N) DO 41081 I 1,N
B( 1,I) C1B(13,I) C2 B(12,I)
C3B(11,I) C4B(10,I) C5 B(
9,I) C6B( 8,I) C7B( 7,I)
C0(B( 5,I) B( 6,I) ) B( 3,I)41081 CONTINUE
13(No Transcript)
14C THE ORIGINAL DO 41090 K KA, KE,
-1 DO 41090 J JA, JE DO 41090
I IA, IE A(K,L,I,J) A(K,L,I,J) -
B(J,1,i,k)A(K1,L,I,1) -
B(J,2,i,k)A(K1,L,I,2) - B(J,3,i,k)A(K1,L,I,3)
- B(J,4,i,k)A(K1,L,I,4) -
B(J,5,i,k)A(K1,L,I,5)41090 CONTINUE
15C THE RESTRUCTURED DO 41091 K KA,
KE, -1 DO 41091 J JA, JE DO
41091 I IA, IE AA(I,K,L,J)
AA(I,K,L,J)-BB(I,J,1,K)AA(I,K1,L,1) -
BB(I,J,2,K)AA(I,K1,L,2)-BB(I,J,3,K)AA(I,K1,L,3
) - BB(I,J,4,K)AA(I,K1,L,4)-BB(I,J,5,K
)AA(I,K1,L,5) 41091 CONTINUE
16(No Transcript)
17 C GAUSS ELIMINATION DO
43020 I 1, MATDIM A(I,I) 1. / A(I,I)
DO 43020 J I1, MATDIM A(J,I)
A(J,I) A(I,I) DO 43020 K I1, MATDIM
A(J,K) A(J,K) - A(J,I) A(I,K)43020
CONTINUE
18 C GAUSS ELIMINATION DO
43020 I 1, MATDIM A(I,I) 1. / A(I,I)
DO 43020 J I1, MATDIM A(J,I)
A(J,I) A(I,I)cpgil nodepchk DO 43020
K I1, MATDIM A(J,K) A(J,K) - A(J,I)
A(I,K)43020 CONTINUE
19(No Transcript)
20C THE ORIGINAL DO 43030 I 2, N
DO 43030 K 1, I-1 A(I) A(I)
B(I,K) A(I-K)43030 CONTINUE
21 DO 43031 I 2, Ncpgil nodepchk
DO 43031 K 1, I-1 A(I) A(I) B(I,K)
A(I-K)43031 CONTINUE
22(No Transcript)
23 DO 43040 J 2, 8 N1 J N2 J -
1 DO 43040 I 2, N A(I,N1)
A(I-1,N2) B(I,J) C(I)43040 CONTINUE
24 DO 43041 J 2, 8 DO 43041 I 2,
N A(I,J) A(I-1,J-1) B(I,J)
C(I)43041 CONTINUE
25(No Transcript)
26C THE ORIGINAL DO 43060 KX 2, 3 DO
43060 KY 2, N D(KY) A(KX,KY1,NL12) -
A(KX,KY-1,NL12) E(KY) B(KX,KY1,NL22) -
B(KX,KY-1,NL22) F(KY) C(KX,KY1,NL32) -
C(KX,KY-1,NL32) A(KX,KY,NL11)
A(KX,KY,NL11) C1D(KY)
C2E(KY) C3F(KY)
C0(A(KX1,KY,NL1) - 2.A(KX,KY,NL1)
A(KX-1,KY,NL1)) B(KX,KY,NL21)
B(KX,KY,NL21) C4D(KY)
C5E(KY) C6F(KY)
C0(B(KX1,KY,NL1) - 2.B(KX,KY,NL1)
B(KX-1,KY,NL1)) C(KX,KY,NL31)
C(KX,KY,NL31) C7D(KY)
C8E(KY) C9F(KY)
C0(C(KX1,KY,NL1) - 2.C(KX,KY,NL1)
C(KX-1,KY,NL1))43060 CONTINUE
27 DO 43061 KX 2, 3cpgil nodepchk DO
43061 KY 2, N D(KY) A(KX,KY1,NL12) -
A(KX,KY-1,NL12) E(KY) B(KX,KY1,NL22) -
B(KX,KY-1,NL22) F(KY) C(KX,KY1,NL32) -
C(KX,KY-1,NL32) A(KX,KY,NL11)
A(KX,KY,NL11) C1D(KY)
C2E(KY) C3F(KY)
C0(A(KX1,KY,NL1) - 2.A(KX,KY,NL1)
A(KX-1,KY,NL1)) B(KX,KY,NL21)
B(KX,KY,NL21) C4D(KY)
C5E(KY) C6F(KY)
C0(B(KX1,KY,NL1) - 2.B(KX,KY,NL1)
B(KX-1,KY,NL1)) C(KX,KY,NL31)
C(KX,KY,NL31) C7D(KY)
C8E(KY) C9F(KY)
C0(C(KX1,KY,NL1) - 2.C(KX,KY,NL1)
C(KX-1,KY,NL1))43061 CONTINUE
28(No Transcript)
29C THE ORIGINAL DO 43070 I 1, N
A(IA(I)) A(IA(I)) C0 B(I)43070 CONTINUE
30cpgil nodepchk DO 43071 I 1, N
A(IA(I)) A(IA(I)) C0 B(I)43071 CONTINUE
31(No Transcript)
32C THE ORIGINAL DO 43100 J 2, N AH
B(J) - B(J-1) DO 43100 I 2, N
A(I,J) AH A(I-1,J) C(I,J)43100 CONTINUE
33 DO 43101 J 2, N VAH(J) B(J) -
B(J-1)43101 CONTINUE DO 43102 I 2, N
DO 43102 J 1, N A(I,J) VAH(J)
A(I-1,J) C(I,J)43102 CONTINUE
34(No Transcript)
35C THE ORIGINAL DO 43111 J 2, N
AH B(J) - B(J-1) DO 43110 I 2, N
A(I,J) AH A(I-1,J) C(I,J)43110
CONTINUE BH D(J) - D(J-1) DO
43112 I N, 2, - 1 A(I,J) BH
A(I1,J) C(I,J)43112 CONTINUE43111 CONTINUE
36C THE RESTRUCTURED DO 43113 J 2, N
VAH(J) B(J) - B(J-1)43113 CONTINUE
DO 43114 I 2, N DO 43114 J 2, N
A(I,J) VAH(J) A(I-1,J) C(I,J)43114
CONTINUE DO 43115 J 2, N VBH(J)
D(J) - D(J-1)43115 CONTINUE DO 43116
I N, 2, - 1 DO 43116 J 2, N
A(I,J) VBH(J) A(I1,J) C(I,J)43116
CONTINUE
37(No Transcript)
38 DO 43140 J 2, N DO 43140 I 2,
N A(I,J,1) A(I,J,1) - B(I,J)
A(I-1,J,1) - C(I,J)
A(I,J-1,1) A(I,J,2) A(I,J,2) - B(I,J)
A(I-1,J,2) - C(I,J)
A(I,J-1,2) A(I,J,3) A(I,J,3) - B(I,J)
A(I-1,J,3) - C(I,J)
A(I,J-1,3)43140 CONTINUE
39C THE RESTRUCTURED NDIAGS 2 N - 3
ISTART 1 JSTART 2 LDIAG
0 DO 43141 IDIAGS 1, NDIAGS
IF(IDIAGS .LE. N-1 ) THEN ISTART
ISTART 1 LDIAG LDIAG 1
ELSE JSTART JSTART 1
LDIAG LDIAG - 1 ENDIF I
ISTART 1 J JSTART - 1!pgil nodepchk
DO 43142 IPOINT 1, LDIAG I I -
1 J J 1 A(I,J,1) A(I,J,1) -
B(I,J) A(I-1,J,1) -
C(I,J) A(I,J-1,1) A(I,J,2) A(I,J,2) -
B(I,J) A(I-1,J,2) -
C(I,J) A(I,J-1,2) A(I,J,3) A(I,J,3) -
B(I,J) A(I-1,J,3) -
C(I,J) A(I,J-1,3)43142 CONTINUE43141 CONTINUE
40(No Transcript)
41C THE ORIGINAL BSQ(1) 0.0
A(1) 0.0 B 0.0 DO 44022 I 2,
N B B DELB BSQ(I) B 2
A(I) C(I) ( DELB C(I) (BSQ(I) -
BSQ(I-1)))44022 CONTINUEC
42C THE ORIGINAL BR 0.0 DO 44020
I 1, N BL BR BR (I-1) DELB
A(I) (BR - BL) C(I) (BR2 - BL2)
C(I)244020 CONTINUE
43 BSQ(1) 0.0 A(1) 0.0 B
0.0 DO 44022 I 2, N B B
DELB BSQ(I) B 2 A(I) C(I)
( DELB C(I) (BSQ(I) - BSQ(I-1)))44022
CONTINUEC
44(No Transcript)
45 PF 0.0 DO 44030 I 2, N AV
B(I) RV PB PF PF
C(I) IF ((D(I) D(I1)) .LT. 0.) PF
-C(I1) AA E(I) - E(I-1) F(I) -
F(I-1) 1 G(I) G(I-1) - H(I) -
H(I-1) BB R(I) S(I-1) T(I)
T(I-1) 1 - U(I) - U(I-1) V(I)
V(I-1) 2 - W(I) W(I-1) - X(I)
X(I-1) A(I) AV (AA BB PF - PB
Y(I) - Z(I)) A(I)44030 CONTINUE
46 VPF(1) 0.0 DO 44031 I 2, N
AV B(I) RV VPF(I) C(I) IF
((D(I) D(I1)) .LT. 0.) VPF(I) -C(I1)
AA E(I) - E(I-1) F(I) - F(I-1) 1
G(I) G(I-1) - H(I) - H(I-1) BB
R(I) S(I-1) T(I) T(I-1) 1 - U(I)
- U(I-1) V(I) V(I-1) 2 - W(I)
W(I-1) - X(I) X(I-1) A(I) AV (AA
BB VPF(I) - VPF(I-1) Y(I) - Z(I))
A(I)44031 CONTINUE
47(No Transcript)
48 DO 44050 I 1, N DO 44050 J 1, N
A(I,J) 0.0 DO 44050 K 1, N
A(I,J) A(I,J) B(I,K) C(K,J)44050 CONTINUE
49 DO 44051 J 1, N DO 44051 I 1,
N A(I,J) 0.044051 CONTINUE DO
44052 K 1, N DO 44052 J 1, N
DO 44052 I 1, N A(I,J) A(I,J)
B(I,K) C(K,J)44052 CONTINUE
50(No Transcript)
51 DO 44060 I 1, N A(I) 0.0
DO 44060 J 1, I A(I) A(I) B(I,J)
C(J,I)44060 CONTINUE
52 DO 44061 I 1, N A(I)
0.044061 CONTINUE DO 44062 J 1, N
DO 44062 I J, N A(I) A(I) B(I,J)
C(J,I)44062 CONTINUE
53(No Transcript)
54C THE ORIGINAL DO 46011 J 1, 4
DO 46010 I 1, N C(J,I)0.046010
CONTINUE DO 46011 K 1,4 DO
46011 I 1,N C(J,I) C(J,I) A(J,K)
B(K,I)46011 CONTINUE
55C THE RESTRUCTURED DO 46012 I 1, N
C(1,I) A(1,1) B(1,I) A(1,2) B(2,I)
A(1,3) B(3,I) A(1,4) B(4,I)
C(2,I) A(2,1) B(1,I) A(2,2) B(2,I)
A(2,3) B(3,I) A(2,4) B(4,I)
C(3,I) A(3,1) B(1,I) A(3,2) B(2,I)
A(3,3) B(3,I) A(3,4) B(4,I)
C(4,I) A(4,1) B(1,I) A(4,2) B(2,I)
A(4,3) B(3,I) A(4,4)
B(4,I)46012 CONTINUE
56(No Transcript)
57C THE ORIGINAL DO 46020 I 1,N
DO 46020 J 1,4 A(I,J) 0. DO
46020 K 1,4 A(I,J) A(I,J) B(I,K)
C(K,J)46020 CONTINUE
58 DO 46021 I 1, N A(I,1) B(I,1)
C(1,1) B(I,2) C(2,1) B(I,3)
C(3,1) B(I,4) C(4,1) A(I,2) B(I,1)
C(1,2) B(I,2) C(2,2) B(I,3)
C(3,2) B(I,4) C(4,2) A(I,3) B(I,1)
C(1,3) B(I,2) C(2,3) B(I,3)
C(3,3) B(I,4) C(4,3) A(I,4) B(I,1)
C(1,4) B(I,2) C(2,4) B(I,3)
C(3,4) B(I,4) C(4,4)46021 CONTINUE
59(No Transcript)
60 DO 46030 J 1, N DO 46030 I 1,
N A(I,J) 0.46030 CONTINUE DO
46031 K 1, N DO 46031 J 1, N
DO 46031 I 1, N A(I,J) A(I,J)
B(I,K) C(K,J)46031 CONTINUE
61C THE RESTRUCTURED DO 46032 J 1,
N DO 46032 I 1, N
A(I,J)0.46032 CONTINUEC DO 46033 K
1, N-5, 6 DO 46033 J 1, N DO
46033 I 1, N A(I,J) A(I,J) B(I,K
) C(K ,J) B(I,K1)
C(K1,J) B(I,K2)
C(K2,J) B(I,K3)
C(K3,J) B(I,K4)
C(K4,J) B(I,K5)
C(K5,J)46033 CONTINUEC DO 46034 KK K,
N DO 46034 J 1, N DO 46034 I
1, N A(I,J) A(I,J) B(I,KK) C(KK
,J)46034 CONTINUE
62(No Transcript)
63I/O Optimization
- Lustre
- Stripe Setting
- I/O Buffering
64Determining Stripe Size
- lfs find -v /lustre/scratch/jlbeck
- Notice this directory has no stripe information.
This means the width is ONE. If I were to create
a file in this directory it would inherit the
properties of the directory.
OBDS 0 ost1_UUID ACTIVE ... 15 ost16_UUID
ACTIVE /lustre/scratch/jlbeck/ has no stripe info
65Stripe setting
- What should my stripe width be?
- Lots of processors writing to individual files
sw 1 - 1 I/O process writing for all, sw small gt 1
- Lots of processes all writing to 1 file. Sw
large - Set your stripe width
- lfs setstripe ltfilenamegt ltstripe-sizegt
ltstart-ostgt ltstrip countgt - lfs setstripe file 0 1 5
- lfs setstrip file 0 0 -1
66IOBUF
- iobuf is an I/O buffering library. iobuf
intercepts the I/O calls (open, read, etc.) from
a program and provides an additional layer of
buffering. In the case of XT3, iobuf replaces the
stdio (glibc/libio) layer of buffering. By
asynchronously prefetching and caching file data,
I/O wait time for programs which read or write
large file sequentially can be reduced - iobuf can gather run time statistics and print a
summary report of I/O activity for each file. - If a memory allocation error occurs, buffering is
reduced or disabled for that file and a
diagnostic is printed to stderr. When the file is
opened, a single buffer is allocated if buffering
is enabled. The allocation of additional buffers
is done when a buffer is needed. When a file is
closed, its buffers are freed unless asynchronous
I/O is still pending on the buffer
67IOBUF
- File selection and parameters for buffering is
set through the IOBUF_PARAMS environment
variable. The default if IOBUF_PARAMS is not set
is no buffering (the I/O call is passed onto the
next layer without intervention). The general
format is a comma-separated list of
specifications. - IOBUF_PARAMSspec1,spec2,spec3,...
68IOBUF
69Example
- export IOBUF_SIZE32768
- export IOBUF_COUNT5
- export IOBUF_PARAMS'//eagerflushlazyclose,s
tdouteagerflushlazyclose'
701. Introduction to SHMEM
- Programming Model
- Memory is private to each processRemotely
accessible, not shared - SHMEM is one-sided message passing modelPut and
get operations - SHMEM is SPMD programming model
- SHMEM application can be part of MPMD MPI job
71Introduction(cont.)
- Symmetric Data Objects
- Primary concept in SHMEM
- Virtual addresses of symmetric data object on
different processes have definite, known
relationship - Access remote symmetric data objects by using
address of corresponding local data object - C global, static or shmallocd data
- Fortran common block, SAVE attribute or
shpallocd data
72Introduction (cont.)
S
stack
stack
S
long A10 void foobar(void) long S10
if(my_pe 0) shmem_put64(A,S,10,1)
heap
heap
symmetric heap
symmetric heap
data
data
A
A
text
text
PE 0
PE 1
73Introduction(cont.)
- Goal
- Deliver best possible communication
performance - by minimizing overhead associated with data
- transfer
742. Cray SHMEM Implementation on XT3
- XT3 uses Portals Networking Protocol
- One-sided RMA protocol
- Guarantees reliable, ordered message delivery
between pairs of processes - Connection-less
- Designed specifically for scalability
- Cray SHMEM layered on top of Portals 3.3
75Cray SHMEM on XT3 (cont.)
- Portals resources
- Memory Descriptor (MD) identifies a memory region
to be used in operation - Event Queue (EQ) used to record information about
operation - SHMEM start-up
- Set up Portals resources
- MDs to describe four memory regions
- EQ to monitor transfer completions
76Cray SHMEM on XT3 (cont.)
- SHMEM data transfer
- Source and target addresses determine which MDs
and EQs to supply to Portals call - Execute Portals put or get command
- Monitor EQ for completion event if necessary
- Persistent Portals resources gt low overhead on
transmit path
773. Cray SHMEM 1.0 Release
- Functionality Supported
- Initialization and Clean up
- shmem_init or start_pes
- shmem_finalize
- Queries
- shmem_my_pe, shmem_n_pes
- Puts and Gets
- shmem_xxx_put,get (generic different types)
- shmem_put,getxxx (different bit counts)
- shmem_put,getmem
78Cray SHMEM 1.0 Release (cont.)
- Functionality Supported (cont.)
- Synchronization
- shmem_fence
- shmem_quiet
- shmem_barrier_all
- shmem_barrier
- Wait
- shmem_xxx_wait (generic different integer
types) - shmem_xxx_wait_until (generic different
integer types)
79Cray SHMEM 1.0 Release (cont.)
- Functionality Supported (cont.)
- Broadcast
- shmem_broadcastxxx (generic different bit
counts) - Reductions
- shmem_xxx_yyy_to_all for operations sum, prod,
max, - min, and, or, xor (different types)
- Currently supported on all PEs only
80Cray SHMEM 1.0 Release (cont.)
- Functionality Supported (cont.)
- Events
- shmem_clear,set,test,wait_event
- Strided Puts and Gets
- shmem_xxx_iput,get (generic different types)
- shmem_iput,getxxx (different bit counts)
81Cray SHMEM 1.0 Release (cont.)
- Functionality Supported (cont.)
- Symmetric Heap management
- shmalloc
- shfree
- shrealloc
- Fortran Interface
- Functions corresponding to C interface
- include mpp/shmem.fh
82Cray SHMEM 1.0 Release (cont.)
- Preliminary Performance Data
- Simple SHMEM get/put operations map well
- onto XT3 architecture
- Advanced SHMEM operations do not map well
- onto XT3 architecture
- Portals not tuned yet, e.g. no OS-bypass,
mostPotrals calls require system call