Title: IFS Benchmark with Federation Switch
1IFS Benchmark with Federation Switch
2Introduction
- Federation has dramatically improved pwr4 p690
communication, so - Measure Federation performance with Small Pages
and Large Pages using simulation program - Compare Federation and pre-Federation (Colony)
performance of IFS - Compare Federation performance of IFS with and
without Large Pages and Memory Affinity - Examine IFS communication using mpi profiling
3Colony v Federation
- Colony (hpca)
- 1.3GHz 32-processor p690s
- Four 8-processor Affinity LPARs per p690
- Needed to get communication performance
- Two 180MB/s adapters per LPAR
- Federation (hpcu)
- 1.7GHz p690s
- One 32-processor LPAR per p690
- Memory and MPI MCM Affinity
- MPI Task and Memory from same MCM
- Slightly better than binding task to specific
processor - Two 2-link 1.2GB/s Federation adapters per p690
- Four 1.2GB/s links per node
4IFS Communicationtranspositions
0
MPI task
Node
1
- MPI Alltoall in all rows simultaneously
- Mostly shared memory
- MPI Alltoall in all columns simultaneously
5Simulation of transpositions
- All transpositions in row use shared memory
- All transpositions in column use switch
- Number of MPI tasks per node varied
- But all processors used by using OpenMP threads
- Bandwidth measured for MPI Sendrecv calls
- Buffers allocated and filled by threads between
each call - Large Pages give best switch performance
- With current switch software
6Transposition Bandwidth per link(8 nodes, 4
links/node, 8 tasks/node, 4 threads/task, 2
tasks/link)
SP Small Pages LP Large Pages
7Transposition Bandwidth per link(8 nodes, 4
links/node)
Multiple threads ensure all processors are used
8hpcu v hpca with IFS
- Benchmark jobs (provided 3 years ago)
- Same executable used on hpcu and hpca
- 256 processors used
- All jobs run with mpi_profiling (and barriers
before data exchange)
Procs Grid Points hpca hpcu Speedup
T399 10x1_4 213988 5828 3810 1.52
T799 16x8_2 843532 9907 5527 1.79
4D-Var T511/T255 16x8_2 4869 2737 1.78
9IFS Speedups hpcu v hpca
LP Large Pages SP Small Pages MA Memory
Affinity
10LP/SP MA/noMA CPU comparison
11LP/SP MA/noMA Comms comparison
12Percentage Communication
hpca ------------------- hpcu
--------------------------
13Extra Memory needed by Large Pages
- Large Pages are allocated in Real Memory in
segments of 256 MB - MPI_INIT
- 80MB which may not be used
- MP_BUFFER_MEM (default 64MB) can be reduced
- MPI_BUFFER_ALLOCATE needs memory which may not
be used - OpenMP threads
- Stack allocated with XLSMPOPTSstack may not
be used - Fragmentation
- Memory is "wasted"
- Last 256 MB segment
- Only a small part of it may be used
14mpi_profile
- Examine IFS communication using mpi profiling
- Use libmpiprof.a
- Calls and MB/s rate for each type of call
- Overall
- For each higher level subroutine
- Histogram of blocksize for each type of call
15mpi_profile for T799
128 MPI tasks, 2 threadsWALL time 5495
sec----------------------------------------------
----------------MPI Routine calls
avg. bytes Mbytes time(sec)
--------------------------------------------------
------------MPI_Send 49784 52733.2
2625.3 7.873MPI_Bsend 6171
454107.3 2802.3 1.331MPI_Isend
84524 1469867.4 124239.1
1.202MPI_Recv 91940 1332252.1
122487.3 359.547MPI_Waitall 75884
0.0 0.0 59.772MPI_Bcast
362 26.6 0.0
0.028MPI_Barrier 9451 0.0
0.0 436.818
-------TOTAL
866.574
-------------------------------------------------
---------------Barrier indcates load imbalance
16mpi_profile for 4D_Var min0
128 MPI tasks, 2 threads WALL time 1218
sec ----------------------------------------------
---------------- MPI Routine calls avg.
bytes Mbytes time(sec) --------------------
------------------------------------------ MPI_Sen
d 43995 7222.9 317.8
1.033 MPI_Bsend 38473 13898.4
534.7 0.843 MPI_Isend 326703
168598.3 55081.6 6.368 MPI_Recv
432364 127061.8 54936.9 220.877
MPI_Waitall 276222 0.0 0.0
23.166 MPI_Bcast 288
374491.7 107.9 0.490 MPI_Barrier
27062 0.0 0.0
94.168 MPI_Allgatherv 466 285958.8
133.3 26.250 MPI_Allreduce 1325
73.2 0.1 1.027
------- TOTAL
374.223 -----------------------------
------------------------------------ Barrier
indicates load imbalance
17MPI Profiles for send/recv
18mpi_profiles for recv/send
Avg MB MB/s per task MB/s per task
Avg MB hpca hpcu
T799 (4 tasks per link)
trltom (inter node) 1.84 35 224
trltog (shrd memory) 4.00 116 890
slcomm2 (halo) 0.66 65 363
4D-Var min0 (4 tasks per link)
trltom (inter node) 0.167 160
trltog (shrd memory) 0.373 490
slcomm2 (halo) 0.088 222
799
19Conclusions
- Speedups of hpcu over hpca
-
- Large Memory
- Pages Affinity Speedup
- N N 1.32 1.60
- Y N 1.43 1.62
- N Y 1.47 1.78
- Y Y 1.52 1.85
- Best Environment Variables
- MPI.networkccc0 (instead of cccs)
- MEMORY_AFFINITYyes
- MP_AFFINITYMCM ! With new pvmd
- MP_BULK_MIN_MSG_SIZE50000
- LDR_CNTRL"LARGE_PAGE_DATAY
- dont use else system calls in LP
very slow - MP_EAGER_LIMIT64K
20hpca v hpcu
------Time----------
----Speedup----- LP Aff I/O
Total CPU Comms Total CPU Comms Comms
min0 hpca N N 2499 1408
1091 43.6 hpcu H/22 N
N 1502 1119 383 1.66 1.26 2.85 25.5
H/21 N Y 1321 951 370 1.89
1.48 2.95 28.0 H/20 Y N 1444
1165 279 1.73 1.21 3.91 19.3
H/19 Y Y 1229 962 267 2.03 1.46
4.08 21.7 min1 hpca N N
1649 1065 584 43.6 hpcu
H/22 N N 1033 825 208 1.60 1.29
2.81 20.1 H/21 N Y 948
734 214 1.74 1.45 2.73 22.5
H/15 Y N 1019 856 163 1.62 1.24
3.58 16.0 H/19 Y Y 914
765 149 1.80 1.39 3.91 16.3
21hpca v hpcu
------Time----------
----Speedup----- LP Aff I/O
Total CPU Comms Total CPU Comms Comms
min0 hpca N N 2499 1408
1091 43.6 hpcu H/22 N
N 1502 1119 383 1.66 1.26 2.85 25.5
H/21 N Y 1321 951 370 1.89
1.48 2.95 28.0 H/20 Y N 1444
1165 279 1.73 1.21 3.91 19.3
H/19 Y Y 1229 962 267 2.03 1.46
4.08 21.7 min1 hpca N N
1649 1065 584 43.6 hpcu
H/22 N N 1033 825 208 1.60 1.29
2.81 20.1 H/21 N Y 948
734 214 1.74 1.45 2.73 22.5
H/15 Y N 1019 856 163 1.62 1.24
3.58 16.0 H/19 Y Y 914
765 149 1.80 1.39 3.91 16.3
22mpi_profiles for recv/send
Avg MB MB/s per task MB/s per task
Avg MB hpca hpcu
T799 (4 tasks per link)
trltom (inter node) 1.84 35 224
trltog (shrd memory) 4.00 116 890
slcomm2 (halo) 0.66 65 363
4D-Var min0 (4 tasks per link)
trltom (inter node) 0.167 160
trltog (shrd memory) 0.373 490
slcomm2 (halo) 0.088 222
799
23Conclusions
- Memory Affinity with binding
- Program binds to MOD(task_idnthrdsthrd_id,32),
or - Use new /usr/lpp/ppe.poe/bin/pmdv4
- How to bind if whole node not used
- Try VSRAC code from Montpellier
- Bind adapter link to MCM ?
- Large Pages
- Advantages
- Need LP for best communication B/W with current
software - Disadvantages
- Uses extra memory (4GB more per node in 4D-Var
min1) - Load Leveler Scheduling
- Prototype switch software indicates Large Pages
not necessary - Collective Communication
- To be investigated
24Linux compared to PWR4 for IFS
- Linux (run by Peter Mayes)
- Opteron, 2GHz, 2 CPUs/node, 6GB/node, myrinet
switch - Portland Group compiler
- Compiler flags -O3 -Mvectsse
- No code optimisation or OpenMP
- Linux 1 1 CPU/node, Myrinet IP
- Linux 1A 1 CPU/node, Myrinet GM
- Linux 2 using 2 CPUs/node
- IBM Power4
- MPI (intra-node shared memory) and OpenMP
- Compiler flags -O3 qstrict
- hpca 1.3GHz p690, 8 CPUs/node, 8GB/node, colony
switch - hpcu 1.7GHz p690, 32 CPUs/node, 32GB/node,
federation switch
25Linux compared to Pwr4