IFS Benchmark with Federation Switch - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

IFS Benchmark with Federation Switch

Description:

IFS Benchmark with Federation Switch – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 26

Provided by: johnh145

Learn more at: http://www.spscicomp.org

Category:

more less

Transcript and Presenter's Notes

Title: IFS Benchmark with Federation Switch

1
IFS Benchmark with Federation Switch

John Hague, IBM

2
Introduction

Federation has dramatically improved pwr4 p690
communication, so
Measure Federation performance with Small Pages
and Large Pages using simulation program
Compare Federation and pre-Federation (Colony)
performance of IFS
Compare Federation performance of IFS with and
without Large Pages and Memory Affinity
Examine IFS communication using mpi profiling

3
Colony v Federation

Colony (hpca)
1.3GHz 32-processor p690s
Four 8-processor Affinity LPARs per p690
Needed to get communication performance
Two 180MB/s adapters per LPAR
Federation (hpcu)
1.7GHz p690s
One 32-processor LPAR per p690
Memory and MPI MCM Affinity
MPI Task and Memory from same MCM
Slightly better than binding task to specific
processor
Two 2-link 1.2GB/s Federation adapters per p690
Four 1.2GB/s links per node

4
IFS Communicationtranspositions
0
MPI task
Node
1

MPI Alltoall in all rows simultaneously
Mostly shared memory
MPI Alltoall in all columns simultaneously

5
Simulation of transpositions

All transpositions in row use shared memory
All transpositions in column use switch
Number of MPI tasks per node varied
But all processors used by using OpenMP threads
Bandwidth measured for MPI Sendrecv calls
Buffers allocated and filled by threads between
each call
Large Pages give best switch performance
With current switch software

6
Transposition Bandwidth per link(8 nodes, 4
links/node, 8 tasks/node, 4 threads/task, 2
tasks/link)
SP Small Pages LP Large Pages
7
Transposition Bandwidth per link(8 nodes, 4
links/node)
Multiple threads ensure all processors are used
8
hpcu v hpca with IFS

Benchmark jobs (provided 3 years ago)
Same executable used on hpcu and hpca
256 processors used
All jobs run with mpi_profiling (and barriers
before data exchange)

Procs Grid Points hpca hpcu Speedup
T399 10x1_4 213988 5828 3810 1.52
T799 16x8_2 843532 9907 5527 1.79
4D-Var T511/T255 16x8_2 4869 2737 1.78
9
IFS Speedups hpcu v hpca
LP Large Pages SP Small Pages MA Memory
Affinity
10
LP/SP MA/noMA CPU comparison
11
LP/SP MA/noMA Comms comparison
12
Percentage Communication
hpca ------------------- hpcu
--------------------------
13
Extra Memory needed by Large Pages

Large Pages are allocated in Real Memory in
segments of 256 MB
MPI_INIT
80MB which may not be used
MP_BUFFER_MEM (default 64MB) can be reduced
MPI_BUFFER_ALLOCATE needs memory which may not
be used
OpenMP threads
Stack allocated with XLSMPOPTSstack may not
be used
Fragmentation
Memory is "wasted"
Last 256 MB segment
Only a small part of it may be used

14
mpi_profile

Examine IFS communication using mpi profiling
Use libmpiprof.a
Calls and MB/s rate for each type of call
Overall
For each higher level subroutine
Histogram of blocksize for each type of call

15
mpi_profile for T799
128 MPI tasks, 2 threadsWALL time 5495
sec----------------------------------------------
----------------MPI Routine calls
avg. bytes Mbytes time(sec)
--------------------------------------------------
------------MPI_Send 49784 52733.2
2625.3 7.873MPI_Bsend 6171
454107.3 2802.3 1.331MPI_Isend
84524 1469867.4 124239.1
1.202MPI_Recv 91940 1332252.1
122487.3 359.547MPI_Waitall 75884
0.0 0.0 59.772MPI_Bcast
362 26.6 0.0
0.028MPI_Barrier 9451 0.0
0.0 436.818
-------TOTAL
866.574
-------------------------------------------------
---------------Barrier indcates load imbalance
16
mpi_profile for 4D_Var min0
128 MPI tasks, 2 threads WALL time 1218
sec ----------------------------------------------
---------------- MPI Routine calls avg.
bytes Mbytes time(sec) --------------------
------------------------------------------ MPI_Sen
d 43995 7222.9 317.8
1.033 MPI_Bsend 38473 13898.4
534.7 0.843 MPI_Isend 326703
168598.3 55081.6 6.368 MPI_Recv
432364 127061.8 54936.9 220.877
MPI_Waitall 276222 0.0 0.0
23.166 MPI_Bcast 288
374491.7 107.9 0.490 MPI_Barrier
27062 0.0 0.0
94.168 MPI_Allgatherv 466 285958.8
133.3 26.250 MPI_Allreduce 1325
73.2 0.1 1.027

------- TOTAL
374.223 -----------------------------
------------------------------------ Barrier
indicates load imbalance
17
MPI Profiles for send/recv
18
mpi_profiles for recv/send
Avg MB MB/s per task MB/s per task
Avg MB hpca hpcu
T799 (4 tasks per link)
trltom (inter node) 1.84 35 224
trltog (shrd memory) 4.00 116 890
slcomm2 (halo) 0.66 65 363
4D-Var min0 (4 tasks per link)
trltom (inter node) 0.167 160
trltog (shrd memory) 0.373 490
slcomm2 (halo) 0.088 222
799
19
Conclusions

Speedups of hpcu over hpca
Large Memory
Pages Affinity Speedup
N N 1.32 1.60
Y N 1.43 1.62
N Y 1.47 1.78
Y Y 1.52 1.85
Best Environment Variables
MPI.networkccc0 (instead of cccs)
MEMORY_AFFINITYyes
MP_AFFINITYMCM ! With new pvmd
MP_BULK_MIN_MSG_SIZE50000
LDR_CNTRL"LARGE_PAGE_DATAY
dont use else system calls in LP
very slow
MP_EAGER_LIMIT64K

20
hpca v hpcu
------Time----------
----Speedup----- LP Aff I/O
Total CPU Comms Total CPU Comms Comms
min0    hpca    N N    2499 1408
1091 43.6    hpcu H/22 N
N    1502 1119   383   1.66 1.26 2.85 25.5
H/21 N Y    1321 951   370   1.89
1.48 2.95 28.0 H/20 Y N    1444
1165   279   1.73 1.21 3.91 19.3
H/19 Y Y    1229 962   267   2.03 1.46
4.08 21.7   min1    hpca    N N
1649 1065   584 43.6    hpcu
H/22 N N    1033 825   208   1.60 1.29
2.81 20.1 H/21 N Y     948
734   214   1.74 1.45 2.73 22.5
H/15 Y N    1019 856   163   1.62 1.24
3.58 16.0 H/19 Y Y 914
765   149   1.80 1.39 3.91 16.3
21
hpca v hpcu
------Time----------
----Speedup----- LP Aff I/O
Total CPU Comms Total CPU Comms Comms
min0    hpca    N N    2499 1408
1091 43.6    hpcu H/22 N
N    1502 1119   383   1.66 1.26 2.85 25.5
H/21 N Y    1321 951   370   1.89
1.48 2.95 28.0 H/20 Y N    1444
1165   279   1.73 1.21 3.91 19.3
H/19 Y Y    1229 962   267   2.03 1.46
4.08 21.7   min1    hpca    N N
1649 1065   584 43.6    hpcu
H/22 N N    1033 825   208   1.60 1.29
2.81 20.1 H/21 N Y     948
734   214   1.74 1.45 2.73 22.5
H/15 Y N    1019 856   163   1.62 1.24
3.58 16.0 H/19 Y Y 914
765   149   1.80 1.39 3.91 16.3
22
mpi_profiles for recv/send
Avg MB MB/s per task MB/s per task
Avg MB hpca hpcu
T799 (4 tasks per link)
trltom (inter node) 1.84 35 224
trltog (shrd memory) 4.00 116 890
slcomm2 (halo) 0.66 65 363
4D-Var min0 (4 tasks per link)
trltom (inter node) 0.167 160
trltog (shrd memory) 0.373 490
slcomm2 (halo) 0.088 222
799
23
Conclusions