IFS Benchmark with Federation Switch - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

IFS Benchmark with Federation Switch

Description:

IFS Benchmark with Federation Switch – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 26
Provided by: johnh145
Learn more at: http://www.spscicomp.org
Category:

less

Transcript and Presenter's Notes

Title: IFS Benchmark with Federation Switch


1
IFS Benchmark with Federation Switch
  • John Hague, IBM

2
Introduction
  • Federation has dramatically improved pwr4 p690
    communication, so
  • Measure Federation performance with Small Pages
    and Large Pages using simulation program
  • Compare Federation and pre-Federation (Colony)
    performance of IFS
  • Compare Federation performance of IFS with and
    without Large Pages and Memory Affinity
  • Examine IFS communication using mpi profiling

3
Colony v Federation
  • Colony (hpca)
  • 1.3GHz 32-processor p690s
  • Four 8-processor Affinity LPARs per p690
  • Needed to get communication performance
  • Two 180MB/s adapters per LPAR
  • Federation (hpcu)
  • 1.7GHz p690s
  • One 32-processor LPAR per p690
  • Memory and MPI MCM Affinity
  • MPI Task and Memory from same MCM
  • Slightly better than binding task to specific
    processor
  • Two 2-link 1.2GB/s Federation adapters per p690
  • Four 1.2GB/s links per node

4
IFS Communicationtranspositions
0
MPI task
Node
1
  • MPI Alltoall in all rows simultaneously
  • Mostly shared memory
  • MPI Alltoall in all columns simultaneously

5
Simulation of transpositions
  • All transpositions in row use shared memory
  • All transpositions in column use switch
  • Number of MPI tasks per node varied
  • But all processors used by using OpenMP threads
  • Bandwidth measured for MPI Sendrecv calls
  • Buffers allocated and filled by threads between
    each call
  • Large Pages give best switch performance
  • With current switch software

6
Transposition Bandwidth per link(8 nodes, 4
links/node, 8 tasks/node, 4 threads/task, 2
tasks/link)
SP Small Pages LP Large Pages
7
Transposition Bandwidth per link(8 nodes, 4
links/node)
Multiple threads ensure all processors are used
8
hpcu v hpca with IFS
  • Benchmark jobs (provided 3 years ago)
  • Same executable used on hpcu and hpca
  • 256 processors used
  • All jobs run with mpi_profiling (and barriers
    before data exchange)

Procs Grid Points hpca hpcu Speedup
T399 10x1_4 213988 5828 3810 1.52
T799 16x8_2 843532 9907 5527 1.79
4D-Var T511/T255 16x8_2 4869 2737 1.78
9
IFS Speedups hpcu v hpca
LP Large Pages SP Small Pages MA Memory
Affinity
10
LP/SP MA/noMA CPU comparison
11
LP/SP MA/noMA Comms comparison
12
Percentage Communication
hpca ------------------- hpcu
--------------------------
13
Extra Memory needed by Large Pages
  • Large Pages are allocated in Real Memory in
    segments of 256 MB
  • MPI_INIT
  • 80MB which may not be used
  • MP_BUFFER_MEM (default 64MB) can be reduced
  • MPI_BUFFER_ALLOCATE needs memory which may not
    be used
  • OpenMP threads
  • Stack allocated with XLSMPOPTSstack may not
    be used
  • Fragmentation
  • Memory is "wasted"
  • Last 256 MB segment
  • Only a small part of it may be used

14
mpi_profile
  • Examine IFS communication using mpi profiling
  • Use libmpiprof.a
  • Calls and MB/s rate for each type of call
  • Overall
  • For each higher level subroutine
  • Histogram of blocksize for each type of call

15
mpi_profile for T799
128 MPI tasks, 2 threadsWALL time 5495
sec----------------------------------------------
----------------MPI Routine calls
avg. bytes Mbytes time(sec)
--------------------------------------------------
------------MPI_Send 49784 52733.2
2625.3 7.873MPI_Bsend 6171
454107.3 2802.3 1.331MPI_Isend
84524 1469867.4 124239.1
1.202MPI_Recv 91940 1332252.1
122487.3 359.547MPI_Waitall 75884
0.0 0.0 59.772MPI_Bcast
362 26.6 0.0
0.028MPI_Barrier 9451 0.0
0.0 436.818
-------TOTAL
866.574
-------------------------------------------------
---------------Barrier indcates load imbalance
16
mpi_profile for 4D_Var min0
128 MPI tasks, 2 threads WALL time 1218
sec ----------------------------------------------
---------------- MPI Routine calls avg.
bytes Mbytes time(sec) --------------------
------------------------------------------ MPI_Sen
d 43995 7222.9 317.8
1.033 MPI_Bsend 38473 13898.4
534.7 0.843 MPI_Isend 326703
168598.3 55081.6 6.368 MPI_Recv
432364 127061.8 54936.9 220.877
MPI_Waitall 276222 0.0 0.0
23.166 MPI_Bcast 288
374491.7 107.9 0.490 MPI_Barrier
27062 0.0 0.0
94.168 MPI_Allgatherv 466 285958.8
133.3 26.250 MPI_Allreduce 1325
73.2 0.1 1.027

------- TOTAL
374.223 -----------------------------
------------------------------------ Barrier
indicates load imbalance
17
MPI Profiles for send/recv
18
mpi_profiles for recv/send
Avg MB MB/s per task MB/s per task
Avg MB hpca hpcu
T799 (4 tasks per link)
trltom (inter node) 1.84 35 224
trltog (shrd memory) 4.00 116 890
slcomm2 (halo) 0.66 65 363
4D-Var min0 (4 tasks per link)
trltom (inter node) 0.167 160
trltog (shrd memory) 0.373 490
slcomm2 (halo) 0.088 222
799
19
Conclusions
  • Speedups of hpcu over hpca
  • Large Memory
  • Pages Affinity Speedup
  • N N 1.32 1.60
  • Y N 1.43 1.62
  • N Y 1.47 1.78
  • Y Y 1.52 1.85
  • Best Environment Variables
  • MPI.networkccc0 (instead of cccs)
  • MEMORY_AFFINITYyes
  • MP_AFFINITYMCM ! With new pvmd
  • MP_BULK_MIN_MSG_SIZE50000
  • LDR_CNTRL"LARGE_PAGE_DATAY
  • dont use else system calls in LP
    very slow
  • MP_EAGER_LIMIT64K

20
hpca v hpcu
------Time----------
----Speedup----- LP Aff I/O
Total CPU Comms Total CPU Comms Comms
min0     hpca     N  N       2499 1408 
1091                    43.6     hpcu H/22  N 
N       1502 1119   383   1.66 1.26  2.85  25.5 
        H/21  N  Y       1321  951   370   1.89
1.48  2.95  28.0          H/20  Y  N       1444
1165   279   1.73 1.21  3.91  19.3         
H/19  Y  Y       1229  962   267   2.03 1.46 
4.08  21.7   min1     hpca     N  N      
1649 1065   584                    43.6     hpcu
H/22  N  N       1033  825   208   1.60 1.29 
2.81  20.1          H/21  N  Y        948 
734   214   1.74 1.45  2.73  22.5         
H/15  Y  N       1019  856   163   1.62 1.24 
3.58  16.0          H/19  Y  Y        914 
765   149   1.80 1.39  3.91  16.3
21
hpca v hpcu
------Time----------
----Speedup----- LP Aff I/O
Total CPU Comms Total CPU Comms Comms
min0     hpca     N  N       2499 1408 
1091                    43.6     hpcu H/22  N 
N       1502 1119   383   1.66 1.26  2.85  25.5 
        H/21  N  Y       1321  951   370   1.89
1.48  2.95  28.0          H/20  Y  N       1444
1165   279   1.73 1.21  3.91  19.3         
H/19  Y  Y       1229  962   267   2.03 1.46 
4.08  21.7   min1     hpca     N  N      
1649 1065   584                    43.6     hpcu
H/22  N  N       1033  825   208   1.60 1.29 
2.81  20.1          H/21  N  Y        948 
734   214   1.74 1.45  2.73  22.5         
H/15  Y  N       1019  856   163   1.62 1.24 
3.58  16.0          H/19  Y  Y        914 
765   149   1.80 1.39  3.91  16.3
22
mpi_profiles for recv/send
Avg MB MB/s per task MB/s per task
Avg MB hpca hpcu
T799 (4 tasks per link)
trltom (inter node) 1.84 35 224
trltog (shrd memory) 4.00 116 890
slcomm2 (halo) 0.66 65 363
4D-Var min0 (4 tasks per link)
trltom (inter node) 0.167 160
trltog (shrd memory) 0.373 490
slcomm2 (halo) 0.088 222
799
23
Conclusions
  • Memory Affinity with binding
  • Program binds to MOD(task_idnthrdsthrd_id,32),
    or
  • Use new /usr/lpp/ppe.poe/bin/pmdv4
  • How to bind if whole node not used
  • Try VSRAC code from Montpellier
  • Bind adapter link to MCM ?
  • Large Pages
  • Advantages
  • Need LP for best communication B/W with current
    software
  • Disadvantages
  • Uses extra memory (4GB more per node in 4D-Var
    min1)
  • Load Leveler Scheduling
  • Prototype switch software indicates Large Pages
    not necessary
  • Collective Communication
  • To be investigated

24
Linux compared to PWR4 for IFS
  • Linux (run by Peter Mayes)
  • Opteron, 2GHz, 2 CPUs/node, 6GB/node, myrinet
    switch
  • Portland Group compiler
  • Compiler flags -O3 -Mvectsse
  • No code optimisation or OpenMP
  • Linux 1 1 CPU/node, Myrinet IP
  • Linux 1A 1 CPU/node, Myrinet GM
  • Linux 2 using 2 CPUs/node
  • IBM Power4
  • MPI (intra-node shared memory) and OpenMP
  • Compiler flags -O3 qstrict
  • hpca 1.3GHz p690, 8 CPUs/node, 8GB/node, colony
    switch
  • hpcu 1.7GHz p690, 32 CPUs/node, 32GB/node,
    federation switch

25
Linux compared to Pwr4
Write a Comment
User Comments (0)
About PowerShow.com