Title: Matrix Transpose Results with Hybrid OpenMP / MPI
1Matrix Transpose Resultswith Hybrid OpenMP / MPI
- O. Haan
- Gesellschaft für wissenschaftliche
DatenverarbeitungGöttingen, Germany( GWDG )
SCICOMP 2000, SDSC, La Jolla
2Overview
- Hybrid Programming Model
- Distributed Matrix Transpose
- Performance Measurements
- Summary of Results
3Architecture of Scalable Parallel Computers
- Two level hierarchy
- cluster of SMP nodes distributed memory high
speed interconnect - SMP nodes with multiple processors shared
memory bus or switch connected
4Programming Models
- message passing over all processors MPI
implementation for shared memory multiple access
to switch adapters SP 4-way Winterhawk2
8-way Nighthawk - - shared memory over all processors virtual global
address space SP - - hybrid message passing - shared memory message
passing between nodes shared memory within
nodes SP
5Hybrid Programming Model
SPMD programwith MPI tasksOpenMP
threadswithin each taskcommunicationbetween
MPI tasks
6Example of Hybrid Program
- program hybrid_example
- include mpif.h
- com MPI_COMM_WORLD
- call MPI_INIT(ierr)
- call MPI_COMM_SIZE(com,nk,ierr)
- call MPI_COMM_RANK(com,my_task,ierr)
- kp OMP_GET_NUM_PROCS()
- !OMP PARALLEL PRIVATE(my_thread)
- my_thread OMP_GET_THREAD_NUM()
- call work(my_thread,kp,my_task,nk,thread_res
) - !OMP END PARALLEL
- do i 0 , kp-1
- node_res node_res thread_res(i)
- end do
- call MPI_REDUCE(node_res,glob_res,1,
- MPI_REAL,MPI_SUM,0,com,ierr)
- call MPI_FINALIZE(ierr)
- stop
- end
7Hybrid Programming vs.Pure Message Passing
-
- works on all SP configuration
- coarser internode communication granularity
- faster intranode communication
- -
- larger programming effort
- additional synchronization steps
- reduced reuse of cached data
the net score depends on the problem
8Distributed Matrix Transpose
93-step Transpose
n1 x n2 matrix A( i1 , i2 ) --gt n2 x n1
matrix B( i2 , i1 ) decompose n1, n2 in local
and global parts n1 n1l np n2 n2l
np write matrices A, B as 4-dim arrays A( i1l
, i1g , i2l i2g ) , B( i2l , i2g , i1l i1g
) step 1 local reorder A( i1l , i1g , i2l
i2g ) -gt a1( i1l , i2l , i1g i2g ) step 2
global reorder a1( i1l , i2l , i1g i2g ) -gt
a2( i1l , i2l , i2g i1g ) step 3 local
transpose a2( i1l , i2l , i2g i1g ) -gt B( i2l
, i2g , i1l i1g )
10Local Steps Copy with Reorder
- data in memoryspeed limited by performance of
bus and memory subsystemsWinterhawk2 all
processors share the same bus bandwidth
1.6 GB/s - data in cachespeed limited by processor
performanceWinterhawk2 one load plus one
store per cyclebandwidth 8 MB / (1/375) s
3 GB / s
11Copy Data in Memory
12Copy Prefetch
13Copy Data in Cache
14Global Reorder
a1( , , i1g i2g ) -gt a2( , , i2g i1g
) global reorder on np processors in np steps
p0 p1
p2
step0 step1 step2
15 Performance Modelling
- Hardware model nk nodes with kp procs each
- np nk kp is total procs count
- Switch model nk concurrent links between nodes
- latency tlat , bandwidth c
- execution model for Hybrid reorder on nk nodes
- nk steps with n1n2 / nk2 data per node
- execution model for MPI reorder on np
processors - np steps with n1n2 / np2 data per
node switch links shared between kp procs
16Performance Modelling
MPI timing model
17Timing of Global Reorder (internode part)
18Timing of Global Reorder (internode part)
19Timing of Global Reorder
20Timing of Transpose
21Scaling of Transpose
22Timing of Transpose Steps
23Summary of Results Hardware
- Memory access in Winterhawk2 is not
adaquatecopy rate of 400 MB/s 50
Mwords/s peak CPU rate of 6000 Mflops/sa
factor of 100 between computational speed and
memory speed - Sharing of switch link by 4 processors degrades
communication speedbandwidth smaller by more
than a factor of 4 ( factor of 4 expected
)latency larger by nearly a factor of 4 (
factor of 1 expected )
24Summary of Results Hybrid vs. MPI
- hybrid OpenMP / MPI programming is profitable
for distributed matrix tranpose 1000 x 1000
matrix on 16 nodes 2.3 times faster10000 x
10000 matrix on 16 nodes 1.1 times faster - Competing influences MPI programming enhances
use of cached dataHybrid programming has lower
communication latency and coarser communication
granularity
25Summary of Results Use of Transpose in FFT
- 2-dim complex array of size
Execution time on nk nodes
where r computational speed per node c
transpose speed per node
effective execution speed per node
26Summary of Results Use of Transpose in FFT-
Example SP
r 4 200 Mflop/s 800 Mflop/sc depends on
n, nk and programming model nk 16 n
106 109 hybrid c
5.6 7.8 Mword/sMPI c
2.5 7.0 Mword/s effective execution speed
per node hybrid 208 338 Mflop/s MPI
108 317 Mflop/s