Matrix Transpose Results with Hybrid OpenMP / MPI - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Matrix Transpose Results with Hybrid OpenMP / MPI

Description:

Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft f r wissenschaftliche Datenverarbeitung G ttingen, Germany ( GWDG ) SCICOMP 2000, SDSC, La Jolla – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 27
Provided by: OHA7
Learn more at: http://www.spscicomp.org
Category:

less

Transcript and Presenter's Notes

Title: Matrix Transpose Results with Hybrid OpenMP / MPI


1
Matrix Transpose Resultswith Hybrid OpenMP / MPI
  • O. Haan
  • Gesellschaft für wissenschaftliche
    DatenverarbeitungGöttingen, Germany( GWDG )

SCICOMP 2000, SDSC, La Jolla
2
Overview
  • Hybrid Programming Model
  • Distributed Matrix Transpose
  • Performance Measurements
  • Summary of Results

3
Architecture of Scalable Parallel Computers
  • Two level hierarchy
  • cluster of SMP nodes distributed memory high
    speed interconnect
  • SMP nodes with multiple processors shared
    memory bus or switch connected

4
Programming Models
  • message passing over all processors MPI
    implementation for shared memory multiple access
    to switch adapters SP 4-way Winterhawk2
    8-way Nighthawk -
  • shared memory over all processors virtual global
    address space SP -
  • hybrid message passing - shared memory message
    passing between nodes shared memory within
    nodes SP

5
Hybrid Programming Model
SPMD programwith MPI tasksOpenMP
threadswithin each taskcommunicationbetween
MPI tasks
6
Example of Hybrid Program
  • program hybrid_example
  • include mpif.h
  • com MPI_COMM_WORLD
  • call MPI_INIT(ierr)
  • call MPI_COMM_SIZE(com,nk,ierr)
  • call MPI_COMM_RANK(com,my_task,ierr)
  • kp OMP_GET_NUM_PROCS()
  • !OMP PARALLEL PRIVATE(my_thread)
  • my_thread OMP_GET_THREAD_NUM()
  • call work(my_thread,kp,my_task,nk,thread_res
    )
  • !OMP END PARALLEL
  • do i 0 , kp-1
  • node_res node_res thread_res(i)
  • end do
  • call MPI_REDUCE(node_res,glob_res,1,
  • MPI_REAL,MPI_SUM,0,com,ierr)
  • call MPI_FINALIZE(ierr)
  • stop
  • end

7
Hybrid Programming vs.Pure Message Passing
  • works on all SP configuration
  • coarser internode communication granularity
  • faster intranode communication
  • -
  • larger programming effort
  • additional synchronization steps
  • reduced reuse of cached data

the net score depends on the problem
8
Distributed Matrix Transpose
9
3-step Transpose
n1 x n2 matrix A( i1 , i2 ) --gt n2 x n1
matrix B( i2 , i1 ) decompose n1, n2 in local
and global parts n1 n1l np n2 n2l
np write matrices A, B as 4-dim arrays A( i1l
, i1g , i2l i2g ) , B( i2l , i2g , i1l i1g
) step 1 local reorder A( i1l , i1g , i2l
i2g ) -gt a1( i1l , i2l , i1g i2g ) step 2
global reorder a1( i1l , i2l , i1g i2g ) -gt
a2( i1l , i2l , i2g i1g ) step 3 local
transpose a2( i1l , i2l , i2g i1g ) -gt B( i2l
, i2g , i1l i1g )
10
Local Steps Copy with Reorder
  • data in memoryspeed limited by performance of
    bus and memory subsystemsWinterhawk2 all
    processors share the same bus bandwidth
    1.6 GB/s
  • data in cachespeed limited by processor
    performanceWinterhawk2 one load plus one
    store per cyclebandwidth 8 MB / (1/375) s
    3 GB / s

11
Copy Data in Memory
12
Copy Prefetch
13
Copy Data in Cache
14
Global Reorder
a1( , , i1g i2g ) -gt a2( , , i2g i1g
) global reorder on np processors in np steps
p0 p1
p2
step0 step1 step2
15
Performance Modelling
  • Hardware model nk nodes with kp procs each
  • np nk kp is total procs count
  • Switch model nk concurrent links between nodes
  • latency tlat , bandwidth c
  • execution model for Hybrid reorder on nk nodes
  • nk steps with n1n2 / nk2 data per node
  • execution model for MPI reorder on np
    processors
  • np steps with n1n2 / np2 data per
    node switch links shared between kp procs

16
Performance Modelling
  • Hybrid timing model

MPI timing model
17
Timing of Global Reorder (internode part)
18
Timing of Global Reorder (internode part)
19
Timing of Global Reorder
20
Timing of Transpose
21
Scaling of Transpose
22
Timing of Transpose Steps
23
Summary of Results Hardware
  • Memory access in Winterhawk2 is not
    adaquatecopy rate of 400 MB/s 50
    Mwords/s peak CPU rate of 6000 Mflops/sa
    factor of 100 between computational speed and
    memory speed
  • Sharing of switch link by 4 processors degrades
    communication speedbandwidth smaller by more
    than a factor of 4 ( factor of 4 expected
    )latency larger by nearly a factor of 4 (
    factor of 1 expected )

24
Summary of Results Hybrid vs. MPI
  • hybrid OpenMP / MPI programming is profitable
    for distributed matrix tranpose 1000 x 1000
    matrix on 16 nodes 2.3 times faster10000 x
    10000 matrix on 16 nodes 1.1 times faster
  • Competing influences MPI programming enhances
    use of cached dataHybrid programming has lower
    communication latency and coarser communication
    granularity

25
Summary of Results Use of Transpose in FFT
  • 2-dim complex array of size

Execution time on nk nodes
where r computational speed per node c
transpose speed per node
effective execution speed per node
26
Summary of Results Use of Transpose in FFT-
Example SP
r 4 200 Mflop/s 800 Mflop/sc depends on
n, nk and programming model nk 16 n
106 109 hybrid c
5.6 7.8 Mword/sMPI c
2.5 7.0 Mword/s effective execution speed
per node hybrid 208 338 Mflop/s MPI
108 317 Mflop/s
Write a Comment
User Comments (0)
About PowerShow.com