Experiences with Coarray Fortran on Hardware Shared Memory Platforms

1 / 35

About This Presentation

Title:

Experiences with Coarray Fortran on Hardware Shared Memory Platforms

Description:

Fortran 90 pointer representation yields 30% of performance on ... COMMON block representation for local access Cray pointer for remote accesses is the best ... –

Number of Views:83

Avg rating:3.0/5.0

Slides: 36

Provided by: csR7

Category:

more less

Transcript and Presenter's Notes

Title: Experiences with Coarray Fortran on Hardware Shared Memory Platforms

1
Experiences with Co-array Fortran on Hardware
Shared Memory Platforms

Yuri Dotsenko Cristian Coarfa
John Mellor-Crummey Daniel Chavarria-Miranda
Rice University, Houston, TX

2
Co-array Fortran

Global Address Space (GAS) language
SPMD programming model
Simple extension of Fortran 90
Explicit control over data placement and
computation distribution
Private data
Shared data both local and remote
One-sided communication (PUT and GET)
Team and point-to-point synchronization

3
Co-array Fortran Example
integer a(10,20) if (this_image() gt 1)
a(110,12) a(110,1920)this_image()-1
Copies from left neighbor
4
Compiling CAF

Source-to-source translation
Prototype Rice cafc
Fortran 90 pointer-based co-array representation
ARMCI-based data movement
Goal performance transparency

Challenges
Retain CAF source-level information
Array contiguity, array bounds, lack of aliasing
Exploit efficient fine-grain communication on
SMPs

5
Outline

Co-array representation and data access
Local data
Remote data
Experimental evaluation
Conclusions

6
Representation and Access for Local Data

Efficient local access to SAVE/COMMON co-arrays
is crucial to achieving best performance on a
target architecture
Fortran 90 pointer
Fortran 90 pointer to structure
Cray pointer
Subroutine argument
COMMON block (need support for symmetric shared
objects)

7
Fortran 90 Pointer Representation

CAF declaration real, save a(10,20)
After translation type T1
integer(PtrSize) handle
real, pointer local(,)
end type T1
type (T1) ca
Local access calocal(2,3)
Portable representation
Back-end compiler has no knowledge about
Potential aliasing (no-alias flags for some
compilers)
Contiguity
Bounds
Implemented in cafc

8
Fortran 90 Pointer to Structure Representation

CAF declaration real, save a(10,20)
After translation type T1
real local(10,20)
end type T1
type (T1), pointer ca
Conveys constant bounds and contiguity
Potential aliasing is still a problem

9
Cray Pointer Representation

CAF declaration real, save a(10,20)
After translation real a_local(10,20)
pointer (a_ptr, a_local)
Conveys constant bounds and contiguity
Potential aliasing is still a problem
Cray pointer is not in Fortran 90 standard

10
Subroutine Argument Representation

CAF source subroutine foo()
real, save a(10,20)
a(i,j) a(i-1,j)
end subroutine foo
After translation
subroutine foo()
! F90 representation for co-array a
call foo_body(calocal(1,1), cahandle, )
end subroutine foo
subroutine foo_body(a_local, a_handle, )
real a_local(10,20)
a_local(i,j) a_local(i-1,j)
end subroutine foo_body

11
Subroutine Argument Representation (cont.)

Avoid conservative assumptions about co-array
aliasing by the back-end compiler
Performance is close to optimal
Extra procedures and procedure calls
Implemented in cafc

12
COMMON Block Representation

CAF declaration real a(10,20)
common /a_cb/ a
After translation real ca(10,20)
common /ca_cb/ ca
Yields best performance for local accesses
OS must support symmetric data objects

13
Outline

Co-array representation and data access
Local data
Remote data
Experimental evaluation
Conclusions

14
Generating CAF Communication

Generic parallel architectures
Library function calls to move data
Shared memory architectures (load/store)
Fortran 90 pointers
Vector of Fortran 90 pointers
Cray pointers

15
Communication Generation for Generic Parallel
Architectures

CAF code a() b()p
Translated code allocate b_temp()
call GET( b, p, b_temp, )
a() b_temp()
deallocate b_temp
Portable works on clusters and SMPs
Function overhead per fine-grain access
Uses temporary to hold off-processor data
Implemented in cafc

16
Communication Generation Using Fortran 90 Pointers

CAF code do j 1, N
C(j) A(j)p
end do
Translated code do j 1, N
ptrA gt A(j)
call CafSetPtr(ptrA,p,A_handle)
C(j) ptrA
end do
Function call overhead for each reference
Implemented in cafc

17
Pointer Initialization Hoisting

Naïvely translated code do j 1, N
ptrA gt A(j)
call CafSetPtr(ptrA,p,A_handle)
C(j) ptrA
end do
Code with hoisted pointer initialization
ptrA gt A(1N)
call CafSetPtr(ptrA,p,A_handle)
do j 1, N
C(j) ptrA(j)
end do
Pointer initialization hoisting is not yet
implemented in cafc

18
Communication Generation Using Vector of Fortran
90 Pointers

CAF code do j 1, N
C(j) A(j)p
end do
Translated code initialization
do j 1, N
C(j) ptrVectorA(p)ptrA(j)
end do
Does not require pointer initialization hoisting
and avoids function calls
Worse performance than that of hoisted pointer
initialization

19
Communication Generation Using Cray Pointers

CAF code do j 1, N
C(j) A(j)p
end do
Translated code integer(PtrSize) addrA()
addrA initialization
do j 1, N
ptrA addrA(p)
C(j) A_rem(j)
end do
addrA(p) address of co-array A on image p
Cray pointer initialization hoisting yields only
marginal improvement

20
Outline

Co-array representation and data access
Local data
Remote data
Experimental evaluation
Conclusions

21
Experimental Platforms

SGI Altix 3000
128 Itanium2 1.5 GHz, 6 MB L3 cache processors
Linux (2.4.21 kernel)
Intel Fortran Compiler 8.0
SGI Origin 2000
16 MIPS R12000 350 MHz, 8 MB L2 cache processors
IRIX64 6.5
MIPSpro Compiler 7.3.1.3m

22
Benchmarks

STREAM
Random Access
Spark98
NAS MG and SP

23
STREAM

Copy kernel
DO J 1, N DO J 1, N
C(J) A(J) C(J) A(J)p
END DO END DO
Triad kernel
DO J 1, N DO J 1, N
A(J)B(J)sC(J) A(J)B(J)psC(J)p
END DO END DO
Goal investigate how well architecture bandwidth
can be delivered up to the language level

24
STREAM Local Accesses

COMMON block is the best, if platform allows
Subroutine parameter has similar performance to
COMMON block representation
Pointer-based representations have performance
within 5 of the best on the Altix (with
no-aliasing flag), and within 15 on the Origin
Fortran 90 pointer representation yields 30 of
performance on the Altix without using the flag
to specify lack of pointer aliasing
Array section statements with Fortran 90 pointer
representation yield 40-50 performance on the
Origin

25
STREAM Remote Accesses

COMMON block representation for local access
Cray pointer for remote accesses is the best
Subroutine argument Cray pointer for remote
accesses has similar performance
Remote accesses with function call per access
yield very poor performance (24 times slower than
the best on the Altix, five times slower on the
Origin)
Generic strategy (with intermediate temporaries)
delivers only 50-60 of performance on the Altix
and 30-40 of performance on the Origin for
vectorized code (except for Copy kernel)
Pointer initialization hoisting is crucial for
Fortran 90 pointers remote accesses and desirable
for Cray pointers
Similarly coded OpenMP version has comparable
performance on the Altix (90 for the scale
kernel) and 86-90 on the Origin

26
Spark98

Based on CMUs earthquake simulation code
Computes sparse matrix-vector product
Irregular application with fine-grain accesses
Matrix distribution and computation partitioning
is done offline (sf2 traces)
Spark98 computes partial product locally, then
assembles the result across processors

27
Spark98 (cont.)

Versions
Serial (Fortran kernel, ported from C)
MPI (Fortran kernel, ported from C)
Hybrid (best shared memory threaded version)
CAF versions (based on MPI version)
CAF Packed PUTs
CAF Packed GETs
CAF GETs (computation with remote data accessed
in place)

28
Spark98 GETs Result Assembly

v2(,) v(,)
call sync_all()
do s 0, subdomains-1
if (commindex(s) lt commindex(s1)) then
pos commindex(s)
comm_len commindex(s1) - pos
v(, comm(posposcomm_len-1))
v(, comm(posposcomm_len-1))
v2(, comm_gets(posposcomm_len-1))s
end if
end do
call sync_all()

29
Spark98 GETs Result Assembly

v2(,) v(,)
call sync_all()
do s 0, subdomains-1
if (commindex(s) lt commindex(s1)) then
pos commindex(s)
comm_len commindex(s1) - pos
v(, comm(posposcomm_len-1))
v(, comm(posposcomm_len-1))
v2(, comm_gets(posposcomm_len-1))s
end if
end do
call sync_all()

30
Spark98 Performance on Altix

Performance of all CAF versions is comparable to
that of MPI and better on large number of CPUs
CAF GETs is simple and more natural to code,
but up to 13 slower
Without considering locality, applications do not
scale on NUMA architectures (Hybrid)
ARMCI library is more efficient than MPI

31
NAS MG and SP

Versions
MPI (NPB 2.3)
CAF (based on MPI NPB 2.3)
Generic code generation with subroutine argument
co-array representation (procedure splitting)
Shared memory code generation (Fortran 90
pointers vectorized source code) with subroutine
argument co-array representation
OpenMP (NPB 3.0)
Class C

32
NAS SP Performance on Altix

Performance of CAF versions is comparable to that
of MPI
CAF-generic has better performance than CAF-shm
because it uses memcpy, which hides latency by
keeping optimal number of memory ops in flight
OpenMP scales poorly

33
NAS MG Performance on Altix
34
Conclusions

Direct load/store communication improves
performance of fine-grain accesses by a factor of
24 on the Altix 3000 and five on the Origin 2000
In-place data use in CAF statements incurs
acceptable abstraction overhead
Performance comparable to that of MPI codes for
fine- and coarse-grain applications
We plan to implement in cafc optimal,
architecture dependent, code generation for local
and remote co-array accesses