Title: Experiences with Coarray Fortran on Hardware Shared Memory Platforms
1Experiences with Co-array Fortran on Hardware
Shared Memory Platforms
- Yuri Dotsenko Cristian Coarfa
- John Mellor-Crummey Daniel Chavarria-Miranda
- Rice University, Houston, TX
2Co-array Fortran
- Global Address Space (GAS) language
- SPMD programming model
- Simple extension of Fortran 90
- Explicit control over data placement and
computation distribution - Private data
- Shared data both local and remote
- One-sided communication (PUT and GET)
- Team and point-to-point synchronization
3Co-array Fortran Example
integer a(10,20) if (this_image() gt 1)
a(110,12) a(110,1920)this_image()-1
Copies from left neighbor
4Compiling CAF
- Source-to-source translation
- Prototype Rice cafc
- Fortran 90 pointer-based co-array representation
- ARMCI-based data movement
- Goal performance transparency
- Challenges
- Retain CAF source-level information
- Array contiguity, array bounds, lack of aliasing
- Exploit efficient fine-grain communication on
SMPs
5Outline
- Co-array representation and data access
- Local data
- Remote data
- Experimental evaluation
- Conclusions
6Representation and Access for Local Data
- Efficient local access to SAVE/COMMON co-arrays
is crucial to achieving best performance on a
target architecture - Fortran 90 pointer
- Fortran 90 pointer to structure
- Cray pointer
- Subroutine argument
- COMMON block (need support for symmetric shared
objects)
7Fortran 90 Pointer Representation
- CAF declaration real, save a(10,20)
- After translation type T1
- integer(PtrSize) handle
- real, pointer local(,)
- end type T1
- type (T1) ca
- Local access calocal(2,3)
- Portable representation
- Back-end compiler has no knowledge about
- Potential aliasing (no-alias flags for some
compilers) - Contiguity
- Bounds
- Implemented in cafc
8Fortran 90 Pointer to Structure Representation
- CAF declaration real, save a(10,20)
- After translation type T1
- real local(10,20)
- end type T1
- type (T1), pointer ca
- Conveys constant bounds and contiguity
- Potential aliasing is still a problem
9Cray Pointer Representation
- CAF declaration real, save a(10,20)
- After translation real a_local(10,20)
- pointer (a_ptr, a_local)
- Conveys constant bounds and contiguity
- Potential aliasing is still a problem
- Cray pointer is not in Fortran 90 standard
10Subroutine Argument Representation
- CAF source subroutine foo()
- real, save a(10,20)
- a(i,j) a(i-1,j)
- end subroutine foo
- After translation
- subroutine foo()
- ! F90 representation for co-array a
- call foo_body(calocal(1,1), cahandle, )
- end subroutine foo
- subroutine foo_body(a_local, a_handle, )
- real a_local(10,20)
- a_local(i,j) a_local(i-1,j)
- end subroutine foo_body
11Subroutine Argument Representation (cont.)
- Avoid conservative assumptions about co-array
aliasing by the back-end compiler - Performance is close to optimal
- Extra procedures and procedure calls
- Implemented in cafc
12COMMON Block Representation
- CAF declaration real a(10,20)
- common /a_cb/ a
- After translation real ca(10,20)
- common /ca_cb/ ca
- Yields best performance for local accesses
- OS must support symmetric data objects
13Outline
- Co-array representation and data access
- Local data
- Remote data
- Experimental evaluation
- Conclusions
14Generating CAF Communication
- Generic parallel architectures
- Library function calls to move data
- Shared memory architectures (load/store)
- Fortran 90 pointers
- Vector of Fortran 90 pointers
- Cray pointers
15Communication Generation for Generic Parallel
Architectures
- CAF code a() b()p
- Translated code allocate b_temp()
- call GET( b, p, b_temp, )
- a() b_temp()
- deallocate b_temp
- Portable works on clusters and SMPs
- Function overhead per fine-grain access
- Uses temporary to hold off-processor data
- Implemented in cafc
16Communication Generation Using Fortran 90 Pointers
- CAF code do j 1, N
- C(j) A(j)p
- end do
- Translated code do j 1, N
- ptrA gt A(j)
- call CafSetPtr(ptrA,p,A_handle)
- C(j) ptrA
- end do
- Function call overhead for each reference
- Implemented in cafc
17Pointer Initialization Hoisting
- Naïvely translated code do j 1, N
- ptrA gt A(j)
- call CafSetPtr(ptrA,p,A_handle)
- C(j) ptrA
- end do
- Code with hoisted pointer initialization
- ptrA gt A(1N)
- call CafSetPtr(ptrA,p,A_handle)
- do j 1, N
- C(j) ptrA(j)
- end do
- Pointer initialization hoisting is not yet
implemented in cafc
18Communication Generation Using Vector of Fortran
90 Pointers
- CAF code do j 1, N
- C(j) A(j)p
- end do
- Translated code initialization
- do j 1, N
- C(j) ptrVectorA(p)ptrA(j)
- end do
- Does not require pointer initialization hoisting
and avoids function calls - Worse performance than that of hoisted pointer
initialization
19Communication Generation Using Cray Pointers
- CAF code do j 1, N
- C(j) A(j)p
- end do
- Translated code integer(PtrSize) addrA()
- addrA initialization
- do j 1, N
- ptrA addrA(p)
- C(j) A_rem(j)
- end do
- addrA(p) address of co-array A on image p
- Cray pointer initialization hoisting yields only
marginal improvement
20Outline
- Co-array representation and data access
- Local data
- Remote data
- Experimental evaluation
- Conclusions
21Experimental Platforms
- SGI Altix 3000
- 128 Itanium2 1.5 GHz, 6 MB L3 cache processors
- Linux (2.4.21 kernel)
- Intel Fortran Compiler 8.0
- SGI Origin 2000
- 16 MIPS R12000 350 MHz, 8 MB L2 cache processors
- IRIX64 6.5
- MIPSpro Compiler 7.3.1.3m
22Benchmarks
- STREAM
- Random Access
- Spark98
- NAS MG and SP
23STREAM
- Copy kernel
- DO J 1, N DO J 1, N
- C(J) A(J) C(J) A(J)p
- END DO END DO
- Triad kernel
- DO J 1, N DO J 1, N
- A(J)B(J)sC(J) A(J)B(J)psC(J)p
- END DO END DO
- Goal investigate how well architecture bandwidth
can be delivered up to the language level
24STREAM Local Accesses
- COMMON block is the best, if platform allows
- Subroutine parameter has similar performance to
COMMON block representation - Pointer-based representations have performance
within 5 of the best on the Altix (with
no-aliasing flag), and within 15 on the Origin - Fortran 90 pointer representation yields 30 of
performance on the Altix without using the flag
to specify lack of pointer aliasing - Array section statements with Fortran 90 pointer
representation yield 40-50 performance on the
Origin
25STREAM Remote Accesses
- COMMON block representation for local access
Cray pointer for remote accesses is the best - Subroutine argument Cray pointer for remote
accesses has similar performance - Remote accesses with function call per access
yield very poor performance (24 times slower than
the best on the Altix, five times slower on the
Origin) - Generic strategy (with intermediate temporaries)
delivers only 50-60 of performance on the Altix
and 30-40 of performance on the Origin for
vectorized code (except for Copy kernel) - Pointer initialization hoisting is crucial for
Fortran 90 pointers remote accesses and desirable
for Cray pointers - Similarly coded OpenMP version has comparable
performance on the Altix (90 for the scale
kernel) and 86-90 on the Origin
26Spark98
- Based on CMUs earthquake simulation code
- Computes sparse matrix-vector product
- Irregular application with fine-grain accesses
- Matrix distribution and computation partitioning
is done offline (sf2 traces) - Spark98 computes partial product locally, then
assembles the result across processors
27Spark98 (cont.)
- Versions
- Serial (Fortran kernel, ported from C)
- MPI (Fortran kernel, ported from C)
- Hybrid (best shared memory threaded version)
- CAF versions (based on MPI version)
- CAF Packed PUTs
- CAF Packed GETs
- CAF GETs (computation with remote data accessed
in place)
28Spark98 GETs Result Assembly
- v2(,) v(,)
- call sync_all()
- do s 0, subdomains-1
- if (commindex(s) lt commindex(s1)) then
- pos commindex(s)
- comm_len commindex(s1) - pos
- v(, comm(posposcomm_len-1))
- v(, comm(posposcomm_len-1))
- v2(, comm_gets(posposcomm_len-1))s
- end if
- end do
- call sync_all()
29Spark98 GETs Result Assembly
- v2(,) v(,)
- call sync_all()
- do s 0, subdomains-1
- if (commindex(s) lt commindex(s1)) then
- pos commindex(s)
- comm_len commindex(s1) - pos
- v(, comm(posposcomm_len-1))
- v(, comm(posposcomm_len-1))
- v2(, comm_gets(posposcomm_len-1))s
- end if
- end do
- call sync_all()
30Spark98 Performance on Altix
- Performance of all CAF versions is comparable to
that of MPI and better on large number of CPUs - CAF GETs is simple and more natural to code,
but up to 13 slower - Without considering locality, applications do not
scale on NUMA architectures (Hybrid) - ARMCI library is more efficient than MPI
31NAS MG and SP
- Versions
- MPI (NPB 2.3)
- CAF (based on MPI NPB 2.3)
- Generic code generation with subroutine argument
co-array representation (procedure splitting) - Shared memory code generation (Fortran 90
pointers vectorized source code) with subroutine
argument co-array representation - OpenMP (NPB 3.0)
- Class C
32NAS SP Performance on Altix
- Performance of CAF versions is comparable to that
of MPI - CAF-generic has better performance than CAF-shm
because it uses memcpy, which hides latency by
keeping optimal number of memory ops in flight - OpenMP scales poorly
33NAS MG Performance on Altix
34Conclusions
- Direct load/store communication improves
performance of fine-grain accesses by a factor of
24 on the Altix 3000 and five on the Origin 2000 - In-place data use in CAF statements incurs
acceptable abstraction overhead - Performance comparable to that of MPI codes for
fine- and coarse-grain applications - We plan to implement in cafc optimal,
architecture dependent, code generation for local
and remote co-array accesses
35- www.hipersoft.rice.edu/caf