Title: A Multi-platform Co-Array Fortran Compiler
1A Multi-platform Co-Array Fortran Compiler
- Yuri Dotsenko Cristian Coarfa
- John Mellor-Crummey
- Department of Computer Science
- Rice University
- Houston, TX USA
2Motivation
- Parallel Programming Models
- MPI de facto standard
- difficult to program
- OpenMP inefficient to map on distributed memory
platforms - lack of locality control
- HPF hard to obtain high-performance
- heroic compilers needed!
Global address space languages CAF, Titanium,
UPC an appealing middle ground
3Co-Array Fortran
- Global address space programming model
- one-sided communication (GET/PUT)
- Programmer has control over performance-critical
factors - data distribution
- computation partitioning
- communication placement
- Data movement and synchronization as language
primitives - amenable to compiler-based communication
optimization
4CAF Programming Model Features
- SPMD process images
- fixed number of images during execution
- images operate asynchronously
- Both private and shared data
- real x(20, 20) a private 20x20 array in each
image - real y(20, 20) a shared 20x20 array in each
image - Simple one-sided shared-memory communication
- x(,jj2) y(,pp2)r copy columns from
image r into local columns - Synchronization intrinsic functions
- sync_all a barrier and a memory fence
- sync_mem a memory fence
- sync_team(team members to notify, team members
to wait for) - Pointers and (perhaps asymmetric) dynamic
allocation
5One-sided Communication with Co-Arrays
image 1
image 2
image N
image 1
image 2
image N
6Rice Co-Array Fortran Compiler (cafc)
- First CAF multi-platform compiler
- previous compiler only for Cray shared memory
systems - Implements core of the language
- currently lacks support for derived type and
dynamic co-arrays - Core sufficient for non-trivial codes
- Performance comparable to that of hand-tuned MPI
codes - Open source
7Outline
- CAF programming model
- cafc
- Core language implementation
- Optimizations
- Experimental evaluation
- Conclusions
8Implementation Strategy
- Source-to-source compilation of CAF codes
- uses Open64/SL Fortran 90 infrastructure
- CAF ? Fortran 90 communication operations
- Communication
- ARMCI library for one-sided communication on
clusters - load/store communication on shared-memory
platforms
- Goals
- portability
- high-performance on a wide range of platforms
9Co-Array Descriptors
- Initialize and manipulate Fortran 90 dope vectors
real a(10,10,10) type CAFDesc_real_3
integer(ptrkind) handle ! Opaque handle
! to CAF runtime
representation real, pointer ptr(,,) !
Fortran 90 pointer
! to local co-array data end Type
CAFDesc_real_3 type(CAFDesc_real_3) a
10Allocating COMMON and SAVE Co-Arrays
- Compiler
- generates static initializer for each common/save
variable - Linker
- collects calls to all initializers
- generates global initializer that calls all
others - compiles global initializer and links into
program - Launch
- invokes global initializer before main program
begins - allocates co-array storage outside Fortran 90
runtime system - associates co-array descriptors with allocated
memory
Similar to handling for C static constructors
11Parameter Passing
call f((a(I)p))
- Call-by-value convention (copy-in, copy-out)
- pass remote co-array data to procedures only as
values - Call-by-co-array convention
- argument declared as a co-array by callee
- enables access to local and remote co-array data
- Call-by-reference convention (cafc)
- argument declared as an explicit shape array
- enables access to local co-array data only
- enables reuse of existing Fortran code
subroutine f(a) real a(10)
real x(10) call f(x) subroutine f(a) real
a(10)
requires an explicit interface
12Multiple Co-dimensions
- Managing processors as a logical
multi-dimensional grid - integer a(10,10)5,4, 3D processor grid 5 x 4
x - Support co-space reshaping at procedure calls
- change number of co-dimensions
- co-space bounds as procedure arguments
13Implementing Communication
- x(1n) a(1n)p
- Use a temporary buffer to hold off processor data
- allocate buffer
- perform GET to fill buffer
- perform computation x(1n) buffer(1n)
- deallocate buffer
- Optimizations
- no temporary storage for co-array to co-array
copies - load/store communication on shared-memory systems
14Synchronization
- Original CAF specification team synchronization
only - sync_all, sync_team
- Limits performance on loosely-coupled
architectures - Point-to-point extensions
- sync_notify(q)
- sync_wait(p)
- Point to point
synchronization semantics - Delivery of a notify to q from p ?
- all communication from p to q issued before the
notify has been delivered to q
15Outline
- CAF programming model
- cafc
- Core language implementation
- Optimizations
- procedure splitting
- supporting hints for non-blocking communication
- packing strided communications
- Experimental evaluation
- Conclusions
16An Impediment to Code Efficiency
- Original reference
- rhs(1,i,j,k,c) u(1,i-1,j,k,c) -
- Transformed reference
- rhsptr(1,i,j,k,c) uptr(1,i-1,j,k,c) -
- Fortran 90 pointer-based co-array representation
does not convey - the lack of co-array aliasing
- co-array contiguity
- co-array bounds
- Lack of knowledge inhibits important code
optimizations
17Procedure Splitting
CAF to CAF preprocessing
subroutine f() real, save c(100) interface
subroutine f_inner(, c_arg) real
c_arg end subroutine f_inner end
interface call f_inner(,c) end subroutine
f subroutine f_inner(, c_arg) real
c_arg(100) ... c_arg(50) ... end
subroutine f_inner
subroutine f() real, save c(100) ...
c(50) ... end subroutine f
18Benefits of Procedure Splitting
- Generated code conveys
- lack of co-array aliasing
- co-array contiguity
- co-array bounds
- Enables back-end compiler to generate better code
19Hiding Communication Latency
- Goal enable communication/computation overlap
- Impediments to generating non-blocking
communication - use of indexed subscripts in co-dimensions
- lack of whole program analysis
- Approach support hints for non-blocking
communication - overcome conservative compiler analysis
- enable sophisticated programmers to achieve good
performance today
20Hints for Non-blocking PUTs
- Hints for CAF run-time system to issue
non-blocking PUTs - region_id open_nb_put_region()
- ...
- Put_Stmt_1
- ...
- Put_Stmt_N
- ...
- call close_nb_put_region(region_id)
-
- Complete non-blocking PUTs
- call complete_nb_put_region(region_id)
- Open problem Exploiting non-blocking GETs?
21Strided vs. Contiguous Transfers
- Problem
- CAF remote reference might induce many small data
transfers - a(i,1n)p b(j,1n)
- Solution
- pack strided data on source and unpack it on
destination
22Pragmatics of Packing
- Who should implement packing?
- The CAF programmer
- difficult to program
- The CAF compiler
- unpacking requires conversion of PUTs into
two-sided communication (a difficult
whole-program transformation) - The communication library
- most natural place
- ARMCI currently performs packing on Myrinet
23CAF Compiler Targets (Sept 2004)
- Processors
- Pentium, Alpha, Itanium2, MIPS
- Interconnects
- Quadrics, Myrinet, Gigabit Ethernet, shared
memory - Operating systems
- Linux, Tru64, IRIX
24Outline
- CAF programming model
- cafc
- Core language implementation
- Optimizations
- Experimental evaluation
- Conclusions
25Experimental Evaluation
- Platforms
- AlphaQuadrics QSNet (Elan3)
- Itanium2Quadrics QSNet II (Elan4)
- Itanium2Myrinet 2000
- Codes
- NAS Parallel Benchmarks (NPB 2.3) from NASA Ames
26NAS BT Efficiency (Class C)
27NAS SP Efficiency (Class C)
lack of non-blocking notify implementation blocks
CAF comm/comp overlap
28NAS MG Efficiency (Class C)
- ARMCI comm is efficient
- pt-2-pt synch in boosts
- CAF performance 30
29NAS CG Efficiency (Class C)
30NAS LU Efficiency (class C)
31Impact of Optimizations
- Assorted Results
- Procedure splitting
- 42-60 improvement for BT on Itanium2Myrinet
cluster - 15-33 improvement for LU on AlphaQuadrics
- Non-blocking communication generation
- 5 improvement for BT on Itanium2Quadrics
cluster - 3 improvement for MG on all platforms
- Packing of strided data
- 31 improvement for BT on AlphaQuadrics cluster
- 37 improvement for LU on Itanium2Quadrics
cluster
See paper for more details
32Conclusions
- CAF boosts programming productivity
- simplifies the development of SPMD parallel
programs - shifts details of managing communication to
compiler - cafc delivers performance comparable to
hand-tuned MPI - cafc implements effective optimizations
- procedure splitting
- non-blocking communication
- packing of strided communication (in ARMCI)
- Vectorization needed to achieve true performance
portability with machines like Cray X1
http//www.hipersoft.rice.edu/caf