A Multi-platform Co-Array Fortran Compiler - PowerPoint PPT Presentation

About This Presentation

Title:

A Multi-platform Co-Array Fortran Compiler

Description:

pack strided data on source and unpack it on destination. 22. Pragmatics of Packing ... unpacking requires conversion of PUTs into two-sided communication (a difficult ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 33

Provided by: hiperso

Learn more at: http://www.hipersoft.rice.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Multi-platform Co-Array Fortran Compiler

1
A Multi-platform Co-Array Fortran Compiler

Yuri Dotsenko Cristian Coarfa
John Mellor-Crummey
Department of Computer Science
Rice University
Houston, TX USA

2
Motivation

Parallel Programming Models
MPI de facto standard
difficult to program
OpenMP inefficient to map on distributed memory
platforms
lack of locality control
HPF hard to obtain high-performance
heroic compilers needed!

Global address space languages CAF, Titanium,
UPC an appealing middle ground
3
Co-Array Fortran

Global address space programming model
one-sided communication (GET/PUT)
Programmer has control over performance-critical
factors
data distribution
computation partitioning
communication placement
Data movement and synchronization as language
primitives
amenable to compiler-based communication
optimization

4
CAF Programming Model Features

SPMD process images
fixed number of images during execution
images operate asynchronously
Both private and shared data
real x(20, 20) a private 20x20 array in each
image
real y(20, 20) a shared 20x20 array in each
image
Simple one-sided shared-memory communication
x(,jj2) y(,pp2)r copy columns from
image r into local columns
Synchronization intrinsic functions
sync_all a barrier and a memory fence
sync_mem a memory fence
sync_team(team members to notify, team members
to wait for)
Pointers and (perhaps asymmetric) dynamic
allocation

5
One-sided Communication with Co-Arrays
image 1
image 2
image N
image 1
image 2
image N
6
Rice Co-Array Fortran Compiler (cafc)

First CAF multi-platform compiler
previous compiler only for Cray shared memory
systems
Implements core of the language
currently lacks support for derived type and
dynamic co-arrays
Core sufficient for non-trivial codes
Performance comparable to that of hand-tuned MPI
codes
Open source

7
Outline

CAF programming model
cafc
Core language implementation
Optimizations
Experimental evaluation
Conclusions

8
Implementation Strategy

Source-to-source compilation of CAF codes
uses Open64/SL Fortran 90 infrastructure
CAF ? Fortran 90 communication operations
Communication
ARMCI library for one-sided communication on
clusters
load/store communication on shared-memory
platforms

Goals
portability
high-performance on a wide range of platforms

9
Co-Array Descriptors

Initialize and manipulate Fortran 90 dope vectors

real a(10,10,10) type CAFDesc_real_3
integer(ptrkind) handle ! Opaque handle
! to CAF runtime
representation real, pointer ptr(,,) !
Fortran 90 pointer
! to local co-array data end Type
CAFDesc_real_3 type(CAFDesc_real_3) a
10
Allocating COMMON and SAVE Co-Arrays

Compiler
generates static initializer for each common/save
variable
Linker
collects calls to all initializers
generates global initializer that calls all
others
compiles global initializer and links into
program
Launch
invokes global initializer before main program
begins
allocates co-array storage outside Fortran 90
runtime system
associates co-array descriptors with allocated
memory

Similar to handling for C static constructors
11
Parameter Passing
call f((a(I)p))

Call-by-value convention (copy-in, copy-out)
pass remote co-array data to procedures only as
values
Call-by-co-array convention
argument declared as a co-array by callee
enables access to local and remote co-array data
Call-by-reference convention (cafc)
argument declared as an explicit shape array
enables access to local co-array data only
enables reuse of existing Fortran code

subroutine f(a) real a(10)
real x(10) call f(x) subroutine f(a) real
a(10)
requires an explicit interface
12
Multiple Co-dimensions

Managing processors as a logical
multi-dimensional grid
integer a(10,10)5,4, 3D processor grid 5 x 4
x
Support co-space reshaping at procedure calls
change number of co-dimensions
co-space bounds as procedure arguments

13
Implementing Communication

x(1n) a(1n)p
Use a temporary buffer to hold off processor data
allocate buffer
perform GET to fill buffer
perform computation x(1n) buffer(1n)
deallocate buffer
Optimizations
no temporary storage for co-array to co-array
copies
load/store communication on shared-memory systems

14
Synchronization

Original CAF specification team synchronization
only
sync_all, sync_team
Limits performance on loosely-coupled
architectures
Point-to-point extensions
sync_notify(q)
sync_wait(p)

Point to point
synchronization semantics
Delivery of a notify to q from p ?
all communication from p to q issued before the
notify has been delivered to q

15
Outline

CAF programming model
cafc
Core language implementation
Optimizations
procedure splitting
supporting hints for non-blocking communication
packing strided communications
Experimental evaluation
Conclusions

16
An Impediment to Code Efficiency

Original reference
rhs(1,i,j,k,c) u(1,i-1,j,k,c) -
Transformed reference
rhsptr(1,i,j,k,c) uptr(1,i-1,j,k,c) -
Fortran 90 pointer-based co-array representation
does not convey
the lack of co-array aliasing
co-array contiguity
co-array bounds
Lack of knowledge inhibits important code
optimizations

17
Procedure Splitting
CAF to CAF preprocessing
subroutine f() real, save c(100) interface
subroutine f_inner(, c_arg) real
c_arg end subroutine f_inner end
interface call f_inner(,c) end subroutine
f subroutine f_inner(, c_arg) real
c_arg(100) ... c_arg(50) ... end
subroutine f_inner
subroutine f() real, save c(100) ...
c(50) ... end subroutine f
18
Benefits of Procedure Splitting

Generated code conveys
lack of co-array aliasing
co-array contiguity
co-array bounds
Enables back-end compiler to generate better code

19
Hiding Communication Latency

Goal enable communication/computation overlap
Impediments to generating non-blocking
communication
use of indexed subscripts in co-dimensions
lack of whole program analysis
Approach support hints for non-blocking
communication
overcome conservative compiler analysis
enable sophisticated programmers to achieve good
performance today

20
Hints for Non-blocking PUTs

Hints for CAF run-time system to issue
non-blocking PUTs
region_id open_nb_put_region()
...
Put_Stmt_1
...
Put_Stmt_N
...
call close_nb_put_region(region_id)
Complete non-blocking PUTs
call complete_nb_put_region(region_id)
Open problem Exploiting non-blocking GETs?

21
Strided vs. Contiguous Transfers

Problem
CAF remote reference might induce many small data
transfers
a(i,1n)p b(j,1n)
Solution
pack strided data on source and unpack it on
destination

22
Pragmatics of Packing

Who should implement packing?
The CAF programmer
difficult to program
The CAF compiler
unpacking requires conversion of PUTs into
two-sided communication (a difficult
whole-program transformation)
The communication library
most natural place
ARMCI currently performs packing on Myrinet

23
CAF Compiler Targets (Sept 2004)

Processors
Pentium, Alpha, Itanium2, MIPS
Interconnects
Quadrics, Myrinet, Gigabit Ethernet, shared
memory
Operating systems
Linux, Tru64, IRIX

24
Outline

CAF programming model
cafc
Core language implementation
Optimizations
Experimental evaluation
Conclusions

25
Experimental Evaluation

Platforms
AlphaQuadrics QSNet (Elan3)
Itanium2Quadrics QSNet II (Elan4)
Itanium2Myrinet 2000
Codes
NAS Parallel Benchmarks (NPB 2.3) from NASA Ames

26
NAS BT Efficiency (Class C)
27
NAS SP Efficiency (Class C)
lack of non-blocking notify implementation blocks
CAF comm/comp overlap
28
NAS MG Efficiency (Class C)

ARMCI comm is efficient
pt-2-pt synch in boosts
CAF performance 30

29
NAS CG Efficiency (Class C)
30
NAS LU Efficiency (class C)
31
Impact of Optimizations

Assorted Results
Procedure splitting
42-60 improvement for BT on Itanium2Myrinet
cluster
15-33 improvement for LU on AlphaQuadrics
Non-blocking communication generation
5 improvement for BT on Itanium2Quadrics
cluster
3 improvement for MG on all platforms
Packing of strided data
31 improvement for BT on AlphaQuadrics cluster
37 improvement for LU on Itanium2Quadrics
cluster

See paper for more details
32
Conclusions

CAF boosts programming productivity
simplifies the development of SPMD parallel
programs
shifts details of managing communication to
compiler
cafc delivers performance comparable to
hand-tuned MPI
cafc implements effective optimizations
procedure splitting
non-blocking communication
packing of strided communication (in ARMCI)
Vectorization needed to achieve true performance
portability with machines like Cray X1

http//www.hipersoft.rice.edu/caf

Write a Comment

User Comments (0)