Title: Coarray Fortran: Compilation, Performance, Languages Issues
1Co-array Fortran Compilation, Performance,
Languages Issues
- John Mellor-Crummey
- Cristian Coarfa Yuri Dotsenko
- Department of Computer Science
- Rice University
2Outline
- Co-array Fortran language recap
- Compilation approach
- Co-array storage management
- Communication
- A preliminary performance study
- Platforms
- Benchmarks and results and lessons
- Language refinement issues
- Conclusions
3CAF Language Assessment
- Strengths
- offloads communication management to the compiler
- choreographing data transfer
- managing mechanics of synchronization
- gives user full control of parallelization
- data movement and synchronization as language
primitives - amenable to compiler optimization
- array syntax supports natural user-level
vectorization - modest compiler technology can yield good
performance - more abstract than MPI ? better performance
portability - Weaknesses
- user manages partitioning of work
- user specifies data movement
- user codes necessary synchronization
4Compiler Goals
- Portable compiler
- Multi-platform code generation
- High performance generated code
5Compilation Approach
- Source-to-source Translation
- Translate CAF into Fortran 90 communication
calls - One-sided communication layer
- strided communication
- gather/scatter
- synchronization barriers, notify/wait
- split phase non-blocking primitives
- Today ARMCI remote memory copy interface
(Nieplocha _at_ PNL) - Benefits
- wide portability
- leverage vendor F90 compilers for good node
performance
6Co-array Data
- Co-array representation
- F90 pointer to data opaque handle for
communication layer - Co-array access
- read/write local co-array data using F90 pointer
dereference - remote accesses translate into ARMCI GET/PUT
calls - Co-array allocation
- storage allocation by communication layer, as
appropriate - on shared memory hardware in shared memory
segment - on Myrinet in pinned memory for direct DMA
access - dope vector initialization using CHASM (Rasmussen
_at_ LANL) - set F90 pointer to point to externally managed
memory
7Allocating Static Co-arrays (COMMON/SAVE)
- Compiler
- generate static initializer for each common/save
variable - Linker
- collect calls to all initializers
- generate global initializer that calls all others
- compile global initializer and link into program
- Launch
- call global initializer before main program
begins
Similar to handling for C static constructors
8COMMON Block Sequence Association
- Problem
- each procedure may have a different view of a
common - Solution
- allocate a contiguous pool of co-array storage
per common - each procedure has a private set of view
variables (F90 pointers) - initialize all per procedure view variables
- only once at launch after common allocation
9Porting to a new Compiler / Architecture
- Synthesize dope vectors for co-array storage
- compiler/architecture specific details CHASM
library - Tailor communication to architecture
- design supports alternate communication libraries
- status
- today ARMCI (PNL)
- ongoing work compiler tailored communication
- direct load/store on shared-memory architectures
- future
- other portable libraries (e.g. GASnet)
- custom communication library for an architecture
10Supporting Multiple Co-dimensions
- A(,)N,M,
- Add precomputed coefficients to co-array
meta-data - Lower, upper bounds for each co-dimension
- this_image_cache for each co-dimension
- e.g., this_image(a,1) yields my co-row index
- cum_hyperplane_size for each co-dimension
11Implementing Communication
- Given a statement
- X(1n) A(1n)p
- A temporary buffer is used for off processor data
- invoke communication library to allocate tmp in
suitable temporary storage - dope vector filled in so tmp can be accessed as
F90 pointer - call communication library to fill in tmp (ARMCI
GET) - X(1n) tmp(1n)
- deallocate tmp
12CAF Compiler Status
- Near production-quality F90 front end from Open64
- being enhanced to meet needs of this project and
others - Working prototype for CAF core features
- Co-array communication
- inserted around statements with co-array accesses
- currently no optimization
13Supported Features
- Declarations
- co-objects scalars and arrays
- COMMON and SAVE co-objects of primitive types
- INTEGER(4), REAL(4) and REAL(8)
- COMMON blocks variables and co-objects
intermixed - co-objects with multiple co-dimensions
- procedure interface blocks with co-array
arguments - Executable code
- array section notation for co-array data indices
- local and remote co-arrays
- co-array argument passing
- co-array dummy arguments require explicit
interface - co-array pointer communication handle
- co-array reshaping supported
- CAF intrinsics
- Image inquiry this_image(), num_images()
- Synchronization sync_all, sync_team,
synch_notify, synch_wait
14Coming Attractions
- Allocatable co-arrays
- REAL(8), ALLOCATABLE X()
- ALLOCATE(X(MYX_NUM))
- Co-arrays of user-defined types
- Allocatable co-array components
- user defined type with pointer components
- Triplets in co-dimensions
- A(j,k)p1p4
15CAF Compiler Targets (May 2004)
- PentiumEthernet workstations, Linux32 RedHat 7.1
- Itanium2 Myrinet, Linux64 RedHat 7.1
- Itanium2 Quadrics, Linux64 RedHat 7.1
- SGI Altix 3000/Itanium2, Linux64 RedHat 7.2
- Alphaserver SC Quadrics, OSF1 Tru64 V5.1A
- SGI Origin 2000/MIPS, IRIX64 6.5
16A Preliminary Performance Study
- Platforms
- AlphaQuadrics QSNet (Elan3)
- Itanium2Quadrics QSNet II (Elan4)
- Itanium2Myrinet 2000
- Codes
- NAS Parallel Benchmarks (NPB) from NASA Ames
17AlphaQuadrics Platform (Lemieux)
- Nodes 750 Compaq AlphaServer ES35 4-way ES45
- 1-GHz Alpha EV6.8 (21264C), 64KB/8MB L1/L2 cache
- 4 GB RAM/node
- Interconnect Quadrics QSNet (Elan3)
- 340 MB/s peak and 210 MB/s sustained x 2 rails
- Operating System Tru64 Unix5.1A SC2.5
- Compiler HP Fortran Compiler V5.5A
- Communication Middleware ARMCI 1.1-beta
18Itanium2Quadrics Platform (PNNL)
- Nodes 944 HP Longs Peak dual-CPU workstations
- 1.5GHz Itanium2 32KB/256KB/6MB L1/L2/L3 cache
- 6GB RAM/node
- Interconnect Quadrics QSNet II
- 905 MB/s
- Operating System Red Hat Linux, 2.4.20
- Compiler Intel Fortran Compiler v7.1
- Communication Middleware ARMCI 1.1-beta
-
19Itanium2Myrinet Platform (Rice)
- Nodes 96 HP zx6000 dual-CPU workstations
- 900MHz Itanium2 32KB/256KB/1.5MB L1/L2/L3 cache
- 4GB RAM/node
- Interconnect Myrinet 2000
- 240 MB/s
- GM version 1.6.5
- MPICH-GM version 1.2.5
- Operating System Red Hat Linux, 2.4.18 patches
- Compiler Intel Fortran Compiler v7.1
- Communication Middleware ARMCI 1.1-beta
-
20NAS Parallel Benchmarks (NPB) 2.3
- Benchmarks by NASA Ames
- 2-3K lines each (Fortran 77)
- Widely used to test parallel compiler performance
- NAS versions
- NPB2.3b2 Hand-coded MPI
- NPB2.3-serial Serial code extracted from MPI
version - Our version
- NPB2.3-CAF CAF implementation, based on the MPI
version
21NAS BT
- Block tridiagonal solve of 3D Navier Stokes
- Dense matrix
- Parallelization
- alternating line sweeps along 3 dimensions
- multipartitioning data distribution for full
parallelism - MPI implementation
- asynchronous send/receive
- communication/computation overlap
- CAF communication
- strided blocks transferred using vector PUTs
(triplet notation) - no user-declared communication buffers
- Large messages, relatively infrequent
communication
22NAS BT Efficiency (Class C)
- Lesson
- Tradeoff buffers vs. synchronization
- more buffers less synchronization
- less synchronization improved performance
23NAS SP
- Scalar pentadiagonal solve of 3D Navier Stokes
- Dense matrix
- Parallelization
- alternating line sweeps along 3 dimensions
- multipartitioning data distribution for full
parallelism - MPI implementation
- asynchronous send/receive
- communication/computation overlap
- CAF communication
- pack into buffer separate buffer for each plane
of sweep - transfer using PUTs
- smaller more frequent messages 1.5x
communication of BT
24NAS SP Efficiency (Class C)
- Lesson
- Inability to overlap communication with
computation in procedure calls hurts performance
25NAS MG
- 3D Multigrid solver with periodic boundary
conditions - Dense matrix
- Grid size and levels are compile time constants
- Communication
- nearest neighbor with possibly 6 neighbors
- MPI asynchronous send/receive
- CAF
- pairwise synch_notify/wait to coordinate with
neighbors - four communication buffers (co-arrays) used 2
sender, 2 receiver - pack and transfer contiguous data using PUTS
- for each dimension
- notify my neighbors that my buffers are free
- wait for my neighbors to notify me their buffers
are free - PUT data into right buffer, notify neighbor
- PUT data into left buffer, notify neighbor
- wait for both to complete
26NAS MG Efficiency (Class C)
- Lessons
- Replacing barriers with point-to-point
synchronization can boost performance 30 - Converting GETs into PUTs also improved
performance
27NAS LU
- Solve 3D Navier Stokes using SSOR
- Dense matrix
- Parallelization on power of 2 processors
- repeated decompositions on x and y until all
processors assigned - wavefront parallelism small messages 5 words
each - MPI implementation
- asynchronous send/receive
- communication/computation overlap
- CAF
- two dimensional co-arrays
- morphed code to pack data for higher
communication efficiency - uses PUTs
28NAS LU Efficiency (Class C)
- Lessons
- Morphing to pack data and use PUTs was hard!
- Compiler had better pitch in and handle the dirty
work
29NAS CG
- Conjugant gradient solve to compute eigenvector
of large, sparse, symmetric, positive definite
matrix - MPI
- Irregular point-to-point messaging
- CAF structure follows MPI
- Irregular notify/wait
- vector assignments for data transfer
- No communication/computation overlap for either
30NAS CG Efficiency (Class C)
- Lessons
- aggregation and vectorization are critical for
high performance communication - memory layout of buffers and arrays might require
thorough analysis and optimization
31CAF GET vs. PUT Communication
- Definitions
- GET q_caf(n1n2) w(m1m2)reduce_exch_proc_nonc
af(i) - PUT q_caf(n1n2)reduce_exch_proc_noncaf(i)
w(m1m2) - Study
- 64 procs, NAS CG class C
- AlphaQuadrics Elan3 (Lemieux)
- Performance
- GET 12.9 slower than MPI
- PUT 4.0 slower than MPI
In general, PUT faster than GET
32Experiments Summary
- On cluster-based architectures, to achieve best
performance with CAF, a user or compiler must - vectorize (and perhaps aggregate) communication
- reduce synchronization strength
- replace all-to-all with point-to-point where
sensible - overlap communication with computation
- convert GETS into PUTS where gets are not a h/w
primitive - consider memory layout conflicts co-array vs.
regular data - generate code amenable for back-end compiler
optimizations - CAF language many optimizations possible at the
source level - Compiler optimizations NECESSARY for portable
coding style - might need user hints where synchronization
analysis falls short - Runtime issues
- on Myrinet pin co-array memory for direct
transfers
33CAF Language Refinement Issues
- Initial implementations on Cray T3E and X1 led to
features not suited for distributed memory
platforms - Key problems and solution suggestions
- Restrictive memory fence semantics for procedure
calls - pragmas to enable programmer to overlap one-sided
communication with procedure calls - Overly restrictive synchronization primitives
- add unidirectional, point-to-point
synchronization - rework team model (next slide)
- No collective operations
- Leads to home-brew non-portable implementations
- add CAF intrinsics for reductions, broadcast,
etc.
34CAF Language Refinement Issues
- CAF dynamic teams lead to dont scale
- pre-arranged communicator-like teams
- would help collectives O(log P) rather than
O(P2) - reordering logical numbering of images for
topology - add shape information to image teams?
- Blocking communication reduces scalability
- user mechanisms to delay completion to enable
overlap? - Synchronization is not paired with data movement
- synchronization hint tags to help analysis
- synchronization tags at run-time to track
completion? - How relaxed should the memory model be for
performance?
35Conclusions
- Tuned CAF performance is comparable to tuned MPI
- even without compiler-based communication
optimizations! - CAF programming model enables source-level
optimization - communication vectorization
- synchronization strength reduction
- achieve performance today rather than waiting for
tomorrows compilers - CAF is amenable to compiler analysis and
optimization - significant communication optimization is
feasible, unlike for MPI - optimizing compilers will help a wider range of
programs achieve high performance - applications can be tailored to fully exploit
architectural characteristics - e.g., shared memory vs. distributed memory vs.
hybrid - However, more abstract programming models would
simplify code development (e.g. HPF)
36Project URL
- http//www.hipersoft.rice.edu/caf
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41Parallel Programming Models
- Goals
- Expressiveness
- Ease of use
- Performance
- Portability
- Current models
- OpenMP difficult to map on distributed memory
platforms - HPF difficult to obtain high-performance on
broad range of programs - MPI de facto standard hard to program,
assumptions about communication granularity are
hard coded - UPC global address space language similar to
CAF but with location transparency
42Finite Element Example (Numrich)
- subroutine assemble(start, prin, ghost, neib, x)
- integer start(), prin(), ghost(),
neib(), k1, k2, p - real x()
- call sync_all(neib)
- do p 1, size(neib) ! Add contributions from
neighbors - k1 start(p) k2 start(p1)-1
- x(prin(k1k2)) x(prin(k1k2))
x(ghost(k1k2)) neib(p) - enddo
- call sync_all(neib)
- do p 1, size(neib) ! Update the neighbors
- k1 start(p) k2 start(p1)-1
- x(ghost(k1k2)) neib(p) x(prin(k1k2))
- enddo
- call synch_all
- end subroutine assemble
43Communicating Private Data
- Example
- REAL A(100,100), B(100)
- A(,j)p B()
- Issue
- B is a local array
- B is sent to a partner
- Will require a copy into shared space before
transfer - For higher efficiency want B in shared storage
- Alternatives
- Declare communicated arrays as co-arrays
- Add a communicated attribute to Bs declaration
- mark it for allocation in shared storage
44Passing Co-arrays as Arguments
- Language restriction pass co-arrays by whole
array - REAL A(100,100)
- CALL FOO(A)
- Callee must declare an explicit subroutine
interface - Proposed option F90 assumed shape co-array
arguments - Allow passing of Fortran 90 style array sections
of local co-array - REAL A(100,100)
- CALL FOO(A(1102,325))
- Callee must declare an explicit subroutine
interface - If matching dummy argument is declared as a
co-array, then - Must declare assumed size data dimensions
- Must declare assumed size co-dimensions
- Avoids copy-in, copy-out for co-array data
45Co-array Fortran (CAF)
- Explicitly-parallel extension of Fortran 90/95
- defined by Numrich Reid
- Global address space SPMD parallel programming
model - one-sided communication
- Simple, two-level memory model for locality
management - local vs. remote memory
- Programmer control over performance critical
decisions - data partitioning
- communication
- Suitable for mapping to a range of parallel
architectures - shared memory, message passing, hybrid, PIM
46CAF Programming Model Features
- SPMD process images
- fixed number of images during execution
- images operate asynchronously
- Both private and shared data
- real x(20, 20) a private 20x20 array in each
image - real y(20, 20) a shared 20x20 array in each
image - Simple one-sided shared-memory communication
- x(,jj2) y(,pp2) r copy columns from
pp2 into local columns - Synchronization intrinsic functions
- sync_all a barrier and a memory fence
- sync_mem a memory fence
- sync_team(notify, wait)
- notify a vector of process ids to signal
- wait a vector of process ids to wait for, a
subset of notify - Pointers and (perhaps asymmetric) dynamic
allocation - Parallel I/O