Coarray Fortran: Compilation, Performance, Languages Issues - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Coarray Fortran: Compilation, Performance, Languages Issues

Description:

Tailor communication to architecture. design supports alternate communication libraries ... applications can be tailored to fully exploit architectural characteristics ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 47

Provided by: rick359

Category:

more less

Transcript and Presenter's Notes

Title: Coarray Fortran: Compilation, Performance, Languages Issues

1
Co-array Fortran Compilation, Performance,
Languages Issues

John Mellor-Crummey
Cristian Coarfa Yuri Dotsenko
Department of Computer Science
Rice University

2
Outline

Co-array Fortran language recap
Compilation approach
Co-array storage management
Communication
A preliminary performance study
Platforms
Benchmarks and results and lessons
Language refinement issues
Conclusions

3
CAF Language Assessment

Strengths
offloads communication management to the compiler
choreographing data transfer
managing mechanics of synchronization
gives user full control of parallelization
data movement and synchronization as language
primitives
amenable to compiler optimization
array syntax supports natural user-level
vectorization
modest compiler technology can yield good
performance
more abstract than MPI ? better performance
portability
Weaknesses
user manages partitioning of work
user specifies data movement
user codes necessary synchronization

4
Compiler Goals

Portable compiler
Multi-platform code generation
High performance generated code

5
Compilation Approach

Source-to-source Translation
Translate CAF into Fortran 90 communication
calls
One-sided communication layer
strided communication
gather/scatter
synchronization barriers, notify/wait
split phase non-blocking primitives
Today ARMCI remote memory copy interface
(Nieplocha _at_ PNL)
Benefits
wide portability
leverage vendor F90 compilers for good node
performance

6
Co-array Data

Co-array representation
F90 pointer to data opaque handle for
communication layer
Co-array access
read/write local co-array data using F90 pointer
dereference
remote accesses translate into ARMCI GET/PUT
calls
Co-array allocation
storage allocation by communication layer, as
appropriate
on shared memory hardware in shared memory
segment
on Myrinet in pinned memory for direct DMA
access
dope vector initialization using CHASM (Rasmussen
_at_ LANL)
set F90 pointer to point to externally managed
memory

7
Allocating Static Co-arrays (COMMON/SAVE)

Compiler
generate static initializer for each common/save
variable
Linker
collect calls to all initializers
generate global initializer that calls all others
compile global initializer and link into program
Launch
call global initializer before main program
begins

Similar to handling for C static constructors
8
COMMON Block Sequence Association

Problem
each procedure may have a different view of a
common
Solution
allocate a contiguous pool of co-array storage
per common
each procedure has a private set of view
variables (F90 pointers)
initialize all per procedure view variables
only once at launch after common allocation

9
Porting to a new Compiler / Architecture

Synthesize dope vectors for co-array storage
compiler/architecture specific details CHASM
library
Tailor communication to architecture
design supports alternate communication libraries
status
today ARMCI (PNL)
ongoing work compiler tailored communication
direct load/store on shared-memory architectures
future
other portable libraries (e.g. GASnet)
custom communication library for an architecture

10
Supporting Multiple Co-dimensions

A(,)N,M,
Add precomputed coefficients to co-array
meta-data
Lower, upper bounds for each co-dimension
this_image_cache for each co-dimension
e.g., this_image(a,1) yields my co-row index
cum_hyperplane_size for each co-dimension

11
Implementing Communication

Given a statement
X(1n) A(1n)p
A temporary buffer is used for off processor data
invoke communication library to allocate tmp in
suitable temporary storage
dope vector filled in so tmp can be accessed as
F90 pointer
call communication library to fill in tmp (ARMCI
GET)
X(1n) tmp(1n)
deallocate tmp

12
CAF Compiler Status

Near production-quality F90 front end from Open64
being enhanced to meet needs of this project and
others
Working prototype for CAF core features
Co-array communication
inserted around statements with co-array accesses
currently no optimization

13
Supported Features

Declarations
co-objects scalars and arrays
COMMON and SAVE co-objects of primitive types
INTEGER(4), REAL(4) and REAL(8)
COMMON blocks variables and co-objects
intermixed
co-objects with multiple co-dimensions
procedure interface blocks with co-array
arguments
Executable code
array section notation for co-array data indices
local and remote co-arrays
co-array argument passing
co-array dummy arguments require explicit
interface
co-array pointer communication handle
co-array reshaping supported
CAF intrinsics
Image inquiry this_image(), num_images()
Synchronization sync_all, sync_team,
synch_notify, synch_wait

14
Coming Attractions

Allocatable co-arrays
REAL(8), ALLOCATABLE X()
ALLOCATE(X(MYX_NUM))
Co-arrays of user-defined types
Allocatable co-array components
user defined type with pointer components
Triplets in co-dimensions
A(j,k)p1p4

15
CAF Compiler Targets (May 2004)

PentiumEthernet workstations, Linux32 RedHat 7.1
Itanium2 Myrinet, Linux64 RedHat 7.1
Itanium2 Quadrics, Linux64 RedHat 7.1
SGI Altix 3000/Itanium2, Linux64 RedHat 7.2
Alphaserver SC Quadrics, OSF1 Tru64 V5.1A
SGI Origin 2000/MIPS, IRIX64 6.5

16
A Preliminary Performance Study

Platforms
AlphaQuadrics QSNet (Elan3)
Itanium2Quadrics QSNet II (Elan4)
Itanium2Myrinet 2000
Codes
NAS Parallel Benchmarks (NPB) from NASA Ames

17
AlphaQuadrics Platform (Lemieux)

Nodes 750 Compaq AlphaServer ES35 4-way ES45
1-GHz Alpha EV6.8 (21264C), 64KB/8MB L1/L2 cache
4 GB RAM/node
Interconnect Quadrics QSNet (Elan3)
340 MB/s peak and 210 MB/s sustained x 2 rails
Operating System Tru64 Unix5.1A SC2.5
Compiler HP Fortran Compiler V5.5A
Communication Middleware ARMCI 1.1-beta

18
Itanium2Quadrics Platform (PNNL)

Nodes 944 HP Longs Peak dual-CPU workstations
1.5GHz Itanium2 32KB/256KB/6MB L1/L2/L3 cache
6GB RAM/node
Interconnect Quadrics QSNet II
905 MB/s
Operating System Red Hat Linux, 2.4.20
Compiler Intel Fortran Compiler v7.1
Communication Middleware ARMCI 1.1-beta

19
Itanium2Myrinet Platform (Rice)

Nodes 96 HP zx6000 dual-CPU workstations
900MHz Itanium2 32KB/256KB/1.5MB L1/L2/L3 cache
4GB RAM/node
Interconnect Myrinet 2000
240 MB/s
GM version 1.6.5
MPICH-GM version 1.2.5
Operating System Red Hat Linux, 2.4.18 patches
Compiler Intel Fortran Compiler v7.1
Communication Middleware ARMCI 1.1-beta

20
NAS Parallel Benchmarks (NPB) 2.3

Benchmarks by NASA Ames
2-3K lines each (Fortran 77)
Widely used to test parallel compiler performance
NAS versions
NPB2.3b2 Hand-coded MPI
NPB2.3-serial Serial code extracted from MPI
version
Our version
NPB2.3-CAF CAF implementation, based on the MPI
version

21
NAS BT

Block tridiagonal solve of 3D Navier Stokes
Dense matrix
Parallelization
alternating line sweeps along 3 dimensions
multipartitioning data distribution for full
parallelism
MPI implementation
asynchronous send/receive
communication/computation overlap
CAF communication
strided blocks transferred using vector PUTs
(triplet notation)
no user-declared communication buffers
Large messages, relatively infrequent
communication

22
NAS BT Efficiency (Class C)

Lesson
Tradeoff buffers vs. synchronization
more buffers less synchronization
less synchronization improved performance

23
NAS SP

Scalar pentadiagonal solve of 3D Navier Stokes
Dense matrix
Parallelization
alternating line sweeps along 3 dimensions
multipartitioning data distribution for full
parallelism
MPI implementation
asynchronous send/receive
communication/computation overlap
CAF communication
pack into buffer separate buffer for each plane
of sweep
transfer using PUTs
smaller more frequent messages 1.5x
communication of BT

24
NAS SP Efficiency (Class C)

Lesson
Inability to overlap communication with
computation in procedure calls hurts performance

25
NAS MG

3D Multigrid solver with periodic boundary
conditions
Dense matrix
Grid size and levels are compile time constants
Communication
nearest neighbor with possibly 6 neighbors
MPI asynchronous send/receive
CAF
pairwise synch_notify/wait to coordinate with
neighbors
four communication buffers (co-arrays) used 2
sender, 2 receiver
pack and transfer contiguous data using PUTS
for each dimension
notify my neighbors that my buffers are free
wait for my neighbors to notify me their buffers
are free
PUT data into right buffer, notify neighbor
PUT data into left buffer, notify neighbor
wait for both to complete

26
NAS MG Efficiency (Class C)

Lessons
Replacing barriers with point-to-point
synchronization can boost performance 30
Converting GETs into PUTs also improved
performance

27
NAS LU

Solve 3D Navier Stokes using SSOR
Dense matrix
Parallelization on power of 2 processors
repeated decompositions on x and y until all
processors assigned
wavefront parallelism small messages 5 words
each
MPI implementation
asynchronous send/receive
communication/computation overlap
CAF
two dimensional co-arrays
morphed code to pack data for higher
communication efficiency
uses PUTs

28
NAS LU Efficiency (Class C)

Lessons
Morphing to pack data and use PUTs was hard!
Compiler had better pitch in and handle the dirty
work

29
NAS CG

Conjugant gradient solve to compute eigenvector
of large, sparse, symmetric, positive definite
matrix
MPI
Irregular point-to-point messaging
CAF structure follows MPI
Irregular notify/wait
vector assignments for data transfer
No communication/computation overlap for either

30
NAS CG Efficiency (Class C)

Lessons
aggregation and vectorization are critical for
high performance communication
memory layout of buffers and arrays might require
thorough analysis and optimization

31
CAF GET vs. PUT Communication

Definitions
GET q_caf(n1n2) w(m1m2)reduce_exch_proc_nonc
af(i)
PUT q_caf(n1n2)reduce_exch_proc_noncaf(i)
w(m1m2)
Study
64 procs, NAS CG class C
AlphaQuadrics Elan3 (Lemieux)
Performance
GET 12.9 slower than MPI
PUT 4.0 slower than MPI

In general, PUT faster than GET
32
Experiments Summary

On cluster-based architectures, to achieve best
performance with CAF, a user or compiler must
vectorize (and perhaps aggregate) communication
reduce synchronization strength
replace all-to-all with point-to-point where
sensible
overlap communication with computation
convert GETS into PUTS where gets are not a h/w
primitive
consider memory layout conflicts co-array vs.
regular data
generate code amenable for back-end compiler
optimizations
CAF language many optimizations possible at the
source level
Compiler optimizations NECESSARY for portable
coding style
might need user hints where synchronization
analysis falls short
Runtime issues
on Myrinet pin co-array memory for direct
transfers

33
CAF Language Refinement Issues

Initial implementations on Cray T3E and X1 led to
features not suited for distributed memory
platforms
Key problems and solution suggestions
Restrictive memory fence semantics for procedure
calls
pragmas to enable programmer to overlap one-sided
communication with procedure calls
Overly restrictive synchronization primitives
add unidirectional, point-to-point
synchronization
rework team model (next slide)
No collective operations
Leads to home-brew non-portable implementations
add CAF intrinsics for reductions, broadcast,
etc.

34
CAF Language Refinement Issues

CAF dynamic teams lead to dont scale
pre-arranged communicator-like teams
would help collectives O(log P) rather than
O(P2)
reordering logical numbering of images for
topology
add shape information to image teams?
Blocking communication reduces scalability
user mechanisms to delay completion to enable
overlap?
Synchronization is not paired with data movement
synchronization hint tags to help analysis
synchronization tags at run-time to track
completion?
How relaxed should the memory model be for
performance?

35
Conclusions

Tuned CAF performance is comparable to tuned MPI
even without compiler-based communication
optimizations!
CAF programming model enables source-level
optimization
communication vectorization
synchronization strength reduction
achieve performance today rather than waiting for
tomorrows compilers
CAF is amenable to compiler analysis and
optimization
significant communication optimization is
feasible, unlike for MPI
optimizing compilers will help a wider range of
programs achieve high performance
applications can be tailored to fully exploit
architectural characteristics
e.g., shared memory vs. distributed memory vs.
hybrid
However, more abstract programming models would
simplify code development (e.g. HPF)

36
Project URL

http//www.hipersoft.rice.edu/caf

37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
Parallel Programming Models

Goals
Expressiveness
Ease of use
Performance
Portability
Current models
OpenMP difficult to map on distributed memory
platforms
HPF difficult to obtain high-performance on
broad range of programs
MPI de facto standard hard to program,
assumptions about communication granularity are
hard coded
UPC global address space language similar to
CAF but with location transparency