A Multi-platform Co-Array Fortran Compiler - PowerPoint PPT Presentation

About This Presentation
Title:

A Multi-platform Co-Array Fortran Compiler

Description:

pack strided data on source and unpack it on destination. 22. Pragmatics of Packing ... unpacking requires conversion of PUTs into two-sided communication (a difficult ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 33
Provided by: hiperso
Category:

less

Transcript and Presenter's Notes

Title: A Multi-platform Co-Array Fortran Compiler


1
A Multi-platform Co-Array Fortran Compiler
  • Yuri Dotsenko Cristian Coarfa
  • John Mellor-Crummey
  • Department of Computer Science
  • Rice University
  • Houston, TX USA

2
Motivation
  • Parallel Programming Models
  • MPI de facto standard
  • difficult to program
  • OpenMP inefficient to map on distributed memory
    platforms
  • lack of locality control
  • HPF hard to obtain high-performance
  • heroic compilers needed!

Global address space languages CAF, Titanium,
UPC an appealing middle ground
3
Co-Array Fortran
  • Global address space programming model
  • one-sided communication (GET/PUT)
  • Programmer has control over performance-critical
    factors
  • data distribution
  • computation partitioning
  • communication placement
  • Data movement and synchronization as language
    primitives
  • amenable to compiler-based communication
    optimization

4
CAF Programming Model Features
  • SPMD process images
  • fixed number of images during execution
  • images operate asynchronously
  • Both private and shared data
  • real x(20, 20) a private 20x20 array in each
    image
  • real y(20, 20) a shared 20x20 array in each
    image
  • Simple one-sided shared-memory communication
  • x(,jj2) y(,pp2)r copy columns from
    image r into local columns
  • Synchronization intrinsic functions
  • sync_all a barrier and a memory fence
  • sync_mem a memory fence
  • sync_team(team members to notify, team members
    to wait for)
  • Pointers and (perhaps asymmetric) dynamic
    allocation

5
One-sided Communication with Co-Arrays
image 1
image 2
image N
image 1
image 2
image N
6
Rice Co-Array Fortran Compiler (cafc)
  • First CAF multi-platform compiler
  • previous compiler only for Cray shared memory
    systems
  • Implements core of the language
  • currently lacks support for derived type and
    dynamic co-arrays
  • Core sufficient for non-trivial codes
  • Performance comparable to that of hand-tuned MPI
    codes
  • Open source

7
Outline
  • CAF programming model
  • cafc
  • Core language implementation
  • Optimizations
  • Experimental evaluation
  • Conclusions

8
Implementation Strategy
  • Source-to-source compilation of CAF codes
  • uses Open64/SL Fortran 90 infrastructure
  • CAF ? Fortran 90 communication operations
  • Communication
  • ARMCI library for one-sided communication on
    clusters
  • load/store communication on shared-memory
    platforms
  • Goals
  • portability
  • high-performance on a wide range of platforms

9
Co-Array Descriptors
  • Initialize and manipulate Fortran 90 dope vectors

real a(10,10,10) type CAFDesc_real_3
integer(ptrkind) handle ! Opaque handle
! to CAF runtime
representation real, pointer ptr(,,) !
Fortran 90 pointer
! to local co-array data end Type
CAFDesc_real_3 type(CAFDesc_real_3) a
10
Allocating COMMON and SAVE Co-Arrays
  • Compiler
  • generates static initializer for each common/save
    variable
  • Linker
  • collects calls to all initializers
  • generates global initializer that calls all
    others
  • compiles global initializer and links into
    program
  • Launch
  • invokes global initializer before main program
    begins
  • allocates co-array storage outside Fortran 90
    runtime system
  • associates co-array descriptors with allocated
    memory

Similar to handling for C static constructors
11
Parameter Passing
call f((a(I)p))
  • Call-by-value convention (copy-in, copy-out)
  • pass remote co-array data to procedures only as
    values
  • Call-by-co-array convention
  • argument declared as a co-array by callee
  • enables access to local and remote co-array data
  • Call-by-reference convention (cafc)
  • argument declared as an explicit shape array
  • enables access to local co-array data only
  • enables reuse of existing Fortran code

subroutine f(a) real a(10)
real x(10) call f(x) subroutine f(a) real
a(10)
requires an explicit interface
12
Multiple Co-dimensions
  • Managing processors as a logical
    multi-dimensional grid
  • integer a(10,10)5,4, 3D processor grid 5 x 4
    x
  • Support co-space reshaping at procedure calls
  • change number of co-dimensions
  • co-space bounds as procedure arguments

13
Implementing Communication
  • x(1n) a(1n)p
  • Use a temporary buffer to hold off processor data
  • allocate buffer
  • perform GET to fill buffer
  • perform computation x(1n) buffer(1n)
  • deallocate buffer
  • Optimizations
  • no temporary storage for co-array to co-array
    copies
  • load/store communication on shared-memory systems

14
Synchronization
  • Original CAF specification team synchronization
    only
  • sync_all, sync_team
  • Limits performance on loosely-coupled
    architectures
  • Point-to-point extensions
  • sync_notify(q)
  • sync_wait(p)
  • Point to point
    synchronization semantics
  • Delivery of a notify to q from p ?
  • all communication from p to q issued before the
    notify has been delivered to q

15
Outline
  • CAF programming model
  • cafc
  • Core language implementation
  • Optimizations
  • procedure splitting
  • supporting hints for non-blocking communication
  • packing strided communications
  • Experimental evaluation
  • Conclusions

16
An Impediment to Code Efficiency
  • Original reference
  • rhs(1,i,j,k,c) u(1,i-1,j,k,c) -
  • Transformed reference
  • rhsptr(1,i,j,k,c) uptr(1,i-1,j,k,c) -
  • Fortran 90 pointer-based co-array representation
    does not convey
  • the lack of co-array aliasing
  • co-array contiguity
  • co-array bounds
  • Lack of knowledge inhibits important code
    optimizations

17
Procedure Splitting
CAF to CAF preprocessing
subroutine f() real, save c(100) interface
subroutine f_inner(, c_arg) real
c_arg end subroutine f_inner end
interface call f_inner(,c) end subroutine
f subroutine f_inner(, c_arg) real
c_arg(100) ... c_arg(50) ... end
subroutine f_inner
subroutine f() real, save c(100) ...
c(50) ... end subroutine f
18
Benefits of Procedure Splitting
  • Generated code conveys
  • lack of co-array aliasing
  • co-array contiguity
  • co-array bounds
  • Enables back-end compiler to generate better code

19
Hiding Communication Latency
  • Goal enable communication/computation overlap
  • Impediments to generating non-blocking
    communication
  • use of indexed subscripts in co-dimensions
  • lack of whole program analysis
  • Approach support hints for non-blocking
    communication
  • overcome conservative compiler analysis
  • enable sophisticated programmers to achieve good
    performance today

20
Hints for Non-blocking PUTs
  • Hints for CAF run-time system to issue
    non-blocking PUTs
  • region_id open_nb_put_region()
  • ...
  • Put_Stmt_1
  • ...
  • Put_Stmt_N
  • ...
  • call close_nb_put_region(region_id)
  • Complete non-blocking PUTs
  • call complete_nb_put_region(region_id)
  • Open problem Exploiting non-blocking GETs?

21
Strided vs. Contiguous Transfers
  • Problem
  • CAF remote reference might induce many small data
    transfers
  • a(i,1n)p b(j,1n)
  • Solution
  • pack strided data on source and unpack it on
    destination

22
Pragmatics of Packing
  • Who should implement packing?
  • The CAF programmer
  • difficult to program
  • The CAF compiler
  • unpacking requires conversion of PUTs into
    two-sided communication (a difficult
    whole-program transformation)
  • The communication library
  • most natural place
  • ARMCI currently performs packing on Myrinet

23
CAF Compiler Targets (Sept 2004)
  • Processors
  • Pentium, Alpha, Itanium2, MIPS
  • Interconnects
  • Quadrics, Myrinet, Gigabit Ethernet, shared
    memory
  • Operating systems
  • Linux, Tru64, IRIX

24
Outline
  • CAF programming model
  • cafc
  • Core language implementation
  • Optimizations
  • Experimental evaluation
  • Conclusions

25
Experimental Evaluation
  • Platforms
  • AlphaQuadrics QSNet (Elan3)
  • Itanium2Quadrics QSNet II (Elan4)
  • Itanium2Myrinet 2000
  • Codes
  • NAS Parallel Benchmarks (NPB 2.3) from NASA Ames

26
NAS BT Efficiency (Class C)
27
NAS SP Efficiency (Class C)
lack of non-blocking notify implementation blocks
CAF comm/comp overlap
28
NAS MG Efficiency (Class C)
  • ARMCI comm is efficient
  • pt-2-pt synch in boosts
  • CAF performance 30

29
NAS CG Efficiency (Class C)
30
NAS LU Efficiency (class C)
31
Impact of Optimizations
  • Assorted Results
  • Procedure splitting
  • 42-60 improvement for BT on Itanium2Myrinet
    cluster
  • 15-33 improvement for LU on AlphaQuadrics
  • Non-blocking communication generation
  • 5 improvement for BT on Itanium2Quadrics
    cluster
  • 3 improvement for MG on all platforms
  • Packing of strided data
  • 31 improvement for BT on AlphaQuadrics cluster
  • 37 improvement for LU on Itanium2Quadrics
    cluster

See paper for more details
32
Conclusions
  • CAF boosts programming productivity
  • simplifies the development of SPMD parallel
    programs
  • shifts details of managing communication to
    compiler
  • cafc delivers performance comparable to
    hand-tuned MPI
  • cafc implements effective optimizations
  • procedure splitting
  • non-blocking communication
  • packing of strided communication (in ARMCI)
  • Vectorization needed to achieve true performance
    portability with machines like Cray X1

http//www.hipersoft.rice.edu/caf
Write a Comment
User Comments (0)
About PowerShow.com