Experiences with Coarray Fortran on Hardware Shared Memory Platforms

1 / 35
About This Presentation
Title:

Experiences with Coarray Fortran on Hardware Shared Memory Platforms

Description:

Fortran 90 pointer representation yields 30% of performance on ... COMMON block representation for local access Cray pointer for remote accesses is the best ... –

Number of Views:83
Avg rating:3.0/5.0
Slides: 36
Provided by: csR7
Category:

less

Transcript and Presenter's Notes

Title: Experiences with Coarray Fortran on Hardware Shared Memory Platforms


1
Experiences with Co-array Fortran on Hardware
Shared Memory Platforms
  • Yuri Dotsenko Cristian Coarfa
  • John Mellor-Crummey Daniel Chavarria-Miranda
  • Rice University, Houston, TX

2
Co-array Fortran
  • Global Address Space (GAS) language
  • SPMD programming model
  • Simple extension of Fortran 90
  • Explicit control over data placement and
    computation distribution
  • Private data
  • Shared data both local and remote
  • One-sided communication (PUT and GET)
  • Team and point-to-point synchronization

3
Co-array Fortran Example
integer a(10,20) if (this_image() gt 1)
a(110,12) a(110,1920)this_image()-1
Copies from left neighbor
4
Compiling CAF
  • Source-to-source translation
  • Prototype Rice cafc
  • Fortran 90 pointer-based co-array representation
  • ARMCI-based data movement
  • Goal performance transparency
  • Challenges
  • Retain CAF source-level information
  • Array contiguity, array bounds, lack of aliasing
  • Exploit efficient fine-grain communication on
    SMPs

5
Outline
  • Co-array representation and data access
  • Local data
  • Remote data
  • Experimental evaluation
  • Conclusions

6
Representation and Access for Local Data
  • Efficient local access to SAVE/COMMON co-arrays
    is crucial to achieving best performance on a
    target architecture
  • Fortran 90 pointer
  • Fortran 90 pointer to structure
  • Cray pointer
  • Subroutine argument
  • COMMON block (need support for symmetric shared
    objects)

7
Fortran 90 Pointer Representation
  • CAF declaration real, save a(10,20)
  • After translation type T1
  • integer(PtrSize) handle
  • real, pointer local(,)
  • end type T1
  • type (T1) ca
  • Local access calocal(2,3)
  • Portable representation
  • Back-end compiler has no knowledge about
  • Potential aliasing (no-alias flags for some
    compilers)
  • Contiguity
  • Bounds
  • Implemented in cafc

8
Fortran 90 Pointer to Structure Representation
  • CAF declaration real, save a(10,20)
  • After translation type T1
  • real local(10,20)
  • end type T1
  • type (T1), pointer ca
  • Conveys constant bounds and contiguity
  • Potential aliasing is still a problem

9
Cray Pointer Representation
  • CAF declaration real, save a(10,20)
  • After translation real a_local(10,20)
  • pointer (a_ptr, a_local)
  • Conveys constant bounds and contiguity
  • Potential aliasing is still a problem
  • Cray pointer is not in Fortran 90 standard

10
Subroutine Argument Representation
  • CAF source subroutine foo()
  • real, save a(10,20)
  • a(i,j) a(i-1,j)
  • end subroutine foo
  • After translation
  • subroutine foo()
  • ! F90 representation for co-array a
  • call foo_body(calocal(1,1), cahandle, )
  • end subroutine foo
  • subroutine foo_body(a_local, a_handle, )
  • real a_local(10,20)
  • a_local(i,j) a_local(i-1,j)
  • end subroutine foo_body

11
Subroutine Argument Representation (cont.)
  • Avoid conservative assumptions about co-array
    aliasing by the back-end compiler
  • Performance is close to optimal
  • Extra procedures and procedure calls
  • Implemented in cafc

12
COMMON Block Representation
  • CAF declaration real a(10,20)
  • common /a_cb/ a
  • After translation real ca(10,20)
  • common /ca_cb/ ca
  • Yields best performance for local accesses
  • OS must support symmetric data objects

13
Outline
  • Co-array representation and data access
  • Local data
  • Remote data
  • Experimental evaluation
  • Conclusions

14
Generating CAF Communication
  • Generic parallel architectures
  • Library function calls to move data
  • Shared memory architectures (load/store)
  • Fortran 90 pointers
  • Vector of Fortran 90 pointers
  • Cray pointers

15
Communication Generation for Generic Parallel
Architectures
  • CAF code a() b()p
  • Translated code allocate b_temp()
  • call GET( b, p, b_temp, )
  • a() b_temp()
  • deallocate b_temp
  • Portable works on clusters and SMPs
  • Function overhead per fine-grain access
  • Uses temporary to hold off-processor data
  • Implemented in cafc

16
Communication Generation Using Fortran 90 Pointers
  • CAF code do j 1, N
  • C(j) A(j)p
  • end do
  • Translated code do j 1, N
  • ptrA gt A(j)
  • call CafSetPtr(ptrA,p,A_handle)
  • C(j) ptrA
  • end do
  • Function call overhead for each reference
  • Implemented in cafc

17
Pointer Initialization Hoisting
  • Naïvely translated code do j 1, N
  • ptrA gt A(j)
  • call CafSetPtr(ptrA,p,A_handle)
  • C(j) ptrA
  • end do
  • Code with hoisted pointer initialization
  • ptrA gt A(1N)
  • call CafSetPtr(ptrA,p,A_handle)
  • do j 1, N
  • C(j) ptrA(j)
  • end do
  • Pointer initialization hoisting is not yet
    implemented in cafc

18
Communication Generation Using Vector of Fortran
90 Pointers
  • CAF code do j 1, N
  • C(j) A(j)p
  • end do
  • Translated code initialization
  • do j 1, N
  • C(j) ptrVectorA(p)ptrA(j)
  • end do
  • Does not require pointer initialization hoisting
    and avoids function calls
  • Worse performance than that of hoisted pointer
    initialization

19
Communication Generation Using Cray Pointers
  • CAF code do j 1, N
  • C(j) A(j)p
  • end do
  • Translated code integer(PtrSize) addrA()
  • addrA initialization
  • do j 1, N
  • ptrA addrA(p)
  • C(j) A_rem(j)
  • end do
  • addrA(p) address of co-array A on image p
  • Cray pointer initialization hoisting yields only
    marginal improvement

20
Outline
  • Co-array representation and data access
  • Local data
  • Remote data
  • Experimental evaluation
  • Conclusions

21
Experimental Platforms
  • SGI Altix 3000
  • 128 Itanium2 1.5 GHz, 6 MB L3 cache processors
  • Linux (2.4.21 kernel)
  • Intel Fortran Compiler 8.0
  • SGI Origin 2000
  • 16 MIPS R12000 350 MHz, 8 MB L2 cache processors
  • IRIX64 6.5
  • MIPSpro Compiler 7.3.1.3m

22
Benchmarks
  • STREAM
  • Random Access
  • Spark98
  • NAS MG and SP

23
STREAM
  • Copy kernel
  • DO J 1, N DO J 1, N
  • C(J) A(J) C(J) A(J)p
  • END DO END DO
  • Triad kernel
  • DO J 1, N DO J 1, N
  • A(J)B(J)sC(J) A(J)B(J)psC(J)p
  • END DO END DO
  • Goal investigate how well architecture bandwidth
    can be delivered up to the language level

24
STREAM Local Accesses
  • COMMON block is the best, if platform allows
  • Subroutine parameter has similar performance to
    COMMON block representation
  • Pointer-based representations have performance
    within 5 of the best on the Altix (with
    no-aliasing flag), and within 15 on the Origin
  • Fortran 90 pointer representation yields 30 of
    performance on the Altix without using the flag
    to specify lack of pointer aliasing
  • Array section statements with Fortran 90 pointer
    representation yield 40-50 performance on the
    Origin

25
STREAM Remote Accesses
  • COMMON block representation for local access
    Cray pointer for remote accesses is the best
  • Subroutine argument Cray pointer for remote
    accesses has similar performance
  • Remote accesses with function call per access
    yield very poor performance (24 times slower than
    the best on the Altix, five times slower on the
    Origin)
  • Generic strategy (with intermediate temporaries)
    delivers only 50-60 of performance on the Altix
    and 30-40 of performance on the Origin for
    vectorized code (except for Copy kernel)
  • Pointer initialization hoisting is crucial for
    Fortran 90 pointers remote accesses and desirable
    for Cray pointers
  • Similarly coded OpenMP version has comparable
    performance on the Altix (90 for the scale
    kernel) and 86-90 on the Origin

26
Spark98
  • Based on CMUs earthquake simulation code
  • Computes sparse matrix-vector product
  • Irregular application with fine-grain accesses
  • Matrix distribution and computation partitioning
    is done offline (sf2 traces)
  • Spark98 computes partial product locally, then
    assembles the result across processors

27
Spark98 (cont.)
  • Versions
  • Serial (Fortran kernel, ported from C)
  • MPI (Fortran kernel, ported from C)
  • Hybrid (best shared memory threaded version)
  • CAF versions (based on MPI version)
  • CAF Packed PUTs
  • CAF Packed GETs
  • CAF GETs (computation with remote data accessed
    in place)

28
Spark98 GETs Result Assembly
  • v2(,) v(,)
  • call sync_all()
  • do s 0, subdomains-1
  • if (commindex(s) lt commindex(s1)) then
  • pos commindex(s)
  • comm_len commindex(s1) - pos
  • v(, comm(posposcomm_len-1))
  • v(, comm(posposcomm_len-1))
  • v2(, comm_gets(posposcomm_len-1))s
  • end if
  • end do
  • call sync_all()

29
Spark98 GETs Result Assembly
  • v2(,) v(,)
  • call sync_all()
  • do s 0, subdomains-1
  • if (commindex(s) lt commindex(s1)) then
  • pos commindex(s)
  • comm_len commindex(s1) - pos
  • v(, comm(posposcomm_len-1))
  • v(, comm(posposcomm_len-1))
  • v2(, comm_gets(posposcomm_len-1))s
  • end if
  • end do
  • call sync_all()

30
Spark98 Performance on Altix
  • Performance of all CAF versions is comparable to
    that of MPI and better on large number of CPUs
  • CAF GETs is simple and more natural to code,
    but up to 13 slower
  • Without considering locality, applications do not
    scale on NUMA architectures (Hybrid)
  • ARMCI library is more efficient than MPI

31
NAS MG and SP
  • Versions
  • MPI (NPB 2.3)
  • CAF (based on MPI NPB 2.3)
  • Generic code generation with subroutine argument
    co-array representation (procedure splitting)
  • Shared memory code generation (Fortran 90
    pointers vectorized source code) with subroutine
    argument co-array representation
  • OpenMP (NPB 3.0)
  • Class C

32
NAS SP Performance on Altix
  • Performance of CAF versions is comparable to that
    of MPI
  • CAF-generic has better performance than CAF-shm
    because it uses memcpy, which hides latency by
    keeping optimal number of memory ops in flight
  • OpenMP scales poorly

33
NAS MG Performance on Altix
34
Conclusions
  • Direct load/store communication improves
    performance of fine-grain accesses by a factor of
    24 on the Altix 3000 and five on the Origin 2000
  • In-place data use in CAF statements incurs
    acceptable abstraction overhead
  • Performance comparable to that of MPI codes for
    fine- and coarse-grain applications
  • We plan to implement in cafc optimal,
    architecture dependent, code generation for local
    and remote co-array accesses

35
  • www.hipersoft.rice.edu/caf
Write a Comment
User Comments (0)
About PowerShow.com