Coarray Fortran: Compilation, Performance, Languages Issues - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Coarray Fortran: Compilation, Performance, Languages Issues

Description:

Tailor communication to architecture. design supports alternate communication libraries ... applications can be tailored to fully exploit architectural characteristics ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 47
Provided by: rick359
Category:

less

Transcript and Presenter's Notes

Title: Coarray Fortran: Compilation, Performance, Languages Issues


1
Co-array Fortran Compilation, Performance,
Languages Issues
  • John Mellor-Crummey
  • Cristian Coarfa Yuri Dotsenko
  • Department of Computer Science
  • Rice University

2
Outline
  • Co-array Fortran language recap
  • Compilation approach
  • Co-array storage management
  • Communication
  • A preliminary performance study
  • Platforms
  • Benchmarks and results and lessons
  • Language refinement issues
  • Conclusions

3
CAF Language Assessment
  • Strengths
  • offloads communication management to the compiler
  • choreographing data transfer
  • managing mechanics of synchronization
  • gives user full control of parallelization
  • data movement and synchronization as language
    primitives
  • amenable to compiler optimization
  • array syntax supports natural user-level
    vectorization
  • modest compiler technology can yield good
    performance
  • more abstract than MPI ? better performance
    portability
  • Weaknesses
  • user manages partitioning of work
  • user specifies data movement
  • user codes necessary synchronization

4
Compiler Goals
  • Portable compiler
  • Multi-platform code generation
  • High performance generated code

5
Compilation Approach
  • Source-to-source Translation
  • Translate CAF into Fortran 90 communication
    calls
  • One-sided communication layer
  • strided communication
  • gather/scatter
  • synchronization barriers, notify/wait
  • split phase non-blocking primitives
  • Today ARMCI remote memory copy interface
    (Nieplocha _at_ PNL)
  • Benefits
  • wide portability
  • leverage vendor F90 compilers for good node
    performance

6
Co-array Data
  • Co-array representation
  • F90 pointer to data opaque handle for
    communication layer
  • Co-array access
  • read/write local co-array data using F90 pointer
    dereference
  • remote accesses translate into ARMCI GET/PUT
    calls
  • Co-array allocation
  • storage allocation by communication layer, as
    appropriate
  • on shared memory hardware in shared memory
    segment
  • on Myrinet in pinned memory for direct DMA
    access
  • dope vector initialization using CHASM (Rasmussen
    _at_ LANL)
  • set F90 pointer to point to externally managed
    memory

7
Allocating Static Co-arrays (COMMON/SAVE)
  • Compiler
  • generate static initializer for each common/save
    variable
  • Linker
  • collect calls to all initializers
  • generate global initializer that calls all others
  • compile global initializer and link into program
  • Launch
  • call global initializer before main program
    begins

Similar to handling for C static constructors
8
COMMON Block Sequence Association
  • Problem
  • each procedure may have a different view of a
    common
  • Solution
  • allocate a contiguous pool of co-array storage
    per common
  • each procedure has a private set of view
    variables (F90 pointers)
  • initialize all per procedure view variables
  • only once at launch after common allocation

9
Porting to a new Compiler / Architecture
  • Synthesize dope vectors for co-array storage
  • compiler/architecture specific details CHASM
    library
  • Tailor communication to architecture
  • design supports alternate communication libraries
  • status
  • today ARMCI (PNL)
  • ongoing work compiler tailored communication
  • direct load/store on shared-memory architectures
  • future
  • other portable libraries (e.g. GASnet)
  • custom communication library for an architecture

10
Supporting Multiple Co-dimensions
  • A(,)N,M,
  • Add precomputed coefficients to co-array
    meta-data
  • Lower, upper bounds for each co-dimension
  • this_image_cache for each co-dimension
  • e.g., this_image(a,1) yields my co-row index
  • cum_hyperplane_size for each co-dimension

11
Implementing Communication
  • Given a statement
  • X(1n) A(1n)p
  • A temporary buffer is used for off processor data
  • invoke communication library to allocate tmp in
    suitable temporary storage
  • dope vector filled in so tmp can be accessed as
    F90 pointer
  • call communication library to fill in tmp (ARMCI
    GET)
  • X(1n) tmp(1n)
  • deallocate tmp

12
CAF Compiler Status
  • Near production-quality F90 front end from Open64
  • being enhanced to meet needs of this project and
    others
  • Working prototype for CAF core features
  • Co-array communication
  • inserted around statements with co-array accesses
  • currently no optimization

13
Supported Features
  • Declarations
  • co-objects scalars and arrays
  • COMMON and SAVE co-objects of primitive types
  • INTEGER(4), REAL(4) and REAL(8)
  • COMMON blocks variables and co-objects
    intermixed
  • co-objects with multiple co-dimensions
  • procedure interface blocks with co-array
    arguments
  • Executable code
  • array section notation for co-array data indices
  • local and remote co-arrays
  • co-array argument passing
  • co-array dummy arguments require explicit
    interface
  • co-array pointer communication handle
  • co-array reshaping supported
  • CAF intrinsics
  • Image inquiry this_image(), num_images()
  • Synchronization sync_all, sync_team,
    synch_notify, synch_wait

14
Coming Attractions
  • Allocatable co-arrays
  • REAL(8), ALLOCATABLE X()
  • ALLOCATE(X(MYX_NUM))
  • Co-arrays of user-defined types
  • Allocatable co-array components
  • user defined type with pointer components
  • Triplets in co-dimensions
  • A(j,k)p1p4

15
CAF Compiler Targets (May 2004)
  • PentiumEthernet workstations, Linux32 RedHat 7.1
  • Itanium2 Myrinet, Linux64 RedHat 7.1
  • Itanium2 Quadrics, Linux64 RedHat 7.1
  • SGI Altix 3000/Itanium2, Linux64 RedHat 7.2
  • Alphaserver SC Quadrics, OSF1 Tru64 V5.1A
  • SGI Origin 2000/MIPS, IRIX64 6.5

16
A Preliminary Performance Study
  • Platforms
  • AlphaQuadrics QSNet (Elan3)
  • Itanium2Quadrics QSNet II (Elan4)
  • Itanium2Myrinet 2000
  • Codes
  • NAS Parallel Benchmarks (NPB) from NASA Ames

17
AlphaQuadrics Platform (Lemieux)
  • Nodes 750 Compaq AlphaServer ES35 4-way ES45
  • 1-GHz Alpha EV6.8 (21264C), 64KB/8MB L1/L2 cache
  • 4 GB RAM/node
  • Interconnect Quadrics QSNet (Elan3)
  • 340 MB/s peak and 210 MB/s sustained x 2 rails
  • Operating System Tru64 Unix5.1A SC2.5
  • Compiler HP Fortran Compiler V5.5A
  • Communication Middleware ARMCI 1.1-beta

18
Itanium2Quadrics Platform (PNNL)
  • Nodes 944 HP Longs Peak dual-CPU workstations
  • 1.5GHz Itanium2 32KB/256KB/6MB L1/L2/L3 cache
  • 6GB RAM/node
  • Interconnect Quadrics QSNet II
  • 905 MB/s
  • Operating System Red Hat Linux, 2.4.20
  • Compiler Intel Fortran Compiler v7.1
  • Communication Middleware ARMCI 1.1-beta

19
Itanium2Myrinet Platform (Rice)
  • Nodes 96 HP zx6000 dual-CPU workstations
  • 900MHz Itanium2 32KB/256KB/1.5MB L1/L2/L3 cache
  • 4GB RAM/node
  • Interconnect Myrinet 2000
  • 240 MB/s
  • GM version 1.6.5
  • MPICH-GM version 1.2.5
  • Operating System Red Hat Linux, 2.4.18 patches
  • Compiler Intel Fortran Compiler v7.1
  • Communication Middleware ARMCI 1.1-beta

20
NAS Parallel Benchmarks (NPB) 2.3
  • Benchmarks by NASA Ames
  • 2-3K lines each (Fortran 77)
  • Widely used to test parallel compiler performance
  • NAS versions
  • NPB2.3b2 Hand-coded MPI
  • NPB2.3-serial Serial code extracted from MPI
    version
  • Our version
  • NPB2.3-CAF CAF implementation, based on the MPI
    version

21
NAS BT
  • Block tridiagonal solve of 3D Navier Stokes
  • Dense matrix
  • Parallelization
  • alternating line sweeps along 3 dimensions
  • multipartitioning data distribution for full
    parallelism
  • MPI implementation
  • asynchronous send/receive
  • communication/computation overlap
  • CAF communication
  • strided blocks transferred using vector PUTs
    (triplet notation)
  • no user-declared communication buffers
  • Large messages, relatively infrequent
    communication

22
NAS BT Efficiency (Class C)
  • Lesson
  • Tradeoff buffers vs. synchronization
  • more buffers less synchronization
  • less synchronization improved performance

23
NAS SP
  • Scalar pentadiagonal solve of 3D Navier Stokes
  • Dense matrix
  • Parallelization
  • alternating line sweeps along 3 dimensions
  • multipartitioning data distribution for full
    parallelism
  • MPI implementation
  • asynchronous send/receive
  • communication/computation overlap
  • CAF communication
  • pack into buffer separate buffer for each plane
    of sweep
  • transfer using PUTs
  • smaller more frequent messages 1.5x
    communication of BT

24
NAS SP Efficiency (Class C)
  • Lesson
  • Inability to overlap communication with
    computation in procedure calls hurts performance

25
NAS MG
  • 3D Multigrid solver with periodic boundary
    conditions
  • Dense matrix
  • Grid size and levels are compile time constants
  • Communication
  • nearest neighbor with possibly 6 neighbors
  • MPI asynchronous send/receive
  • CAF
  • pairwise synch_notify/wait to coordinate with
    neighbors
  • four communication buffers (co-arrays) used 2
    sender, 2 receiver
  • pack and transfer contiguous data using PUTS
  • for each dimension
  • notify my neighbors that my buffers are free
  • wait for my neighbors to notify me their buffers
    are free
  • PUT data into right buffer, notify neighbor
  • PUT data into left buffer, notify neighbor
  • wait for both to complete

26
NAS MG Efficiency (Class C)
  • Lessons
  • Replacing barriers with point-to-point
    synchronization can boost performance 30
  • Converting GETs into PUTs also improved
    performance

27
NAS LU
  • Solve 3D Navier Stokes using SSOR
  • Dense matrix
  • Parallelization on power of 2 processors
  • repeated decompositions on x and y until all
    processors assigned
  • wavefront parallelism small messages 5 words
    each
  • MPI implementation
  • asynchronous send/receive
  • communication/computation overlap
  • CAF
  • two dimensional co-arrays
  • morphed code to pack data for higher
    communication efficiency
  • uses PUTs

28
NAS LU Efficiency (Class C)
  • Lessons
  • Morphing to pack data and use PUTs was hard!
  • Compiler had better pitch in and handle the dirty
    work

29
NAS CG
  • Conjugant gradient solve to compute eigenvector
    of large, sparse, symmetric, positive definite
    matrix
  • MPI
  • Irregular point-to-point messaging
  • CAF structure follows MPI
  • Irregular notify/wait
  • vector assignments for data transfer
  • No communication/computation overlap for either

30
NAS CG Efficiency (Class C)
  • Lessons
  • aggregation and vectorization are critical for
    high performance communication
  • memory layout of buffers and arrays might require
    thorough analysis and optimization

31
CAF GET vs. PUT Communication
  • Definitions
  • GET q_caf(n1n2) w(m1m2)reduce_exch_proc_nonc
    af(i)
  • PUT q_caf(n1n2)reduce_exch_proc_noncaf(i)
    w(m1m2)
  • Study
  • 64 procs, NAS CG class C
  • AlphaQuadrics Elan3 (Lemieux)
  • Performance
  • GET 12.9 slower than MPI
  • PUT 4.0 slower than MPI

In general, PUT faster than GET
32
Experiments Summary
  • On cluster-based architectures, to achieve best
    performance with CAF, a user or compiler must
  • vectorize (and perhaps aggregate) communication
  • reduce synchronization strength
  • replace all-to-all with point-to-point where
    sensible
  • overlap communication with computation
  • convert GETS into PUTS where gets are not a h/w
    primitive
  • consider memory layout conflicts co-array vs.
    regular data
  • generate code amenable for back-end compiler
    optimizations
  • CAF language many optimizations possible at the
    source level
  • Compiler optimizations NECESSARY for portable
    coding style
  • might need user hints where synchronization
    analysis falls short
  • Runtime issues
  • on Myrinet pin co-array memory for direct
    transfers

33
CAF Language Refinement Issues
  • Initial implementations on Cray T3E and X1 led to
    features not suited for distributed memory
    platforms
  • Key problems and solution suggestions
  • Restrictive memory fence semantics for procedure
    calls
  • pragmas to enable programmer to overlap one-sided
    communication with procedure calls
  • Overly restrictive synchronization primitives
  • add unidirectional, point-to-point
    synchronization
  • rework team model (next slide)
  • No collective operations
  • Leads to home-brew non-portable implementations
  • add CAF intrinsics for reductions, broadcast,
    etc.

34
CAF Language Refinement Issues
  • CAF dynamic teams lead to dont scale
  • pre-arranged communicator-like teams
  • would help collectives O(log P) rather than
    O(P2)
  • reordering logical numbering of images for
    topology
  • add shape information to image teams?
  • Blocking communication reduces scalability
  • user mechanisms to delay completion to enable
    overlap?
  • Synchronization is not paired with data movement
  • synchronization hint tags to help analysis
  • synchronization tags at run-time to track
    completion?
  • How relaxed should the memory model be for
    performance?

35
Conclusions
  • Tuned CAF performance is comparable to tuned MPI
  • even without compiler-based communication
    optimizations!
  • CAF programming model enables source-level
    optimization
  • communication vectorization
  • synchronization strength reduction
  • achieve performance today rather than waiting for
    tomorrows compilers
  • CAF is amenable to compiler analysis and
    optimization
  • significant communication optimization is
    feasible, unlike for MPI
  • optimizing compilers will help a wider range of
    programs achieve high performance
  • applications can be tailored to fully exploit
    architectural characteristics
  • e.g., shared memory vs. distributed memory vs.
    hybrid
  • However, more abstract programming models would
    simplify code development (e.g. HPF)

36
Project URL
  • http//www.hipersoft.rice.edu/caf

37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
Parallel Programming Models
  • Goals
  • Expressiveness
  • Ease of use
  • Performance
  • Portability
  • Current models
  • OpenMP difficult to map on distributed memory
    platforms
  • HPF difficult to obtain high-performance on
    broad range of programs
  • MPI de facto standard hard to program,
    assumptions about communication granularity are
    hard coded
  • UPC global address space language similar to
    CAF but with location transparency

42
Finite Element Example (Numrich)
  • subroutine assemble(start, prin, ghost, neib, x)
  • integer start(), prin(), ghost(),
    neib(), k1, k2, p
  • real x()
  • call sync_all(neib)
  • do p 1, size(neib) ! Add contributions from
    neighbors
  • k1 start(p) k2 start(p1)-1
  • x(prin(k1k2)) x(prin(k1k2))
    x(ghost(k1k2)) neib(p)
  • enddo
  • call sync_all(neib)
  • do p 1, size(neib) ! Update the neighbors
  • k1 start(p) k2 start(p1)-1
  • x(ghost(k1k2)) neib(p) x(prin(k1k2))
  • enddo
  • call synch_all
  • end subroutine assemble

43
Communicating Private Data
  • Example
  • REAL A(100,100), B(100)
  • A(,j)p B()
  • Issue
  • B is a local array
  • B is sent to a partner
  • Will require a copy into shared space before
    transfer
  • For higher efficiency want B in shared storage
  • Alternatives
  • Declare communicated arrays as co-arrays
  • Add a communicated attribute to Bs declaration
  • mark it for allocation in shared storage

44
Passing Co-arrays as Arguments
  • Language restriction pass co-arrays by whole
    array
  • REAL A(100,100)
  • CALL FOO(A)
  • Callee must declare an explicit subroutine
    interface
  • Proposed option F90 assumed shape co-array
    arguments
  • Allow passing of Fortran 90 style array sections
    of local co-array
  • REAL A(100,100)
  • CALL FOO(A(1102,325))
  • Callee must declare an explicit subroutine
    interface
  • If matching dummy argument is declared as a
    co-array, then
  • Must declare assumed size data dimensions
  • Must declare assumed size co-dimensions
  • Avoids copy-in, copy-out for co-array data

45
Co-array Fortran (CAF)
  • Explicitly-parallel extension of Fortran 90/95
  • defined by Numrich Reid
  • Global address space SPMD parallel programming
    model
  • one-sided communication
  • Simple, two-level memory model for locality
    management
  • local vs. remote memory
  • Programmer control over performance critical
    decisions
  • data partitioning
  • communication
  • Suitable for mapping to a range of parallel
    architectures
  • shared memory, message passing, hybrid, PIM

46
CAF Programming Model Features
  • SPMD process images
  • fixed number of images during execution
  • images operate asynchronously
  • Both private and shared data
  • real x(20, 20) a private 20x20 array in each
    image
  • real y(20, 20) a shared 20x20 array in each
    image
  • Simple one-sided shared-memory communication
  • x(,jj2) y(,pp2) r copy columns from
    pp2 into local columns
  • Synchronization intrinsic functions
  • sync_all a barrier and a memory fence
  • sync_mem a memory fence
  • sync_team(notify, wait)
  • notify a vector of process ids to signal
  • wait a vector of process ids to wait for, a
    subset of notify
  • Pointers and (perhaps asymmetric) dynamic
    allocation
  • Parallel I/O
Write a Comment
User Comments (0)
About PowerShow.com