Title: Ernest Orlando Lawrence Berkeley National Laboratory
1Implementing a Global Address Space Language on
the Cray X1 the Berkeley UPC Experience
Christian Bell and Wei Chen CS252 Class
Project December 10, 2003
2Outline
- An Overview of UPC and the Berkeley UPC Compiler
- Overview of the Cray X1
- Implementing the GASNet layer on the X1
- Implementing the runtime layer on the X1
- Serial performance
- Evaluation of compiler optimizations
3Unified Parallel C (UPC)
- UPC is an explicitly parallel global address
space language with SPMD parallelism - An extension of ISO C
- User level shared memory, partitioned by threads
- One-sided (bulk and fine-grained) communication
through reads/writes of shared variables
Shared
X0
X1
XP
Global address space
Private
4Shared Arrays and Pointers in UPC
A0 A2 A4 B0 B1 B4 B5 C0 C1
C2
A1 A3 A5 B2 B3 B6 B7
T1
T0
- Cyclic shared int An
- Block Cyclic shared 2 int Bn
- Indefinite shared 0 int C
- (shared 0 int ) upc_alloc(n)
- Use pointer-to-shared to access shared data
- Block size part of the pointer type
- A generic pointer-to-shared contains
- Address, Thread id, Phase
- Cyclic and Indefinite pointers are phaseless
5Accessing Shared Memory in UPC
start of array object
start of block
Shared Memory
block size
Phase
Thread 1
Thread N -1
Thread 0
0
2
addr
6UPC Programming Model Features
- Block cyclically distributed arrays
- Shared and private pointers
- Global synchronization -- barriers
- Pair-wise synchronization locks
- Parallel loops
- Dynamic shared memory allocation
- Bulk Shared Memory accesses
- Strict vs. Relaxed memory consistency models
7Overview of the Berkeley UPC Compiler
Two Goals Portability and High-Performance
Lower UPC code into ANSI-C code
Translator
UPC Code
Shared Memory Management and pointer operations
Platform- independent
Translator Generated C Code
Berkeley UPC Runtime System
Network- independent
Compiler- independent
GASNet Communication System
Language- independent
Network Hardware
Uniform get/put interface for underlying networks
8A Layered Design
- Portable
- C is our intermediate language
- Can run on top of MPI (with performance penalty)
- GASNet has a layered design with a small core
- High-Performance
- Native C compiler optimizes serial code
- Translator can perform high-level communication
optimizations - GASNet can access network hardware directly,
provides a rich set of communication /
synchronization primitives
9Outline
- An Overview of UPC and the Berkeley UPC Compiler
- Overview of the Cray X1
- Implementing the GASNet layer on the X1
- Implementing the runtime layer on the X1
- Serial performance
- Evaluation of compiler optimizations
10The Cray X1 Architecture
- New line of Vector Architecture
- Two modes of operation
- SSP up to 16 CPUs/node
- MSP multistreams long loops
- Single-node UMA, multi-node NUMA (no caching
remote data) - Global pointers
- Low latency, high bandwidth
- All Gets/Puts must be loads/stores (directly or
shmem interface) - Only puts are non-blocking, gets are blocking
- Vectorization is crucial
- Vector pipeline 2x faster than scalar
- Utilization of memory bandwidth
- Strided accesses, scatter-gather, reduction, etc.
11Outline
- An Overview of UPC and the Berkeley UPC Compiler
- Overview of the Cray X1
- Implementing the GASNet layer on the X1
- Implementing the runtime layer on the X1
- Serial performance
- Evaluation of compiler optimizations
12GASNet Communication System- Architecture
- 2-Level architecture to ease implementation
- Core API
- Based heavily on Active Messages
- Compatibility layer
- Port to X1 in 2 days, new algorithm to manipulate
queues in Shared Memory - Extended API
- Wider interface that includes more complicated
operations (puts, gets) - A reference implementation of the extended API in
terms of the core API is provided - Current revision is tuned especially for the X1
with shared memory as the primary focus (minimal
overhead)
Compiler-generated code
Compiler-specific runtime system
GASNet Extended API
GASNet Core API
Network Hardware
13GASNet Extended API Remote memory operations
- GASNet offers expressive Put/Get primitives
- All Gets/Puts can be blocking and non-blocking
- Non-blocking can be explicit (handle-based)
- Non-blocking can be implicit (global or
region-based) - Synchronization can poll or block
- Paves the way for complex split-phase
communication (compiler optimizations) - Cray X1 uses exclusively shared memory
- All Gets/Puts must be loads/stores
- Only puts are non-blocking, gets are blocking
- Very limited synchronization mechanisms
- Efficient communication only through vectors (one
order of magnitude between scalar and vector
communication) - Vectorization instead of split-phase?
14GASNet and Cray X1 Remote memory operations
GASNet Cray X1 Instruction Comment
Bulk operations Vector bcopy() Fully vectorized, suitable for GASNet/UPC
Non-bulk blocking puts Store gsync No vectorization
Non-bulk blocking gets Load
Non-bulk Non-blocking explicit puts/gets Store/load gsync No vectorization if sync done in the loop
Non-bulk Non-blocking implicit puts/gets Store/load gsync No vectorization if sync done in the loop
- Flexible communications provides no benefit
without vectorization (factor of 10 between
vector and scalar) - Difficult to expose vectorization through a
layered software stack Native C compiler now has
to optimize parallel code! - Cray X1 big hammer gsync() prevents
interesting communication optimizations
15GASNet/X1 Performance
- GASNet/X1 improves small message performance
- Minimal overhead as portable network assembly
language - Core API (Active Messages) solves Cray problem
of upc_global_alloc (non-collective memory
allocation) - Synthetic benchmarks show no GASNet
interference, but not necessarily the case for
application benchmarks
16Outline
- An Overview of UPC and the Berkeley UPC Compiler
- Overview of the Cray X1
- Implementing the GASNet layer on the X1
- Implementing the runtime layer on the X1
- Serial performance
- Evaluation of compiler optimizations
17Shared Pointer Representations
- Cray X1 memory centrifuge useful for UPC
- Possible to manipulate UPC phaseless pointers
directly as X1 global pointers allocated by the
symmetric heap - Heavy function inlining and macros remove all
traces of UPC Runtime and GASNet calls
18Cost of Shared Pointer Arithmetic and Accesses
19Outline
- An Overview of UPC and the Berkeley UPC Compiler
- Overview of the Cray X1
- Implementing the GASNet layer on the X1
- Implementing the runtime layer on the X1
- Serial performance
- Evaluation of compiler optimizations
20Serial Performance
- Its all about vectorization
- Cray C highly sensitive to changes in inner loop
- Want translators output as vectorizable as
original C source. - Strategy Keep translated code syntactically
close to the source - Preserve high level loops
- aexp becomes (aexp)
- Multidimensional arrays linearized
- Preserve restrict qualifier and ivdep pragmas
21Livermore Loop Kernels
22Evaluating Communication Optimizations on Cray X1
- Message Aggregation
- LogGP model fewer messages means less overhead
- Techniques message vectorization, coalescing,
bulk prefetching - Still true for Cray X1?
- Remote access latency comparable to local
accesses - Vectorization should hide most overhead of small
messages - Remote data not cache coherent may still help
to store them into local buffers - Essentially, a question of fine-grained vs.
coarse-grained programming model
23NAS CG OpenMP style vs. MPI style
- Fine-grained (OpenMP style) version still slower
- shared memory programming style leads more
overhead (redundant boundary computation) - UPCs hybrid programming model can really help
24More Optimizations
- Overlapping Communication and Computation
- Hides communication latencies with independent
computation - Examples communication scheduling, message
pipelining - Requires split-phase operations try to separate
sync() as far as possible from non-blocking
get/put - But Cray X1 lacks support for nonblocking gets
- No user/compiler level overlapping
- All communication optimizations rely on
vectorization (e.g., gups) - Vectorization is too restrictive in our opinion
gives up on pointer code and sync(), bulk
synchronous programs, etc
25Conclusion
- We have an efficient UPC implementation on Cray
X1 - Evaluation of Cray X1 for GAS languages
- Great latency/bandwidth for both local and remote
memory operations - Remote communication transparent with global
loads and stores - Lack of split-phase gets means losing
optimization opportunities - Poor user-level support for communication and
synchronization of remote operations (no
prefetching, no non-binding or per-operation
completion mechanisms) - Heavy reliance on vectorization for performance
great when it happens, not much so otherwise
(slow scalar processor) - First platform more sensitive to translated code
and less to communication/computation scheduling - First possible mismatch for GASNet between
semantics and platform were hoping the X2 can
address our concerns