Title: Ernest Orlando Lawrence Berkeley National Laboratory
1The Berkeley UPC Compiler Implementation and
Performance
Wei Chen, Dan Bonachea, Jason Duell, Parry
Husbands, Costin Iancu, Kathy Yelick the
LBNL/Berkeley UPC Group http//upc.lbl.gov
2Outline
- An Overview of UPC
- Design and Implementation of the Berkeley UPC
Compiler - Preliminary Performance Results
- Communication Optimizations
3Unified Parallel C (UPC)
- UPC is an explicitly parallel global address
space language with SPMD parallelism - An extension of C
- Shared memory is partitioned by threads
- One-sided (bulk and fine-grained) communication
through reads/writes of shared variables - Collective Efforts by industry, academia, and
government - http//upc.gwu.edu
Shared
X0
X1
XP
Global address space
Private
4UPC Programming Model Features
- Block cyclically distributed arrays
- Shared and private pointers
- Global synchronization -- barriers
- Pair-wise synchronization locks
- Parallel loops
- dynamic shared memory allocation
- Bulk Shared Memory accesses
- Strict vs. Relaxed memory consistency models
5Design and Implementation of the Berkeley UPC
Compiler
6Overview of the Berkeley UPC Compiler
Two Goals Portability and High-Performance
Lower UPC code into ANSI-C code
Translator
UPC Code
Shared Memory Management and pointer operations
Platform- independent
Translator Generated C Code
Berkeley UPC Runtime System
Network- independent
Compiler- independent
GASNet Communication System
Language- independent
Network Hardware
Uniform get/put interface for underlying networks
7A Layered Design
- Portable
- C is our intermediate language
- Can run on top of MPI (with performance penalty)
- GASNet has a layered design with a small core
- High-Performance
- Native C compiler optimizes serial code
- Translator can perform high-level communication
optimizations - GASNet can access network hardware directly
8Implementing the UPC to C Translator
Preprocessed File
- Based on the Open64 compiler
- Source to source transformation
-
- Convert shared memory
- operations into runtime library
- calls
- Designed to incorporate existing
- optimization framework in open64
- Communicate with runtime via a
- standard API
C front end
Whirl w/ shared types
Backend lowering
Whirl w/ runtime calls
Whirl2c
ANSI-compliant C Code
9Shared Arrays and Pointers in UPC
A0 A2 A4 B0 B1 B4 B5 C0 C1
C2
A1 A3 A5 B2 B3 B6 B7
T1
T0
- Cyclic shared int An
- Block Cyclic shared 2 int Bn
- Indefinite shared 0 int C
- (shared 0 int ) upc_alloc(n)
- Use global pointers (pointer-to-shared) to access
shared (possibly remote) data - Block size part of the pointer type
- A generic pointer-to-shared contains
- Address, Thread id, Phase
10Accessing Shared Memory in UPC
start of array object
start of block
Shared Memory
block size
Phase
Thread 1
Thread N -1
Thread 0
0
2
addr
11Phaseless Pointer-to-Shared
- A pointer needs a phase to keep track of where
it is in a block - Source of overhead for pointer arithmetic
- Special case for phaseless pointers Cyclic
Indefinite - Cyclic pointers always have phase 0
- Indefinite pointers only have one block
- Dont need to keep phase in pointer operations
for cyclic and indefinite - Dont need to update thread id for indefinite
pointer arithmetic
12Pointer-to-Shared Representation
- Pointer Size
- Want to allow pointers to reside in a register
- But very large machines may require a longer
representation - Datatype
- Use of scalar types (long) rather than a struct
may improve backend code quality - Faster pointer manipulation, e.g., ptrint as
well as dereferencing - Portability and performance balance in UPC
compiler - 8-byte scalar vs. struct format (configuration
time option) - The pointer representation is hidden in the
runtime layer - Modular design means easy to add new
representations
13Performance Results
14Performance Evaluation
- Testbed
- HP AlphaServer (1GHz processor), with Quadrics
interconnect - Compaq C compiler for the translated C code
- Compare with HP UPC 2.1
- Cost of Language Features
- Shared pointer arithmetic, shared memory
accesses, parallel loops, etc - Application Benchmarks
- EP no communication
- IS large bulk memory operations
- - MG bulk memget
- CG fine-grained vs. bulk memput
- Potentials of Optimizations
- Measure the effectiveness of various
communication optimizations
15Performance of Shared Pointer Arithmetic
1 cycle 1ns Struct 16 bytes
- Phaseless pointer an important optimization
- Packed representation also helps.
16Cost of Shared Memory Access
- Local accesses somewhat slower than private
accesses - Layered design does not add additional overhead
- Remote accesses a few magnitude worse than local
17Parallel Loops in UPC
- UPC has a forall construct for distributing
computation - shared int v1N, v2N, v3N
- upc_forall(i0 i lt N i v3i)
- v3i v2i v1i
- Affinity tests performed on every iteration to
decide if it should execute - Two kinds of affinity expressions
- Integer (compare with thread id)
- Shared address (check the affinity of address)
18Application Performance
19NAS Benchmarks (EP and IS)
- EP shows the backend C compiler can still
successfully optimize translated C code - IS shows Berkeley UPC compiler is effective for
communication operations
20NAS Benchmarks (CG and MG)
- Berkeley UPC compiler scales well
21Performance of Fine-grained Applications
- Doesnt scale well due to nature of the benchmark
(lots of small reads) - HP UPCs software caching helps performance
22Observations on the Results
- Acceptable worst-case overhead for shared memory
access latencies - lt 10 cycle overhead for shared local accesses
- 1.5 usec overhead compared to end-to-end
network latency - Optimizations on pointer-to-shared representation
are effective - Both phaseless pointers and packed 8-byte format
- Good performance compared to HP UPC 2.1
23Communication Optimizations for UPC
24Communication Optimizations
- Hiding Communication Latencies
- Use of non-blocking operations
- Possible placement analysis, by separating
get(), put() as far as possible from sync() - Message pipelining to overlap communication with
more communication - Optimizing Shared Memory Accesses
- Eliminating locality test for local shared
pointers flow- and context-sensitive analysis - Transforming forall loops into equivalent for
loops - Eliminate redundant pointer arithmetic for
pointers with same thread and phase
25More Optimizations
- Message Vectorization and Aggregation
- Scatter/gather techniques
- Packing generally pays off for small (lt 500 byte)
messages - Software Caching and Prefetching
- A prototype implementation
- Local knowledge only no coherence messages
- Cache remote reads and buffer outgoing writes
- Based on the weak ordering consistency model
26Example Optimizing Loops
27Experimental Results
- Computation/communication overlap better than
communication/communication overlap, for Quadrics - Results likely different for other networks
28Example Optimizing Local Shared Accesses
29Experimental Results
- Neither compiler performs well for naïve version
- Culprit pointer-to-shared operations affinity
tests - Privatizing local shared accesses improves
performance by an order of magnitude
30Compiler Status
- A fully UPC 1.1 compliant public release in April
- Supported Platforms
- HP AlphaServer, IBM SP, Linux x86/Itanium, SGI
Origin2000, Solaris Sparc/x86, Mac OSX PowerPC - Supported Networks
- Quadrics/Elan, Myrinet/GM, IBM/LAPI, and MPI
- A release this summer will include
- Pthreads/System V shared memory support
- GASNet Infiniband support
31Conclusion
- The Berkeley UPC Compiler achieves both
portability and good performance - Layered, modular design
- Effective pointer-to-shared optimizations
- Good performance compared to commercially
available UPC compiler - Still lots of opportunities for communication
optimizations - Available for download at
- http//upc.lbl.gov