Title: Unified Parallel C UPC and the Berkeley UPC Compiler
1Unified Parallel C (UPC) and the Berkeley UPC
Compiler
Wei Chen the Berkeley UPC Group 3/11/07
2Parallel Programming
- Most parallel programs are written using either
- Message passing with a SPMD model
- Usually for scientific applications with
C/Fortran - Scales easily user controlled data layout
- Hard to use send/receive matching, message
packing/unpacking - Shared memory with OpenMP/pthreads/Java
- Usually for non-scientific applications
- Easier to program direct reads and writes to
shared data - Hard to scale (mostly) limited to SMPs, no
concept of locality - PGAS an alternative hybrid model
3Partitioned Global Address Space
- PGAS model uses global address space abstraction
- Shared memory is partitioned by processors
- User controlled data layout (global pointers and
distributed arrays) - One-sided communication
- Use RDMA support for reads/writes of shared
variables - Much faster than message passing for small/medium
size messages - Hybrid model works for both SMPs and clusters
- Languages Titanium, Co-Array Fortran, UPC
X0
X1
XP
Shared
Global address space
ptr
ptr
ptr
Private
4Unified Parallel C
- A SPMD parallel extension of C
- PGAS add shared qualifier to type system
- Several kinds of shared array distributions
- Fine-grained and bulk communication
- Commercial compilers with Cray/HP/IBM
- Open source compilers with Berkeley UPC
Vector Addition in UPC
define N 100THREADSshared int v1N, v2N,
sumN //cyclic layoutvoid main() for(int
i0 iltN i) if (MYTHREAD iTHREADS)
//SPMD sumiv1iv2i
5Overview of the Berkeley UPC Compiler
Two Goals Portability and High-Performance
Lower UPC code into ISO C code
Translator
UPC Code
Shared Memory Management and pointer operations
Platform- independent
Translator Generated C Code
Berkeley UPC Runtime System
Network- independent
Compiler- independent
GASNet Communication System
Language- independent
Network Hardware
Uniform get/put interface for underlying networks
6UPC to C Translator
- Based on Open64
- Extend with shared type
- Reuse analysis framework
- Add UPC specific optimizations
- Portable translation
- High level IR
- Config file for platform dependent information
- Reinclude library headers
- Convert shared memory operations into runtime
calls
Preprocessed UPC Source
Parsing
WHIRL with shared types
Optimizer
Optimized WHIRL
Lowering
WHIRL with runtime calls
Lowering
WHIRL2C
Backend C compiler
ISO C code
7Optimization framework
- Combination of language/compiler/runtime support
- Transparent to the user
- Performance portable
- Short term goal effective on different cluster
networks. - Long term goal code designed for SMP get good
performance on clusters
Optimize regular array accesses
Optimize irregular pointer accesses
Nonblocking bulk communication
p-gtx-gty
upc_memget(dst, src, size)
Aijk
Loop framework for message vectorization, strip
mining
PRE framework with split-phase access and
coalescing
Runtime framework for communication overlap
8Application Performance LU Decomposition
- UPC performance comparable to MPI/HPL(Linpack)
with lt ½ the code size - Uses light-weight multi-threading atop SPMD ?
latency tolerant - Highly adaptable to different problem and machine
sizes
9Application Performance 3D FFT
MFLOPS / Proc
up is good
- One-sided UPC approach sends more, smaller
messages - Same total volume of data, but send earlier and
more often - Aggressively overlaps the transpose with the 2nd
1-D FFT - Same approach is less effective in MPI due to
higher per-message cost - Consistently outperforms MPI-based
implementations by as much as 2X
10Current Status
- Public release v2.4 in November 2006
- Fully compliant with UPC 1.2 specification
- Communication optimizations
- Extensions for performance and programmability
- Support from laptops to supercomputers
- OS UNIX (Linux, BSD, AIX, Solaris, etc), Mac,
Cygwin - Arch x86, Itanium, Opteron, Alpha, PPC, SPARC,
Cray X1, NEC SX-6, Blue Gene, etc. - Network SMP, Myrinet, Quadrics, Infiniband, IBM
LAPI, MPI, Ethernet, SHMEM, etc. - Give us a try at http//upc.lbl.gov
11Summary
- UPC designed to be consistent with C
- Expose memory layout
- Flexible communication with pointers and arrays
- Give users more control to achieve high
performance - Berkeley UPC compiler provides an open-source and
portable implementation - Hand optimized UPC programs match and often beat
MPIs performance - Research goal productive user efficient
compiler