Ernest Orlando Lawrence Berkeley National Laboratory - PowerPoint PPT Presentation

About This Presentation

Title:

Ernest Orlando Lawrence Berkeley National Laboratory

Description:

Strict vs. Relaxed memory consistency models. Unified Parallel C at LBNL/UCB ... Based on the weak ordering consistency model. Unified Parallel C at LBNL/UCB ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 32

Provided by: gabort

Learn more at: https://upc.lbl.gov

Category:

more less

Transcript and Presenter's Notes

Title: Ernest Orlando Lawrence Berkeley National Laboratory

1
The Berkeley UPC Compiler Implementation and
Performance
Wei Chen, Dan Bonachea, Jason Duell, Parry
Husbands, Costin Iancu, Kathy Yelick the
LBNL/Berkeley UPC Group http//upc.lbl.gov
2
Outline

An Overview of UPC
Design and Implementation of the Berkeley UPC
Compiler
Preliminary Performance Results
Communication Optimizations

3
Unified Parallel C (UPC)

UPC is an explicitly parallel global address
space language with SPMD parallelism
An extension of C
Shared memory is partitioned by threads
One-sided (bulk and fine-grained) communication
through reads/writes of shared variables
Collective Efforts by industry, academia, and
government
http//upc.gwu.edu

Shared
X0
X1
XP
Global address space
Private
4
UPC Programming Model Features

Block cyclically distributed arrays
Shared and private pointers
Global synchronization -- barriers
Pair-wise synchronization locks
Parallel loops
dynamic shared memory allocation
Bulk Shared Memory accesses
Strict vs. Relaxed memory consistency models

5
Design and Implementation of the Berkeley UPC
Compiler
6
Overview of the Berkeley UPC Compiler
Two Goals Portability and High-Performance
Lower UPC code into ANSI-C code
Translator
UPC Code
Shared Memory Management and pointer operations
Platform- independent
Translator Generated C Code
Berkeley UPC Runtime System
Network- independent
Compiler- independent
GASNet Communication System
Language- independent
Network Hardware
Uniform get/put interface for underlying networks
7
A Layered Design

Portable
C is our intermediate language
Can run on top of MPI (with performance penalty)
GASNet has a layered design with a small core
High-Performance
Native C compiler optimizes serial code
Translator can perform high-level communication
optimizations
GASNet can access network hardware directly

8
Implementing the UPC to C Translator
Preprocessed File

Based on the Open64 compiler
Source to source transformation
Convert shared memory
operations into runtime library
calls
Designed to incorporate existing
optimization framework in open64
Communicate with runtime via a
standard API

C front end
Whirl w/ shared types
Backend lowering
Whirl w/ runtime calls
Whirl2c
ANSI-compliant C Code
9
Shared Arrays and Pointers in UPC
A0 A2 A4 B0 B1 B4 B5 C0 C1
C2
A1 A3 A5 B2 B3 B6 B7
T1
T0

Cyclic shared int An
Block Cyclic shared 2 int Bn
Indefinite shared 0 int C
(shared 0 int ) upc_alloc(n)
Use global pointers (pointer-to-shared) to access
shared (possibly remote) data
Block size part of the pointer type
A generic pointer-to-shared contains
Address, Thread id, Phase

10
Accessing Shared Memory in UPC
start of array object
start of block
Shared Memory

block size
Phase

Thread 1
Thread N -1
Thread 0
0
2
addr
11
Phaseless Pointer-to-Shared

A pointer needs a phase to keep track of where
it is in a block
Source of overhead for pointer arithmetic
Special case for phaseless pointers Cyclic
Indefinite
Cyclic pointers always have phase 0
Indefinite pointers only have one block
Dont need to keep phase in pointer operations
for cyclic and indefinite
Dont need to update thread id for indefinite
pointer arithmetic

12
Pointer-to-Shared Representation

Pointer Size
Want to allow pointers to reside in a register
But very large machines may require a longer
representation
Datatype
Use of scalar types (long) rather than a struct
may improve backend code quality
Faster pointer manipulation, e.g., ptrint as
well as dereferencing
Portability and performance balance in UPC
compiler
8-byte scalar vs. struct format (configuration
time option)
The pointer representation is hidden in the
runtime layer
Modular design means easy to add new
representations

13
Performance Results
14
Performance Evaluation

Testbed
HP AlphaServer (1GHz processor), with Quadrics
interconnect
Compaq C compiler for the translated C code
Compare with HP UPC 2.1
Cost of Language Features
Shared pointer arithmetic, shared memory
accesses, parallel loops, etc
Application Benchmarks
EP no communication
IS large bulk memory operations
- MG bulk memget
CG fine-grained vs. bulk memput
Potentials of Optimizations
Measure the effectiveness of various
communication optimizations

15
Performance of Shared Pointer Arithmetic
1 cycle 1ns Struct 16 bytes

Phaseless pointer an important optimization
Packed representation also helps.

16
Cost of Shared Memory Access

Local accesses somewhat slower than private
accesses
Layered design does not add additional overhead
Remote accesses a few magnitude worse than local

17
Parallel Loops in UPC

UPC has a forall construct for distributing
computation
shared int v1N, v2N, v3N
upc_forall(i0 i lt N i v3i)
v3i v2i v1i
Affinity tests performed on every iteration to
decide if it should execute
Two kinds of affinity expressions
Integer (compare with thread id)
Shared address (check the affinity of address)

18
Application Performance
19
NAS Benchmarks (EP and IS)

EP shows the backend C compiler can still
successfully optimize translated C code
IS shows Berkeley UPC compiler is effective for
communication operations

20
NAS Benchmarks (CG and MG)

Berkeley UPC compiler scales well

21
Performance of Fine-grained Applications

Doesnt scale well due to nature of the benchmark
(lots of small reads)
HP UPCs software caching helps performance

22
Observations on the Results

Acceptable worst-case overhead for shared memory
access latencies
lt 10 cycle overhead for shared local accesses
1.5 usec overhead compared to end-to-end
network latency
Optimizations on pointer-to-shared representation
are effective
Both phaseless pointers and packed 8-byte format
Good performance compared to HP UPC 2.1

23
Communication Optimizations for UPC
24
Communication Optimizations

Hiding Communication Latencies
Use of non-blocking operations
Possible placement analysis, by separating
get(), put() as far as possible from sync()
Message pipelining to overlap communication with
more communication
Optimizing Shared Memory Accesses
Eliminating locality test for local shared
pointers flow- and context-sensitive analysis
Transforming forall loops into equivalent for
loops
Eliminate redundant pointer arithmetic for
pointers with same thread and phase

25
More Optimizations

Message Vectorization and Aggregation
Scatter/gather techniques
Packing generally pays off for small (lt 500 byte)
messages
Software Caching and Prefetching
A prototype implementation
Local knowledge only no coherence messages
Cache remote reads and buffer outgoing writes
Based on the weak ordering consistency model

26
Example Optimizing Loops
27
Experimental Results

Computation/communication overlap better than
communication/communication overlap, for Quadrics
Results likely different for other networks

28
Example Optimizing Local Shared Accesses
29
Experimental Results

Neither compiler performs well for naïve version
Culprit pointer-to-shared operations affinity
tests
Privatizing local shared accesses improves
performance by an order of magnitude

30
Compiler Status

A fully UPC 1.1 compliant public release in April
Supported Platforms
HP AlphaServer, IBM SP, Linux x86/Itanium, SGI
Origin2000, Solaris Sparc/x86, Mac OSX PowerPC
Supported Networks
Quadrics/Elan, Myrinet/GM, IBM/LAPI, and MPI
A release this summer will include
Pthreads/System V shared memory support
GASNet Infiniband support

31
Conclusion