Title: Zhang Zhang, Jeevan Savant, Steve Seidel
1A UPC Runtime System Basedon MPI and POSIX
Threads
- Zhang Zhang, Jeevan Savant, Steve Seidel
- Department of Computer Science
- Michigan Technological University
- zhazhang,jvsavant,steve_at_mtu.edu
- http//www.upc.mtu.edu
2Outline
- Introduction
- The UPC programming model
- MuPC overview
- Related work
- Runtime system design
- Performance features
- Benchmark measurements
- Summary and continuing work
31. Introduction
- Unified Parallel C (UPC) is an extension of ANSI
C that provides a partitioned shared memory model
for parallel programming. - UPC programs are SPMD.
- UPC compilers are available for platforms ranging
from Linux clusters to the Cray X1. - MuPC is a runtime system that manages the
execution of the users UPC program. - The design and performance of MuPC will be
discussed.
42. The UPC programming model
- A short history
- An overview of UPC
5A short history
Culler, Yelick, et al. 1993, UC Berkeley
Draper, Carlson 1995, IDA
Eugene Brooks, et al., 1991 Lawrence Livermore Lab
PCP AC Split-C
Unified Parallel C
El-Ghazawi, Carlson, Draper v1.0 February,
2001 v1.1 July, 2003 v1.2 June, 2005
6UPC, the language
- UPC is an extension of ISO C99.
- Every C program is a UPC program.
- UPC is not a library like MPI, though UPC has
libraries. - UPC processes are called threads.
- Predefined identifiers THREADS and MYTHREAD are
provided. - UPC programs are SPMD.
- UPC is based on a partitioned shared memory
model.
7Partitioned shared memory model
- Shared memory A single address space shared by
all processors. (Sun Enterprise, Cray T3E) - Distributed memory Each process has its own,
private address space. (Beowulf clusters) - Distributed shared memory A single address space
that is distributed among processors. (An
illusion usually provided by a run time system.) - Partitioned shared memory A single address space
that is logically partitioned among processors.
The distribution of memory is built in to the
language. A physical partition may or may not be
present on the hardware.
8UPC partitioned shared address space
- Each thread has a private (local) address space.
- All threads share a global address space that is
partitioned among the threads. - A shared object in thread is region of the
partition is said to have affinity to thread i. - If thread i has affinity to a shared object x, it
is likely that accesses to x take less time than
accesses to shared objects to which thread i does
not have affinity. - Shared objects must be declared at file scope.
9UPC programming model
10Shared arrays and pointers-to-shared
- Arrays are the fundamental shared objects in UPC.
- Shared arrays are distributed block-cyclically
(round robin, in blocksize chunks). - shared 5 int p can be used to point to the
array in the previous example. - Pointer arithmetic on p is transparent, that is,
if p points to A4, then p points to A5. - int q q
(int)p - casts p to a true private pointer that can
access elements of A that have affinity to this
thread.
11Parallel matrix multiply in UPC
include ltupc.hgt shared100 double
a100100 shared100 double
b100100 shared100 double c100100 int
i, j, k upc_forall (i0 ilt100 i ai0)
for (j0 jlt100 j) ci,j
0.0 for (k0 klt100 k) ci,j
aikbkj
123. MuPC Overview
- The complete MuPC system consists of a UPC
compiler and a runtime system based on MPI-1 and
POSIX threads. - Users code app.c is translated by the EDG front
end into an intermediate language (IL) tree. - The IL tree is lowered to pure C which includes
- local structures representing shared data objects
- calls to MuPC functions to perform remote
accesses and other nonlocal actions, such as
synchronization. - app.int.c
- mupcc compiles app.int.c and links with
libmupc.a. - Run using mpirun np n ./app
13MuPC Overview
- The MuPC runtime system is portable to any system
that supports POSIX threads and MPI-1. - MuPC has been ported to Linux clusters,
Alphaserver clusters, and Sun Enterprise
platforms. - MuPC currently supports UPC v1.1.1
(not yet at v1.2, though the
differences are minor). - MuPC is open source. The EDG front end is
distributed as a binary.
144. Related work
- Berkeley UPC (Yelick, Bonachea, et al.)
- Core and extended GASnet communication APIs allow
mating UPC and Titanium runtime systems with many
different transport layers, e.g., Myrinet,
Quadrics, and even MPI. - Front end is Open64 source-to-source translator
- Runtime system is platform independent
- Highly portable
- Open64 translator has a large footprint that
encourages remote compilation at Berkeley.
15Other UPC compilers
- Hewlett-Packard
- the first commercial UPC compiler
- supports Tru64 Unix, HP-UX, and XC Linux clusters
- offers a runtime cache and a trainable prefetcher
- Intrepid UPC
- extends GNU GCC compiler
- only for shared memory platforms such as SGI Irix
and Cray T3E - Cray UPC for the X1 and other current Cray
platforms
165. Runtime system design
- Runtime objects
- Shared memory management
- Shared memory accesses
- Synchronization
- Shared memory consistency
- 2-threaded design
17a) Runtime objects
- Constants THREADS and MYTHREAD
- A pointer to shared type is a structure
- 64 address bits
- 32 phase (offset) bits
- 32 thread number bits
- linked lists of shared object attributes
- init and fini routines handle startup/shutdown
protocol.
18b) Shared memory management
- The front end allocates static shared objects
- On distributed memory platforms corresponding
elements have the same local address in each
thread. - The memory image of shared objects is the same
for each thread. Some space may be wasted. - At startup, part of the heap is allocated for
dynamically created shared objects. - The same approach to object allocation and
addressing is used with dynamically created
shared objects.
19c) Shared memory accesses
- Reads and writes of shared memory are performed
by corresponding get and put functions. - get functions are synchronous (blocking)
- put functions are asynchronous, in general
- Nonscalar shared objects are always moved
synchronously as blocks of raw bytes. - UPC provides one-sided message passing functions
such as upc_memcpy(). MuPC implements these with
its block get and put functions.
20d) Synchronization
- upc_barrier and its variants are implemented with
a tree-based synchronization routine.
(MPI_Barrier() cannot be used to implement all of
the variants provided in UPC.) - Fences force completion of shared memory
accesses. MuPC implements fences by blocking
until pending accesses are complete. - UPC provides locks so the programmer can
synchronize accesses to shared memory. MuPC
implements a lock with a shared array of THREADS
bits, one per thread.
21e) Shared memory consistency
- UPC supports a noncoherent memory model.
- relaxed accesses are the default.
- The programmer can force consistency by
explicitly stating that a memory operation is
strict. - All threads must see strict operations occur in
the same order. - A fence forces the completion of all outstanding
memory operations in this thread. - Strict accesses consist of a relaxed access
surrounded by fences. - A fence requires an ack from all threads written
to since the last fence.
22f) 2-threaded design
- Each UPC thread is implemented as two POSIX
threads - The user Pthread is the users UPC program.
- compiled C program with calls to MuPC functions
- The communication Pthread handles remote
accesses. - an event-driven MPI program servicing requests
for operations on shared objects - includes a persistent MPI_Irecv() to catch
requests from other threads - yields when there are no requests on the queue
- Thread safety is guaranteed by isolating all MPI
calls within the communication Pthread.
236. Performance features
- Detecting accesses to local shared memory
- easy, and critical to good performance
- Runtime software cache
- improves performance in some cases
24Detecting local shared accesses
- Accesses to local shared memory are detected at
run time. - shared 10 int a10THREADS
- int i, b10
- int p
- // detected by MuPC
- i 10MYTHREAD
- bi ai
-
- // detected by user
- p (int )ai
- bi p
25Runtime software cache
- Scalar remote references can be cached.
- Direct mapped, write back
- LRU replacement
- small victim cache
- THREADS cache segments in each thread (own is
unused) - bit vector used to avoid false sharing
- Performance of stride 1 reads and writes improved
by more that a factor of 10.
267. Benchmark measurements
- Streaming remote access
- measures single-thread remote access rate
- stride-1 reads and writes
- random reads and writes
- Natural ring
- all threads read and write to neighbor in a ring
- similar to all-processes-in-a-ring HPC benchmark
- Units are thousands of references per second.
27Parallel systems
- Compiler and RTS
- MuPC V1.1 without cache
- MuPC V1.1 with cache
- Berkeley V2.0 (has no cache)
- HP V2.2 with cache
- caches configured for 256xTHREADS 1K blocks
- Platforms
- HP AlphaServer SC
- 8 4-way 667MHz EV67 nodes
- runs HP UPC and MuPC
- Linux/Myrinet cluster
- 16 2-way 2GHz Pentiums
- runs MuPC and Berkeley UPC
28Streaming remote accesses
- Stride-1 reads
- Stride-1 writes
- Random reads
- Random writes
29Single stream accessesno cache, Pentium cluster
103 References/sec
30Single stream accessescache, AlphaServer cluster
103 References/sec
31Single stream accessesPentium clusterMuPC with
and without cache
103 References/sec
32Natural ringno cache, Pentium cluster
103 References/sec
33Natural ring cache, AlphaServer cluster
103 References/sec
34Natural ring, Pentium cluster,MuPC with and
without cache
103 References/sec
35Performance summary
- MuPC runtime cache significantly improves
performance for stride-1 accesses. - MuPC runtime cache penalizes random accesses much
less than stride-1 accesses benefit. - HP runtime cache is much slower for writes than
reads, perhaps due to its write-through design. - Without a runtime cache, Berkeley cannot match
stride-1 access performance of other systems.
36Performance summary
- Heavy network traffic increased MuPC and Berkeley
access times as much as a factor of 10. HP held
up better except for random reads. - HP performs better, sometimes much better, than
MuPC except for stride-1 writes. - When cache is turned off or is not available,
MuPC and Berkeley performance is a toss up.
37Other MuPC performance results
- NAS Parallel Benchmarks
- EP, CG, FT, IS, MG
- IPDPS05 PMEO Workshop
- A UPC performance model
- histogramming, matrix multiply, Sobel edge
detection - IPDPS06 (to appear)
388. Summary and continuing work
- MuPC is a portable, open source implementation of
UPC that provides performance comparable to other
systems of similar design. - MuPC is a practical testbed for experimental
features of partitioned shared memory languages. - Work on MuPC is continuing in the areas of
- performance improvements
- atomic memory operations
- one-sided collective operations