Title: An Introduction to Unified Parallel C UPC
1An Introduction toUnified Parallel C (UPC)
- James Dinan
- PhD Intern from Ohio State
- MCS Seminar at Argonne May 4, 2009
- Adapted from slides by
- Kathy Yelick (LBNL UCB) and Tarek El-Ghazawi
(GWU)
2UPC Outline
- Background
- UPC Programming Model
- Memory Consistency and Synchronization
- Work Distribution
- Distributed Shared Arrays
- Pointers and Dynamic Memory Management
- Performance Results
3Context
- Most parallel programs are written using either
- SPMD Parallel Message passing
- Many scientific applications
- Good scaling
- Requires attention to data dist and communication
- Shared memory OpenMP, PThreads
- Easier to program, but less scalable performance
- Few scientific applications
- Global Address Space Languages take the best of
both - Shared memory like threads (programmability)
- SPMD parallelism like MPI (performance)
- Adds Locality to bridge the gap
4Partitioned Global Address Space Models
Thread0 Thread1
Threadn
X0
X1
XP
Shared
head
node
node
Global address space
Private
ptr
ptr
ptr
- Explicitly-parallel SPMD programming model
- Global Address Space model of memory
- Address space is logically partitioned
- Local vs. remote memory
- Local shared vs local private
- Enables creation of distributed shared data
structures - Programmer control over data layout and locality
- Multiple PGAS models UPC (C), CAF (Fortran),
Titanium (Java), Global Arrays (library)
5UPC Overview
- Unified Parallel C (UPC) is
- An explicit parallel extension of ANSI C
- A partitioned global address space language
- Similar to the C language philosophy
- Programmers are clever and careful, and may need
to get close to hardware - to get performance, but can get in trouble
- Concise and efficient syntax
- Tunable approach to performance
- High level Sequential C ? Shared memory
- Medium level Locality, data distribution,
consistency - Low level Explicit one-sided communication
- Based on ideas from Split-C, AC, and PCP
6Who is UPC
- UPC is an open standard, latest is v1.2 from May,
2005 - Academic and Government Institutions
- George Washington University
- Laurence Berkeley National Laboratory
- University of California, Berkeley
- University of Florida
- Michigan Technological University
- U.S. Department of Energy
- Army High Performance Computing Research Center
- Commercial Institutions
- Hewlett-Packard (HP)
- Cray, Inc
- Intrepid Technology, Inc.
- IBM
- Etnus, LLC (Totalview)
7UPC Programming Model
8UPC Execution Model
- A number of threads (i.e. processes) working
independently in a SPMD fashion - Number of threads THREADS
- MYTHREAD specifies thread index (0..THREADS-1)
- There are two compilation modes
- Static Threads mode
- THREADS is specified at compile time by the user
- The program may use THREADS as a compile-time
constant - Dynamic Threads mode
- Number of threads chosen when app is launched
9Hello World in UPC
- Any legal C program is also a legal UPC program
- If you compile and run it as UPC with N threads,
it will run N copies of the program. - include ltupc.hgt
- include ltstdio.hgt
- int main(int argc, char argv)
- printf("Thread d of d hello UPC world\n",
- MYTHREAD, THREADS)
- return 0
10Private vs. Shared Variables in UPC
- Normal C variables and objects are allocated in
the private memory space for each thread (stack
is private) - Shared variables are allocated only once, by
thread 0 - shared int ours // use sparingly
performance - int mine
- Shared variables may not have dynamic lifetime
may not occur in a in a function definition,
except as static. Why?
Thread0 Thread1
Threadn
Shared
ours
Global address space
mine
mine
mine
Private
11Memory Consistency and Synchronization
12Memory Consistency in UPC
- The consistency model defines the order in which
one thread may see another threads accesses to
memory - If you write a program with unsychronized
accesses, what happens? - Does this work?
- data while (!flag)
- flag 1 mydata data
- UPC has two types of accesses
- Strict Sequential consistency, all threads see
the same ordering - Relaxed May appear out of order to other
threads, concurrency - Can be combined in same program, default is
strict - There are several ways of specifying the
consistency model - include ltupc_relaxed.hgt
- pragma upc strict
- strict shared int flag
13Synchronization- Fence
- upc_fence
- Non-collective
- UPC ensures that all shared references issued
before the upc_fence are complete - Allows you to force an ordering between shared
accesses - Important when using relaxed semantics
14UPC Global Synchronization
- UPC has two basic forms of barriers
- Barrier block until all other threads arrive
- upc_barrier
- Split-phase barriers
- upc_notify this thread is ready for barrier
- do computation unrelated to barrier
- upc_wait wait for others to be ready
- Optional labels allow for debugging
- define MERGE_BARRIER 12
- if (MYTHREAD2 0)
- ...
- upc_barrier MERGE_BARRIER
- else
- ...
- upc_barrier MERGE_BARRIER
15Synchronization - Locks
- Locks in UPC are represented by an opaque type
- upc_lock_t
- Locks must be allocated before use
- Collective returns same pointer to all threads
- upc_lock_t upc_all_lock_alloc(void)
- Non-Collective returns different pointers
- upc_lock_t upc_global_lock_alloc(void)
- To use a lock
- void upc_lock(upc_lock_t l)
- void upc_unlock(upc_lock_t l)
- Locks can be freed when not in use
- void upc_lock_free(upc_lock_t ptr)
16Example Monte Carlo Pi Calculation
- Estimate Pi by throwing darts at a unit square
- Calculate percentage that fall in the unit circle
- Area of square r2 1
- Area of circle quadrant ¼ p r2 p/4
- Randomly throw darts at x,y positions
- If x2 y2 lt 1, then point is inside circle
- Compute ratio
- Area of quadrant points inside / points
total - p 4area
17Helper Code for Pi in UPC
- Function to throw dart and calculate where it
hits -
- int hit()
- double x ((double) rand()) / RAND_MAX
- double y ((double) rand()) / RAND_MAX
- if ((xx yy) lt 1.0)
- return 1
- else
- return 0
-
18Pi in UPC Shared Memory Style
- shared int hits
- main(int argc, char argv)
- int i, my_hits, my_trials 0
- upc_lock_t hit_lock upc_all_lock_alloc()
- int trials atoi(argv1)
- my_trials (trials THREADS - 1)/THREADS
- srand(MYTHREAD17)
- for (i0 i lt my_trials i)
- my_hits hit()
- upc_lock(hit_lock)
- hits my_hits
- upc_unlock(hit_lock)
- upc_barrier
- if (MYTHREAD 0)
- printf("PI f", 4.0hits/trials)
-
create a lock
accumulate hits locally
accumulate across threads
19Work Distribution Using upc_forall
20Shared Arrays Are Cyclic By Default
- Shared scalars always live in thread 0
- Shared arrays are spread over threads cyclically
- shared int xTHREADS / 1 element per
thread / - shared int y3THREADS / 3 elements per
thread / - shared int z33 / 2 or 3
elements per thread / - In the pictures below, assume THREADS 4
- Red elts have affinity to thread 0
Think of linearized C array, then map in
round-robin
x
As a 2D array, y is logically blocked by columns
y
z
z is not
21Example Vector Addition
- Questions about parallel vector additions
- How to layout data (here it is cyclic)
- Which processor does what (here it is owner
computes)
- / vadd.c /
- include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
sumNvoid main() int i for(i0 i lt N
i) - if (MYTHREAD iTHREADS) sumi
v1iv2i
cyclic layout
owner computes
22Work Sharing with upc_forall()
- The idiom in the previous slide is very common
- Loop over all work on those owned by this proc
- UPC adds a special type of loop
- upc_forall(init test loop affinity)
- Programmer indicates the iterations are
independent - Behavior undefined if there are dependencies
across threads - Affinity expression indicates which iterations to
run on each thread. It may have one of two
types - Integer affinityTHREADS MYTHREAD
- Pointer upc_threadof(affinity) MYTHREAD
- Syntactic sugar for loop on previous slide
- Some compilers may do better than this, e.g.,
- for(iMYTHREAD iltN iTHREADS)
- Rather than having all threads iterate N times
- for(i0 iltN i) if (MYTHREAD iTHREADS)
23Vector Addition with upc_forall
- The vadd example can be rewritten as follows
- Equivalent code could use sumi for affinity
- The code would be correct but slow if the
affinity expression were i1 rather than i.
- define N 100THREADSshared int v1N, v2N,
sumNvoid main() int i upc_forall(i0 i
lt N i i) - sumiv1iv2i
The cyclic data distribution may perform poorly
on some machines. Cache effects!
24Distributed Arrays in UPC
25Blocked Layouts in UPC
- The cyclic layout is typically stored in one of
two ways - Distributed memory each processor has a chunk of
memory - Thread 0 would have 0,THREADS, THREADS2, in a
chunk - Shared memory machine each thread has a logical
chunk - Shared memory would have 0,1,2,THREADS,THREADS1
, - What performance problem is there with the
latter? - Vector addition example can be rewritten as
follows
- define N 100THREADSshared int v1N,
v2N, sumNvoid main() int
i upc_forall(i0 iltN i ai) - sumiv1iv2i
blocked layout
26Layouts in General
- All non-array objects have affinity with thread
zero. - Array layouts are controlled by layout
specifiers - None (cyclic layout, i.e. block size of 1)
- (blocked layout)
- 0 or (indefinite layout, all on 1 thread)
- b or b1b2bn b1b2bn (fixed block
size) - The affinity of an array element is defined in
terms of - block size, a compile-time constant
- and THREADS.
- Element i has affinity with thread
- (i / block_size) THREADS
- In 2D and higher, linearize the elements as in a
C row major representation, and then use above
mapping
272D Array Layouts in UPC
a1
a2
- shared m int a1 nm
- shared km int a2 nm
- shared int a3 nmTHREADS
- a1 has a row layout
- a2 has a block row layout.
- a3 has a column layout
-
- To get more general HPF and ScaLAPACK style 2D
blocked layouts, one needs to add dimensions. - Assume rc THREADS
- shared b1b2 int a5 mnrcb1b2
- or equivalently
- shared b1b2 int a5 mnrcb1b2
a3
28UPC Matrix Multiplication Code
include ltupc_relaxed.hgt define N 4 define P
4 define M 4 // a and c are row-wise blocked
shared matrices // b is column-wise
blocked shared NP/THREADS int aNP,
cNM Shared M/THREADS int bPM void
main (void) int i, j , k // private
variables upc_forall (i 0 i lt N i
ci0) for (j0 j lt M j) cij
0 for (k 0 k ? P k) cij
aikbkj
29Domain Decomposition for UPC
- Exploits locality in matrix multiplication
- A (N ? P) is decomposed row-wise into blocks of
size (N ? P) / THREADS as shown below
- B (P ? M) is decomposed column wise into M/
THREADS blocks as shown below
Thread THREADS-1
Thread 0
M
P
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
THREADS-1
Columns 0 (M/THREADS)-1
Columns ((THREAD-1) ? M)/THREADS(M-1)
Note N and M are assumed to be multiples of
THREADS
30Observations on Matrix Multiplication Code
- UPC code is almost the same size as the seq code
- Convert sequential C to parallel UPC code by
adding shared to matrices and work sharing loop - Distributions are an incremental optimization
that allow us to improve locality - Would still get correct result without
distributions - Further Improvement
- We may not have all needed elements of B locally
- Making a private copy of B in each thread might
result in better performance - Can be done with the help of upc_memget
31Pointers and Dynamic Memory Management
32Pointers to Shared vs. Arrays
- In the C tradition, array can be access through
pointers - Here is the vector addition example using pointers
- define N 100THREADS
- shared int v1N, v2N, sumN
- void main() int ishared int p1, p2p1v1
p2v2upc_forall (i0 iltN i, p1, p2 i) - sumi p1 p2
v1
p1
33UPC Pointers
Where does the pointer reside?
Where does the pointer point?
int p1 / private pointer to local / shared
int p2 / private pointer to shared / int
shared p3 / shared pointer to local
/ shared int shared p4 / shared pointer to
shared / Shared to private is not recommended.
34UPC Pointers
Thread0 Thread1
Threadn
p3
p3
p3
Shared
p4
p4
p4
Global address space
p1
p1
p1
Private
p2
p2
p2
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space /
Pointers to shared often require more storage and
are more costly to dereference they may refer to
local or remote memory.
35UPC Pointers
- In UPC pointers to shared objects have three
fields - thread number
- local address of block
- phase (specifies position in the block)
- lt Thread, Phase, Local Address gt
- Example Cray T3E implementation
0
37
38
48
49
63
36UPC Pointers
- Pointer arithmetic supports blocked and
non-blocked array distributions - Casting pointers
- shared to local is allowed
- private to shared is not allowed
- Casting of shared to local is well defined only
if the object pointed to by the pointer to shared
has affinity with the thread performing the cast - In general will result in an error
37Special Functions
- size_t upc_threadof(shared void ptr)
- Returns the thread id that the data at ptr has
affinity to - size_t upc_phaseof(shared void ptr)
- Returns the index (position within the block)
field of the pointer to shared - shared void upc_resetphase(shared void ptr)
- Resets the phase to zero
38UPC Memory Allocation
- shared void upc_alloc(size_t nbytes)
- Allocate nbytes of shared memory with affinity to
the calling thread - upc_alloc is not collective
- void upc_free(shared void ptr)
- Frees the dynamically allocated shared memory
pointed to by ptr - upc_free is not collective
39Global Memory Allocation
- shared void upc_global_alloc(size_t nblocks,
size_t nbytes) - nblocks number of blocksnbytes block size
- Non-collective called by one thread
- Allocates memory in the shared space
- If called by more than one thread, multiple
regions are allocated and each thread which makes
the call gets a different pointer - Space allocated per calling thread is equivalent
toshared nbytes charnblocks nbytes
40Collective Global Memory Allocation
- shared void upc_all_alloc(size_t nblocks,
- size_t nbytes)
- nblocks number of blocksnbytes block size
- This function has the same result as
upc_global_alloc. But this is a collective
function - All the threads will get the same pointer
- Equivalent to
- shared nbytes charnblocks nbytes
41One-Sided Communication
- Semantics similar to libc memcpy() but operate in
GAS - Allow programmer to explicitly manage
communication - upc_memget(void dst, shared void src, size_t
size) - One-sided get copies data from shared to local
- upc_memput(shared void dst, void src, size_t
size) - One-sided put copies data from local to shared
- upc_memcpy(shared void dst, shared void src,
size_t size) - One-sided copy copies data from shared to shared
42Distributed Arrays Directory Style
- Some high performance UPC programmers avoid the
UPC style arrays - Instead, build directories of distributed objects
- Also more general
- typedef shared double sdblptr
- shared sdblptr directoryTHREADS
- directoryiupc_alloc(local_sizesizeof(double))
- upc_barrier
43UPC PerformanceStudy conducted by Kathy
Yelick, Chris Bell, Rajesh Nishtala, Dan Bonachea
at UC Berkeley
44One-Sided vs Two-Sided Messaging
one-sided put message
host CPU
address
data payload
network interface
two-sided message
memory
message id
data payload
- A one-sided put/get message can be handled
directly by a network interface with RDMA support - Avoid interrupting the CPU or storing data from
CPU (preposts) - A two-sided messages needs to be matched with a
receive to identify memory address to put data - Offloaded to Network Interface in networks like
Quadrics - Need to download match tables to interface (from
host) - Ordering requirements on messages can also hinder
bandwidth
45Performance Advantage of One-Sided Communication
- Opteron/InfiniBand (Jacquard at NERSC)
- GASNets vapi-conduit and OSU MPI 0.9.5 MVAPICH
- This is a very good MPI implementation its
limited by semantics of message matching,
ordering, etc. - Half power point (N ½ ) differs by one order of
magnitude
Kathy Yelick with Paul Hargrove and Dan Bonachea
46Case Study 2 NAS FT
- Performance of Exchange (All-to-all) is critical
- 1D FFTs in each dimension, 3 phases
- Transpose after first 2 for locality
- Bisection bandwidth-limited
- Problem as procs grows
- Three approaches to transpose
- Exchange
- wait for 2nd dim FFTs to finish, send 1 message
per processor pair - Slab
- wait for chunk of rows destined for 1 proc, send
when ready - Pencil
- send each row as it completes
Kathy Yelick, Chris Bell, Rajesh Nishtala, Dan
Bonachea
47Overlapping Communication
- Goal make use of all the wires all the time
- Schedule communication to avoid network backup
- Trade-off overhead vs. overlap
- Exchange has fewest messages, less message
overhead - Slabs and pencils have more overlap pencils the
most - Example Class D problem on 256 Processors
Kathy Yelick with Chris Bell, Rajesh Nishtala,
Dan Bonachea
48NAS FT Variants Performance Summary
.5 Tflops
- Slab is always best for MPI small message cost
too high - Pencil is always best for UPC more overlap
Kathy Yelick with Chris Bell, Rajesh Nishtala,
Dan Bonachea
49Summary
- UPC extends C with explicit parallel constructs
- UPC is PGAS
- Defines a global address space
- Every shared object has a affinity to a UPC
thread - UPC promises high productivity and high
performance - Tunable approach to performance
- High level Sequential C ? Shared memory
- Medium level Locality, data distribution,
consistency - Low level Explicit one-sided communication
- Gives high performance on modern capability
systems - NAS FT Benchmark on IB Cluster
- 16k Processors on
50UPC Resources
- UPC Language Spec, Users guide, resources
- http//upc.gwu.edu
- Berkeley UPC (rDMA clusers, shared memory, )
- Source-to-source UPC ? C, GASNet runtime
- http//upc.lbl.gov
- http//upc-wiki.lbl.gov
- Intrepid UPC/GCC-UPC (Shared memory or BUPCR)
- Direct UPC ? binary, also can use BUPC runtime
- http//www.intrepid.com
- MuPC from Michigan Tech (Linux Clusters)
- EDG UPC ? C trans, MPI runtime, Reference Impl.
- http//www.upc.mtu.edu
- Commercial UPC IBM, SGI, Cray, HP
51Backup Slides
52Proposed Extensions to UPC
- UPC Collectives
- Thread groups/teams
- UPC I/O
- Nonblocking communication
- Point-to-point synchronization
- Atomic memory operations
- Variable blocksize pointers
- High performance timers
- Hierarchical thread layout query and control
- Non-contiguous data transfer
- UPC
- Active Messages
53UPC Distributed Shared Linked List
- include ltupc.hgt
- struct node_s
- int value
- shared struct node_s next
-
- typedef struct node_s node_t
- shared node_t shared head NULL
- shared int turn 0
- int main(int argc, char argv)
- shared node_t cur
- shared node_t mynode
- mynode upc_alloc(sizeof(node_t))
- mynode-gtvalue MYTHREAD
- mynode-gtnext NULL
54UPC Collectives in General
- Collectives are a proposed extension
- Implemented in MuPC, Berkeley UPC and others
- The UPC collectives interface is available from
- http//www.gwu.edu/upc/docs/
- It contains typical functions
- Data movement broadcast, scatter, gather,
- Computational reduce, prefix,
- Interface has synchronization modes
- Avoid over-synchronizing (barrier before/after is
simplest semantics, but may be unnecessary) - Data being collected may be read/written by any
thread simultaneously
55Pi in UPC Data Parallel Style
- The previous version of Pi works, but is not
scalable - On a large of threads, the locked region will
be a bottleneck - Use a reduction for better scalability
-
- include ltbupc_collectivev.hgt
- // shared int hits
- main(int argc, char argv)
- ...
- for (i0 i lt my_trials i)
- my_hits hit()
- my_hits // type, input, thread, op
- bupc_allv_reduce(int, my_hits, 0,
UPC_ADD) - // upc_barrier
- if (MYTHREAD 0)
- printf("PI f", 4.0my_hits/trials)
-
Berkeley collectives
no shared variables
barrier implied by collective
56GASNet Portability and High-Performance
GASNet better for latency across machines
Kathy Yelick with UPC Group GASNet design by Dan
Bonachea
57GASNet Portability and High-Performance
GASNet at least as high (comparable) for large
messages
Kathy Yelick with UPC Group GASNet design by Dan
Bonachea
58GASNet Portability and High-Performance
GASNet excels at mid-range sizes important for
overlap
Kathy Yelick with UPC Group GASNet design by Dan
Bonachea