An Introduction to Unified Parallel C UPC

About This Presentation

Title:

An Introduction to Unified Parallel C UPC

Description:

A number of threads (i.e. processes) working independently in a SPMD fashion ... Distributed Arrays Directory Style ... build directories of distributed ... – PowerPoint PPT presentation

Number of Views:194

Avg rating:3.0/5.0

Slides: 59

Provided by: kath220

Category:

more less

Transcript and Presenter's Notes

Title: An Introduction to Unified Parallel C UPC

1
An Introduction toUnified Parallel C (UPC)

James Dinan
PhD Intern from Ohio State
MCS Seminar at Argonne May 4, 2009
Adapted from slides by
Kathy Yelick (LBNL UCB) and Tarek El-Ghazawi
(GWU)

2
UPC Outline

Background
UPC Programming Model
Memory Consistency and Synchronization
Work Distribution
Distributed Shared Arrays
Pointers and Dynamic Memory Management
Performance Results

3
Context

Most parallel programs are written using either
SPMD Parallel Message passing
Many scientific applications
Good scaling
Requires attention to data dist and communication
Shared memory OpenMP, PThreads
Easier to program, but less scalable performance
Few scientific applications
Global Address Space Languages take the best of
both
Shared memory like threads (programmability)
SPMD parallelism like MPI (performance)
Adds Locality to bridge the gap

4
Partitioned Global Address Space Models
Thread0 Thread1
Threadn
X0
X1
XP
Shared
head
node
node
Global address space
Private
ptr
ptr
ptr

Explicitly-parallel SPMD programming model
Global Address Space model of memory
Address space is logically partitioned
Local vs. remote memory
Local shared vs local private
Enables creation of distributed shared data
structures
Programmer control over data layout and locality
Multiple PGAS models UPC (C), CAF (Fortran),
Titanium (Java), Global Arrays (library)

5
UPC Overview

Unified Parallel C (UPC) is
An explicit parallel extension of ANSI C
A partitioned global address space language
Similar to the C language philosophy
Programmers are clever and careful, and may need
to get close to hardware
to get performance, but can get in trouble
Concise and efficient syntax
Tunable approach to performance
High level Sequential C ? Shared memory
Medium level Locality, data distribution,
consistency
Low level Explicit one-sided communication
Based on ideas from Split-C, AC, and PCP

6
Who is UPC

UPC is an open standard, latest is v1.2 from May,
2005
Academic and Government Institutions
George Washington University
Laurence Berkeley National Laboratory
University of California, Berkeley
University of Florida
Michigan Technological University
U.S. Department of Energy
Army High Performance Computing Research Center
Commercial Institutions
Hewlett-Packard (HP)
Cray, Inc
Intrepid Technology, Inc.
IBM
Etnus, LLC (Totalview)

7
UPC Programming Model
8
UPC Execution Model

A number of threads (i.e. processes) working
independently in a SPMD fashion
Number of threads THREADS
MYTHREAD specifies thread index (0..THREADS-1)
There are two compilation modes
Static Threads mode
THREADS is specified at compile time by the user
The program may use THREADS as a compile-time
constant
Dynamic Threads mode
Number of threads chosen when app is launched

9
Hello World in UPC

Any legal C program is also a legal UPC program
If you compile and run it as UPC with N threads,
it will run N copies of the program.
include ltupc.hgt
include ltstdio.hgt
int main(int argc, char argv)
printf("Thread d of d hello UPC world\n",
MYTHREAD, THREADS)
return 0

10
Private vs. Shared Variables in UPC

Normal C variables and objects are allocated in
the private memory space for each thread (stack
is private)
Shared variables are allocated only once, by
thread 0
shared int ours // use sparingly
performance
int mine
Shared variables may not have dynamic lifetime
may not occur in a in a function definition,
except as static. Why?

Thread0 Thread1
Threadn
Shared
ours
Global address space
mine
mine
mine
Private
11
Memory Consistency and Synchronization
12
Memory Consistency in UPC

The consistency model defines the order in which
one thread may see another threads accesses to
memory
If you write a program with unsychronized
accesses, what happens?
Does this work?
data while (!flag)
flag 1 mydata data
UPC has two types of accesses
Strict Sequential consistency, all threads see
the same ordering
Relaxed May appear out of order to other
threads, concurrency
Can be combined in same program, default is
strict
There are several ways of specifying the
consistency model
include ltupc_relaxed.hgt
pragma upc strict
strict shared int flag

13
Synchronization- Fence

upc_fence
Non-collective
UPC ensures that all shared references issued
before the upc_fence are complete
Allows you to force an ordering between shared
accesses
Important when using relaxed semantics

14
UPC Global Synchronization

UPC has two basic forms of barriers
Barrier block until all other threads arrive
upc_barrier
Split-phase barriers
upc_notify this thread is ready for barrier
do computation unrelated to barrier
upc_wait wait for others to be ready
Optional labels allow for debugging
define MERGE_BARRIER 12
if (MYTHREAD2 0)
...
upc_barrier MERGE_BARRIER
else
...
upc_barrier MERGE_BARRIER

15
Synchronization - Locks

Locks in UPC are represented by an opaque type
upc_lock_t
Locks must be allocated before use
Collective returns same pointer to all threads
upc_lock_t upc_all_lock_alloc(void)
Non-Collective returns different pointers
upc_lock_t upc_global_lock_alloc(void)
To use a lock
void upc_lock(upc_lock_t l)
void upc_unlock(upc_lock_t l)
Locks can be freed when not in use
void upc_lock_free(upc_lock_t ptr)

16
Example Monte Carlo Pi Calculation

Estimate Pi by throwing darts at a unit square
Calculate percentage that fall in the unit circle
Area of square r2 1
Area of circle quadrant ¼ p r2 p/4
Randomly throw darts at x,y positions
If x2 y2 lt 1, then point is inside circle
Compute ratio
Area of quadrant points inside / points
total
p 4area

17
Helper Code for Pi in UPC

Function to throw dart and calculate where it
hits
int hit()
double x ((double) rand()) / RAND_MAX
double y ((double) rand()) / RAND_MAX
if ((xx yy) lt 1.0)
return 1
else
return 0

18
Pi in UPC Shared Memory Style

shared int hits
main(int argc, char argv)
int i, my_hits, my_trials 0
upc_lock_t hit_lock upc_all_lock_alloc()
int trials atoi(argv1)
my_trials (trials THREADS - 1)/THREADS
srand(MYTHREAD17)
for (i0 i lt my_trials i)
my_hits hit()
upc_lock(hit_lock)
hits my_hits
upc_unlock(hit_lock)
upc_barrier
if (MYTHREAD 0)
printf("PI f", 4.0hits/trials)

create a lock
accumulate hits locally
accumulate across threads
19
Work Distribution Using upc_forall
20
Shared Arrays Are Cyclic By Default

Shared scalars always live in thread 0
Shared arrays are spread over threads cyclically
shared int xTHREADS / 1 element per
thread /
shared int y3THREADS / 3 elements per
thread /
shared int z33 / 2 or 3
elements per thread /
In the pictures below, assume THREADS 4
Red elts have affinity to thread 0

Think of linearized C array, then map in
round-robin
x
As a 2D array, y is logically blocked by columns
y
z
z is not
21
Example Vector Addition

Questions about parallel vector additions
How to layout data (here it is cyclic)
Which processor does what (here it is owner
computes)

/ vadd.c /
include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
sumNvoid main() int i for(i0 i lt N
i)
if (MYTHREAD iTHREADS) sumi
v1iv2i

cyclic layout
owner computes
22
Work Sharing with upc_forall()

The idiom in the previous slide is very common
Loop over all work on those owned by this proc
UPC adds a special type of loop
upc_forall(init test loop affinity)
Programmer indicates the iterations are
independent
Behavior undefined if there are dependencies
across threads
Affinity expression indicates which iterations to
run on each thread. It may have one of two
types
Integer affinityTHREADS MYTHREAD
Pointer upc_threadof(affinity) MYTHREAD
Syntactic sugar for loop on previous slide
Some compilers may do better than this, e.g.,
for(iMYTHREAD iltN iTHREADS)
Rather than having all threads iterate N times
for(i0 iltN i) if (MYTHREAD iTHREADS)

23
Vector Addition with upc_forall

The vadd example can be rewritten as follows
Equivalent code could use sumi for affinity
The code would be correct but slow if the
affinity expression were i1 rather than i.

define N 100THREADSshared int v1N, v2N,
sumNvoid main() int i upc_forall(i0 i
lt N i i)
sumiv1iv2i

The cyclic data distribution may perform poorly
on some machines. Cache effects!
24
Distributed Arrays in UPC
25
Blocked Layouts in UPC

The cyclic layout is typically stored in one of
two ways
Distributed memory each processor has a chunk of
memory
Thread 0 would have 0,THREADS, THREADS2, in a
chunk
Shared memory machine each thread has a logical
chunk
Shared memory would have 0,1,2,THREADS,THREADS1
,
What performance problem is there with the
latter?
Vector addition example can be rewritten as
follows

define N 100THREADSshared int v1N,
v2N, sumNvoid main() int
i upc_forall(i0 iltN i ai)
sumiv1iv2i

blocked layout
26
Layouts in General

All non-array objects have affinity with thread
zero.
Array layouts are controlled by layout
specifiers
None (cyclic layout, i.e. block size of 1)
(blocked layout)
0 or (indefinite layout, all on 1 thread)
b or b1b2bn b1b2bn (fixed block
size)
The affinity of an array element is defined in
terms of
block size, a compile-time constant
and THREADS.
Element i has affinity with thread
(i / block_size) THREADS
In 2D and higher, linearize the elements as in a
C row major representation, and then use above
mapping

27
2D Array Layouts in UPC
a1
a2

shared m int a1 nm
shared km int a2 nm
shared int a3 nmTHREADS
a1 has a row layout
a2 has a block row layout.
a3 has a column layout
To get more general HPF and ScaLAPACK style 2D
blocked layouts, one needs to add dimensions.
Assume rc THREADS
shared b1b2 int a5 mnrcb1b2
or equivalently
shared b1b2 int a5 mnrcb1b2

a3
28
UPC Matrix Multiplication Code
include ltupc_relaxed.hgt define N 4 define P
4 define M 4 // a and c are row-wise blocked
shared matrices // b is column-wise
blocked shared NP/THREADS int aNP,
cNM Shared M/THREADS int bPM void
main (void) int i, j , k // private
variables upc_forall (i 0 i lt N i
ci0) for (j0 j lt M j) cij
0 for (k 0 k ? P k) cij
aikbkj
29
Domain Decomposition for UPC

Exploits locality in matrix multiplication

A (N ? P) is decomposed row-wise into blocks of
size (N ? P) / THREADS as shown below

B (P ? M) is decomposed column wise into M/
THREADS blocks as shown below

Thread THREADS-1
Thread 0
M
P
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
THREADS-1
Columns 0 (M/THREADS)-1
Columns ((THREAD-1) ? M)/THREADS(M-1)
Note N and M are assumed to be multiples of
THREADS
30
Observations on Matrix Multiplication Code

UPC code is almost the same size as the seq code
Convert sequential C to parallel UPC code by
adding shared to matrices and work sharing loop
Distributions are an incremental optimization
that allow us to improve locality
Would still get correct result without
distributions
Further Improvement
We may not have all needed elements of B locally
Making a private copy of B in each thread might
result in better performance
Can be done with the help of upc_memget

31
Pointers and Dynamic Memory Management
32
Pointers to Shared vs. Arrays

In the C tradition, array can be access through
pointers
Here is the vector addition example using pointers

define N 100THREADS
shared int v1N, v2N, sumN
void main() int ishared int p1, p2p1v1
p2v2upc_forall (i0 iltN i, p1, p2 i)
sumi p1 p2

v1
p1
33
UPC Pointers
Where does the pointer reside?
Where does the pointer point?
int p1 / private pointer to local / shared
int p2 / private pointer to shared / int
shared p3 / shared pointer to local
/ shared int shared p4 / shared pointer to
shared / Shared to private is not recommended.
34
UPC Pointers
Thread0 Thread1
Threadn
p3
p3
p3
Shared
p4
p4
p4
Global address space
p1
p1
p1
Private
p2
p2
p2
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space /
Pointers to shared often require more storage and
are more costly to dereference they may refer to
local or remote memory.
35
UPC Pointers

In UPC pointers to shared objects have three
fields
thread number
local address of block
phase (specifies position in the block)
lt Thread, Phase, Local Address gt
Example Cray T3E implementation

0
37
38
48
49
63
36
UPC Pointers

Pointer arithmetic supports blocked and
non-blocked array distributions
Casting pointers
shared to local is allowed
private to shared is not allowed
Casting of shared to local is well defined only
if the object pointed to by the pointer to shared
has affinity with the thread performing the cast
In general will result in an error

37
Special Functions

size_t upc_threadof(shared void ptr)
Returns the thread id that the data at ptr has
affinity to
size_t upc_phaseof(shared void ptr)
Returns the index (position within the block)
field of the pointer to shared
shared void upc_resetphase(shared void ptr)
Resets the phase to zero

38
UPC Memory Allocation

shared void upc_alloc(size_t nbytes)
Allocate nbytes of shared memory with affinity to
the calling thread
upc_alloc is not collective
void upc_free(shared void ptr)
Frees the dynamically allocated shared memory
pointed to by ptr
upc_free is not collective

39
Global Memory Allocation

shared void upc_global_alloc(size_t nblocks,
size_t nbytes)
nblocks number of blocksnbytes block size
Non-collective called by one thread
Allocates memory in the shared space
If called by more than one thread, multiple
regions are allocated and each thread which makes
the call gets a different pointer
Space allocated per calling thread is equivalent
toshared nbytes charnblocks nbytes

40
Collective Global Memory Allocation

shared void upc_all_alloc(size_t nblocks,
size_t nbytes)
nblocks number of blocksnbytes block size
This function has the same result as
upc_global_alloc. But this is a collective
function
All the threads will get the same pointer
Equivalent to
shared nbytes charnblocks nbytes

41
One-Sided Communication

Semantics similar to libc memcpy() but operate in
GAS
Allow programmer to explicitly manage
communication
upc_memget(void dst, shared void src, size_t
size)
One-sided get copies data from shared to local
upc_memput(shared void dst, void src, size_t
size)
One-sided put copies data from local to shared
upc_memcpy(shared void dst, shared void src,
size_t size)
One-sided copy copies data from shared to shared

42
Distributed Arrays Directory Style

Some high performance UPC programmers avoid the
UPC style arrays
Instead, build directories of distributed objects
Also more general
typedef shared double sdblptr
shared sdblptr directoryTHREADS
directoryiupc_alloc(local_sizesizeof(double))
upc_barrier

43
UPC PerformanceStudy conducted by Kathy
Yelick, Chris Bell, Rajesh Nishtala, Dan Bonachea
at UC Berkeley
44
One-Sided vs Two-Sided Messaging
one-sided put message
host CPU
address
data payload
network interface
two-sided message
memory
message id
data payload

A one-sided put/get message can be handled
directly by a network interface with RDMA support
Avoid interrupting the CPU or storing data from
CPU (preposts)
A two-sided messages needs to be matched with a
receive to identify memory address to put data
Offloaded to Network Interface in networks like
Quadrics
Need to download match tables to interface (from
host)
Ordering requirements on messages can also hinder
bandwidth

45
Performance Advantage of One-Sided Communication

Opteron/InfiniBand (Jacquard at NERSC)
GASNets vapi-conduit and OSU MPI 0.9.5 MVAPICH
This is a very good MPI implementation its
limited by semantics of message matching,
ordering, etc.
Half power point (N ½ ) differs by one order of
magnitude

Kathy Yelick with Paul Hargrove and Dan Bonachea
46
Case Study 2 NAS FT

Performance of Exchange (All-to-all) is critical
1D FFTs in each dimension, 3 phases
Transpose after first 2 for locality
Bisection bandwidth-limited
Problem as procs grows

Three approaches to transpose
Exchange
wait for 2nd dim FFTs to finish, send 1 message
per processor pair
Slab
wait for chunk of rows destined for 1 proc, send
when ready
Pencil
send each row as it completes

Kathy Yelick, Chris Bell, Rajesh Nishtala, Dan
Bonachea
47
Overlapping Communication

Goal make use of all the wires all the time
Schedule communication to avoid network backup
Trade-off overhead vs. overlap
Exchange has fewest messages, less message
overhead
Slabs and pencils have more overlap pencils the
most
Example Class D problem on 256 Processors

Kathy Yelick with Chris Bell, Rajesh Nishtala,
Dan Bonachea
48
NAS FT Variants Performance Summary
.5 Tflops

Slab is always best for MPI small message cost
too high
Pencil is always best for UPC more overlap

Kathy Yelick with Chris Bell, Rajesh Nishtala,
Dan Bonachea
49
Summary

UPC extends C with explicit parallel constructs
UPC is PGAS
Defines a global address space
Every shared object has a affinity to a UPC
thread
UPC promises high productivity and high
performance
Tunable approach to performance
High level Sequential C ? Shared memory
Medium level Locality, data distribution,
consistency
Low level Explicit one-sided communication
Gives high performance on modern capability
systems
NAS FT Benchmark on IB Cluster
16k Processors on

50
UPC Resources

UPC Language Spec, Users guide, resources
http//upc.gwu.edu
Berkeley UPC (rDMA clusers, shared memory, )
Source-to-source UPC ? C, GASNet runtime
http//upc.lbl.gov
http//upc-wiki.lbl.gov
Intrepid UPC/GCC-UPC (Shared memory or BUPCR)
Direct UPC ? binary, also can use BUPC runtime
http//www.intrepid.com
MuPC from Michigan Tech (Linux Clusters)
EDG UPC ? C trans, MPI runtime, Reference Impl.
http//www.upc.mtu.edu
Commercial UPC IBM, SGI, Cray, HP

51
Backup Slides
52
Proposed Extensions to UPC

UPC Collectives
Thread groups/teams
UPC I/O
Nonblocking communication
Point-to-point synchronization
Atomic memory operations
Variable blocksize pointers
High performance timers
Hierarchical thread layout query and control
Non-contiguous data transfer
UPC
Active Messages

53
UPC Distributed Shared Linked List

include ltupc.hgt
struct node_s
int value
shared struct node_s next
typedef struct node_s node_t
shared node_t shared head NULL
shared int turn 0
int main(int argc, char argv)
shared node_t cur
shared node_t mynode
mynode upc_alloc(sizeof(node_t))
mynode-gtvalue MYTHREAD
mynode-gtnext NULL

54
UPC Collectives in General

Collectives are a proposed extension
Implemented in MuPC, Berkeley UPC and others
The UPC collectives interface is available from
http//www.gwu.edu/upc/docs/
It contains typical functions
Data movement broadcast, scatter, gather,
Computational reduce, prefix,
Interface has synchronization modes
Avoid over-synchronizing (barrier before/after is
simplest semantics, but may be unnecessary)
Data being collected may be read/written by any
thread simultaneously

55
Pi in UPC Data Parallel Style

The previous version of Pi works, but is not
scalable
On a large of threads, the locked region will
be a bottleneck
Use a reduction for better scalability
include ltbupc_collectivev.hgt
// shared int hits
main(int argc, char argv)
...
for (i0 i lt my_trials i)
my_hits hit()
my_hits // type, input, thread, op
bupc_allv_reduce(int, my_hits, 0,
UPC_ADD)
// upc_barrier
if (MYTHREAD 0)
printf("PI f", 4.0my_hits/trials)

Berkeley collectives
no shared variables
barrier implied by collective
56
GASNet Portability and High-Performance
GASNet better for latency across machines
Kathy Yelick with UPC Group GASNet design by Dan
Bonachea
57
GASNet Portability and High-Performance
GASNet at least as high (comparable) for large
messages
Kathy Yelick with UPC Group GASNet design by Dan
Bonachea
58
GASNet Portability and High-Performance
GASNet excels at mid-range sizes important for
overlap
Kathy Yelick with UPC Group GASNet design by Dan
Bonachea

Write a Comment

User Comments (0)