SC05 Tutorial

About This Presentation

Title:

SC05 Tutorial

Description:

SC05 Tutorial – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 268

Provided by: STE3

Category:

more less

Transcript and Presenter's Notes

Title: SC05 Tutorial

1
SC05 Tutorial

High Performance
Parallel Programming
with Unified Parallel C (UPC)

Tarek El-Ghazawi tarek_at_gwu.edu Phillip
Merkey Michigan Technological U. Steve
Seidel merk,steve_at_mtu.edu
The George Washington U.
2
UPC Tutorial Web Site
This site contains the UPC code segments
discussed in this tutorial. http//www.upc.mtu.e
du/SC05-tutorial
3
UPC Home Page http//www.upc.gwu.edu
4
UPC textbook now available
http//www.upcworld.org

UPC Distributed Shared Memory Programming
Tarek El-Ghazawi
William Carlson
Thomas Sterling
Katherine Yelick
Wiley, May, 2005
ISBN 0-471-22048-5

5
Section 1 The UPC Language

Introduction
UPC and the PGAS Model
Data Distribution
Pointers
Worksharing and Exploiting Locality
Dynamic Memory Management
(1015am - 1030am break)
Synchronization
Memory Consistency

El-Ghazawi
6
Section 2 UPC Systems
Merkey Seidel

Summary of current UPC systems
Cray X-1
Hewlett-Packard
Berkeley
Intrepid
MTU
UPC application development tools
totalview
upc_trace
performance toolkit interface
performance model

7
Section 3 UPC Libraries
Seidel

Collective Functions
Bucket sort example
UPC collectives
Synchronization modes
Collectives performance
Extensions
Noon 100pm lunch
UPC-IO
Concepts
Main Library Calls
Library Overview

El-Ghazawi
8
Sec. 4 UPC Applications Development

Two case studies of application design
histogramming
locks revisited
generalizing the histogram problem
programming the sparse case
implications of the memory model
(230pm 245pm break)
generic science code (advection)
shared multi-dimensional arrays
implications of the memory model
UPC tips, tricks, and traps

Merkey
Seidel
9
Introduction

UPC Unified Parallel C
Set of specs for a parallel C
v1.0 completed February of 2001
v1.1.1 in October of 2003
v1.2 in May of 2005
Compiler implementations by vendors and others
Consortium of government, academia, and HPC
vendors including IDA CCS, GWU, UCB, MTU, UMN,
ARSC, UMCP, U of Florida, ANL, LBNL, LLNL, DoD,
DoE, HP, Cray, IBM, Sun, Intrepid, Etnus,

10
Introductions cont.

UPC compilers are now available for most HPC
platforms and clusters
Some are open source
A debugger is available and a performance
analysis tool is in the works
Benchmarks, programming examples, and compiler
testing suite(s) are available
Visit www.upcworld.org or upc.gwu.edu for more
information

11
Parallel Programming Models

What is a programming model?
An abstract virtual machine
A view of data and execution
The logical interface between architecture and
applications
Why Programming Models?
Decouple applications and architectures
Write applications that run effectively across
architectures
Design new architectures that can effectively
support legacy applications
Programming Model Design Considerations
Expose modern architectural features to exploit
machine power and improve performance
Maintain Ease of Use

12
Programming Models

Common Parallel Programming models
Data Parallel
Message Passing
Shared Memory
Distributed Shared Memory
Hybrid models
Shared Memory under Message Passing

13
Programming Models
Process/Thread
Address Space
Message Passing Shared Memory DSM/PGAS M
PI OpenMP UPC
14
The Partitioned Global Address Space (PGAS) Model

Aka the DSM model
Concurrent threads with a partitioned shared
space
Similar to the shared memory
Memory partition Mi has affinity to thread Thi
()ive
Helps exploiting locality
Simple statements as SM
(-)ive
Synchronization
UPC, also CAF and Titanium

Th0 Thn-2 Thn-1
x
Mn-1
Mn-2
M0
Legend

Thread/Process

Memory Access

Address Space

15
What is UPC?

Unified Parallel C
An explicit parallel extension of ISO C
A partitioned shared memory parallel programming
language

16
UPC Execution Model

A number of threads working independently in a
SPMD fashion
MYTHREAD specifies thread index (0..THREADS-1)
Number of threads specified at compile-time or
run-time
Synchronization when needed
Barriers
Locks
Memory consistency control

17
UPC Memory Model
Thread THREADS-1
Thread 0
Thread 1
Partitioned Global address space
Shared
Private THREADS-1
Private 0
Private 1
Private Spaces

A pointer-to-shared can reference all locations
in the shared space, but there is data-thread
affinity
A private pointer may reference addresses in its
private space or its local portion of the shared
space
Static and dynamic memory allocations are
supported for both shared and private memory

18
Users General View

A collection of threads operating in a single
global address space, which is logically
partitioned among threads. Each thread has
affinity with a portion of the globally shared
address space. Each thread has also a private
space.

19
A First Example Vector addition
Thread 0
Thread 1

//vect_add.c
include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
v1plusv2Nvoid main() int i for(i0 iltN
i)
if (MYTHREADiTHREADS) v1plusv2iv1i
v2i

Iteration
0
1
2
3
v10
v11
Shared Space
v12
v13

v20
v21
v22
v23

v1plusv20
v1plusv21
v1plusv22
v1plusv23

20
2nd Example A More Efficient Implementation
Thread 0
Thread 1
Iteration

//vect_add.c
include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
v1plusv2Nvoid main() int
i for(iMYTHREAD iltN iTHREADS) v1plusv2i
v1iv2i

0
1
2
3
v10
v11
Shared Space
v12
v13

v20
v21
v22
v23

v1plusv20
v1plusv21
v1plusv22
v1plusv23

21
3rd Example A More Convenient Implementation
with upc_forall

//vect_add.c
include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
v1plusv2Nvoid main() int
i upc_forall(i0 iltN i i) v1plusv2iv1
iv2i

Thread 0
Thread 1
Iteration
0
1
2
3
v10
v11
Shared Space
v12
v13

v20
v21
v22
v23

v1plusv20
v1plusv21
v1plusv22
v1plusv23

22
Example UPC Matrix-Vector Multiplication-
Default Distribution
// vect_mat_mult.c include ltupc_relaxed.hgt share
d int aTHREADSTHREADS shared int
bTHREADS, cTHREADS void main (void) int
i, j upc_forall( i 0 i lt THREADS i i)
ci 0 for ( j 0 j ? THREADS
j) ci aijbj
23
Data Distribution
Th. 0
Th. 0

Th. 1
Thread 0
Thread 2
Thread 1
Th. 1
Th. 2
Th. 2
A
B
C
24
A Better Data Distribution
Th. 0
Th. 0
Thread 0

Th. 1
Th. 1
Thread 1
Th. 2
Th. 2
Thread 2
C
A
B
25
Example UPC Matrix-Vector Multiplication- The
Better Distribution
// vect_mat_mult.c include ltupc_relaxed.hgt share
d THREADS int aTHREADSTHREADS shared int
bTHREADS, cTHREADS void main (void) int
i, j upc_forall( i 0 i lt THREADS i i)
ci 0 for ( j 0 j? THREADS
j) ci aijbj
26
Shared and Private Data

Examples of Shared and Private Data Layout
Assume THREADS 3
shared int x /x will have affinity to thread 0
/
shared int yTHREADS
int z
will result in the layout

Thread 0
Thread 1
Thread 2
x
y0
y1
y2
z
z
z
27
Shared and Private Data

shared int A4THREADS
will result in the following data layout

Thread 0
Thread 1
Thread 2
A02
A00
A01
A12
A10
A11
A22
A20
A21
A32
A30
A31
28
Shared and Private Data

shared int A22THREADS
will result in the following data layout

Thread 0
Thread 1
Thread (THREADS-1)
A0THREADS-1
A00
A01
A0THREADS1
A02THREADS-1
A0THREADS
A10
A1THREADS-1
A11
A1THREADS
A12THREADS-1
A1THREADS1
29
Blocking of Shared Arrays

Default block size is 1
Shared arrays can be distributed on a block per
thread basis, round robin with arbitrary block
sizes.
A block size is specified in the declaration as
follows
shared block-size type arrayN
e.g. shared 4 int a16

30
Blocking of Shared Arrays

Block size and THREADS determine affinity
The term affinity means in which threads local
shared-memory space, a shared data item will
reside
Element i of a blocked array has affinity to
thread

31
Shared and Private Data

Shared objects placed in memory based on affinity
Affinity can be also defined based on the ability
of a thread to refer to an object by a private
pointer
All non-array shared qualified objects, i.e.
shared scalars, have affinity to thread 0
Threads access shared and private data

32
Shared and Private Data

Assume THREADS 4
shared 3 int A4THREADS
will result in the following data layout

Thread 0
Thread 1
Thread 2
Thread 3
A00
A03
A12
A21
A01
A10
A13
A22
A02
A11
A20
A23
A30
A33
A31
A32
33
Special Operators

upc_localsizeof(type-name or expression)returns
the size of the local portion of a shared object
upc_blocksizeof(type-name or expression)returns
the blocking factor associated with the argument
upc_elemsizeof(type-name or expression)returns
the size (in bytes) of the left-most type that is
not an array

34
Usage Example of Special Operators

typedef shared int sharray10THREADS
sharray a
char i
upc_localsizeof(sharray) ? 10sizeof(int)
upc_localsizeof(a) ?10 sizeof(int)
upc_localsizeof(i) ?1
upc_blocksizeof(a) ?1
upc_elementsizeof(a) ?sizeof(int)

35
String functions in UPC

UPC provides standard library functions to move
data to/from shared memory
Can be used to move chunks in the shared space or
between shared and private spaces

36
String functions in UPC

Equivalent of memcpy
upc_memcpy(dst, src, size)
copy from shared to shared
upc_memput(dst, src, size)
copy from private to shared
upc_memget(dst, src, size)
copy from shared to private
Equivalent of memset
upc_memset(dst, char, size)
initializes shared memory with a character
The shared block must be a contiguous with all of
its elements having the same affinity

37
UPC Pointers
Where does it point to?
Where does it reside?
38
UPC Pointers

How to declare them?
int p1 / private pointer pointing locally
/
shared int p2 / private pointer pointing into
the shared space /
int shared p3 / shared pointer pointing
locally /
shared int shared p4 / shared pointer
pointing into the shared space /
You may find many using shared pointer to mean
a pointer pointing to a shared object, e.g.
equivalent to p2 but could be p4 as well.

39
UPC Pointers
Thread 0
Shared
Private
40
UPC Pointers

What are the common usages?
int p1 / access to private data or to
local shared data /
shared int p2 / independent access of
threads to data in shared space /
int shared p3 / not recommended/
shared int shared p4 / common access of
all threads to data in the shared
space/

41
UPC Pointers

In UPC pointers to shared objects have three
fields
thread number
local address of block
phase (specifies position in the block)
Example Cray T3E implementation

0
37
38
48
49
63
42
UPC Pointers

Pointer arithmetic supports blocked and
non-blocked array distributions
Casting of shared to private pointers is allowed
but not vice versa !
When casting a pointer-to-shared to a private
pointer, the thread number of the
pointer-to-shared may be lost
Casting of a pointer-to-shared to a private
pointer is well defined only if the pointed to
object has affinity with the local thread

43
Special Functions

size_t upc_threadof(shared void ptr)returns
the thread number that has affinity to the object
pointed to by ptr
size_t upc_phaseof(shared void ptr)returns the
index (position within the block) of the object
which is pointed to by ptr
size_t upc_addrfield(shared void ptr)returns
the address of the block which is pointed at by
the pointer to shared
shared void upc_resetphase(shared void
ptr)resets the phase to zero
size_t upc_affinitysize(size_t ntotal, size_t
nbytes, size_t thr)returns the exact size of
the local portion of the data in a shared object
with affinity to a given thread

44
UPC Pointers

pointer to shared Arithmetic Examples
Assume THREADS 4
define N 16
shared int xN
shared int dpx5, dp1
dp1 dp 9

45
UPC Pointers
Thread 3
Thread 2
Thread 0
X1
X2
X3
X0
X5
dp1
X6
X7
X4
dp
dp2
X9
dp 3
dp 4
dp6
dp 5
X10
X11
X8
X13
X14
X15
dp 8
dp 7
dp 9
X12
dp1
46
UPC Pointers

Assume THREADS 4
shared3 int xN, dpx5, dp1
dp1 dp 9

47
UPC Pointers
Thread 3
Thread 2
Thread 0
X6
X9
dp 1
dp 4
X7
X10
dp 5
dp 2
X11
X8
dp 6
dp
dp 3
X12
dp 7
X15
X13
dp 8
X14
dp9
dp1
48
UPC Pointers

Example Pointer Castings and Mismatched
Assignments
Pointer Casting
shared int xTHREADS
int p
p (int ) xMYTHREAD / p points to
xMYTHREAD /
Each of the private pointers will point at the x
element which has affinity with its thread, i.e.
MYTHREAD

49
UPC Pointers

Mismatched Assignments
Assume THREADS 4
shared int xN
shared3 int dpx5, dp1
dp1 dp 9
The last statement assigns to dp1 a value that is
9 positions beyond dp
The pointer will follow its own blocking and not
that of the array

50
UPC Pointers
Thread 3
Thread 2
Thread 0
X2
X3
X6
X7
dp 6
dp
dp 3
X11
X10
dp 1
dp 7
dp 4
dp 2
dp 8
dp 5
X16
dp 9
dp1
51
UPC Pointers

Given the declarations
shared3 int p
shared5 int q
Then
pq / is acceptable (an implementation may
require an explicit cast, e.g. p(shared
3)q) /
Pointer p, however, will follow pointer
arithmetic for blocks of 3, not 5 !!
A pointer cast sets the phase to 0

52
Worksharing with upc_forall

Distributes independent iteration across threads
in the way you wish typically used to boost
locality exploitation in a convenient way
Simple C-like syntax and semantics
upc_forall(init test loop affinity)
statement
Affinity could be an integer expression, or a
Reference to (address of) a shared object

53
Work Sharing and Exploiting Locality via
upc_forall()

Example 1 explicit affinity using shared
references
shared int a100,b100, c100
int i
upc_forall (i0 ilt100 i ai)
ai bi ci
Example 2 implicit affinity with integer
expressions and distribution in a round-robin
fashion
shared int a100,b100, c100
int i
upc_forall (i0 ilt100 i i)
ai bi ci

Note Examples 1 and 2 result in the same
distribution
54
Work Sharing upc_forall()

Example 3 Implicitly with distribution by chunks
shared int a100,b100, c100
int i
upc_forall (i0 ilt100 i (iTHREADS)/100)
ai bi ci
Assuming 4 threads, the following results

55
Distributing Multidimensional Data

Uses the inherent contiguous memory layout of C
multidimensional arrays
shared BLOCKSIZE double gridsNN
Distribution depends on the value of BLOCKSIZE,

N
Column BlocksBLOCKSIZE N/THREADS
Distribution by Row BlockBLOCKSIZEN
N
Default BLOCKSIZE1
BLOCKSIZENNorBLOCKSIZE infinite
y
x
56
2D Heat Conduction Problem

Based on the 2D Partial Differential Equation
(1), 2D Heat Conduction problem is similar to a
4-point stencil operation, as seen in (2)

(1)
Because of the time steps, Typically, two grids
are used
y
(2)
x
57
2D Heat Conduction Problem

shared BLOCKSIZE double grids2NN
shared double dTmax_localTHREADS
int nr_iter,i,x,y,z,dg,sg,finished
double dTmax, dT, T
do
dTmax 0.0
for( y1 yltN-1 y )
upc_forall( x1 xltN-1 x
gridssgyx )
T (gridssgy-1x
gridssgy1x
gridssgzyx-1
gridssgzyx1) / 4.0
dT T gridssgyx
gridsdgyx T
if( dTmax lt fabs(dT) )
dTmax fabs(dT)

Work distribution, according to the defined
BLOCKSIZE of gridsHERE, generic
expression, working for any BLOCKSIZE
4-point Stencil
58

if( dTmax lt epsilon )
finished 1
else
// swapping the source
destination pointers
dg sg
sg !sg
nr_iter
while( !finished )
upc_barrier

dTmax_localMYTHREAD dTmax
upc_barrier
dTmax dTmax_local0
for( i1 iltTHREADS i )
if( dTmax lt dTmax_locali)
dTmax dTmax_locali
upc_barrier

Reduction operation
59
Dynamic Memory Allocation in UPC

Dynamic memory allocation of shared memory is
available in UPC
Functions can be collective or not
A collective function has to be called by every
thread and will return the same value to all of
them
As a convention, the name of a collective
function typically includes all

60
Collective Global Memory Allocation

shared void upc_all_alloc (size_t
nblocks, size_t nbytes)
nblocks number of blocksnbytes block size
This function has the same result as
upc_global_alloc. But this is a collective
function, which is expected to be called by all
threads
All the threads will get the same pointer
Equivalent to shared nbytes charnblocks
nbytes

61
Collective Global Memory Allocation

Thread

Thread
Thread
0
1

THREADS
-
1

SHARED

N

N

N

PRIVATE

ptr

ptr

ptr

shared N int ptr ptr (shared N int )
upc_all_alloc( THREADS, Nsizeof( int ) )
62
2D Heat Conduction Example

for( y1 yltN-1 y )
upc_forall( x1 xltN-1 x gridssgyx
)
T (gridssgy1x
while( finished 0 )
return nr_iter

define CHECK_MEM(var)\
if( var NULL )\
\
printf("TH02d ERROR s NULL\n",\
MYTHREAD, var ) \
upc_global_exit(1) \
shared BLOCKSIZE double sh_grids
void heat_conduction(shared BLOCKSIZE double
(grids)NN)

63
int main(void) int nr_iter / allocate
the memory required for grids2NN /
sh_grids (shared BLOCKSIZE double
) upc_all_alloc( 2NN/BLOCKSIZE,
BLOCKSIZEsizeof(double))
CHECK_MEM(sh_grids) / performs the heat
conduction computations / nr_iter
heat_conduction( (shared BLOCKSIZE
double ()NN) sh_grids)
Casting here to a 2-D shared pointer!
64
Global Memory Allocation

shared void upc_global_alloc
(size_t nblocks,
size_t nbytes)
nblocks number of blocksnbytes block size
Non collective, expected to be called by one
thread
The calling thread allocates a contiguous memory
region in the shared space
Space allocated per calling thread is equivalent
to shared nbytes charnblocks nbytes
If called by more than one thread, multiple
regions are allocated and each calling thread
gets a different pointer

65
Global Memory Allocation
shared N int ptr ptr (shared N int )
upc_global_alloc( THREADS, Nsizeof( int ))
shared N int shared myptrTHREADS myptrMY
THREAD (shared N int )
upc_global_alloc( THREADS, Nsizeof( int ))
66
(No Transcript)
67
(No Transcript)
68
Local-Shared Memory Allocation

shared void upc_alloc (size_t nbytes)
nbytes block size
Non collective, expected to be called by one
thread
The calling thread allocates a contiguous memory
region in the local-shared space of the calling
thread
Space allocated per calling thread is equivalent
to shared charnbytes
If called by more than one thread, multiple
regions are allocated and each calling thread
gets a different pointer

69
Local-Shared Memory Allocation
shared int ptr ptr (shared int
)upc_alloc(Nsizeof( int ))
70
Blocking Multidimensional Data by Cells

Blocking can also be done by 2D cells, of equal
size across THREADS
Works best with N being a power of 2

THREADS2NO_COLS2 NO_ROWS1
THREADS1NO_COLS1 NO_ROWS1
N
N
DIMY
THREADS4NO_COLS2 NO_ROWS2
THREADS8NO_COLS4 NO_ROWS2
DIMX
THREADS16NO_COLS4 NO_ROWS4
y
x
71
Blocking Multidimensional Data by Cells

Determining DIMX and DIMY
NO_COLS NO_ROWS 1
for( i2, j0 iltTHREADS iltlt1, j )
if( (j3)0 ) NO_COLS ltlt 1
else if((j3)1) NO_ROWS ltlt 1
DIMX N / NO_COLS
DIMY N / NO_ROWS

Accessing one element of those 3D shared cells
(by a macro)
define CELL_SIZE DIMYDIMX
struct gridcell_s
double cellCELL_SIZE
typedef struct gridcell_s gridcell_t
shared gridcell_t cell_grids2THREADS
define grids(gridno, y, x) \
cell_gridsgridno((y)/DIMY)NO_COLS
((x)/DIMX).cell ((y)DIMY)DIMX ((x)DIMX)

Definition of a cell
2 One cell per thread
Which THREAD?
Linearization 2D into a 11D(using a C Macro)
Which Offset in the cell?
73
2D Heat Conduction Example w 2D-Cells

typedef struct chunk_s chunk_t
struct chunk_s
shared double chunk
int N
shared chunk_t sh_grids2THREADS
shared double dTmax_localTHREADS
define grids(no,y,x) sh_gridsno((y)N(x))/(N
NN/THREADS).chunk((y)N(x))(NNN/THREADS)
int heat_conduction(shared chunk_t
(sh_grids)THREADS)
// grids has to be changed to grids(,,,)

74
int main(int argc, char argv) int nr_iter,
no // get N as parameter for( no0
nolt2 no ) / allocate /
sh_gridsnoMYTHREAD.chunk (shared
double ) upc_alloc (NN/THREADSsizeof(
double )) CHECK_MEM( sh_gridsnoMYTHREAD.c
hunk ) / performs the heat conduction
computation / nr_iter heat_conduction(sh_grid
s)
75
Memory Space Clean-up

void upc_free(shared void ptr)
The upc_free function frees the dynamically
allocated shared memory pointed to by ptr
upc_free is not collective

76
Example Matrix Multiplication in UPC

Given two integer matrices A(NxP) and B(PxM), we
want to compute C A x B.
Entries cij in C are computed by the formula

77
Doing it in C

include ltstdlib.hgt
define N 4
define P 4
define M 4
int aNP 1,2,3,4,5,6,7,8,9,10,11,12,14,14,1
5,16, cNM
int bPM 0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1
void main (void)
int i, j , l
for (i 0 iltN i)
for (j0 jltM j)
cij 0
for (l 0 l?P l) cij
ailblj

78
Domain Decomposition for UPC
Exploiting locality in matrix multiplication

A (N ? P) is decomposed row-wise into blocks of
size (N ? P) / THREADS as shown below

B(P ? M) is decomposed column- wise into M/
THREADS blocks as shown below

Thread THREADS-1
Thread 0
P
M
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
Thread THREADS-1

Note N and M are assumed to be multiples of
THREADS

Columns 0 (M/THREADS)-1
Columns ((THREADS-1) ? M)/THREADS(M-1)
79
UPC Matrix Multiplication Code
include ltupc_relaxed.hgt define N 4 define P
4 define M 4 shared NP /THREADS int
aNP shared NM /THREADS int cNM // a
and c are blocked shared matrices, initialization
is not currently implemented sharedM/THREADS
int bPM void main (void) int i, j , l //
private variables upc_forall(i 0 iltN i
ci0) for (j0 jltM j) cij
0 for (l 0 l?P l) cij
ailblj
80
UPC Matrix Multiplication Code with Privatization
include ltupc_relaxed.hgt define N 4 define P
4 define M 4 shared NP /THREADS int aNP
// N, P and M divisible by THREADS shared NM
/THREADS int cNM sharedM/THREADS int
bPM int a_priv, c_priv void main (void)
int i, j , l // private variables upc_forall(
i 0 iltN i ci0) a_priv (int
)ai c_priv (int )ci for (j0 jltM
j) c_privj 0 for (l 0 l?P
l) c_privj a_privlblj
81
UPC Matrix Multiplication Code with block copy
include ltupc_relaxed.hgt shared NP /THREADS
int aNP shared NM /THREADS int
cNM // a and c are blocked shared matrices,
initialization is not currently
implemented sharedM/THREADS int bPM int
b_localPM void main (void) int i, j , l
// private variables for( i0 iltP i ) for(
j0 jltTHREADS j ) upc_memget(b_localij
(M/THREADS), bij(M/THREADS),
(M/THREADS)sizeof(int)) upc_forall(i 0 iltN
i ci0) for (j0 jltM j)
cij 0 for (l 0 l?P l)
cij ailb_locallj
82
UPC Matrix Multiplication Code with Privatization
and Block Copy
include ltupc_relaxed.hgt shared NP /THREADS
int aNP // N, P and M divisible by
THREADS shared NM /THREADS int
cNM sharedM/THREADS int bPM int
a_priv, c_priv, b_localPM void main (void)
int i, priv_i, j , l // private
variables for( i0 iltP i ) for( j0
jltTHREADS j ) upc_memget(b_localij(M/THR
EADS), bij(M/THREADS),
(M/THREADS)sizeof(int)) upc_forall(i 0 iltN
i ci0) a_priv (int )ai c_priv
(int )ci for (j0 jltM j)
c_privj 0 for (l 0 l?P l)
c_privj a_privlb_locallj
83
Matrix Multiplication with dynamic memory
include ltupc_relaxed.hgt shared NP /THREADS
int a shared NM /THREADS int c shared
M/THREADS int b void main (void) int i, j
, l // private variables aupc_all_alloc(THREAD
S,(NP/THREADS)upc_elemsizeof(a)) cupc_all_al
loc(THREADS,(NM/THREADS) upc_elemsizeof(c)) b
upc_all_alloc(PTHREADS, (M/THREADS)upc_elemsize
of(b)) upc_forall(i 0 iltN i ciM)
for (j0 jltM j) ciMj
0 for (l 0 l?P l) ciMj
aiPlblMj
84
Synchronization

No implicit synchronization among the threads
UPC provides the following synchronization
mechanisms
Barriers
Locks

85
Synchronization - Barriers

No implicit synchronization among the threads
UPC provides the following barrier
synchronization constructs
Barriers (Blocking)
upc_barrier expropt
Split-Phase Barriers (Non-blocking)
upc_notify expropt
upc_wait expropt
Note upc_notify is not blocking upc_wait is

86
Synchronization - Locks

In UPC, shared data can be protected against
multiple writers
void upc_lock(upc_lock_t l)
int upc_lock_attempt(upc_lock_t l) //returns 1
on success and 0 on failure
void upc_unlock(upc_lock_t l)
Locks are allocated dynamically, and can be freed
Locks are properly initialized after they are
allocated

87
Dynamic lock allocation

The locks can be managed using the following
functions
collective lock allocation (à la upc_all_alloc)
upc_lock_t upc_all_lock_alloc(void)
global lock allocation (à la upc_global_alloc)
upc_lock_t upc_global_lock_alloc(void)
lock freeing
void upc_lock_free(upc_lock_t ptr)

88
Collective lock allocation

collective lock allocation
upc_lock_t upc_all_lock_alloc(void)
Needs to be called by all the threads
Returns a single lock to all calling threads

89
Global lock allocation

global lock allocation
upc_lock_t upc_global_lock_alloc(void)
Returns one lock pointer to the calling thread
This is not a collective function

90
Lock freeing

Lock freeing
void upc_lock_free(upc_lock_t l)
This is not a collective function

91
Numerical Integration (computation of ?)

Integrate the function f (which equals p )

92
Example Using Locks in Numerical Integration
upc_forall(i0iltNi i) local_pi
(float) f((.5i)/(N)) local_pi (float)
(4.0 / N) upc_lock(l) /better with
collectives/ pi local_pi upc_unlock(l)
upc_barrier() // Ensure all is done
upc_lock_free( l ) if(MYTHREAD0)
printf("PIf\n",pi)

// Example The Famous PI - Numerical
Integration
include ltupc_relaxed.hgt
define N 1000000
define f(x) 1/(1xx)
upc_lock_t l
shared float pi
void main(void)
float local_pi0.0
int i
l upc_all_lock_alloc()
upc_barrier

93
Memory Consistency Models

Has to do with ordering of shared operations, and
when a change of a shared object by a thread
becomes visible to others
Consistency can be strict or relaxed
Under the relaxed consistency model, the shared
operations can be reordered by the compiler /
runtime system
The strict consistency model enforces sequential
ordering of shared operations. (No operation on
shared can begin before the previous ones are
done, and changes become visible immediately)

94
Memory Consistency

Default behavior can be controlled by the
programmer and set at the program level
To have strict memory consistency
include ltupc_strict.hgt
To have relaxed memory consistency
include ltupc_relaxed.hgt

95
Memory Consistency

Default behavior can be altered for a variable
definition in the declaration using
Type qualifiers strict relaxed
Default behavior can be altered for a statement
or a block of statements using
pragma upc strict
pragma upc relaxed
Highest precedence is at declarations, then
pragmas, then program level

96
Memory Consistency- Fence

UPC provides a fence construct
Equivalent to a null strict reference, and has
the syntax
upc_fence
UPC ensures that all shared references are issued
before the upc_fence is completed

97
Memory Consistency Example

strict shared int flag_ready 0shared int
result0, result1if (MYTHREAD0)
results0 expression1 flag_ready1 //if
not strict, it could be // switched with the
above statement
else if (MYTHREAD1)
while(!flag_ready) //Same note
result1expression2results0
We could have used a barrier between the first
and second statement in the if and the else code
blocks. Expensive!! Affects all operations at all
threads.
We could have used a fence in the same places.
Affects shared references at all threads!
The above works as an example of point to point
synchronization.

98
Section 2 UPC Systems
Merkey Seidel

Summary of current UPC systems
Cray X-1
Hewlett-Packard
Berkeley
Intrepid
MTU
UPC application development tools
totalview
upc_trace
work in progress
performance toolkit interface
performance model

99
Cray UPC

Platform Cray X1 supporting UPC v1.1.1
Features shared memory architecture
UPC is compiler option gt all of the ILP
optimization is available in UPC.
The processors are designed with 4 SSP's per MSP.
A UPC thread can run on a SSP or a MSP, a
SSP-mode vs. MSP-mode performance analysis is
required before making a choice.
There are no virtual processors.
This is a high-bandwidth, low latency system.
The SSP's are vector processors, the key to
performance is exploiting ILP through
vectorization.
The MSPs run at a higher clock speed, the key to
performance is having enough independent work to
be multi-streamed.

100
Cray UPC

Usage
Compiling for arbitrary numbers of threads
cc -hupc filename.c (MSP mode, one thread per
MSP)
cc -hupc,ssp filename.c (SSP mode, one
thread per SSP)
Running
aprun -n THREADS ./a.out
Compiling for fixed number of threads
cc hssp,upc -X THREADS filename.c -o a.out
Running
./a.out
URL
http//docs.cray.com
Search for UPC under Cray X1

101
Hewlett-Packard UPC

Platforms Alphaserver SC, HP-UX IPF, PA-RISC, HP
XC ProLiant DL360 or 380.
Features
UPC version 1.2 compliant
UPC-specific performance optimization
Write-through software cache for remote accesses
Cache configurable at run time
Takes advantage of same-node shared memory when
running on SMP clusters
Rich diagnostic and error-checking facilities

102
Hewlett-Packard UPC

Usage
Compiling for arbitrary number of threads
upc filename.c
Compiling for fixed number of threads
upc -fthreads THREADS filename.c
Running
prun -n THREADS ./a.out
URL http//h30097.www3.hp.com/upc

103
Berkeley UPC (BUPC)

Platforms Supports a wide range of
architectures, interconnects and operating
systems
Features
Open64 open source compiler as front end
Lightweight runtime and networking layers built
on GASNet
Full UPC version 1.2 compliant, including UPC
collectives and a reference implementation of UPC
parallel I/O
Can be debugged by Totalview
Trace analysis upc_trace

104
Berkeley UPC (BUPC)

Usage
Compiling for arbitrary number of threads
upcc filename.c
Compiling for fixed number of threads
upcc -Tthreads THREADS filename.c
Compiling with optimization enabled
(experimental)
upcc -opt filename.c
Running
upcrun -n THREADS ./a.out
URL http//upc.nersc.gov

105
Intrepid GCC/UPC

Platforms shared memory platforms only
Itanium, AMD64, Intel x86 uniprocessor and SMPs
SGI IRIX
Cray T3E
Features
Based on GNU GCC compiler
UPC version 1.1 compliant
Can be a front-end of the Berkeley UPC runtime

106
Intrepid GCC/UPC

Usage
Compiling for arbitrary number of threads
upc -x upc filename.c
Running
mpprun ./a.out
Compiling for fixed number of threads
upc -x upc -fupc-threads-THREADS filename.c
Running
./a.out
URL http//www.intrepid.com/upc

107
MTU UPC (MuPC)

Platforms Intel x86 Linux clusters and
AlphaServer SC clusters with MPI-1.1 and Pthreads
Features
EDG front end source-to-source translator
UPC version 1.1 compliant
Generates 2 Pthreads for each UPC thread
user code
MPI-1 Pthread handles remote accesses
Write-back software cache for remote accesses
Cache configurable at run time
Reference implementation of UPC collectives

108
MTU UPC (MuPC)

Usage
Compiling for arbitrary number of threads
mupcc filename.c
Compiling for fixed number of threads
mupcc f THREADS filename.c
Running
mupcrun n THREADS ./a.out
URL http//www.upc.mtu.edu

109
UPC Tools

Etnus Totalview
Berkeley UPC trace tool
U. of Florida performance tool interface
MTU performance modeling project

110
Totalview

Platforms
HP UPC on Alphaservers
Berkeley UPC on x86 architectures with MPICH or
Quadrics elan as network.
Must be Totalview version 7.0.1 or above
BUPC runtime must be configured with
--enable-trace
BUPC back end must be GNU GCC
Features
UPC-level source examination, steps through UPC
code
Examines shared variable values at run time

111
Totalview

Usage
Compiling for totalview debugging
upcc -tv filename.c
Running when MPICH is used
mpirun -tv -np THREADS ./a.out
Running when Quadrics elan is used
totalview prun -a -n THREADS ./a.out
URL
http//upc.lbl.gov/docs/user/totalview.html
http//www.etnus.com/TotalView/

112
UPC trace

upc_trace analyzes the communication behavior of
UPC programs.
A tool available for Berkeley UPC
Useage
upcc must be configured with --enable-trace.
Run your application with
upcrun -trace ... or
upcrun -tracefile TRACE_FILE_NAME ...
Run upc_trace on trace files to retrieve
statistics of runtime communication events.
Finer tracing control by manually instrumenting
programs
bupc_trace_setmask(), bupc_trace_getmask(),
bupc_trace_gettracelocal(), bupc_trace_settraceloc
al(), etc.

113
UPC trace

upc_trace provides information on
Which lines of code generated network traffic
How many messages each line caused
The type (local and/or remote gets/puts) of
messages
The maximum/minimum/average/combined sizes of the
messages
Local shared memory accesses
Lock-related events, memory allocation events,
and strict operations
URL http//upc.nersc.gov/docs/user/upc_trace.html

114
Performance tool interface

A platform independent interface for toolkit
developers
A callback mechanism notifies performance tool
when certain events, such as remote accesses,
occur at runtime
Relates runtime events to source code
Events Initialization/completion, shared memory
accesses, synchronization, work-sharing, library
function calls, user-defined events
Interface proposal is under development
URL http//www.hcs.ufl.edu/leko/upctoolint/

115
Performance model

Application-level analytical performance model
Models the performance of UPC fine-grain accesses
through platform benchmarking and code analysis
Platform abstraction
Identify a common set of optimizations performed
by a high performance UPC platform aggregation,
vectorization, pipelining, local shared access
optimization, communication/computation
overlapping
Design microbenchmarks to determine the
platforms optimization potentials

116
Performance model

Code analysis
High performance achievable by exploiting
concurrency in shared references
Reference partitioning
A dependence-based analysis to determine
concurrency in shared access scheduling
References are partitioned into groups, accesses
of references in a group are subject to one type
of envisioned optimization
Run time prediction

117
Section 3 UPC Libraries

Collective Functions
Bucket sort example
UPC collectives
Synchronization modes
Collectives performance
Extensions
UPC-IO
Concepts
Main Library Calls
Library Overview

118
Collective functions

A collective function performs an operation in
which all threads participate.
Recall that UPC includes the collectives
upc_barrier, upc_notify, upc_wait, upc_all_alloc,
upc_all_lock_alloc
Collectives covered here are for bulk data
movement and computation.
upc_all_broadcast, upc_all_exchange,
upc_all_prefix_reduce, etc.

119
A quick example Parallel bucketsort

shared N int A NTHREADS
Assume the keys in A are uniformly distributed.
Find global min and max values in A.
Determine max bucket size.
Allocate bucket array and exchange array.
Bucketize A into local shared buckets.
Exchange buckets and merge.
Rebalance and return data to A if desired.

120
Sort shared array A

shared N int A NTHREADS

A
pointers-to-shared
shared
private
121
1. Find global min and max values

shared int minmax02 // only on Thr 0
shared 2 int MinMax2THREADS
// Thread 0 receives min and max values
upc_all_reduce(minmax00,A,,UPC_MIN,)
upc_all_reduce(minmax01,A,,UPC_MAX,)
// Thread 0 broadcasts min and max
upc_all_broadcast(MinMax,minmax0,
2sizeof(int),NULL)

122
1. Find global min and max values
(animation)
shared int minmax02 // only on Thread
0 shared 2 int MinMax2THREADS
upc_all_reduce(minmax0,A,,UPC_MIN,)
upc_all_reduce(minmax1,A,,UPC_MAX,)
upc_all_broadcast(MinMax,minmax,2sizeof(int),NULL
)
min
max
151 -92
A
-92
151
shared
minmax0
-92
-92
-92
151
151
151
pointers-to-shared
MinMax
private
123
2. Determine max bucket size

shared THREADS int BSizesTHREADSTHREADS
shared int bmax0 // only on Thread 0
shared int BmaxTHREADS
// determine splitting keys (not shown)
// initialize Bsize to 0, then
upc_forall(i0 iltNTHREADS i Ai)
if (Ai will go in bucket j)
BsizesMYTHREADj
upc_all_reduceI(bmax0,Bsizes,,UPC_MAX,)
upc_all_broadcast(Bmax,bmax0,sizeof(int),)

124
2. Find max bucket size required
(animation)
shared THREADS int BSizesTHREADSTHREADS sha
red int bmax0 // only on Thread 0 shared int
BmaxTHREADS
upc_all_reduceI(bmax0,Bsizes,,UPC_MAX,)
upc_all_broadcast(Bmax,bmax0,sizeof(int),NULL)
A
30 9 12 3 21 27 8 31 12
31
Bsizes
max
pointers-to-shared
shared
bmax0
31
31
31
BMax
private
125
3. Allocate bucket and exchange arrays

shared int BuckAry
shared int BuckDst
int Blen
Blen (int)BmaxMYTHREADsizeof(int)
BuckAry upc_all_alloc(BlenTHREADS,
BlenTHREADSTHREADS)
BuckDst upc_all_alloc(BlenTHREADS,
BlenTHREADSTHREADS)
Blen Blen/sizeof(int)

126
3. Allocate bucket and exchange arrays
(animation)
int Blen
Blen (int)BmaxMYTHREADsizeof(int)
A
BuckAry
pointers-to-shared
shared
BuckDst
124
124
124
BMax
31
31
31
31
31
31
31
Blen
private
127
4. Bucketize A

int Bptr // local ptr to BuckAry
shared THREADS int BcntTHREADSTHREADS
// cast to local pointer
Bptr (int )BuckAryMYTHREAD
// init bucket counters BcntMYTHREADi0
upc_forall (i0 iltNTHREADS i Ai)
if (Ai belongs in bucket j)
BptrBlenjBcntMYTHREADj Ai
BcntMYTHREADj

128
4. Bucketize A
(animation)
Bptr (int )BuckAryMYTHREAD if (Ai
belongs in bucket j) BptrBlenjBcntMYTHRE
ADj Ai BcntMYTHREADj
A
BuckAry
pointers-to-shared
shared
BuckDst
private
129
5. Exchange buckets
(animation)
upc_all_exchange(BuckDst, BuckAry,
Blensizeof(int), NULL)
A
BuckAry
pointers-to-shared
shared
BuckDst
private
130
6. Merge and rebalance

Bptr (int )BuckDstMYTHREAD
// Each thread locally merges its part of
// BuckDst. Rebalance and return to A
// if desired.
if (MYTHREAD0)
upc_free(BuckAry)
upc_free(BuckDst)

131
Collectives in UPC 1.2

Relocalization collectives change data
affinity.
upc_all_broadcast
upc_all_scatter
upc_all_gather
upc_all_gather_all
upc_all_exchange
upc_all_permute
Computational collectives for data reduction
upc_all_reduce
upc_all_prefix_reduce

132
Why have collectives in UPC?

Sometimes bulk data movement is the right thing
to do.
Built-in collectives offer better performance.
Caution UPC programs can come out looking like
MPI code.

133
An animated tour of UPC collectives

The following illustrations serve to define the
UPC collective functions.
High performance implementations of the
collectives use more sophisticated algorithms.

134
upc_all_broadcast
(animation)
Thread 0 sends the same block of data to each
thread.
shared blk char dstblkTHREADS
shared char srcblk

blk
shared
private
135
upc_all_scatter
(animation)
Thread 0 sends a unique block of data to each
thread.
shared blk char dstblkTHREADS
shared char srcblkTHREADS
shared
private
136
upc_all_gather
(animation)
Each thread sends a block of data to thread 0.
shared char dstblkTHREADS
shared blk char srcblkTHREADS
shared
private
137
upc_all_gather_all
(animation)
Each thread sends one block of data to all
threads.
shared
private
138
upc_all_exchange
(animation)
Each thread sends a unique block of data to each
thread.
shared
private
139
upc_all_permute
(animation)
Thread i sends a block of data to thread perm(i).
shared
private
140
Reduce and prefix_reduce

One function for each C scalar type, e.g.,
upc_all_reduceI() returns an Integer
Operations
, , , , xor, , , min, max
user-defined binary function
non-commutative function option

141
upc_all_reduceTYPE
(animation)
n
Thread 0 receives UPC_OP srci.
i0
0
6
3
4
8
1
16
64
128
256
2
32
shared
512
1024
2048
S
S
448
56
S
3591
9
4095
private
142
upc_all_prefix_reduceTYPE
(animation)
k
Thread k receives UPC_OP srci.
i0
63
3
7
1
32
4
16
2
8
64
128
256
127
15
511
3
7
63
127
shared
1
31
255
15
31
255
private
143
Common collectives properties

Collectives function arguments are single-valued
corresponding arguments must match across all
threads.
Data blocks must have identical sizes.
Source data blocks must be in the same array and
at the same relative location in each thread.
The same is true for destination data blocks.
Various synchronization modes are provided.

144
Synchronization modes

Sync modes determine the strength of
synchronization between threads executing a
collective function.
Sync modes are specified by flags for function
entry and for function exit
UPC_IN_
UPC_OUT_
Sync modes have three strengths
ALLSYNC
MYSYNC
NOSYNC

145
ALLSYNC

ALLSYNC provides barrier-like synchronization.
It is the strongest and most convenient mode.
upc_all_broadcast(dst, src, nbytes,
UPC_IN_ALLSYNC UPC_OUT_ALLSYNC)
No thread will access collective data until all
threads have reached the collective call.
No thread will exit the collective call until all
threads have completed accesses to collective
data.

146
NOSYNC

NOSYNC provides weak synchronization. The
programmer is responsible for synchronization.
Assume there are no data dependencies between
the arguments in the following two calls
upc_all_broadcast(dst0, src0, nbytes,
UPC_IN_ALLSYNC UPC_OUT_NOSYNC)
upc_all_broadcast(dst1, src1, mbytes,
UPC_IN_NOSYNC UPC_OUT_ALLSYNC)
Chaining independent calls by using NOSYNC
eliminates
the need for synchronization between calls.

147
MYSYNC

Syncronization is provided with respect to data
read (UPC_IN) and written (UPC_OUT) by each
thread.
MYSYNC provides an intermediate level of
synchronization.
Assume thread 0 is the source thread. Each
thread needs
to synchronize only with thread 0.
upc_all_broadcast(dst, src, nbytes,
UPC_IN_MYSYNC UPC_OUT_MYSYNC)

148
MYSYNC example
(animation)

Each thread synchronizes with thread 0.
Threads 1 and 2 exit as soon as they receive the
data.
It is not likely that thread 2 needs to read
thread 1s data.

shared
private
149
ALLSYNC vs. MYSYNC performance
upc_all_broadcast() on a Linux/Myrinet cluster, 8
nodes
150
Sync mode summary

ALLSYNC is the most expensive because it
provides barrier-like synchronization.
NOSYNC is the most dangerous but it is almost
free.
MYSYNC provides synchronization only between
threads which need it. It is likely to be strong
enough for most programmers needs, and it is
more efficient.

151
Collectives performance