Title: SC05 Tutorial
1SC05 Tutorial
- High Performance
- Parallel Programming
- with Unified Parallel C (UPC)
Tarek El-Ghazawi tarek_at_gwu.edu Phillip
Merkey Michigan Technological U. Steve
Seidel merk,steve_at_mtu.edu
The George Washington U.
2UPC Tutorial Web Site
This site contains the UPC code segments
discussed in this tutorial. http//www.upc.mtu.e
du/SC05-tutorial
3UPC Home Page http//www.upc.gwu.edu
4UPC textbook now available
http//www.upcworld.org
- UPC Distributed Shared Memory Programming
- Tarek El-Ghazawi
- William Carlson
- Thomas Sterling
- Katherine Yelick
- Wiley, May, 2005
- ISBN 0-471-22048-5
5Section 1 The UPC Language
- Introduction
- UPC and the PGAS Model
- Data Distribution
- Pointers
- Worksharing and Exploiting Locality
- Dynamic Memory Management
- (1015am - 1030am break)
- Synchronization
- Memory Consistency
El-Ghazawi
6Section 2 UPC Systems
Merkey Seidel
- Summary of current UPC systems
- Cray X-1
- Hewlett-Packard
- Berkeley
- Intrepid
- MTU
- UPC application development tools
- totalview
- upc_trace
- performance toolkit interface
- performance model
7Section 3 UPC Libraries
Seidel
- Collective Functions
- Bucket sort example
- UPC collectives
- Synchronization modes
- Collectives performance
- Extensions
- Noon 100pm lunch
- UPC-IO
- Concepts
- Main Library Calls
- Library Overview
El-Ghazawi
8Sec. 4 UPC Applications Development
- Two case studies of application design
- histogramming
- locks revisited
- generalizing the histogram problem
- programming the sparse case
- implications of the memory model
- (230pm 245pm break)
- generic science code (advection)
- shared multi-dimensional arrays
- implications of the memory model
- UPC tips, tricks, and traps
Merkey
Seidel
9Introduction
- UPC Unified Parallel C
- Set of specs for a parallel C
- v1.0 completed February of 2001
- v1.1.1 in October of 2003
- v1.2 in May of 2005
- Compiler implementations by vendors and others
- Consortium of government, academia, and HPC
vendors including IDA CCS, GWU, UCB, MTU, UMN,
ARSC, UMCP, U of Florida, ANL, LBNL, LLNL, DoD,
DoE, HP, Cray, IBM, Sun, Intrepid, Etnus,
10Introductions cont.
- UPC compilers are now available for most HPC
platforms and clusters - Some are open source
- A debugger is available and a performance
analysis tool is in the works - Benchmarks, programming examples, and compiler
testing suite(s) are available - Visit www.upcworld.org or upc.gwu.edu for more
information
11Parallel Programming Models
- What is a programming model?
- An abstract virtual machine
- A view of data and execution
- The logical interface between architecture and
applications - Why Programming Models?
- Decouple applications and architectures
- Write applications that run effectively across
architectures - Design new architectures that can effectively
support legacy applications - Programming Model Design Considerations
- Expose modern architectural features to exploit
machine power and improve performance - Maintain Ease of Use
12Programming Models
- Common Parallel Programming models
- Data Parallel
- Message Passing
- Shared Memory
- Distributed Shared Memory
-
- Hybrid models
- Shared Memory under Message Passing
13Programming Models
Process/Thread
Address Space
Message Passing Shared Memory DSM/PGAS M
PI OpenMP UPC
14The Partitioned Global Address Space (PGAS) Model
- Aka the DSM model
- Concurrent threads with a partitioned shared
space - Similar to the shared memory
- Memory partition Mi has affinity to thread Thi
- ()ive
- Helps exploiting locality
- Simple statements as SM
- (-)ive
- Synchronization
- UPC, also CAF and Titanium
Th0 Thn-2 Thn-1
x
Mn-1
Mn-2
M0
Legend
Thread/Process
Memory Access
Address Space
15What is UPC?
- Unified Parallel C
- An explicit parallel extension of ISO C
- A partitioned shared memory parallel programming
language
16UPC Execution Model
- A number of threads working independently in a
SPMD fashion - MYTHREAD specifies thread index (0..THREADS-1)
- Number of threads specified at compile-time or
run-time - Synchronization when needed
- Barriers
- Locks
- Memory consistency control
17UPC Memory Model
Thread THREADS-1
Thread 0
Thread 1
Partitioned Global address space
Shared
Private THREADS-1
Private 0
Private 1
Private Spaces
- A pointer-to-shared can reference all locations
in the shared space, but there is data-thread
affinity - A private pointer may reference addresses in its
private space or its local portion of the shared
space - Static and dynamic memory allocations are
supported for both shared and private memory
18Users General View
- A collection of threads operating in a single
global address space, which is logically
partitioned among threads. Each thread has
affinity with a portion of the globally shared
address space. Each thread has also a private
space.
19A First Example Vector addition
Thread 0
Thread 1
- //vect_add.c
- include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
v1plusv2Nvoid main() int i for(i0 iltN
i) - if (MYTHREADiTHREADS) v1plusv2iv1i
v2i -
Iteration
0
1
2
3
v10
v11
Shared Space
v12
v13
v20
v21
v22
v23
v1plusv20
v1plusv21
v1plusv22
v1plusv23
202nd Example A More Efficient Implementation
Thread 0
Thread 1
Iteration
- //vect_add.c
- include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
v1plusv2Nvoid main() int
i for(iMYTHREAD iltN iTHREADS) v1plusv2i
v1iv2i -
0
1
2
3
v10
v11
Shared Space
v12
v13
v20
v21
v22
v23
v1plusv20
v1plusv21
v1plusv22
v1plusv23
213rd Example A More Convenient Implementation
with upc_forall
- //vect_add.c
- include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
v1plusv2Nvoid main() int
i upc_forall(i0 iltN i i) v1plusv2iv1
iv2i
Thread 0
Thread 1
Iteration
0
1
2
3
v10
v11
Shared Space
v12
v13
v20
v21
v22
v23
v1plusv20
v1plusv21
v1plusv22
v1plusv23
22Example UPC Matrix-Vector Multiplication-
Default Distribution
// vect_mat_mult.c include ltupc_relaxed.hgt share
d int aTHREADSTHREADS shared int
bTHREADS, cTHREADS void main (void) int
i, j upc_forall( i 0 i lt THREADS i i)
ci 0 for ( j 0 j ? THREADS
j) ci aijbj
23Data Distribution
Th. 0
Th. 0
Th. 1
Thread 0
Thread 2
Thread 1
Th. 1
Th. 2
Th. 2
A
B
C
24A Better Data Distribution
Th. 0
Th. 0
Thread 0
Th. 1
Th. 1
Thread 1
Th. 2
Th. 2
Thread 2
C
A
B
25Example UPC Matrix-Vector Multiplication- The
Better Distribution
// vect_mat_mult.c include ltupc_relaxed.hgt share
d THREADS int aTHREADSTHREADS shared int
bTHREADS, cTHREADS void main (void) int
i, j upc_forall( i 0 i lt THREADS i i)
ci 0 for ( j 0 j? THREADS
j) ci aijbj
26Shared and Private Data
- Examples of Shared and Private Data Layout
- Assume THREADS 3
- shared int x /x will have affinity to thread 0
/ - shared int yTHREADS
- int z
- will result in the layout
Thread 0
Thread 1
Thread 2
x
y0
y1
y2
z
z
z
27Shared and Private Data
- shared int A4THREADS
-
- will result in the following data layout
Thread 0
Thread 1
Thread 2
A02
A00
A01
A12
A10
A11
A22
A20
A21
A32
A30
A31
28Shared and Private Data
- shared int A22THREADS
- will result in the following data layout
Thread 0
Thread 1
Thread (THREADS-1)
A0THREADS-1
A00
A01
A0THREADS1
A02THREADS-1
A0THREADS
A10
A1THREADS-1
A11
A1THREADS
A12THREADS-1
A1THREADS1
29Blocking of Shared Arrays
- Default block size is 1
- Shared arrays can be distributed on a block per
thread basis, round robin with arbitrary block
sizes. - A block size is specified in the declaration as
follows - shared block-size type arrayN
- e.g. shared 4 int a16
30Blocking of Shared Arrays
- Block size and THREADS determine affinity
- The term affinity means in which threads local
shared-memory space, a shared data item will
reside - Element i of a blocked array has affinity to
thread
31Shared and Private Data
- Shared objects placed in memory based on affinity
- Affinity can be also defined based on the ability
of a thread to refer to an object by a private
pointer - All non-array shared qualified objects, i.e.
shared scalars, have affinity to thread 0 - Threads access shared and private data
32Shared and Private Data
- Assume THREADS 4
- shared 3 int A4THREADS
- will result in the following data layout
Thread 0
Thread 1
Thread 2
Thread 3
A00
A03
A12
A21
A01
A10
A13
A22
A02
A11
A20
A23
A30
A33
A31
A32
33Special Operators
- upc_localsizeof(type-name or expression)returns
the size of the local portion of a shared object - upc_blocksizeof(type-name or expression)returns
the blocking factor associated with the argument - upc_elemsizeof(type-name or expression)returns
the size (in bytes) of the left-most type that is
not an array
34Usage Example of Special Operators
- typedef shared int sharray10THREADS
- sharray a
- char i
- upc_localsizeof(sharray) ? 10sizeof(int)
- upc_localsizeof(a) ?10 sizeof(int)
- upc_localsizeof(i) ?1
- upc_blocksizeof(a) ?1
- upc_elementsizeof(a) ?sizeof(int)
35String functions in UPC
- UPC provides standard library functions to move
data to/from shared memory - Can be used to move chunks in the shared space or
between shared and private spaces
36String functions in UPC
- Equivalent of memcpy
- upc_memcpy(dst, src, size)
- copy from shared to shared
- upc_memput(dst, src, size)
- copy from private to shared
- upc_memget(dst, src, size)
- copy from shared to private
- Equivalent of memset
- upc_memset(dst, char, size)
- initializes shared memory with a character
- The shared block must be a contiguous with all of
its elements having the same affinity
37UPC Pointers
Where does it point to?
Where does it reside?
38UPC Pointers
- How to declare them?
- int p1 / private pointer pointing locally
/ - shared int p2 / private pointer pointing into
the shared space / - int shared p3 / shared pointer pointing
locally / - shared int shared p4 / shared pointer
pointing into the shared space / - You may find many using shared pointer to mean
a pointer pointing to a shared object, e.g.
equivalent to p2 but could be p4 as well.
39UPC Pointers
Thread 0
Shared
Private
40UPC Pointers
- What are the common usages?
- int p1 / access to private data or to
local shared data / - shared int p2 / independent access of
threads to data in shared space / - int shared p3 / not recommended/
- shared int shared p4 / common access of
all threads to data in the shared
space/
41UPC Pointers
- In UPC pointers to shared objects have three
fields - thread number
- local address of block
- phase (specifies position in the block)
- Example Cray T3E implementation
0
37
38
48
49
63
42UPC Pointers
- Pointer arithmetic supports blocked and
non-blocked array distributions - Casting of shared to private pointers is allowed
but not vice versa ! - When casting a pointer-to-shared to a private
pointer, the thread number of the
pointer-to-shared may be lost - Casting of a pointer-to-shared to a private
pointer is well defined only if the pointed to
object has affinity with the local thread
43Special Functions
- size_t upc_threadof(shared void ptr)returns
the thread number that has affinity to the object
pointed to by ptr - size_t upc_phaseof(shared void ptr)returns the
index (position within the block) of the object
which is pointed to by ptr - size_t upc_addrfield(shared void ptr)returns
the address of the block which is pointed at by
the pointer to shared - shared void upc_resetphase(shared void
ptr)resets the phase to zero - size_t upc_affinitysize(size_t ntotal, size_t
nbytes, size_t thr)returns the exact size of
the local portion of the data in a shared object
with affinity to a given thread
44UPC Pointers
- pointer to shared Arithmetic Examples
- Assume THREADS 4
- define N 16
- shared int xN
- shared int dpx5, dp1
- dp1 dp 9
45UPC Pointers
Thread 3
Thread 2
Thread 0
X1
X2
X3
X0
X5
dp1
X6
X7
X4
dp
dp2
X9
dp 3
dp 4
dp6
dp 5
X10
X11
X8
X13
X14
X15
dp 8
dp 7
dp 9
X12
dp1
46UPC Pointers
- Assume THREADS 4
- shared3 int xN, dpx5, dp1
- dp1 dp 9
47UPC Pointers
Thread 3
Thread 2
Thread 0
X6
X9
dp 1
dp 4
X7
X10
dp 5
dp 2
X11
X8
dp 6
dp
dp 3
X12
dp 7
X15
X13
dp 8
X14
dp9
dp1
48UPC Pointers
- Example Pointer Castings and Mismatched
Assignments - Pointer Casting
- shared int xTHREADS
- int p
- p (int ) xMYTHREAD / p points to
xMYTHREAD / - Each of the private pointers will point at the x
element which has affinity with its thread, i.e.
MYTHREAD
49UPC Pointers
- Mismatched Assignments
- Assume THREADS 4
- shared int xN
- shared3 int dpx5, dp1
- dp1 dp 9
- The last statement assigns to dp1 a value that is
9 positions beyond dp - The pointer will follow its own blocking and not
that of the array
50UPC Pointers
Thread 3
Thread 2
Thread 0
X2
X3
X6
X7
dp 6
dp
dp 3
X11
X10
dp 1
dp 7
dp 4
dp 2
dp 8
dp 5
X16
dp 9
dp1
51UPC Pointers
- Given the declarations
- shared3 int p
- shared5 int q
- Then
- pq / is acceptable (an implementation may
require an explicit cast, e.g. p(shared
3)q) / - Pointer p, however, will follow pointer
arithmetic for blocks of 3, not 5 !! - A pointer cast sets the phase to 0
52Worksharing with upc_forall
- Distributes independent iteration across threads
in the way you wish typically used to boost
locality exploitation in a convenient way - Simple C-like syntax and semantics
- upc_forall(init test loop affinity)
- statement
- Affinity could be an integer expression, or a
- Reference to (address of) a shared object
53Work Sharing and Exploiting Locality via
upc_forall()
- Example 1 explicit affinity using shared
references - shared int a100,b100, c100
- int i
- upc_forall (i0 ilt100 i ai)
- ai bi ci
- Example 2 implicit affinity with integer
expressions and distribution in a round-robin
fashion - shared int a100,b100, c100
- int i
- upc_forall (i0 ilt100 i i)
- ai bi ci
Note Examples 1 and 2 result in the same
distribution
54Work Sharing upc_forall()
- Example 3 Implicitly with distribution by chunks
- shared int a100,b100, c100
- int i
- upc_forall (i0 ilt100 i (iTHREADS)/100)
- ai bi ci
- Assuming 4 threads, the following results
55Distributing Multidimensional Data
- Uses the inherent contiguous memory layout of C
multidimensional arrays - shared BLOCKSIZE double gridsNN
- Distribution depends on the value of BLOCKSIZE,
N
Column BlocksBLOCKSIZE N/THREADS
Distribution by Row BlockBLOCKSIZEN
N
Default BLOCKSIZE1
BLOCKSIZENNorBLOCKSIZE infinite
y
x
562D Heat Conduction Problem
- Based on the 2D Partial Differential Equation
(1), 2D Heat Conduction problem is similar to a
4-point stencil operation, as seen in (2)
(1)
Because of the time steps, Typically, two grids
are used
y
(2)
x
572D Heat Conduction Problem
- shared BLOCKSIZE double grids2NN
- shared double dTmax_localTHREADS
- int nr_iter,i,x,y,z,dg,sg,finished
- double dTmax, dT, T
- do
- dTmax 0.0
- for( y1 yltN-1 y )
-
- upc_forall( x1 xltN-1 x
gridssgyx ) -
- T (gridssgy-1x
gridssgy1x - gridssgzyx-1
gridssgzyx1) / 4.0 - dT T gridssgyx
- gridsdgyx T
- if( dTmax lt fabs(dT) )
- dTmax fabs(dT)
-
Work distribution, according to the defined
BLOCKSIZE of gridsHERE, generic
expression, working for any BLOCKSIZE
4-point Stencil
58- if( dTmax lt epsilon )
- finished 1
- else
-
- // swapping the source
destination pointers - dg sg
- sg !sg
-
- nr_iter
- while( !finished )
- upc_barrier
- dTmax_localMYTHREAD dTmax
- upc_barrier
- dTmax dTmax_local0
- for( i1 iltTHREADS i )
- if( dTmax lt dTmax_locali)
- dTmax dTmax_locali
- upc_barrier
-
Reduction operation
59Dynamic Memory Allocation in UPC
- Dynamic memory allocation of shared memory is
available in UPC - Functions can be collective or not
- A collective function has to be called by every
thread and will return the same value to all of
them - As a convention, the name of a collective
function typically includes all
60Collective Global Memory Allocation
- shared void upc_all_alloc (size_t
nblocks, size_t nbytes) - nblocks number of blocksnbytes block size
- This function has the same result as
upc_global_alloc. But this is a collective
function, which is expected to be called by all
threads - All the threads will get the same pointer
- Equivalent to shared nbytes charnblocks
nbytes
61Collective Global Memory Allocation
Thread
Thread
Thread
0
1
THREADS
-
1
SHARED
N
N
N
PRIVATE
ptr
ptr
ptr
shared N int ptr ptr (shared N int )
upc_all_alloc( THREADS, Nsizeof( int ) )
622D Heat Conduction Example
- for( y1 yltN-1 y )
-
- upc_forall( x1 xltN-1 x gridssgyx
) -
-
- T (gridssgy1x
-
- while( finished 0 )
- return nr_iter
-
- define CHECK_MEM(var)\
- if( var NULL )\
- \
- printf("TH02d ERROR s NULL\n",\
- MYTHREAD, var ) \
- upc_global_exit(1) \
-
- shared BLOCKSIZE double sh_grids
- void heat_conduction(shared BLOCKSIZE double
(grids)NN) -
-
-
63 int main(void) int nr_iter / allocate
the memory required for grids2NN /
sh_grids (shared BLOCKSIZE double
) upc_all_alloc( 2NN/BLOCKSIZE,
BLOCKSIZEsizeof(double))
CHECK_MEM(sh_grids) / performs the heat
conduction computations / nr_iter
heat_conduction( (shared BLOCKSIZE
double ()NN) sh_grids)
Casting here to a 2-D shared pointer!
64Global Memory Allocation
- shared void upc_global_alloc
(size_t nblocks,
size_t nbytes) - nblocks number of blocksnbytes block size
- Non collective, expected to be called by one
thread - The calling thread allocates a contiguous memory
region in the shared space - Space allocated per calling thread is equivalent
to shared nbytes charnblocks nbytes - If called by more than one thread, multiple
regions are allocated and each calling thread
gets a different pointer -
65Global Memory Allocation
shared N int ptr ptr (shared N int )
upc_global_alloc( THREADS, Nsizeof( int ))
shared N int shared myptrTHREADS myptrMY
THREAD (shared N int )
upc_global_alloc( THREADS, Nsizeof( int ))
66(No Transcript)
67(No Transcript)
68Local-Shared Memory Allocation
- shared void upc_alloc (size_t nbytes)
- nbytes block size
- Non collective, expected to be called by one
thread - The calling thread allocates a contiguous memory
region in the local-shared space of the calling
thread - Space allocated per calling thread is equivalent
to shared charnbytes - If called by more than one thread, multiple
regions are allocated and each calling thread
gets a different pointer
69Local-Shared Memory Allocation
shared int ptr ptr (shared int
)upc_alloc(Nsizeof( int ))
70Blocking Multidimensional Data by Cells
- Blocking can also be done by 2D cells, of equal
size across THREADS - Works best with N being a power of 2
THREADS2NO_COLS2 NO_ROWS1
THREADS1NO_COLS1 NO_ROWS1
N
N
DIMY
THREADS4NO_COLS2 NO_ROWS2
THREADS8NO_COLS4 NO_ROWS2
DIMX
THREADS16NO_COLS4 NO_ROWS4
y
x
71Blocking Multidimensional Data by Cells
- Determining DIMX and DIMY
- NO_COLS NO_ROWS 1
- for( i2, j0 iltTHREADS iltlt1, j )
- if( (j3)0 ) NO_COLS ltlt 1
- else if((j3)1) NO_ROWS ltlt 1
-
- DIMX N / NO_COLS
- DIMY N / NO_ROWS
72- Accessing one element of those 3D shared cells
(by a macro) - define CELL_SIZE DIMYDIMX
- struct gridcell_s
- double cellCELL_SIZE
-
- typedef struct gridcell_s gridcell_t
- shared gridcell_t cell_grids2THREADS
- define grids(gridno, y, x) \
- cell_gridsgridno((y)/DIMY)NO_COLS
((x)/DIMX).cell ((y)DIMY)DIMX ((x)DIMX) -
Definition of a cell
2 One cell per thread
Which THREAD?
Linearization 2D into a 11D(using a C Macro)
Which Offset in the cell?
732D Heat Conduction Example w 2D-Cells
- typedef struct chunk_s chunk_t
- struct chunk_s
- shared double chunk
-
- int N
- shared chunk_t sh_grids2THREADS
- shared double dTmax_localTHREADS
- define grids(no,y,x) sh_gridsno((y)N(x))/(N
NN/THREADS).chunk((y)N(x))(NNN/THREADS)
- int heat_conduction(shared chunk_t
(sh_grids)THREADS) -
- // grids has to be changed to grids(,,,)
-
74int main(int argc, char argv) int nr_iter,
no // get N as parameter for( no0
nolt2 no ) / allocate /
sh_gridsnoMYTHREAD.chunk (shared
double ) upc_alloc (NN/THREADSsizeof(
double )) CHECK_MEM( sh_gridsnoMYTHREAD.c
hunk ) / performs the heat conduction
computation / nr_iter heat_conduction(sh_grid
s)
75Memory Space Clean-up
- void upc_free(shared void ptr)
- The upc_free function frees the dynamically
allocated shared memory pointed to by ptr - upc_free is not collective
76Example Matrix Multiplication in UPC
- Given two integer matrices A(NxP) and B(PxM), we
want to compute C A x B. - Entries cij in C are computed by the formula
77Doing it in C
- include ltstdlib.hgt
- define N 4
- define P 4
- define M 4
- int aNP 1,2,3,4,5,6,7,8,9,10,11,12,14,14,1
5,16, cNM - int bPM 0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1
- void main (void)
- int i, j , l
- for (i 0 iltN i)
- for (j0 jltM j)
- cij 0
- for (l 0 l?P l) cij
ailblj -
-
78Domain Decomposition for UPC
Exploiting locality in matrix multiplication
- A (N ? P) is decomposed row-wise into blocks of
size (N ? P) / THREADS as shown below
- B(P ? M) is decomposed column- wise into M/
THREADS blocks as shown below
Thread THREADS-1
Thread 0
P
M
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
Thread THREADS-1
- Note N and M are assumed to be multiples of
THREADS
Columns 0 (M/THREADS)-1
Columns ((THREADS-1) ? M)/THREADS(M-1)
79UPC Matrix Multiplication Code
include ltupc_relaxed.hgt define N 4 define P
4 define M 4 shared NP /THREADS int
aNP shared NM /THREADS int cNM // a
and c are blocked shared matrices, initialization
is not currently implemented sharedM/THREADS
int bPM void main (void) int i, j , l //
private variables upc_forall(i 0 iltN i
ci0) for (j0 jltM j) cij
0 for (l 0 l?P l) cij
ailblj
80UPC Matrix Multiplication Code with Privatization
include ltupc_relaxed.hgt define N 4 define P
4 define M 4 shared NP /THREADS int aNP
// N, P and M divisible by THREADS shared NM
/THREADS int cNM sharedM/THREADS int
bPM int a_priv, c_priv void main (void)
int i, j , l // private variables upc_forall(
i 0 iltN i ci0) a_priv (int
)ai c_priv (int )ci for (j0 jltM
j) c_privj 0 for (l 0 l?P
l) c_privj a_privlblj
81UPC Matrix Multiplication Code with block copy
include ltupc_relaxed.hgt shared NP /THREADS
int aNP shared NM /THREADS int
cNM // a and c are blocked shared matrices,
initialization is not currently
implemented sharedM/THREADS int bPM int
b_localPM void main (void) int i, j , l
// private variables for( i0 iltP i ) for(
j0 jltTHREADS j ) upc_memget(b_localij
(M/THREADS), bij(M/THREADS),
(M/THREADS)sizeof(int)) upc_forall(i 0 iltN
i ci0) for (j0 jltM j)
cij 0 for (l 0 l?P l)
cij ailb_locallj
82UPC Matrix Multiplication Code with Privatization
and Block Copy
include ltupc_relaxed.hgt shared NP /THREADS
int aNP // N, P and M divisible by
THREADS shared NM /THREADS int
cNM sharedM/THREADS int bPM int
a_priv, c_priv, b_localPM void main (void)
int i, priv_i, j , l // private
variables for( i0 iltP i ) for( j0
jltTHREADS j ) upc_memget(b_localij(M/THR
EADS), bij(M/THREADS),
(M/THREADS)sizeof(int)) upc_forall(i 0 iltN
i ci0) a_priv (int )ai c_priv
(int )ci for (j0 jltM j)
c_privj 0 for (l 0 l?P l)
c_privj a_privlb_locallj
83Matrix Multiplication with dynamic memory
include ltupc_relaxed.hgt shared NP /THREADS
int a shared NM /THREADS int c shared
M/THREADS int b void main (void) int i, j
, l // private variables aupc_all_alloc(THREAD
S,(NP/THREADS)upc_elemsizeof(a)) cupc_all_al
loc(THREADS,(NM/THREADS) upc_elemsizeof(c)) b
upc_all_alloc(PTHREADS, (M/THREADS)upc_elemsize
of(b)) upc_forall(i 0 iltN i ciM)
for (j0 jltM j) ciMj
0 for (l 0 l?P l) ciMj
aiPlblMj
84Synchronization
- No implicit synchronization among the threads
- UPC provides the following synchronization
mechanisms - Barriers
- Locks
85Synchronization - Barriers
- No implicit synchronization among the threads
- UPC provides the following barrier
synchronization constructs - Barriers (Blocking)
- upc_barrier expropt
- Split-Phase Barriers (Non-blocking)
- upc_notify expropt
- upc_wait expropt
- Note upc_notify is not blocking upc_wait is
86Synchronization - Locks
- In UPC, shared data can be protected against
multiple writers - void upc_lock(upc_lock_t l)
- int upc_lock_attempt(upc_lock_t l) //returns 1
on success and 0 on failure - void upc_unlock(upc_lock_t l)
- Locks are allocated dynamically, and can be freed
- Locks are properly initialized after they are
allocated
87Dynamic lock allocation
- The locks can be managed using the following
functions - collective lock allocation (à la upc_all_alloc)
- upc_lock_t upc_all_lock_alloc(void)
- global lock allocation (à la upc_global_alloc)
- upc_lock_t upc_global_lock_alloc(void)
- lock freeing
- void upc_lock_free(upc_lock_t ptr)
88Collective lock allocation
- collective lock allocation
- upc_lock_t upc_all_lock_alloc(void)
- Needs to be called by all the threads
- Returns a single lock to all calling threads
89Global lock allocation
- global lock allocation
- upc_lock_t upc_global_lock_alloc(void)
- Returns one lock pointer to the calling thread
- This is not a collective function
90Lock freeing
- Lock freeing
- void upc_lock_free(upc_lock_t l)
- This is not a collective function
91Numerical Integration (computation of ?)
- Integrate the function f (which equals p )
-
92Example Using Locks in Numerical Integration
upc_forall(i0iltNi i) local_pi
(float) f((.5i)/(N)) local_pi (float)
(4.0 / N) upc_lock(l) /better with
collectives/ pi local_pi upc_unlock(l)
upc_barrier() // Ensure all is done
upc_lock_free( l ) if(MYTHREAD0)
printf("PIf\n",pi)
- // Example The Famous PI - Numerical
Integration - include ltupc_relaxed.hgt
- define N 1000000
- define f(x) 1/(1xx)
- upc_lock_t l
- shared float pi
- void main(void)
-
- float local_pi0.0
- int i
- l upc_all_lock_alloc()
- upc_barrier
93Memory Consistency Models
- Has to do with ordering of shared operations, and
when a change of a shared object by a thread
becomes visible to others - Consistency can be strict or relaxed
- Under the relaxed consistency model, the shared
operations can be reordered by the compiler /
runtime system - The strict consistency model enforces sequential
ordering of shared operations. (No operation on
shared can begin before the previous ones are
done, and changes become visible immediately)
94Memory Consistency
- Default behavior can be controlled by the
programmer and set at the program level - To have strict memory consistency
- include ltupc_strict.hgt
- To have relaxed memory consistency
- include ltupc_relaxed.hgt
95Memory Consistency
- Default behavior can be altered for a variable
definition in the declaration using - Type qualifiers strict relaxed
- Default behavior can be altered for a statement
or a block of statements using - pragma upc strict
- pragma upc relaxed
- Highest precedence is at declarations, then
pragmas, then program level -
96Memory Consistency- Fence
- UPC provides a fence construct
- Equivalent to a null strict reference, and has
the syntax - upc_fence
- UPC ensures that all shared references are issued
before the upc_fence is completed
97Memory Consistency Example
- strict shared int flag_ready 0shared int
result0, result1if (MYTHREAD0) - results0 expression1 flag_ready1 //if
not strict, it could be // switched with the
above statement - else if (MYTHREAD1)
- while(!flag_ready) //Same note
- result1expression2results0
- We could have used a barrier between the first
and second statement in the if and the else code
blocks. Expensive!! Affects all operations at all
threads. - We could have used a fence in the same places.
Affects shared references at all threads! - The above works as an example of point to point
synchronization.
98Section 2 UPC Systems
Merkey Seidel
- Summary of current UPC systems
- Cray X-1
- Hewlett-Packard
- Berkeley
- Intrepid
- MTU
- UPC application development tools
- totalview
- upc_trace
- work in progress
- performance toolkit interface
- performance model
99Cray UPC
- Platform Cray X1 supporting UPC v1.1.1
- Features shared memory architecture
- UPC is compiler option gt all of the ILP
optimization is available in UPC. - The processors are designed with 4 SSP's per MSP.
- A UPC thread can run on a SSP or a MSP, a
SSP-mode vs. MSP-mode performance analysis is
required before making a choice. - There are no virtual processors.
- This is a high-bandwidth, low latency system.
- The SSP's are vector processors, the key to
performance is exploiting ILP through
vectorization. - The MSPs run at a higher clock speed, the key to
performance is having enough independent work to
be multi-streamed.
100Cray UPC
- Usage
- Compiling for arbitrary numbers of threads
- cc -hupc filename.c (MSP mode, one thread per
MSP) - cc -hupc,ssp filename.c (SSP mode, one
thread per SSP) - Running
- aprun -n THREADS ./a.out
- Compiling for fixed number of threads
- cc hssp,upc -X THREADS filename.c -o a.out
- Running
- ./a.out
- URL
- http//docs.cray.com
- Search for UPC under Cray X1
101Hewlett-Packard UPC
- Platforms Alphaserver SC, HP-UX IPF, PA-RISC, HP
XC ProLiant DL360 or 380. - Features
- UPC version 1.2 compliant
- UPC-specific performance optimization
- Write-through software cache for remote accesses
- Cache configurable at run time
- Takes advantage of same-node shared memory when
running on SMP clusters - Rich diagnostic and error-checking facilities
102Hewlett-Packard UPC
- Usage
- Compiling for arbitrary number of threads
- upc filename.c
- Compiling for fixed number of threads
- upc -fthreads THREADS filename.c
- Running
- prun -n THREADS ./a.out
- URL http//h30097.www3.hp.com/upc
103Berkeley UPC (BUPC)
- Platforms Supports a wide range of
architectures, interconnects and operating
systems - Features
- Open64 open source compiler as front end
- Lightweight runtime and networking layers built
on GASNet - Full UPC version 1.2 compliant, including UPC
collectives and a reference implementation of UPC
parallel I/O - Can be debugged by Totalview
- Trace analysis upc_trace
104Berkeley UPC (BUPC)
- Usage
- Compiling for arbitrary number of threads
- upcc filename.c
- Compiling for fixed number of threads
- upcc -Tthreads THREADS filename.c
- Compiling with optimization enabled
(experimental) - upcc -opt filename.c
- Running
- upcrun -n THREADS ./a.out
- URL http//upc.nersc.gov
105Intrepid GCC/UPC
- Platforms shared memory platforms only
- Itanium, AMD64, Intel x86 uniprocessor and SMPs
- SGI IRIX
- Cray T3E
- Features
- Based on GNU GCC compiler
- UPC version 1.1 compliant
- Can be a front-end of the Berkeley UPC runtime
106Intrepid GCC/UPC
- Usage
- Compiling for arbitrary number of threads
- upc -x upc filename.c
- Running
- mpprun ./a.out
- Compiling for fixed number of threads
- upc -x upc -fupc-threads-THREADS filename.c
- Running
- ./a.out
- URL http//www.intrepid.com/upc
107MTU UPC (MuPC)
- Platforms Intel x86 Linux clusters and
AlphaServer SC clusters with MPI-1.1 and Pthreads - Features
- EDG front end source-to-source translator
- UPC version 1.1 compliant
- Generates 2 Pthreads for each UPC thread
- user code
- MPI-1 Pthread handles remote accesses
- Write-back software cache for remote accesses
- Cache configurable at run time
- Reference implementation of UPC collectives
108MTU UPC (MuPC)
- Usage
- Compiling for arbitrary number of threads
- mupcc filename.c
- Compiling for fixed number of threads
- mupcc f THREADS filename.c
- Running
- mupcrun n THREADS ./a.out
- URL http//www.upc.mtu.edu
109UPC Tools
- Etnus Totalview
- Berkeley UPC trace tool
- U. of Florida performance tool interface
- MTU performance modeling project
110Totalview
- Platforms
- HP UPC on Alphaservers
- Berkeley UPC on x86 architectures with MPICH or
Quadrics elan as network. - Must be Totalview version 7.0.1 or above
- BUPC runtime must be configured with
--enable-trace - BUPC back end must be GNU GCC
- Features
- UPC-level source examination, steps through UPC
code - Examines shared variable values at run time
111Totalview
- Usage
- Compiling for totalview debugging
- upcc -tv filename.c
- Running when MPICH is used
- mpirun -tv -np THREADS ./a.out
- Running when Quadrics elan is used
- totalview prun -a -n THREADS ./a.out
- URL
- http//upc.lbl.gov/docs/user/totalview.html
- http//www.etnus.com/TotalView/
112UPC trace
- upc_trace analyzes the communication behavior of
UPC programs. - A tool available for Berkeley UPC
- Useage
- upcc must be configured with --enable-trace.
- Run your application with
- upcrun -trace ... or
- upcrun -tracefile TRACE_FILE_NAME ...
- Run upc_trace on trace files to retrieve
statistics of runtime communication events. - Finer tracing control by manually instrumenting
programs - bupc_trace_setmask(), bupc_trace_getmask(),
bupc_trace_gettracelocal(), bupc_trace_settraceloc
al(), etc.
113UPC trace
- upc_trace provides information on
- Which lines of code generated network traffic
- How many messages each line caused
- The type (local and/or remote gets/puts) of
messages - The maximum/minimum/average/combined sizes of the
messages - Local shared memory accesses
- Lock-related events, memory allocation events,
and strict operations - URL http//upc.nersc.gov/docs/user/upc_trace.html
114Performance tool interface
- A platform independent interface for toolkit
developers - A callback mechanism notifies performance tool
when certain events, such as remote accesses,
occur at runtime - Relates runtime events to source code
- Events Initialization/completion, shared memory
accesses, synchronization, work-sharing, library
function calls, user-defined events - Interface proposal is under development
- URL http//www.hcs.ufl.edu/leko/upctoolint/
115Performance model
- Application-level analytical performance model
- Models the performance of UPC fine-grain accesses
through platform benchmarking and code analysis - Platform abstraction
- Identify a common set of optimizations performed
by a high performance UPC platform aggregation,
vectorization, pipelining, local shared access
optimization, communication/computation
overlapping - Design microbenchmarks to determine the
platforms optimization potentials
116Performance model
- Code analysis
- High performance achievable by exploiting
concurrency in shared references - Reference partitioning
- A dependence-based analysis to determine
concurrency in shared access scheduling - References are partitioned into groups, accesses
of references in a group are subject to one type
of envisioned optimization - Run time prediction
117Section 3 UPC Libraries
- Collective Functions
- Bucket sort example
- UPC collectives
- Synchronization modes
- Collectives performance
- Extensions
- UPC-IO
- Concepts
- Main Library Calls
- Library Overview
118Collective functions
- A collective function performs an operation in
which all threads participate. - Recall that UPC includes the collectives
- upc_barrier, upc_notify, upc_wait, upc_all_alloc,
upc_all_lock_alloc - Collectives covered here are for bulk data
movement and computation. - upc_all_broadcast, upc_all_exchange,
upc_all_prefix_reduce, etc.
119A quick example Parallel bucketsort
- shared N int A NTHREADS
- Assume the keys in A are uniformly distributed.
- Find global min and max values in A.
- Determine max bucket size.
- Allocate bucket array and exchange array.
- Bucketize A into local shared buckets.
- Exchange buckets and merge.
- Rebalance and return data to A if desired.
120Sort shared array A
A
pointers-to-shared
shared
private
1211. Find global min and max values
- shared int minmax02 // only on Thr 0
- shared 2 int MinMax2THREADS
- // Thread 0 receives min and max values
- upc_all_reduce(minmax00,A,,UPC_MIN,)
- upc_all_reduce(minmax01,A,,UPC_MAX,)
- // Thread 0 broadcasts min and max
- upc_all_broadcast(MinMax,minmax0,
2sizeof(int),NULL)
1221. Find global min and max values
(animation)
shared int minmax02 // only on Thread
0 shared 2 int MinMax2THREADS
upc_all_reduce(minmax0,A,,UPC_MIN,)
upc_all_reduce(minmax1,A,,UPC_MAX,)
upc_all_broadcast(MinMax,minmax,2sizeof(int),NULL
)
min
max
151 -92
A
-92
151
shared
minmax0
-92
-92
-92
151
151
151
pointers-to-shared
MinMax
private
1232. Determine max bucket size
- shared THREADS int BSizesTHREADSTHREADS
- shared int bmax0 // only on Thread 0
- shared int BmaxTHREADS
- // determine splitting keys (not shown)
- // initialize Bsize to 0, then
- upc_forall(i0 iltNTHREADS i Ai)
- if (Ai will go in bucket j)
- BsizesMYTHREADj
- upc_all_reduceI(bmax0,Bsizes,,UPC_MAX,)
- upc_all_broadcast(Bmax,bmax0,sizeof(int),)
1242. Find max bucket size required
(animation)
shared THREADS int BSizesTHREADSTHREADS sha
red int bmax0 // only on Thread 0 shared int
BmaxTHREADS
upc_all_reduceI(bmax0,Bsizes,,UPC_MAX,)
upc_all_broadcast(Bmax,bmax0,sizeof(int),NULL)
A
30 9 12 3 21 27 8 31 12
31
Bsizes
max
pointers-to-shared
shared
bmax0
31
31
31
BMax
private
1253. Allocate bucket and exchange arrays
- shared int BuckAry
- shared int BuckDst
- int Blen
- Blen (int)BmaxMYTHREADsizeof(int)
- BuckAry upc_all_alloc(BlenTHREADS,
- BlenTHREADSTHREADS)
- BuckDst upc_all_alloc(BlenTHREADS,
- BlenTHREADSTHREADS)
- Blen Blen/sizeof(int)
1263. Allocate bucket and exchange arrays
(animation)
int Blen
Blen (int)BmaxMYTHREADsizeof(int)
A
BuckAry
pointers-to-shared
shared
BuckDst
124
124
124
BMax
31
31
31
31
31
31
31
Blen
private
1274. Bucketize A
- int Bptr // local ptr to BuckAry
- shared THREADS int BcntTHREADSTHREADS
- // cast to local pointer
- Bptr (int )BuckAryMYTHREAD
- // init bucket counters BcntMYTHREADi0
- upc_forall (i0 iltNTHREADS i Ai)
- if (Ai belongs in bucket j)
- BptrBlenjBcntMYTHREADj Ai
- BcntMYTHREADj
1284. Bucketize A
(animation)
Bptr (int )BuckAryMYTHREAD if (Ai
belongs in bucket j) BptrBlenjBcntMYTHRE
ADj Ai BcntMYTHREADj
A
BuckAry
pointers-to-shared
shared
BuckDst
private
1295. Exchange buckets
(animation)
upc_all_exchange(BuckDst, BuckAry,
Blensizeof(int), NULL)
A
BuckAry
pointers-to-shared
shared
BuckDst
private
1306. Merge and rebalance
- Bptr (int )BuckDstMYTHREAD
- // Each thread locally merges its part of
- // BuckDst. Rebalance and return to A
- // if desired.
- if (MYTHREAD0)
- upc_free(BuckAry)
- upc_free(BuckDst)
131Collectives in UPC 1.2
- Relocalization collectives change data
affinity. - upc_all_broadcast
- upc_all_scatter
- upc_all_gather
- upc_all_gather_all
- upc_all_exchange
- upc_all_permute
- Computational collectives for data reduction
- upc_all_reduce
- upc_all_prefix_reduce
132Why have collectives in UPC?
- Sometimes bulk data movement is the right thing
to do. - Built-in collectives offer better performance.
- Caution UPC programs can come out looking like
MPI code.
133An animated tour of UPC collectives
- The following illustrations serve to define the
UPC collective functions. - High performance implementations of the
collectives use more sophisticated algorithms.
134upc_all_broadcast
(animation)
Thread 0 sends the same block of data to each
thread.
shared blk char dstblkTHREADS
shared char srcblk
blk
shared
private
135upc_all_scatter
(animation)
Thread 0 sends a unique block of data to each
thread.
shared blk char dstblkTHREADS
shared char srcblkTHREADS
shared
private
136upc_all_gather
(animation)
Each thread sends a block of data to thread 0.
shared char dstblkTHREADS
shared blk char srcblkTHREADS
shared
private
137upc_all_gather_all
(animation)
Each thread sends one block of data to all
threads.
shared
private
138upc_all_exchange
(animation)
Each thread sends a unique block of data to each
thread.
shared
private
139upc_all_permute
(animation)
Thread i sends a block of data to thread perm(i).
shared
private
140Reduce and prefix_reduce
- One function for each C scalar type, e.g.,
- upc_all_reduceI() returns an Integer
- Operations
- , , , , xor, , , min, max
- user-defined binary function
- non-commutative function option
141upc_all_reduceTYPE
(animation)
n
Thread 0 receives UPC_OP srci.
i0
0
6
3
4
8
1
16
64
128
256
2
32
shared
512
1024
2048
S
S
448
56
S
3591
9
4095
private
142upc_all_prefix_reduceTYPE
(animation)
k
Thread k receives UPC_OP srci.
i0
63
3
7
1
32
4
16
2
8
64
128
256
127
15
511
3
7
63
127
shared
1
31
255
15
31
255
private
143Common collectives properties
- Collectives function arguments are single-valued
corresponding arguments must match across all
threads. - Data blocks must have identical sizes.
- Source data blocks must be in the same array and
at the same relative location in each thread. - The same is true for destination data blocks.
- Various synchronization modes are provided.
144Synchronization modes
- Sync modes determine the strength of
synchronization between threads executing a
collective function. - Sync modes are specified by flags for function
entry and for function exit - UPC_IN_
- UPC_OUT_
- Sync modes have three strengths
- ALLSYNC
- MYSYNC
- NOSYNC
145ALLSYNC
- ALLSYNC provides barrier-like synchronization.
It is the strongest and most convenient mode. - upc_all_broadcast(dst, src, nbytes,
- UPC_IN_ALLSYNC UPC_OUT_ALLSYNC)
- No thread will access collective data until all
threads have reached the collective call. - No thread will exit the collective call until all
threads have completed accesses to collective
data.
146NOSYNC
- NOSYNC provides weak synchronization. The
programmer is responsible for synchronization. - Assume there are no data dependencies between
- the arguments in the following two calls
- upc_all_broadcast(dst0, src0, nbytes,
- UPC_IN_ALLSYNC UPC_OUT_NOSYNC)
- upc_all_broadcast(dst1, src1, mbytes,
- UPC_IN_NOSYNC UPC_OUT_ALLSYNC)
- Chaining independent calls by using NOSYNC
eliminates - the need for synchronization between calls.
147MYSYNC
- Syncronization is provided with respect to data
read (UPC_IN) and written (UPC_OUT) by each
thread. - MYSYNC provides an intermediate level of
synchronization. - Assume thread 0 is the source thread. Each
thread needs - to synchronize only with thread 0.
- upc_all_broadcast(dst, src, nbytes,
- UPC_IN_MYSYNC UPC_OUT_MYSYNC)
148MYSYNC example
(animation)
- Each thread synchronizes with thread 0.
- Threads 1 and 2 exit as soon as they receive the
data. - It is not likely that thread 2 needs to read
thread 1s data.
shared
private
149ALLSYNC vs. MYSYNC performance
upc_all_broadcast() on a Linux/Myrinet cluster, 8
nodes
150Sync mode summary
- ALLSYNC is the most expensive because it
provides barrier-like synchronization. - NOSYNC is the most dangerous but it is almost
free. - MYSYNC provides synchronization only between
threads which need it. It is likely to be strong
enough for most programmers needs, and it is
more efficient.
151Collectives performance
- UPC-level implementations can be improved.
- Algorithmic approaches
- tree-based algorithms
- message combining (cf chained broadcasts)
- Platform-specific approaches
- RDMA put and get (e.g., Myrinet and Quadrics)
- broadcast and barrier primitives may be a benefit
- buffer management
- static permanent but of fixed size
- dynamic expensive if allocated for each use
- pinned defined RMDA memory area, best solution
152Push and pull animations