Title: CS 267 Applications of Parallel Computers Lecture 9: Split-C
1CS 267 Applications of Parallel
ComputersLecture 9 Split-C
- James Demmel
- http//www.cs.berkeley.edu/demmel/cs267_Spr99
2Comparison of Programming Models
- Data Parallel (HPF)
- Good for regular applications compiler controls
performance - Message Passing SPMD (MPI)
- Standard and portable
- Needs low level programmer control no global
data structures - Shared Memory with Dynamic Threads
- Shared data is easy, but locality cannot be
ignored - Virtual processor model adds overhead
- Shared Address Space SPMD
- Single thread per processor
- Address space is partitioned, but shared
- Encourages shared data structures matched to the
architecture - Titanium - targets (adaptive) grid computations
- Split-C - simple parallel extension to C
- F77 Heroic Compiler
- Depends on compiler to discover parallelism
- Hard to do except for fine grain parallelism,
usually in loops
3Overview
- Split-C
- Systems programming language based on C
- Creating Parallelism SPMD
- Communication Global pointers and spread arrays
- Memory consistency model
- Synchronization
- Optimization opportunities
4Split-C Systems Programming
- Widely used parallel extension to C
- Supported on most large-scale parallel machines
- Tunable performance
- Consistent with C
5Split-C Overview
Globally- Addressable Local Memory
Globally- Addressable Remote Memory
Global Address Space
int x
int x
local address space
Memory
g_P
P0
P1
P2
P3
- Adds two new levels to the memory hierarchy
- - Local in the global address space
- - Remote in the global address space
-
- Model is a collection of processors global
address space - SPMD Model
- Same Program on each node
6SPMD Control Model
- PROCS threads of control
- independent
- explicit synchronization
-
- Synchronization
- global barrier
- locks
PE
PE
PE
PE
barrier()
7C Pointers
- (x) read pointer to x
- Types read right to left
- int read as pointer to int
- P read as value at P
- / assign the value of 6 to x /
- int x
- int P x
- P 6
8Global Pointers
A global pointer may refer to an object anywhere
in the machine. Each object (C structure) lives
on one processor Global pointers can be
dereferenced, incremented, and indexed just like
local pointers.
- int global gp1 / global ptr to an int /
- typedef int global g_ptr
- gptr gp2 / same /
- typedef double foo
- foo global global gp3 / global ptr to a
global ptr to a foo / - int global gp4 / local ptr to a global ptr to
an int /
9Memory Model
Processor 0
Processor 2
- on_one
- double global g_P toglobal(2,x)
- g_P 6
10C Arrays
- Set 4 values to 0,2,4,6
- Origin is 0
- for (I 0 Ilt 4 I)
- AI I2
-
- Pointers Arrays
- AI (AI)
11Spread Arrays
- Spread Arrays are spread over the entire machine
- spreader determines which dimensions are
spread - dimensions to the right define the objects on
individual processors - dimensions to the left are linearized and
spread in cyclic map - Example
- double Anrbb,
-
Per processor blocks
Spread high dimensions
Aij gt A ir j in units of
sizeof(double)bb
The traditional C duality between arrays and
pointers is preserved through spread pointers.
12Spread Pointers
Global pointers, but with index arithmetic across
processors(cyclic) 1 dimensional address
space, i.e. wrap and increment Processor
component varies fastest
No communication
- double APROCS
- for_my_1d (i,PROCS) Ai i2
13Blocked Matrix Multiply
- void all_mat_mult_blk(int n, int r, int m, int b,
- double Cnmbb,
- double Anrbb,
- double Brmbb)
- int i,j,k,l
- double labb, lbbb
- for_my_2D(i,j,l,n,m)
- double (lc)b tolocal(Cij)
- for (k0kltrk)
- bulk_read (la, Aik, bbsizeof(double))
- bulk_read (lb, Bkj, bbsizeof(double))
- matrix_mult(b,b,b,lc,la,lb)
-
-
- barrier()
Configuration independent use of spread arrays
Local copies of subblocks
Highly optimized local routine
Blocking improves performance because the number
of remote accesses is reduced.
14An Irregular Problem EM3D
Maxwells Equations on an Unstructured 3D Mesh
Irregular Bipartite Graph of varying
degree (about 20) with weighted edges
v1
v2
w1
w2
H
E
B
Basic operation is to subtract weighted sum
of neighboring values for all E nodes for
all H nodes
D
15EM3D Uniprocessor Version
- typedef struct node_t
- double value
- int edge_count
- double coeffs
- double (values)
- struct node_t next
- node_t
- void all_compute_E()
- node_t n
- int i
- for (n e_nodes n n n-gtnext)
- for (i 0 i lt n-gtedge_count i)
- n-gtvalue n-gtvalue -
- (n-gtvaluesi) (n-gtcoeffsi)
-
H
E
coeffs
value
values
value
How would you optimize this for a
uniprocessor? minimize cache misses by
organizing list such that neighboring nodes
are visited in order
16EM3D Simple Parallel Version
Each processor has list of local nodes
- typedef struct node_t
- double value
- int edge_count
- double coeffs
- double global (values)
- struct node_t next
- node_t
- void all_compute_e()
- node_t n
- int i
- for (n e_nodes n n n-gtnext)
- for (i 0 i lt n-gtedge_count i)
- n-gtvalue n-gtvalue -
- (n-gtvaluesi) (n-gtcoeffsi)
-
- barrier()
v1
How do you optimize this? Minimize remote
edges Balance load across processors C(p)
aNodes bEdges cRemotes
17EM3D Eliminate Redundant Remote Accesses
- void all_compute_e()
- ghost_node_t g
- node_t n
- int i
- for (g h_ghost_nodes g g g-gtnext)
g-gtvalue (g-gtrval) - for (n e_nodes n n n-gtnext)
- for (i 0 i lt n-gtedge_count i)
- n-gtvalue n-gtvalue - (n-gtvaluesi)
(n-gtcoeffsi) -
- barrier()
18EM3D Overlap Global Reads GET
- void all_compute_e()
- ghost_node_t g
- node_t n
- int i
- for (g h_ghost_nodes g g g-gtnext) g-gtvalue
(g-gtrval) - sync()
- for (n e_nodes n n n-gtnext)
- for (i 0 i lt n-gtedge_count i)
- n-gtvalue n-gtvalue - (n-gtvaluesi)
(n-gtcoeffsi) -
- barrier()
19Split-C Systems Programming
- Tuning affects application performance
usec per edge
20Global Operations and Shared Memory
- int all_bcast(int val)
- int left 2MYPROC1
- int right 2MYPROC2
- if (MYPROC gt 0)
- while (spread_lockMYPROC 0)
- spread_lockMYPROC 0
- val spread_bufMYPROC
-
- if ( left lt PROCS)
- spread_bufleft val
- spread_lockleft 1
-
- if ( right lt PROCS)
- spread_bufright val
- spread_lockright val
-
- return val
Requires sequential consistency
21Global Operations and Signaling Store
- int all_bcast(int val)
-
- int left 2MYPROC1
- int right 2MYPROC2
- if (MYPROC gt 0)
- store_sync(4)
- val spread_bufMYPROC
-
- if ( left lt PROCS)
- spread_bufleft - val
- if ( right lt PROCS)
- spread_bufright - val
- return val
22Signaling Store and Global Communication
- void all_block_to_cyclic ( int m ,
- double BPROCSm,
- double APROCSm)
-
- double a AMYPROC
- for (i 0 i lt m i)
- BmMYPROCi - ai
-
- all_store_sync()
PE
PE
PE
PE
23Split-C Summary
- Performance tuning capabilities of message
passing - Support for shared data structures
- Installed on NOW and available on most platforms
- http//www.cs.berkeley.edu/projects/split-c
- Consistent with C design
- arrays are simply blocks of memory
- no linguistic support for data abstraction
- interfaces difficult for complex data structures
- explicit memory management