Title: PLB: A Programming Language for the Blues
1PL/B A Programming Language for the Blues
- George Almasi, Luiz DeRose, Jose Moreira, and
David Padua
2Objective of this Project
- To develop a programming system for the Blue Gene
series (and distributed-memory machines in
general) that is - Convenient to use for program development and
maintenance. Resulting parallel programs should
be very readable easy to modify. - Not too difficult to implement so that the system
could be completed in a reasonable time. - Main focus is numerical computing.
3Most Popular Parallel Programming Models Today
- SPMD
- Library-based
- MPI
- BSP
- Compiler-based
- Co-Array Fortran
- UPC
- Single thread of execution - loop/array-based
- Implicit code partitioning and communication
- Sophisticated compiler transformations
- HPF
- ZPL
- Explicit code partitioning and communication.
- Simple compiler transformations.
- OpenMP
4Design Principles (1 of 2)
- Wanted to avoid the SPMD programming model.
- In its more general form can lead to 4D spaghetti
code. More structure is needed. - The most popular of the SPMD implementations,
MPI, is cumbersome to use, in part due to the
lack of compiler support. It has been called the
assembly language of parallel computing. - It is difficult to reason about SPMD programs.
When we reason about them, we rely on an
(implicit) global view of communication and
computation. - Wanted to avoid the compiler complexity and
limitations that led to the failure of HPF.
5Design Principles (2 of 2)
- OpenMP is a good model, but Blue Gene is not a
shared-memory machine. OpenMP could be
implemented on top of TreadMarks, but efficient
implementations would require untested,
sophisticated compilers. - We propose a few language extensions and a
programming model based on a single thread of
execution. We would like the extensions to
require only simple transformations similar to
those used by OpenMP compilers. - We would like the extensions to be of a general
nature so that they can be applied to any
conventional language.
6MATLAB Implementation
- We will present next our solution in the context
of MATLAB. - There are at least two reasons why developing a
parallel extension to MATLAB is a good idea. - MATLAB is an excellent language for prototyping
conventional algorithms. There is nothing
equivalent for parallel algorithms. - Programmers of parallel machines should
appreciate the MATLAB environment as much as
conventional programmers. - In fact, during a site visit to TJ Watson,
several government representatives requested that
a parallel MATLAB be included in the PERCS
project. - It is relatively simple to develop a prototype of
the proposed extensions in MATLAB.
7PL/B Data Types and Statements
- The constructs of PL/B are
- A new type of objects Hierarchically tiled
arrays. - Array (and hierarchical array) operations and
assignment statements similar to those found in
MATLAB, Fortran 90, and APL. - Hierarchically tiled arrays (HTAs) are used
- to specify data distribution in parallel programs
and - to facilitate the writing of blocked sequential
algorithms (for locality).
8Example 1 Matrix Multiplication (1 of 4)
1. Tiled Matrix Multiplication in a Conventional
Language
for I1qn for J1qn for K1qn
for iIIq-1 for
jJJq-1 for kKKq-1
c(i,j)c(i,j)a(i,k)b(k,j)
end end
end end end end
9Example 1 Matrix Multiplication (2 of 4)
2. Tiled Matrix Multiplication Using HTAs
- Here ci,j, ai,k, bk,j, and T represent
submatrices. - The operator represents matrix multiplication
in MATLAB.
for i1m for j1m T0 for k1m
TTai,kbk,j end ci,jT
end end
10Example 1 Matrix Multiplication (3 of 4)
3. Parallel Matrix Multiplication Using HTAs
for j1m for k1m
c,jc,ja,kbk,j end end
- Use the middle product method to do parallel
matrix multiply, - Assume a two dimensional processor mesh.
- Assume i,j submatrices are stored in processor
(i,j) - All communications are explicit in separate array
assignment statements. - Parallel computations are also explicit in
separate array assignment statements and involve
no communication.
block i,j is in processor (i,j) for j1m
for k1m copy kth column of a
r,ja,k broadcast
bk,j s,jbk,j
compute w/o communication
c,jc,jr,js,j end end
11Example 1 Matrix Multiplication (4 of 4)
4. A Second Version of Parallel Matrix
Multiplication
for i1m for j1m ci,jdotproduct(a,b,i
,j) end end function x dotproduct(a,b,i,j)
ti,b,j communication
ti,ti,ai, computation
si,ti, for i1ceil(log2(size(a)))
si,eoshift(ti,,2i,eye)
ti,si,ti, end xti,1 end
- Use now the inner product method,
- Collective operations that involve both
computations and communications can be
represented using intrinsic functions or user
functions.
12HTAs
- HTAs are n-dimensional arrays that have been
tiled (tiles are d-dimensional, d n). - The tiles could be recursive tiled with the tiles
at each level having the same shape except for
boundary cases. - The topmost levels could be distributed across
modules of a parallel system.
13Examples of HTA
A(14,118) ?A1214,13(13)
A(14,117) ?A114,1313,
A214,13(12)
14Flattening
- Elements of a HTA are referenced using a tile
index for each level in the hierarchy followed by
an array index. Each tile index tuple is enclosed
within s and the array index is enclosed within
parentheses. - In the matrix multiplication code, ci,j(3,4)
would represent element 3,4 of submatrix i,j. - In the first example of the previous page
A2,4,3(3) represents the element on the
bottom right corner. - Alternatively, the tiled array could be accessed
as a flat array. - In the matrix multiplication example, assuming
that each tile is p p, c(r,s) would represent
element 3,4 of submatrix i,j if r(i-1)p3 and
s(j-1)p4.
15Two Ways of Referencing the Elements of an 8 x 8
Array.
16Parallel Communication and Computation in PL/B (1
of 2)
- PL/B programs are single-threaded and contain
array operations on HTAs. The sequential part of
the program - Like in HPF and MPI processors can be arranged in
different forms. In MPF and MPI, processors are
arranged into meshes, in PL/B we also consider
(virtual) nodes so that processors can be
arranged into hierarchical meshes. - The top levels of HTAs can be distributed onto a
subset of nodes.
17Parallel Communication and Computation in PL/B (1
of 2)
- Array operations on HTAs can represent
communication or computation. - Assignment statements where all HTA indices are
identical are computations executed in the home
of each of the HTA elements involved. - Assignment statements where this is not the case
represent communication operations.
18HTA Distribution Examples (1 of 4)
- Consider a 3x6 processor arrangement.
- A HTA with shape 13,16(110) will be
distributed on the processor mesh by assigning
each 10 element vector o a different processor. - A HTA with shape 16,118(110) will be
distributed cyclically on both dimensions
19HTA Distribution Examples (2 of 4)
- Consider the two level processor arrangement
shown on the left. It is a 3x2 arrangement of 2x2
meshes of processors. - A HTA with shape 13,1212,12(110,110)
could be distributed by assigning each 10x10
array to a different processor. - Also, the processor arrangement shown on left
could be flattened so that an HTA with shape
16,14(110,110) could be distributed by
assigning each 10x10 array to a different
processor
20HTA Distribution Example (3 of 4)
- A HTA, b, with shape 13,1212,120(15)
would be distributed cyclically on the
arrangement on the left. The pair of processors
i,jk,12 would host the following vectors
bi,jk,1(15) bi,jk,3(15) bi,jk,19(1
5)
bi,jk,2(15) bi,jk,4(15) bi,jk,20(1
5)
- A HTA, b, with shape 16,140(15) would be
distributed cyclically on the flattened
arrangement so that processors i,14 would
host the following vectors
bi,1(15) bi,5(15) bi,17(15)
bi,2(15) bi,6(15) bi,18(15)
bi,3(15) bi,7(15) bi,19(15)
bi,4(15) bi,8(15) bi,20(15)
21Advantages of the PL/B Programming Model
- We believe that by relying on a single thread our
programming model facilitates the transition from
sequential to parallel form. - In fact, a conventional program can be
incrementally modified by converting one data
structure at a time into distributed cell array
form (cf. OpenMP).
22Additional Examples
- We now present several additional examples of
PL/B - Except where indicated, we assume that
- Each HTA has a single level of tiling,
- All HTAs have identical sizes and shapes,
- The HTAs are identically distributed distributed
on a processor mesh that matches in size and
shape all HTAs. - Our objective in these examples is to show the
expressiveness of the language for both dense and
sparse computations.
23Example 2 Cannons Algorithm (1 of 3)
24Example 2 Cannons Algorithm (2 of 3)
forall version
c1n,1n zeros(p,p) communication forall
i1n a,i cshift(a(i,,i-1)
communication end forall i1n b,i
cshift(b,i,i-1) communication end for
k1n forall i1n, j1n ci,j
ci,jai,jbi,j computation end forall
i1n ai, cshift(ai,,1)
communication end forall i1n b,i
cshift(b,i,1) communication end end
25Example 2 Cannons Algorithm (3 of 3)
Triplet version
c1n,1n zeros(p,p) communication for
i2n ain, cshift(a(in,,dim2,shift1
) communication b,in
cshift(b,in,dim1,shift1) communication end
for k1n c, c,a,b,
computation a, cshift(a,,dim2,
shift1) communication b,
cshift(b,,dim1,shift1) communication end
26Example 2 The SUMMA Algorithm (1 of 6)
Use now the outer-product method
(n2-parallelism) Interchanging the loop headers
of the loop in Example 1 produce for k1n
for i1n for j1n
Ci,jCi,jAi,kBk,
j end
end end To obtain n2 parallelism, the inner
two loops should take the form of a block
operations for k1n
C,C,A,k ? Bk, end Where the
operator ? represents the outer product operations
27Example 2 The SUMMA Algorithm (2 of 6)
A
B
C
b11 b12
Switch Orientation -- By using a column of A and
a row of B broadcast to all, compute the next
terms of the dot product
a11 a21
a11b11
a11b12
a21b11
a21b12
28Example 2 The SUMMA Algorithm (3 of 6)
c1n,1n zeros(p,p) communication for
i1n t1,spread(a(,i),dim2,ncopiesN)
communication t2,spread(b(i,),dim
1,ncopiesN) communication
c,c,t1,t2, computation end
29Example 2 The SUMMA Algorithm (4 of 6)
30Example 2 The SUMMA Algorithm (5 of 6)
31Example 2 The SUMMA Algorithm (6 of 6)
32Example 4 Jacobi Relaxation
while dif gt epsilon v2,(0,)
vn-1,(p,) communication vn-1,
(p1,) v2, (1,) communication
v,2 (,0) v,1n-1(,p)
communication v, n-1(,p1)
v,2(,1) communication
u,(1p,1p) a (v, (1p,0p-1)
v, (0p-1,1p)
v, (1p,2p1) v, (2p1,1p))
computation difmax(max( abs ( v u ) ))
v u end
33Example 5 Sparse MVM/copy vector (1 of 2)
P1
P2
P3
A
b
P4
34Example 5 Sparse MVM/copy vector (1 of 2)
Distribute a c1n
a(DIST(1n)DIST(1n1)-1,) Broadcast vector
b v1n b Local multiply t c
v Get result forall i1N vi t()
flattened t end
Important observation In MATLAB sparse
computations can be represented as dense
computations. The interpreter only performs the
necessary operations.
35Example 6 Sparse MVM/distribute vector (1 of 7)
P1
P2
P3
A
b
P4
36Example 6 Sparse MVM/distribute vector (2 of 7)
Step 1 Compute set of indices needed in
processor i from chunk of vector in processor j
P1
P2
P3
P4
37Example 6 Sparse MVM/distribute vector (3 of 7)
Step 2 Perform collective communication
by transposing matrix of set of indices
P1
P2
P3
P4
38Example 6 Sparse MVM/distribute vector (4 of 7)
Step 3 In each processor, collect the
vector elements needed by the other processors
P1
P2
P3
P4
39Example 6 Sparse MVM/distribute vector (5 of 7)
Step 4 Transpose again to communicate vector
elements to the processor where they are needed
P1
P2
P3
P4
40Example 6 Sparse MVM/distribute vector (6 of 7)
c1n a(DIST(1n)DIST(2n1)-1,) v1n(DI
ST(1n)DIST(2n1)-1,1) b(DIST(1n)DIST(2n1)
-1) Step 1 PIJ contains index of elements
of vector v from processor J needed by
processor I. First component of P is
distributed and the second is a regular cell
array. forall I1N PI fun(cI,DIST) end
41Example 6 Sparse MVM/distribute vector (7 of 7)
Step 2 RIJ contains the index of elements
of vector v from processor I needed by
processor J R transpose(P) Step
3 QIJ contains all elements of vector v
from processor I needed by processor J forall
I1N, k1N, (k ! I) QIk
vI(RIk()) end Step 4 RIJ
contains all elements of vector v
from processor J needed by processor I R
transpose(Q ) forall I1N vI(PI(
))RI() end v c v
42Example 7 SuperLU_DIST
Step 1 Apply LU decomposition to corner block
CK,K
43Example 7 SuperLU_DIST
Step 2 Factorize block column L,K
C,K/UK,K
44Example 7 SuperLU_DIST
Step 3 Factorize block row UK,
45Example 7 SuperLU_DIST
Step 4 Update trailing Submatrix
CI,JCI,J-LI,KUK,J
46Example 7 SuperLU_DIST
for K1N LK,K,UK,Klu(CK,K) Step
1 LK1N,KCK1N,K/UK,K Step
2 UK,K1NLK,K\CK,K1N Step
3 forall IK1N LI,K1LI,K UK1
,IUK1,I end CK1,K1CK1,K1
-LK1,K1UK1,K1 Step 4 end
47Example 7 SuperLU_DIST
Avoiding unnecessary communication
48Example 7 SuperLU_DIST
Tranpose Send
T
T
T
T
T
T
F
F
F
F
F
F
Spread
T
T
T
T
T
T
F
F
F
F
F
F
T
T
T
T
T
T
49Example 7 SuperLU_DIST
for K1N LK,K,UK,Klu(CK,K) Step
1 TKN,K CKN,K ! 0 SKN,K(KN)s
pread(TKN,K,) R K,KNtranspose(SLK1N,
KKN) LK1N,KCK1N,K/UK,K
Step 2 UK,K1NLK,K\CK,K1N Step
3 forall IK1N LI,K1LI,K end
forall IK1N URK,I,IUK,I end C
K1,K1CK1,K1-LK1,K1UK1,K1
Step 4 end