Title: MATLAB HPCS Extensions
1MATLAB HPCS Extensions
- Presented by David Padua
- University of Illinois at Urbana-Champaign
2Senior Team Members
- Gheorghe Almasi - IBM Research
- Calin Cascaval - IBM Research
- Basilio Fraguela - University of Illinois
- Jose Moreira - IBM Research
- David Padua - University of Illinois
3Objectives
- To develop MATLAB extensions for accessing,
prototyping, and implementing scalable parallel
algorithms. - To give programmers of high-end machines access
to all the powerful features of MATLAB, as a
result. - Array operations / kernels.
- Interactive interface.
- Rendering.
4Goals of the Design
- Minimal extension.
- A natural extension to MATLAB that is easy to
use. - Extensions for direct control of parallelism and
communication on top of the ability to access
parallel library routines. It does not seem it is
possible to encapsulate all the important
parallelism in library routines. - Extensions that provide the necessary information
and can be automatically and effectively analyzed
for compilation and translation.
5Uses of the Language Extension
- Parallel libraries user.
- Parallel library developer.
- Input to type-inference compiler.
- No existing MATLAB extension had the necessary
characteristics.
6Approach
- In our approach, the programmer interacts with a
copy of MATLAB running on a workstation. - The workstation controls parallel computation on
servers.
7Approach (Cont.)
- All conventional MATLAB operations are executed
on the workstation. - The parallel server operates on a new class of
MATLAB objects hierarchically tiled arrays
(HTAs). - HTAs are implemented as a MATLAB toolbox. This
enables implementation as a language extension
and simplifies porting to future versions of
MATLAB.
8Hierarchically Tiled Arrays
- Array tiles are a powerful mechanism to enhance
locality in sequential computations and to
represent data distribution across parallel
systems. - Our main extension to MATLAB is arrays that are
hierarchically tiled. - Several levels of tiling are useful to distribute
data across parallel machines with a hierarchical
organization and to simultaneously represent both
data distribution and memory layout. - For example, a two-level hierarchy of tiles can
be used to represent - the data distribution on a parallel system and
- the memory layout within each component.
9Hierarchically Tiled Arrays (Cont.)
- Computation and communication are represented as
array operations on HTAs. - Using array operations for communication and
computation raises the level of abstraction and,
at the same time, facilitates optimization.
10Semantics
11HTA Definition
Array partitioned into tiles in such a way that
adjacent tiles have the same size along the
dimension of adjacency each tile being either
an unpartitioned array or an HTA
12More HTAs
13Example of non-HTA
14Addressing the HTAs
- We want to be able to address any level of an HTA
A - A.getTile(i,j) Get tile (i,j) of the HTA
- A.getElem(a,b) Get element (a,b) of the HTA,
disregarding the tiles (flattening) - A.getTile(i,j).getElem(c,d) Get element (c,d)
within tile (i,j) - Triplets supported in the indexes
15HTA Indexing Example
A hta 1214,13(13)
16Assigning to an HTA
- Replication of scalars
- A(13,45)scalar
- Replication of matrices
- A13,45matrix
- Replication of HTAs
- A13,45single_tile_HTA
- Copies tile to tile
- A13,45B3 8 9,23
17Operation Legality
- A scalar can be operated with every scalar (or
set of them) of an HTA - A matrix can be operated with any (sub)tiles of
an HTA - Two HTAs can be operated if their corresponding
tiles can be operated - Notice that this depends on the operation
18Tile Usage Basics
- Tiles are used in many algorithms to express
locality - Tiles can be distributed among processors
- Block-cyclic always
- Operation on tiles take place on the server where
they reside
19Using HTAs for Locality Enhancement
Tiled matrix multiplication using conventional
arrays
for I1qn for J1qn for K1qn
for iIIq-1 for
jJJq-1 for kKKq-1
C(i,j)C(i,j)A(i,k)B(k,j)
end end
end end end end
20Using HTAs for Locality Enhancement
Tiled matrix multiplication using HTAs
- Here, Ci,j, Ai,k, Bk,j represent
submatrices. - The operator represents matrix multiplication
in MATLAB.
for i1m for j1m for k1m
Ci,jCi,jAi,kBk,j end end end
21Using HTAs to Represent Data Distribution and
Parallelism
Cannons Algorithm
22(No Transcript)
23(No Transcript)
24(No Transcript)
25Cannnons Algorithm in MATLAB with HPCS Extensions
ahta(A,1SZ/nn,1SZ/nn,n n) bhta(B,1SZ/
nn,1SZ/nn,n n) chta(n,n,n n) c,
zeros(p,p) for i2n ai,
circshift(ai,, 0 -1) b,i
circshift(b,i, -1 0) end for k1n c
c a b a circshift(a,0 -1) b
circshift(b,-1 0) end
26Cannnons Algorithm in C MPI
for (km 0 km lt m km) char chn
"T" dgemm(chn, chn, lclMxSz, lclMxSz,
lclMxSz, 1.0, a, lclMxSz, b, lclMxSz, 1.0, c,
lclMxSz) MPI_Isend(a, lclMxSz lclMxSz,
MPI_DOUBLE, destrow, ROW_SHIFT_TAG,
MPI_COMM_WORLD, requestrow) MPI_Isend(b,
lclMxSz lclMxSz, MPI_DOUBLE, destcol,
COL_SHIFT_TAG, MPI_COMM_WORLD, requestcol) MPI
_Recv(abuf, lclMxSz lclMxSz, MPI_DOUBLE,
MPI_ANY_SOURCE, ROW_SHIFT_TAG, MPI_COMM_WORLD,
status) MPI_Recv(bbuf, lclMxSz lclMxSz,
MPI_DOUBLE, MPI_ANY_SOURCE, COL_SHIFT_TAG,
MPI_COMM_WORLD, status) MPI_Wait(requestr
ow, status) aptr a a abuf
abuf aptr MPI_Wait(requestcol,
status) bptr b b bbuf bbuf
bptr
27Speedups on afour-processor IBM SP-2
28Speedups on anine-processor IBM SP-2
29The SUMMA Algorithm (1 of 6)
Use now the outer-product method
(n2-parallelism) Interchanging the loop headers
of the loop in Example 1 produce for k1n
for i1n for j1n
Ci,jCi,jAi,kBk,
j end
end end To obtain n2 parallelism, the inner
two loops should take the form of a block
operations for k1n
C,C,A,k ? Bk, end Where the
operator ? represents the outer product operations
30The SUMMA Algorithm (2 of 6)
A
B
C
b11 b12
Switch Orientation -- By using a column of A and
a row of B broadcast to all, compute the next
terms of the dot product
a11 a21
a11b11
a11b12
a21b11
a21b12
31The SUMMA Algorithm (3 of 6)
c1n,1n zeros(p,p) communication for
i1n t1,spread(a(,i),dim2,ncopiesN)
communication t2,spread(b(i,),dim
1,ncopiesN) communication
c,c,t1,t2, computation end
32The SUMMA Algorithm (4 of 6)
33The SUMMA Algorithm (5 of 6)
34The SUMMA Algorithm (6 of 6)
35Matrix Multiplication with Columnwise Block
Striped Matrix
for i0p-1 block(i) in/p (i-1)n/p-1
end for i0p-1 Mi A(,block(i)) vi
b(block(i)) end forall j0p-1 cj
Mivj end forall i0p-1j0p-1 di(j)
cj(i) end forall i0p-1 bi sum(di) end
36Flattening
- Elements of an HTA are referenced using a tile
index for each level in the hierarchy followed by
an array index. Each tile index tuple is enclosed
within s and the array index is enclosed within
parentheses. - In the matrix multiplication code, Ci,j(3,4)
would represent element 3,4 of submatrix i,j. - Alternatively, the tiled array could be accessed
as a flat array as shown in the next slide. - This feature is useful when a global view of the
array is needed in the algorithm. It is also
useful while transforming a sequential code into
parallel form.
37Two Ways of Referencing the Elements of an 8 x 8
Array.
38Status
- We have completed the implementation of
practically all of our initial language
extensions (for IBM SP-2 and Linux Clusters). - Following the toolbox approach has been a
challenge, but we have been able to overcome all
obstacles.
39Jacobi Relaxation
while dif gt epsilon v2,(0,)
vn-1,(p,) communication vn-1,
(p1,) v2, (1,) communication
v,2 (,0) v,1n-1(,p)
communication v, n-1(,p1)
v,2(,1) communication
u,(1p,1p) a (v, (1p,0p-1)
v, (0p-1,1p)
v, (1p,2p1) v, (2p1,1p))
computation difmax(max( abs ( v u ) ))
v u end
40Sparse Matrix Vector Product
P1
P2
P3
b
A
(Distributed)
(Replicated)
P4
41Sparse Matrix Vector Product HTA Implementation
c hta(a, dist, 1, 4 1) v hta(4,1,4
1) v b t c v r t()
42Implementation
43Tools
- MATLABTM allows user-defined classes
- Functions can be written in C, FORTRAN (mex)
- Obvious communication tool MPI
- MATLABTM currently provides us both the language
and the computation engine both in the client and
the servers - Standard MATLABTM extension mechanism toolbox
44MATLABTM Classes
- Have constructor, methods
- They are standard MATLABTM functions (possibly
mex files) that can access the internals of the
class - Allow overloading of standard operators and
functions - Big problem no destructors!!
- Solution register created HTAs and apply garbage
collection
45Architecture
MATLABTM
MATLABTM
Client
Server
HTA Subsystem
HTA Subsystem
MPI Interface
MPI Interface
Mex Functions
MATLABTM Functions
MATLABTM Functions
Mex Functions
46HTAs Currently Supported
- Dense and sparse matrices
- Real and complex double precision
- Any number of levels of tiling
- HTAs must be either completely empty or
completely full - Every tile in an HTA must have the same number of
levels of tiling
47Communication Patterns
- Client
- Broadcasts commands data
- Synchronizes servers sometimes
- Still, servers can swap tiles following commands
issued by client - So in some operations the client is a bottleneck
in the current implementation, while in others
the system is efficient
48Communication Implementation
- No MPI collective communications used
- Individual messages
- Extensive usage of non blocking sends
- Every message is a vector of doubles
- Internally MATLABTM stores most of its data as
doubles - Faster/easier development
49Conclusions
- We have developed parallel extensions to MATLAB.
- It is possible to write highly readable parallel
code for both dense and sparse computations with
these extensions. - The HTA objects and operations have been
implemented as a MATLAB toolbox which enabled
their implementation as language extensions.