MATLAB HPCS Extensions - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

MATLAB HPCS Extensions

Description:

The SUMMA Algorithm (1 of 6) Use now the outer-product method (n2-parallelism) ... The SUMMA Algorithm (4 of 6) The SUMMA Algorithm (5 of 6) The SUMMA Algorithm ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 50

Provided by: vickier4

Learn more at: http://polaris.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: MATLAB HPCS Extensions

1
MATLAB HPCS Extensions

Presented by David Padua
University of Illinois at Urbana-Champaign

2
Senior Team Members

Gheorghe Almasi - IBM Research
Calin Cascaval - IBM Research
Basilio Fraguela - University of Illinois
Jose Moreira - IBM Research
David Padua - University of Illinois

3
Objectives

To develop MATLAB extensions for accessing,
prototyping, and implementing scalable parallel
algorithms.
To give programmers of high-end machines access
to all the powerful features of MATLAB, as a
result.
Array operations / kernels.
Interactive interface.
Rendering.

4
Goals of the Design

Minimal extension.
A natural extension to MATLAB that is easy to
use.
Extensions for direct control of parallelism and
communication on top of the ability to access
parallel library routines. It does not seem it is
possible to encapsulate all the important
parallelism in library routines.
Extensions that provide the necessary information
and can be automatically and effectively analyzed
for compilation and translation.

5
Uses of the Language Extension

Parallel libraries user.
Parallel library developer.
Input to type-inference compiler.
No existing MATLAB extension had the necessary
characteristics.

6
Approach

In our approach, the programmer interacts with a
copy of MATLAB running on a workstation.
The workstation controls parallel computation on
servers.

7
Approach (Cont.)

All conventional MATLAB operations are executed
on the workstation.
The parallel server operates on a new class of
MATLAB objects hierarchically tiled arrays
(HTAs).
HTAs are implemented as a MATLAB toolbox. This
enables implementation as a language extension
and simplifies porting to future versions of
MATLAB.

8
Hierarchically Tiled Arrays

Array tiles are a powerful mechanism to enhance
locality in sequential computations and to
represent data distribution across parallel
systems.
Our main extension to MATLAB is arrays that are
hierarchically tiled.
Several levels of tiling are useful to distribute
data across parallel machines with a hierarchical
organization and to simultaneously represent both
data distribution and memory layout.
For example, a two-level hierarchy of tiles can
be used to represent
the data distribution on a parallel system and
the memory layout within each component.

9
Hierarchically Tiled Arrays (Cont.)

Computation and communication are represented as
array operations on HTAs.
Using array operations for communication and
computation raises the level of abstraction and,
at the same time, facilitates optimization.

10
Semantics
11
HTA Definition
Array partitioned into tiles in such a way that
adjacent tiles have the same size along the
dimension of adjacency each tile being either
an unpartitioned array or an HTA
12
More HTAs
13
Example of non-HTA
14
Addressing the HTAs

We want to be able to address any level of an HTA
A
A.getTile(i,j) Get tile (i,j) of the HTA
A.getElem(a,b) Get element (a,b) of the HTA,
disregarding the tiles (flattening)
A.getTile(i,j).getElem(c,d) Get element (c,d)
within tile (i,j)
Triplets supported in the indexes

15
HTA Indexing Example
A hta 1214,13(13)
16
Assigning to an HTA

Replication of scalars
A(13,45)scalar
Replication of matrices
A13,45matrix
Replication of HTAs
A13,45single_tile_HTA
Copies tile to tile
A13,45B3 8 9,23

17
Operation Legality

A scalar can be operated with every scalar (or
set of them) of an HTA
A matrix can be operated with any (sub)tiles of
an HTA
Two HTAs can be operated if their corresponding
tiles can be operated
Notice that this depends on the operation

18
Tile Usage Basics

Tiles are used in many algorithms to express
locality
Tiles can be distributed among processors
Block-cyclic always
Operation on tiles take place on the server where
they reside

19
Using HTAs for Locality Enhancement
Tiled matrix multiplication using conventional
arrays
for I1qn for J1qn for K1qn
for iIIq-1 for
jJJq-1 for kKKq-1
C(i,j)C(i,j)A(i,k)B(k,j)
end end
end end end end
20
Using HTAs for Locality Enhancement
Tiled matrix multiplication using HTAs

Here, Ci,j, Ai,k, Bk,j represent
submatrices.
The operator represents matrix multiplication
in MATLAB.

for i1m for j1m for k1m
Ci,jCi,jAi,kBk,j end end end
21
Using HTAs to Represent Data Distribution and
Parallelism
Cannons Algorithm
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Cannnons Algorithm in MATLAB with HPCS Extensions
ahta(A,1SZ/nn,1SZ/nn,n n) bhta(B,1SZ/
nn,1SZ/nn,n n) chta(n,n,n n) c,
zeros(p,p) for i2n ai,
circshift(ai,, 0 -1) b,i
circshift(b,i, -1 0) end for k1n c
c a b a circshift(a,0 -1) b
circshift(b,-1 0) end
26
Cannnons Algorithm in C MPI
for (km 0 km lt m km) char chn
"T" dgemm(chn, chn, lclMxSz, lclMxSz,
lclMxSz, 1.0, a, lclMxSz, b, lclMxSz, 1.0, c,
lclMxSz) MPI_Isend(a, lclMxSz lclMxSz,
MPI_DOUBLE, destrow, ROW_SHIFT_TAG,
MPI_COMM_WORLD, requestrow) MPI_Isend(b,
lclMxSz lclMxSz, MPI_DOUBLE, destcol,
COL_SHIFT_TAG, MPI_COMM_WORLD, requestcol) MPI
_Recv(abuf, lclMxSz lclMxSz, MPI_DOUBLE,
MPI_ANY_SOURCE, ROW_SHIFT_TAG, MPI_COMM_WORLD,
status) MPI_Recv(bbuf, lclMxSz lclMxSz,
MPI_DOUBLE, MPI_ANY_SOURCE, COL_SHIFT_TAG,
MPI_COMM_WORLD, status) MPI_Wait(requestr
ow, status) aptr a a abuf
abuf aptr MPI_Wait(requestcol,
status) bptr b b bbuf bbuf
bptr
27
Speedups on afour-processor IBM SP-2
28
Speedups on anine-processor IBM SP-2
29
The SUMMA Algorithm (1 of 6)
Use now the outer-product method
(n2-parallelism) Interchanging the loop headers
of the loop in Example 1 produce for k1n
for i1n for j1n
Ci,jCi,jAi,kBk,
j end
end end To obtain n2 parallelism, the inner
two loops should take the form of a block
operations for k1n
C,C,A,k ? Bk, end Where the
operator ? represents the outer product operations
30
The SUMMA Algorithm (2 of 6)

The SUMMA Algorithm

A
B
C
b11 b12
Switch Orientation -- By using a column of A and
a row of B broadcast to all, compute the next
terms of the dot product
a11 a21
a11b11
a11b12
a21b11
a21b12
31
The SUMMA Algorithm (3 of 6)
c1n,1n zeros(p,p) communication for
i1n t1,spread(a(,i),dim2,ncopiesN)
communication t2,spread(b(i,),dim
1,ncopiesN) communication
c,c,t1,t2, computation end
32
The SUMMA Algorithm (4 of 6)
33
The SUMMA Algorithm (5 of 6)
34
The SUMMA Algorithm (6 of 6)
35
Matrix Multiplication with Columnwise Block
Striped Matrix
for i0p-1 block(i) in/p (i-1)n/p-1
end for i0p-1 Mi A(,block(i)) vi
b(block(i)) end forall j0p-1 cj
Mivj end forall i0p-1j0p-1 di(j)
cj(i) end forall i0p-1 bi sum(di) end
36
Flattening

Elements of an HTA are referenced using a tile
index for each level in the hierarchy followed by
an array index. Each tile index tuple is enclosed
within s and the array index is enclosed within
parentheses.
In the matrix multiplication code, Ci,j(3,4)
would represent element 3,4 of submatrix i,j.
Alternatively, the tiled array could be accessed
as a flat array as shown in the next slide.
This feature is useful when a global view of the
array is needed in the algorithm. It is also
useful while transforming a sequential code into
parallel form.

37
Two Ways of Referencing the Elements of an 8 x 8
Array.
38
Status

We have completed the implementation of
practically all of our initial language
extensions (for IBM SP-2 and Linux Clusters).
Following the toolbox approach has been a
challenge, but we have been able to overcome all
obstacles.

39
Jacobi Relaxation
while dif gt epsilon v2,(0,)
vn-1,(p,) communication vn-1,
(p1,) v2, (1,) communication
v,2 (,0) v,1n-1(,p)
communication v, n-1(,p1)
v,2(,1) communication
u,(1p,1p) a (v, (1p,0p-1)
v, (0p-1,1p)
v, (1p,2p1) v, (2p1,1p))
computation difmax(max( abs ( v u ) ))
v u end
40
Sparse Matrix Vector Product
P1
P2
P3
b
A
(Distributed)
(Replicated)
P4
41
Sparse Matrix Vector Product HTA Implementation
c hta(a, dist, 1, 4 1) v hta(4,1,4
1) v b t c v r t()
42
Implementation
43
Tools

MATLABTM allows user-defined classes
Functions can be written in C, FORTRAN (mex)
Obvious communication tool MPI
MATLABTM currently provides us both the language
and the computation engine both in the client and
the servers
Standard MATLABTM extension mechanism toolbox

44
MATLABTM Classes

Have constructor, methods
They are standard MATLABTM functions (possibly
mex files) that can access the internals of the
class
Allow overloading of standard operators and
functions
Big problem no destructors!!
Solution register created HTAs and apply garbage
collection

45
Architecture
MATLABTM
MATLABTM
Client
Server

HTA Subsystem
HTA Subsystem
MPI Interface
MPI Interface
Mex Functions
MATLABTM Functions
MATLABTM Functions
Mex Functions
46
HTAs Currently Supported

Dense and sparse matrices
Real and complex double precision
Any number of levels of tiling
HTAs must be either completely empty or
completely full
Every tile in an HTA must have the same number of
levels of tiling

47
Communication Patterns

Client
Broadcasts commands data
Synchronizes servers sometimes
Still, servers can swap tiles following commands
issued by client
So in some operations the client is a bottleneck
in the current implementation, while in others
the system is efficient

48
Communication Implementation

No MPI collective communications used
Individual messages
Extensive usage of non blocking sends
Every message is a vector of doubles
Internally MATLABTM stores most of its data as
doubles
Faster/easier development

49
Conclusions

We have developed parallel extensions to MATLAB.
It is possible to write highly readable parallel
code for both dense and sparse computations with
these extensions.
The HTA objects and operations have been
implemented as a MATLAB toolbox which enabled
their implementation as language extensions.

Write a Comment

User Comments (0)