MATLAB HPCS Extensions - PowerPoint PPT Presentation

About This Presentation
Title:

MATLAB HPCS Extensions

Description:

The SUMMA Algorithm (1 of 6) Use now the outer-product method (n2-parallelism) ... The SUMMA Algorithm (4 of 6) The SUMMA Algorithm (5 of 6) The SUMMA Algorithm ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 50
Provided by: vickier4
Category:

less

Transcript and Presenter's Notes

Title: MATLAB HPCS Extensions


1
MATLAB HPCS Extensions
  • Presented by David Padua
  • University of Illinois at Urbana-Champaign

2
Senior Team Members
  • Gheorghe Almasi - IBM Research
  • Calin Cascaval - IBM Research
  • Basilio Fraguela - University of Illinois
  • Jose Moreira - IBM Research
  • David Padua - University of Illinois

3
Objectives
  • To develop MATLAB extensions for accessing,
    prototyping, and implementing scalable parallel
    algorithms.
  • To give programmers of high-end machines access
    to all the powerful features of MATLAB, as a
    result.
  • Array operations / kernels.
  • Interactive interface.
  • Rendering.

4
Goals of the Design
  • Minimal extension.
  • A natural extension to MATLAB that is easy to
    use.
  • Extensions for direct control of parallelism and
    communication on top of the ability to access
    parallel library routines. It does not seem it is
    possible to encapsulate all the important
    parallelism in library routines.
  • Extensions that provide the necessary information
    and can be automatically and effectively analyzed
    for compilation and translation.

5
Uses of the Language Extension
  • Parallel libraries user.
  • Parallel library developer.
  • Input to type-inference compiler.
  • No existing MATLAB extension had the necessary
    characteristics.

6
Approach
  • In our approach, the programmer interacts with a
    copy of MATLAB running on a workstation.
  • The workstation controls parallel computation on
    servers.

7
Approach (Cont.)
  • All conventional MATLAB operations are executed
    on the workstation.
  • The parallel server operates on a new class of
    MATLAB objects hierarchically tiled arrays
    (HTAs).
  • HTAs are implemented as a MATLAB toolbox. This
    enables implementation as a language extension
    and simplifies porting to future versions of
    MATLAB.

8
Hierarchically Tiled Arrays
  • Array tiles are a powerful mechanism to enhance
    locality in sequential computations and to
    represent data distribution across parallel
    systems.
  • Our main extension to MATLAB is arrays that are
    hierarchically tiled.
  • Several levels of tiling are useful to distribute
    data across parallel machines with a hierarchical
    organization and to simultaneously represent both
    data distribution and memory layout.
  • For example, a two-level hierarchy of tiles can
    be used to represent
  • the data distribution on a parallel system and
  • the memory layout within each component.

9
Hierarchically Tiled Arrays (Cont.)
  • Computation and communication are represented as
    array operations on HTAs.
  • Using array operations for communication and
    computation raises the level of abstraction and,
    at the same time, facilitates optimization.

10
Semantics
11
HTA Definition
Array partitioned into tiles in such a way that
adjacent tiles have the same size along the
dimension of adjacency each tile being either
an unpartitioned array or an HTA
12
More HTAs
13
Example of non-HTA
14
Addressing the HTAs
  • We want to be able to address any level of an HTA
    A
  • A.getTile(i,j) Get tile (i,j) of the HTA
  • A.getElem(a,b) Get element (a,b) of the HTA,
    disregarding the tiles (flattening)
  • A.getTile(i,j).getElem(c,d) Get element (c,d)
    within tile (i,j)
  • Triplets supported in the indexes

15
HTA Indexing Example
A hta 1214,13(13)
16
Assigning to an HTA
  • Replication of scalars
  • A(13,45)scalar
  • Replication of matrices
  • A13,45matrix
  • Replication of HTAs
  • A13,45single_tile_HTA
  • Copies tile to tile
  • A13,45B3 8 9,23

17
Operation Legality
  • A scalar can be operated with every scalar (or
    set of them) of an HTA
  • A matrix can be operated with any (sub)tiles of
    an HTA
  • Two HTAs can be operated if their corresponding
    tiles can be operated
  • Notice that this depends on the operation

18
Tile Usage Basics
  • Tiles are used in many algorithms to express
    locality
  • Tiles can be distributed among processors
  • Block-cyclic always
  • Operation on tiles take place on the server where
    they reside

19
Using HTAs for Locality Enhancement
Tiled matrix multiplication using conventional
arrays
for I1qn for J1qn for K1qn
for iIIq-1 for
jJJq-1 for kKKq-1
C(i,j)C(i,j)A(i,k)B(k,j)
end end
end end end end
20
Using HTAs for Locality Enhancement
Tiled matrix multiplication using HTAs
  • Here, Ci,j, Ai,k, Bk,j represent
    submatrices.
  • The operator represents matrix multiplication
    in MATLAB.

for i1m for j1m for k1m
Ci,jCi,jAi,kBk,j end end end
21
Using HTAs to Represent Data Distribution and
Parallelism
Cannons Algorithm
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Cannnons Algorithm in MATLAB with HPCS Extensions
ahta(A,1SZ/nn,1SZ/nn,n n) bhta(B,1SZ/
nn,1SZ/nn,n n) chta(n,n,n n) c,
zeros(p,p) for i2n ai,
circshift(ai,, 0 -1) b,i
circshift(b,i, -1 0) end for k1n c
c a b a circshift(a,0 -1) b
circshift(b,-1 0) end
26
Cannnons Algorithm in C MPI
for (km 0 km lt m km) char chn
"T" dgemm(chn, chn, lclMxSz, lclMxSz,
lclMxSz, 1.0, a, lclMxSz, b, lclMxSz, 1.0, c,
lclMxSz) MPI_Isend(a, lclMxSz lclMxSz,
MPI_DOUBLE, destrow, ROW_SHIFT_TAG,
MPI_COMM_WORLD, requestrow) MPI_Isend(b,
lclMxSz lclMxSz, MPI_DOUBLE, destcol,
COL_SHIFT_TAG, MPI_COMM_WORLD, requestcol) MPI
_Recv(abuf, lclMxSz lclMxSz, MPI_DOUBLE,
MPI_ANY_SOURCE, ROW_SHIFT_TAG, MPI_COMM_WORLD,
status) MPI_Recv(bbuf, lclMxSz lclMxSz,
MPI_DOUBLE, MPI_ANY_SOURCE, COL_SHIFT_TAG,
MPI_COMM_WORLD, status) MPI_Wait(requestr
ow, status) aptr a a abuf
abuf aptr MPI_Wait(requestcol,
status) bptr b b bbuf bbuf
bptr
27
Speedups on afour-processor IBM SP-2
28
Speedups on anine-processor IBM SP-2
29
The SUMMA Algorithm (1 of 6)
Use now the outer-product method
(n2-parallelism) Interchanging the loop headers
of the loop in Example 1 produce for k1n
for i1n for j1n
Ci,jCi,jAi,kBk,
j end
end end To obtain n2 parallelism, the inner
two loops should take the form of a block
operations for k1n
C,C,A,k ? Bk, end Where the
operator ? represents the outer product operations
30
The SUMMA Algorithm (2 of 6)
  • The SUMMA Algorithm

A
B
C
b11 b12
Switch Orientation -- By using a column of A and
a row of B broadcast to all, compute the next
terms of the dot product
a11 a21
a11b11
a11b12
a21b11
a21b12
31
The SUMMA Algorithm (3 of 6)
c1n,1n zeros(p,p) communication for
i1n t1,spread(a(,i),dim2,ncopiesN)
communication t2,spread(b(i,),dim
1,ncopiesN) communication
c,c,t1,t2, computation end
32
The SUMMA Algorithm (4 of 6)
33
The SUMMA Algorithm (5 of 6)
34
The SUMMA Algorithm (6 of 6)
35
Matrix Multiplication with Columnwise Block
Striped Matrix
for i0p-1 block(i) in/p (i-1)n/p-1
end for i0p-1 Mi A(,block(i)) vi
b(block(i)) end forall j0p-1 cj
Mivj end forall i0p-1j0p-1 di(j)
cj(i) end forall i0p-1 bi sum(di) end
36
Flattening
  • Elements of an HTA are referenced using a tile
    index for each level in the hierarchy followed by
    an array index. Each tile index tuple is enclosed
    within s and the array index is enclosed within
    parentheses.
  • In the matrix multiplication code, Ci,j(3,4)
    would represent element 3,4 of submatrix i,j.
  • Alternatively, the tiled array could be accessed
    as a flat array as shown in the next slide.
  • This feature is useful when a global view of the
    array is needed in the algorithm. It is also
    useful while transforming a sequential code into
    parallel form.

37
Two Ways of Referencing the Elements of an 8 x 8
Array.
38
Status
  • We have completed the implementation of
    practically all of our initial language
    extensions (for IBM SP-2 and Linux Clusters).
  • Following the toolbox approach has been a
    challenge, but we have been able to overcome all
    obstacles.

39
Jacobi Relaxation
while dif gt epsilon v2,(0,)
vn-1,(p,) communication vn-1,
(p1,) v2, (1,) communication
v,2 (,0) v,1n-1(,p)
communication v, n-1(,p1)
v,2(,1) communication
u,(1p,1p) a (v, (1p,0p-1)
v, (0p-1,1p)
v, (1p,2p1) v, (2p1,1p))
computation difmax(max( abs ( v u ) ))
v u end
40
Sparse Matrix Vector Product
P1
P2
P3
b
A
(Distributed)
(Replicated)
P4
41
Sparse Matrix Vector Product HTA Implementation
c hta(a, dist, 1, 4 1) v hta(4,1,4
1) v b t c v r t()
42
Implementation
43
Tools
  • MATLABTM allows user-defined classes
  • Functions can be written in C, FORTRAN (mex)
  • Obvious communication tool MPI
  • MATLABTM currently provides us both the language
    and the computation engine both in the client and
    the servers
  • Standard MATLABTM extension mechanism toolbox

44
MATLABTM Classes
  • Have constructor, methods
  • They are standard MATLABTM functions (possibly
    mex files) that can access the internals of the
    class
  • Allow overloading of standard operators and
    functions
  • Big problem no destructors!!
  • Solution register created HTAs and apply garbage
    collection

45
Architecture
MATLABTM
MATLABTM
Client
Server


HTA Subsystem
HTA Subsystem
MPI Interface
MPI Interface
Mex Functions
MATLABTM Functions
MATLABTM Functions
Mex Functions
46
HTAs Currently Supported
  • Dense and sparse matrices
  • Real and complex double precision
  • Any number of levels of tiling
  • HTAs must be either completely empty or
    completely full
  • Every tile in an HTA must have the same number of
    levels of tiling

47
Communication Patterns
  • Client
  • Broadcasts commands data
  • Synchronizes servers sometimes
  • Still, servers can swap tiles following commands
    issued by client
  • So in some operations the client is a bottleneck
    in the current implementation, while in others
    the system is efficient

48
Communication Implementation
  • No MPI collective communications used
  • Individual messages
  • Extensive usage of non blocking sends
  • Every message is a vector of doubles
  • Internally MATLABTM stores most of its data as
    doubles
  • Faster/easier development

49
Conclusions
  • We have developed parallel extensions to MATLAB.
  • It is possible to write highly readable parallel
    code for both dense and sparse computations with
    these extensions.
  • The HTA objects and operations have been
    implemented as a MATLAB toolbox which enabled
    their implementation as language extensions.
Write a Comment
User Comments (0)
About PowerShow.com