Title: Chapter 5, CLR Textbook
1Chapter 5, CLR Textbook
- Algorithms on Grids of Processors
2Logical 2D Grids of Processors
- In this chapter, we develop algorithms for a 2-D
grid (often just called grid). - See Figure 5.1(a) example of a square grid with p
q2. - Processors are indexed by their row and column,
Pi,j with 0 ? i,j lt q. - One popular variation of the grid topology is
obtained by adding loops to form what is called a
2-D torus (or torus). - In this case, every processor belongs to two
rings. - The bidirectional torus is a very convenient and
this will be our default version. - For simplicity, we will always assume a square
grid, although algorithms here can be adapted to
rectangular grid using somewhat more cumbersome
notation.
3(No Transcript)
4Logical 2-D Grids of Processors (cont)
- We assume that communication can occur on several
links at the same time. - The standard assumptions about concurrent
sending, receiving, and computing apply (See
Section 3.3). - We make the assumption that links are
full-duplex, allowing communication to flow both
directions without contention. - This assumption may or may not hold for platform
being used. - Algorithms can easily be adjusted if full-duplex
is not supported. - A processor to concievably be involved in one
send and one receive on all of its network links
concurrently. of its for bidirectional links is
the multi-port model
5Logical 2-D Grids of Processors (cont)
- Assuming that all previous communications can
occur in a processor with no decrease in
communication speed over a single communication
is the multi-port model. - If only two concurrent operations are allowed,
one being sent and the other received, this is
the 1-port model. - This chapter includes performance analysis for
both the 1 port and 4 port model. - There are actual platforms whose physical
topology are include grids and/or rings. - Intel Paragon grid
- IBM Blue Gene/L 3D torus topology contains
rings grids.
6Logical 2-D Grids of Processors (cont)
- When both a ring and grid maps well to physical
platform, the grid is often preferable. - Given p processors,
- a torus has 2p network links
- a grid has 2(p - ?p) network
- a ring has p network links.
- As a result, the torus and grid can support more
concurrent communication. - Even in platforms without a grid, writing some
algorithms assuming a grid topology is useful.
7Grid Communication Details
- The processor in row i and column j of a q?q mesh
for 0?i,jltq are denoted Pi,j or P(i,j). - A processor can find the indices of is row and
column using the following functions - My_Proc_Row() and My_Proc_Col()
- A processor can determine the total number pq2
by calling Num_Proc(). - Rectangular grids require two functions to give
the total number of rows and the total number of
columns. - A processor can send a message of L data items
stored at address addr to one of its neighbors by
calling - Send(dest, addr,L)
- where dest has value North, South, West, or East
8Grid Communication Details (cont)
- With grid topology, some dest values are not
allowed - The torus topology is used in the majority of
algorithms . - The neighbors of Pi,j are
- North neighbor P(i-1 mod q, j)
- South neighbor P(i1 mod q, j)
- West neighbor P(i, j-1 mod q)
- East neighbor P(i, j1 mod q)
- Often the modulo is omitted and modulo q is
assumed. - Each Send call has a matching Recv call
- Recv(src, addr, L)
- As in Chapter 4, the following are used
- Non-blocking sends
- Both blocking and non-blocking receives
9Grid Communication Details (cont)
- Broadcast command from Pi,j to all processors in
row i - BroadcastRow(i,j,srcaddr, dstaddr, L)
- srcaddr is the address in Pi,j of message
- dstaddr where message is stored in receiving
processors. - L is the length of the message
- Broadcast command from Pi,j to all PEs in column
j - BroadcastCol(i,j,srcaddr, dstaddr, L)
- Technically a row/column broadcast is a
multi-cast. - With a torus, each row and column is a ring, so
can use the pipelined implementation of a ring
broadcast in 3.3.4 - If links are bidirectional, then broadcast can be
speeded up by sending broadcast both directions.
10Grid Communication Details (cont)
- If topology is not a torus, but links are
bidirectional, then row column broadcasts can
be implemented by sending message both
directions. - If topology is not a torus and links are not
bidirectional, then these broadcast functions can
not be implemented. - Simplifying assumption If a processor calls a
broadcast function but is not in the row/column
for broadcast, the processor returns immediately. - Allows us to omit the column/row processor number
in calls.
11Matrix Multiplication on a Grid
- Assume that the matrix is stored on a square q?q
grid with p q2 processors. - Assume the matrix is also square with dimensions
n?n and that q divides n. - If m n/q, the standard approach is to partition
the matrix over the grid by assigning a m?m block
of each matrix to each processor. - Technically, processor Pi,j for 0 ? i,j lt n
holds matrix elements Ak,l , Bk,l , and Ck,l. - This is illustrated on the next slide.
12(No Transcript)
13Outer-Product Algorithm
- While standard matrix multiplication is computed
using a sequence of inner product computations,
we consider the outer-product order of computing
these products. - Assuming all Ci,j are initialized to 0, the
outer-product is - for k 0 to n-1 do
- for i 0 to n-1 do
- for j 0 to n-1 do
- Ci,j Ci,j Ai,k?Ak,j
- This outer-product leads to a simple and elegant
parallelization on a torus of processors. - At each step k, all Ci,j are updated
- Since all matrices are partitioned into q2 blocks
of size m?m
14Outer-Product Algorithm
- This algorithm can be summarized in terms of
matrix blocks and matrix multiplications as
- Next we consider executing this algorithm on a
torus of p q2 processors. - Processor Pi,j holds block Ci,j and updates it
each step. - To perform Step k, Pi,j needs blocks Ai,j Bi,j
. - At Step k, Pi,j already holds block Ai,j.
- For all other steps, Pi,j must obtain Ai,k from
Pi,k.
15Outer-Product Algorithm
- This is true for all processors Pi,j with j?k.
- Note this means that at step k, processor Pi,k
must broadcast its block of matrix A to all
processors Pi,j on its row. - This is true for all rows i, as well.
- Similarly, blocks of matrix B must be broadcast
at step k by Pk,j to all processors on row and
for all j. - The resulting communication pattern is shown on
the next slide. - The outer product algorithm is given on the
following slide in Algorithm 5.1
16(No Transcript)
17(No Transcript)
18Outer Product Algorithm Steps
- Statement 1 declares the square blocks of the
three matrices stored by each processor. - The matrix C is assumed to be initialized to zero
- Arrays A B contain sub-matrices in PEs in Fig
5.2 - Statement 2 declares two helper buffers used by
PEs - In Statement 3, PEs determines value of q
- In Statement 4-5, PEs determine their location
on torus - The q steps of program occur in lines 7-19 inside
loop 6 - In statements 7-8, all q processors in column k
broadcast (in parallel) their block of A to the
processors in each of their rows. - Statements 9-10 implement similar broadcasts of
blocks of matrix B along processor columns.
19Outer Product Algorithm Steps (cont)
- Comments
- When preceding broadcasts are complete, each PE
holds all the needed blocks. - Each processor will multiply a block of A by a
block of B and adds the result to the block of C,
for which it is responsible. - The algorithm uses the notation
MatrixMultiplyAdd() for PE matrix block
operations of Ci,j ? Ci,j Ai,kBk,j . - In lines 12-13, if the PE is on both row k
column k, then it can just multiply the two
blocks of A and B that it holds. - Lines 14-15 If the PE is on row k but not on
column k, then it will multiply the block of A
that it receives with the block of B that it
holds.
20Outer Product Algorithm Steps (cont)
- Lines 16-17 Similarly, if a PE is on column k
but not row k, then it multiplies the block of A
it holds with the block of B it just received. - Lines 18-19 (General Case) If a PE is neither on
row k or column k, then it will multiply the
block of A it receives with the block of B that
it receives. - Generalization of Matrix Multiply
- By allotting rectangular blocks of Matrix A and B
to processors, the preceding algorithm can be
adapted to work for non-square matrix products.
21Performance Analysis of Algorithm
- At each of the q passes through the loop, each
processor is involved in two broadcast messages
containing m2 elements sent to q-1 processors. - Using the pipelined broadcast implementation on a
ring in Section 3.3.4, the time for each
broadcast is - where L is the communications startup cost, b is
the time to communicate a matrix element. - After 2 broadcasts, each processor computes a m?m
matrix multiplication, which takes m3 w time,
where w is the computation time for a basic
matrix operation.
22Performance Analysis (cont)
- After 0th step (i.e. loop), communication at step
k can always occur in parallel with computation
at step k-1 - No communication occurs during last computation
step. - The total execution time (for 1-port model) is
- For the 4-port model, both broadcasts can occur
concurrently. - The execution time of the algorithm is obtained
by removing the factor of 2 in front of each
- Recalling p q2 and m n/q, as n becomes large,
- and
- This indicates that algorithm achieves an
efficiency of 1. -
23Grid vs Ring
- An optimal asymptotic matrix multiplication
algorithm was already given for the ring. - The ring is a simpler topology, so why bother to
implement another asymptotically optimal matrix
algorithm for the grid? - Since matrix computation has an O(n3) complexity
and O(n2) size, getting an asymptotic optimal
algorithm is relatively easy. - However, communication costs that become
negligible as n becomes large do matter for
practical values of n. - As discussed below, the grid topology is better
than ring topology for reducing this practical
communication cost. - A detailed analysis in CLR pg 155 shows that the
algorithm on the grid spends ?p/2 fewer steps
communicating than the algorithm on a ring.
24Grid vs Ring (cont)
- With the 4-port, this factor is ?p.
- This advantage can be attributed to the presence
of more network links and to the fact that many
of these links can be used concurrently. - For matrix multiplication, the 2D data
distribution induced by the grid topology is
inherently better than the 1D topology induced by
a ring, regardless of the underlying physical
topology. - In particular, the total number of elements sent
on the network is lower by at least a factor of
2?p than is the case of the algorithm on a ring. - The implementation is that for purposes of matrix
multiplication, the grid topology and induced 2D
data distribution is at least as good and
possibly better than when using the ring topology.
25Grid vs Ring (cont)
- As a result, when implementing a parallel matrix
multiplication in a physical topology on which
all communications are serialized (e.g., on a bus
architecture), one should opt for a logical grid
topology with a 2D data distribution to reduce
the amount of transferred data.