Title: Introduction to Computer Hardware
1On Grid-based Matrix Partitioning for Networks of
Heterogeneous Processors
Alexey Lastovetsky School of Computer Science
and Informatics University College
Dublin Alexey.Lastovetsky_at_ucd.ie
2Heterogeneous parallel computing
- Heterogeneity of processors
- The processors run at different speeds
- Even distribution of computations do not balance
processors load - The performance is determined by the slowest
processor - Data must be distributed unevenly
- So that each processor will perform the volume of
computation proportional to its speed
3Constant performance models of heterogeneous
processors
- The simplest performance model of heterogeneous
processors - p, the number of the processors,
- Ss1, s2, ..., sp, the speeds of the processors
(positive constants). - The speed
- Absolute the number of computational units
performed by the processor per one time unit - Relative
- Some use the execution time
4Data distribution problems with constant models
of heterogeneous processors
- Typical design of heterogeneous parallel
algorithms - Problem of distribution of computations in
proportion to the speed of processors - Problem of partitioning of some mathematical
objects - Sets, matrices, graphs, geometric figures, etc.
5Partitioning matrices with constant models of
heterogeneous processors
- Matrices
- Most widely used math. objects in scientific
computing - Studied partitioning problems mainly deal with
matrices - Matrix partitioning in one dimension over a 1D
arrangement of processors - Often reduced to partitioning sets or
well-ordered sets - Design of algorithms often results in matrix
partitioning problems not imposing the
restriction of partitioning in one dimension - E.g., in parallel linear algebra for
heterogeneous platforms - We will use matrix multiplication
- A simple but very important linear algebra kernel
6Partitioning matrices with constant models of
heterogeneous processors (ctd)
- A heterogeneous matrix multiplication algorithm
- A modification of some homogeneous one
- Most often, of the 2D block cyclic ScaLAPACK
algorithm
7Partitioning matrices with constant models of
heterogeneous processors (ctd)
- 2D block cyclic ScaLAPACK MM algorithm (ctd)
8Partitioning matrices with constant models of
heterogeneous processors (ctd)
- 2D block cyclic ScaLAPACK MM algorithm (ctd)
- The matrices are identically partitioned into
rectangular generalized blocks of the size
(pr)(qr) - Each generalized block forms a 2D pq grid of rr
blocks - There is 1-to-1 mapping between this grid of
blocks and the pq processor grid - At each step of the algorithm
- Each processor not owing the pivot row and column
receives horizontally (n/p)r elements of matrix
A and vertically (n/q)r elements of matrix B - gt in total, , i.e., the
half-perimeter of the rectangle
area allocated to the processor
9Partitioning matrices with constant models of
heterogeneous processors (ctd)
- General design of heterogeneous modifications
- Matrices A, B, and C are identically partitioned
into equal rectangular generalized blocks - The generalized blocks are identically
partitioned into rectangles so that - There is one-to-one mapping between the
rectangles and the processors - The area of each rectangle is (approximately)
proportional to the speed of the processor which
has the rectangle - Then, the algorithm follows the steps of its
homogeneous prototype
10Partitioning matrices with constant models of
heterogeneous processors (ctd)
11Partitioning matrices with constant models of
heterogeneous processors (ctd)
- Why to partition the GBs in proportion to the
speed - At each step, updating one rr block of matrix C
needs the same amount of computation for all the
blocks - gt the load will be perfectly balanced if the
number of blocks updated by each processor is
proportional to its speed - The number niNGB
- ni the area of the GB partition allocated to
i-th processor (measured in rr blocks) - gt if the area of each GB partition to the
speed of the owing processor, their load will be
perfectly balanced
12Partitioning matrices with constant models of
heterogeneous processors (ctd)
- A generalized block from partitioning POV
- An integer-valued rectangular
- If we need an asymptotically optimal solution,
the problem can be reduced to a geometrical
problem of optimal partitioning of a real-valued
rectangle - the asymptotically optimal integer-valued
solution can be obtained by rounding off an
optimal real-valued solution of the geometrical
partitioning problem
13Geometrical partitioning problem
- The general geometrical partitioning problem
- Given a set of p processors P1, P2, ..., Pp, the
relative speed of each of which is characterized
by a positive constant, si, ( ), - Partition a unit square into p rectangles so that
- There is one-to-one mapping between the
rectangles and the processors - The area of the rectangle allocated to processor
Pi is equal to si - The partitioning minimizes ,
where wi is the width and hi is the
height of the rectangle allocated to processor Pi
14Geometrical partitioning problem (ctd)
- Motivation behind the formulation
- Proportionality of the areas to the speeds
- Balancing the load of the processors
- Minimization of the sum of half-perimeters
- Multiple partitionings can balance the load
- Minimizes the total volume of communications
- At each step of MM, each receiving processor
receives data the half-perimeter of its
rectangle - gt In total, the communicated data
15Geometrical partitioning problem (ctd)
- Motivation behind the formulation (ctd)
- Option minimizing the maximal half-perimeter
- Parallel communications
- The use of a unit square instead of a rectangle
- No loss of generality
- the optimal solution for an arbitrary rectangle
is obtained by straightforward scaling of that
for the unit square - Proposition. The general geometrical partitioning
problem is NP-complete.
16Restricted geometrical partitioning problems
- Restricted problems having polynomial solutions
- Column-based
- Grid-based
- Column-based partitioning
- Rectangles make up columns
- Has an optimal solution of complexity O(p3)
17Column-based partitioning problem
18Column-based partitioning problem (ctd)
- A more restricted form of the column-based
partitioning problem - The processors are already arranged into a set of
columns - Algorithm 1 Optimal partitioning a unit square
between p heterogeneous processors arranged into
c columns, each of which is made of rj
processors, j1,,c - Let the relative speed of the i-th processor from
the j-th column, Pij, be sij. - Then, we first partition the unit square into c
vertical rectangular slices such that the width
the j-th slice
19Column-based partitioning problem (ctd)
- Algorithm 1 (ctd)
- Second, each vertical slice is partitioned
independently into rectangles in proportion with
the speed of the processors in the corresponding
processor column. - Algorithm 1 is of linear complexity.
20Grid-based partitioning problem
- Grid-based partitioning problem
- The heterogeneous processors form a
two-dimensional grid
- There exist p and q such that any vertical line
crossing the unit square will pass through
exactly p rectangles and any horizontal line
crossing the square will pass through exactly q
rectangles
21Grid-based partitioning problem (ctd)
- Proposition. Let a grid-based partitioning of the
unit square between p heterogeneous processors
form c columns, each of which consists of r
processors, prc. Then, the sum of
half-perimeters of the rectangles of the
partitioning will be equal to (rc). - The shape rc of the processor grid formed by any
optimal grid-based partitioning will minimize
(rc). - The sum of half-perimeters of the rectangles of
the optimal grid-based partitioning does not
depend on the mapping of the processors onto the
nodes of the grid.
22Grid-based partitioning problem (ctd)
- Algorithm 2 Optimal grid-based partitioning a
unit square between p heterogeneous processors - Step 1 Find the optimal shape rc of the
processor grid such that prc and (rc) is
minimal. - Step 2 Map the processors onto the nodes of the
grid. - Step 3 Apply Algorithm 3 of the optimal
partitioning of the unit square to this rc
arrangement of the p heterogeneous processors. - The correctness of Algorithm 2 is obvious.
- Algorithm 2 returns a column-based partitioning.
23Grid-based partitioning problem (ctd)
- The optimal grid-based partitioning can be seen
as a restricted form of column-based partitioning.
24Grid-based partitioning problem (ctd)
- Algorithm 3 Finding r and c such that prc and
(rc) is minimal -
- while(rgt1)
- if((p mod r)0))
- goto stop
- else
- r--
- stop c p / r
25Grid-based partitioning problem (ctd)
- Proposition. Algorithm 3 is correct.
- Proposition. The complexity of Algorithm 2 can be
bounded by O(p3/2).
26Experimental results
Specifications of sixteen Linux computers on
which the matrix multiplication is executed
27Experimental results (ctd)
28Application to Cartesian partitioning
- Cartesian partitioning
- A column-based partitioning, the rectangles of
which make up rows.
29Application to Cartesian partitioning (ctd)
- Cartesian partitioning
- Plays important role in design of heterogeneous
parallel algorithms (e.g., in scalable
algorithms) - The Cartesian partitioning problem
- Very difficult
- Their may be no Cartesian partitionings perfectly
balancing the load of processors
30Application to Cartesian partitioning (ctd)
- Cartesian partitioning problem in general form
- Given p processors, the speed of each of which is
characterized by a given positive constant, - Find a Cartesian partitioning of a unit square
such that - There is 1-to-1 mapping between the rectangles
and the processors - The partitioning minimizes
31Application to Cartesian partitioning (ctd)
- The Cartesian partitioning problem
- Not even studied in the general form.
- If shape rc is given, it proved NP-complete.
- Unclear if there exists a polynomial algorithm
when both the shape and the processors mapping
are given - There exists an optimal Cartesian partitioning
with processors arranged in a non-increasing
order of speed
32Application to Cartesian partitioning (ctd)
- Approximate solutions of the Cartesian
partitioning problem are based on the observation - Let the speed matrix sij of the given rc
processor arrangement be rank-one - Then there exists a Cartesian partitioning
perfectly - balancing the load of the processors
33Application to Cartesian partitioning (ctd)
- Algorithm 4 Finding an approximate solution of
the simplified Cartesian problem (when only the
shape rc is given) - Step 1 Arrange the processors in a
non-increasing order of speed - Step 2 For this arrangement, let
and -
- be the parameters of the partitioning
- Step 3 Calculate the areas hiwj of the
rectangles of this partitioning
34Application to Cartesian partitioning (ctd)
- Algorithm 5 Finding an approximate solution of
the simplified Cartesian problem when only the
shape rc is given (ctd) - Step 4 Re-arrange the processors so that
- Step 5 If Step 4 does not change the arrangement
of the processors then return the current
partitioning and stop the procedure else go to
Step 2
35Application to Cartesian partitioning (ctd)
- Proposition. Let a Cartesian partitioning of the
unit square between p heterogeneous processors
form c columns, each of which consists of r
processors, prc. Then, the sum of
half-perimeters of the rectangles of the
partitioning will be (rc). - Proof is a trivial exercise
- Minimization of the communication cost does not
depend on the speeds of the processors but only
on their number - gt minimization of communication cost and
minimization of computation cost are two
independent problems - Any Cartesian partitioning minimizing (rc) will
optimize communication cost
36Application to Cartesian partitioning (ctd)
- Now we can extend Algorithm 5
- By adding the 0-th step, finding the optimal rc
- The modified algorithm returns an approximate
solution of the extended Cartesian problem - Aimed at minimization of both computation and
communication cost - The modified Algorithm 5 will return an optimal
solution if the speed matrix for the arrangement
is a rank-one matrix