Introduction to Computer Hardware - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to Computer Hardware

Description:

Data must be distributed unevenly ... Then, we first partition the unit square into c ... A column-based partitioning, the rectangles of which make up rows. ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 37

Provided by: alexeylas

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Computer Hardware

1
On Grid-based Matrix Partitioning for Networks of
Heterogeneous Processors
Alexey Lastovetsky School of Computer Science
and Informatics University College
Dublin Alexey.Lastovetsky_at_ucd.ie
2
Heterogeneous parallel computing

Heterogeneity of processors
The processors run at different speeds
Even distribution of computations do not balance
processors load
The performance is determined by the slowest
processor
Data must be distributed unevenly
So that each processor will perform the volume of
computation proportional to its speed

3
Constant performance models of heterogeneous
processors

The simplest performance model of heterogeneous
processors
p, the number of the processors,
Ss1, s2, ..., sp, the speeds of the processors
(positive constants).
The speed
Absolute the number of computational units
performed by the processor per one time unit
Relative
Some use the execution time

4
Data distribution problems with constant models
of heterogeneous processors

Typical design of heterogeneous parallel
algorithms
Problem of distribution of computations in
proportion to the speed of processors
Problem of partitioning of some mathematical
objects
Sets, matrices, graphs, geometric figures, etc.

5
Partitioning matrices with constant models of
heterogeneous processors

Matrices
Most widely used math. objects in scientific
computing
Studied partitioning problems mainly deal with
matrices
Matrix partitioning in one dimension over a 1D
arrangement of processors
Often reduced to partitioning sets or
well-ordered sets
Design of algorithms often results in matrix
partitioning problems not imposing the
restriction of partitioning in one dimension
E.g., in parallel linear algebra for
heterogeneous platforms
We will use matrix multiplication
A simple but very important linear algebra kernel

6
Partitioning matrices with constant models of
heterogeneous processors (ctd)

A heterogeneous matrix multiplication algorithm
A modification of some homogeneous one
Most often, of the 2D block cyclic ScaLAPACK
algorithm

7
Partitioning matrices with constant models of
heterogeneous processors (ctd)

2D block cyclic ScaLAPACK MM algorithm (ctd)

8
Partitioning matrices with constant models of
heterogeneous processors (ctd)

2D block cyclic ScaLAPACK MM algorithm (ctd)
The matrices are identically partitioned into
rectangular generalized blocks of the size
(pr)(qr)
Each generalized block forms a 2D pq grid of rr
blocks
There is 1-to-1 mapping between this grid of
blocks and the pq processor grid
At each step of the algorithm
Each processor not owing the pivot row and column
receives horizontally (n/p)r elements of matrix
A and vertically (n/q)r elements of matrix B
gt in total, , i.e., the
half-perimeter of the rectangle
area allocated to the processor

9
Partitioning matrices with constant models of
heterogeneous processors (ctd)

General design of heterogeneous modifications
Matrices A, B, and C are identically partitioned
into equal rectangular generalized blocks
The generalized blocks are identically
partitioned into rectangles so that
There is one-to-one mapping between the
rectangles and the processors
The area of each rectangle is (approximately)
proportional to the speed of the processor which
has the rectangle
Then, the algorithm follows the steps of its
homogeneous prototype

10
Partitioning matrices with constant models of
heterogeneous processors (ctd)
11
Partitioning matrices with constant models of
heterogeneous processors (ctd)

Why to partition the GBs in proportion to the
speed
At each step, updating one rr block of matrix C
needs the same amount of computation for all the
blocks
gt the load will be perfectly balanced if the
number of blocks updated by each processor is
proportional to its speed
The number niNGB
ni the area of the GB partition allocated to
i-th processor (measured in rr blocks)
gt if the area of each GB partition to the
speed of the owing processor, their load will be
perfectly balanced

12
Partitioning matrices with constant models of
heterogeneous processors (ctd)

A generalized block from partitioning POV
An integer-valued rectangular
If we need an asymptotically optimal solution,
the problem can be reduced to a geometrical
problem of optimal partitioning of a real-valued
rectangle
the asymptotically optimal integer-valued
solution can be obtained by rounding off an
optimal real-valued solution of the geometrical
partitioning problem

13
Geometrical partitioning problem

The general geometrical partitioning problem
Given a set of p processors P1, P2, ..., Pp, the
relative speed of each of which is characterized
by a positive constant, si, ( ),
Partition a unit square into p rectangles so that
There is one-to-one mapping between the
rectangles and the processors
The area of the rectangle allocated to processor
Pi is equal to si
The partitioning minimizes ,
where wi is the width and hi is the
height of the rectangle allocated to processor Pi

14
Geometrical partitioning problem (ctd)

Motivation behind the formulation
Proportionality of the areas to the speeds
Balancing the load of the processors
Minimization of the sum of half-perimeters
Multiple partitionings can balance the load
Minimizes the total volume of communications
At each step of MM, each receiving processor
receives data the half-perimeter of its
rectangle
gt In total, the communicated data

15
Geometrical partitioning problem (ctd)

Motivation behind the formulation (ctd)
Option minimizing the maximal half-perimeter
Parallel communications
The use of a unit square instead of a rectangle
No loss of generality
the optimal solution for an arbitrary rectangle
is obtained by straightforward scaling of that
for the unit square
Proposition. The general geometrical partitioning
problem is NP-complete.

16
Restricted geometrical partitioning problems

Restricted problems having polynomial solutions
Column-based
Grid-based
Column-based partitioning
Rectangles make up columns
Has an optimal solution of complexity O(p3)

17
Column-based partitioning problem
18
Column-based partitioning problem (ctd)

A more restricted form of the column-based
partitioning problem
The processors are already arranged into a set of
columns
Algorithm 1 Optimal partitioning a unit square
between p heterogeneous processors arranged into
c columns, each of which is made of rj
processors, j1,,c
Let the relative speed of the i-th processor from
the j-th column, Pij, be sij.
Then, we first partition the unit square into c
vertical rectangular slices such that the width
the j-th slice

19
Column-based partitioning problem (ctd)

Algorithm 1 (ctd)
Second, each vertical slice is partitioned
independently into rectangles in proportion with
the speed of the processors in the corresponding
processor column.
Algorithm 1 is of linear complexity.

20
Grid-based partitioning problem

Grid-based partitioning problem
The heterogeneous processors form a
two-dimensional grid

- There exist p and q such that any vertical line
crossing the unit square will pass through
exactly p rectangles and any horizontal line
crossing the square will pass through exactly q
rectangles
21
Grid-based partitioning problem (ctd)

Proposition. Let a grid-based partitioning of the
unit square between p heterogeneous processors
form c columns, each of which consists of r
processors, prc. Then, the sum of
half-perimeters of the rectangles of the
partitioning will be equal to (rc).
The shape rc of the processor grid formed by any
optimal grid-based partitioning will minimize
(rc).
The sum of half-perimeters of the rectangles of
the optimal grid-based partitioning does not
depend on the mapping of the processors onto the
nodes of the grid.

22
Grid-based partitioning problem (ctd)

Algorithm 2 Optimal grid-based partitioning a
unit square between p heterogeneous processors
Step 1 Find the optimal shape rc of the
processor grid such that prc and (rc) is
minimal.
Step 2 Map the processors onto the nodes of the
grid.
Step 3 Apply Algorithm 3 of the optimal
partitioning of the unit square to this rc
arrangement of the p heterogeneous processors.
The correctness of Algorithm 2 is obvious.
Algorithm 2 returns a column-based partitioning.

23
Grid-based partitioning problem (ctd)

The optimal grid-based partitioning can be seen
as a restricted form of column-based partitioning.

24
Grid-based partitioning problem (ctd)

Algorithm 3 Finding r and c such that prc and
(rc) is minimal
while(rgt1)
if((p mod r)0))
goto stop
else
r--
stop c p / r

25
Grid-based partitioning problem (ctd)

Proposition. Algorithm 3 is correct.
Proposition. The complexity of Algorithm 2 can be
bounded by O(p3/2).

26
Experimental results
Specifications of sixteen Linux computers on
which the matrix multiplication is executed
27
Experimental results (ctd)
28
Application to Cartesian partitioning

Cartesian partitioning
A column-based partitioning, the rectangles of
which make up rows.

29
Application to Cartesian partitioning (ctd)

Cartesian partitioning
Plays important role in design of heterogeneous
parallel algorithms (e.g., in scalable
algorithms)
The Cartesian partitioning problem
Very difficult
Their may be no Cartesian partitionings perfectly
balancing the load of processors

30
Application to Cartesian partitioning (ctd)

Cartesian partitioning problem in general form
Given p processors, the speed of each of which is
characterized by a given positive constant,
Find a Cartesian partitioning of a unit square
such that
There is 1-to-1 mapping between the rectangles
and the processors
The partitioning minimizes

31
Application to Cartesian partitioning (ctd)

The Cartesian partitioning problem
Not even studied in the general form.
If shape rc is given, it proved NP-complete.
Unclear if there exists a polynomial algorithm
when both the shape and the processors mapping
are given
There exists an optimal Cartesian partitioning
with processors arranged in a non-increasing
order of speed

32
Application to Cartesian partitioning (ctd)

Approximate solutions of the Cartesian
partitioning problem are based on the observation
Let the speed matrix sij of the given rc
processor arrangement be rank-one
Then there exists a Cartesian partitioning
perfectly
balancing the load of the processors

33
Application to Cartesian partitioning (ctd)

Algorithm 4 Finding an approximate solution of
the simplified Cartesian problem (when only the
shape rc is given)
Step 1 Arrange the processors in a
non-increasing order of speed
Step 2 For this arrangement, let
and
be the parameters of the partitioning
Step 3 Calculate the areas hiwj of the
rectangles of this partitioning

34
Application to Cartesian partitioning (ctd)

Algorithm 5 Finding an approximate solution of
the simplified Cartesian problem when only the
shape rc is given (ctd)
Step 4 Re-arrange the processors so that
Step 5 If Step 4 does not change the arrangement
of the processors then return the current
partitioning and stop the procedure else go to
Step 2

35
Application to Cartesian partitioning (ctd)

Proposition. Let a Cartesian partitioning of the
unit square between p heterogeneous processors
form c columns, each of which consists of r
processors, prc. Then, the sum of
half-perimeters of the rectangles of the
partitioning will be (rc).
Proof is a trivial exercise
Minimization of the communication cost does not
depend on the speeds of the processors but only
on their number
gt minimization of communication cost and
minimization of computation cost are two
independent problems
Any Cartesian partitioning minimizing (rc) will
optimize communication cost

36
Application to Cartesian partitioning (ctd)

Now we can extend Algorithm 5
By adding the 0-th step, finding the optimal rc
The modified algorithm returns an approximate
solution of the extended Cartesian problem
Aimed at minimization of both computation and
communication cost
The modified Algorithm 5 will return an optimal
solution if the speed matrix for the arrangement
is a rank-one matrix

Write a Comment

User Comments (0)