Introduction to Computer Hardware - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Computer Hardware

Description:

Data must be distributed unevenly ... Then, we first partition the unit square into c ... A column-based partitioning, the rectangles of which make up rows. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 37
Provided by: alexeylas
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Computer Hardware


1
On Grid-based Matrix Partitioning for Networks of
Heterogeneous Processors
Alexey Lastovetsky School of Computer Science
and Informatics University College
Dublin Alexey.Lastovetsky_at_ucd.ie
2
Heterogeneous parallel computing
  • Heterogeneity of processors
  • The processors run at different speeds
  • Even distribution of computations do not balance
    processors load
  • The performance is determined by the slowest
    processor
  • Data must be distributed unevenly
  • So that each processor will perform the volume of
    computation proportional to its speed

3
Constant performance models of heterogeneous
processors
  • The simplest performance model of heterogeneous
    processors
  • p, the number of the processors,
  • Ss1, s2, ..., sp, the speeds of the processors
    (positive constants).
  • The speed
  • Absolute the number of computational units
    performed by the processor per one time unit
  • Relative
  • Some use the execution time

4
Data distribution problems with constant models
of heterogeneous processors
  • Typical design of heterogeneous parallel
    algorithms
  • Problem of distribution of computations in
    proportion to the speed of processors
  • Problem of partitioning of some mathematical
    objects
  • Sets, matrices, graphs, geometric figures, etc.

5
Partitioning matrices with constant models of
heterogeneous processors
  • Matrices
  • Most widely used math. objects in scientific
    computing
  • Studied partitioning problems mainly deal with
    matrices
  • Matrix partitioning in one dimension over a 1D
    arrangement of processors
  • Often reduced to partitioning sets or
    well-ordered sets
  • Design of algorithms often results in matrix
    partitioning problems not imposing the
    restriction of partitioning in one dimension
  • E.g., in parallel linear algebra for
    heterogeneous platforms
  • We will use matrix multiplication
  • A simple but very important linear algebra kernel

6
Partitioning matrices with constant models of
heterogeneous processors (ctd)
  • A heterogeneous matrix multiplication algorithm
  • A modification of some homogeneous one
  • Most often, of the 2D block cyclic ScaLAPACK
    algorithm

7
Partitioning matrices with constant models of
heterogeneous processors (ctd)
  • 2D block cyclic ScaLAPACK MM algorithm (ctd)

8
Partitioning matrices with constant models of
heterogeneous processors (ctd)
  • 2D block cyclic ScaLAPACK MM algorithm (ctd)
  • The matrices are identically partitioned into
    rectangular generalized blocks of the size
    (pr)(qr)
  • Each generalized block forms a 2D pq grid of rr
    blocks
  • There is 1-to-1 mapping between this grid of
    blocks and the pq processor grid
  • At each step of the algorithm
  • Each processor not owing the pivot row and column
    receives horizontally (n/p)r elements of matrix
    A and vertically (n/q)r elements of matrix B
  • gt in total, , i.e., the
    half-perimeter of the rectangle
    area allocated to the processor

9
Partitioning matrices with constant models of
heterogeneous processors (ctd)
  • General design of heterogeneous modifications
  • Matrices A, B, and C are identically partitioned
    into equal rectangular generalized blocks
  • The generalized blocks are identically
    partitioned into rectangles so that
  • There is one-to-one mapping between the
    rectangles and the processors
  • The area of each rectangle is (approximately)
    proportional to the speed of the processor which
    has the rectangle
  • Then, the algorithm follows the steps of its
    homogeneous prototype

10
Partitioning matrices with constant models of
heterogeneous processors (ctd)
11
Partitioning matrices with constant models of
heterogeneous processors (ctd)
  • Why to partition the GBs in proportion to the
    speed
  • At each step, updating one rr block of matrix C
    needs the same amount of computation for all the
    blocks
  • gt the load will be perfectly balanced if the
    number of blocks updated by each processor is
    proportional to its speed
  • The number niNGB
  • ni the area of the GB partition allocated to
    i-th processor (measured in rr blocks)
  • gt if the area of each GB partition to the
    speed of the owing processor, their load will be
    perfectly balanced

12
Partitioning matrices with constant models of
heterogeneous processors (ctd)
  • A generalized block from partitioning POV
  • An integer-valued rectangular
  • If we need an asymptotically optimal solution,
    the problem can be reduced to a geometrical
    problem of optimal partitioning of a real-valued
    rectangle
  • the asymptotically optimal integer-valued
    solution can be obtained by rounding off an
    optimal real-valued solution of the geometrical
    partitioning problem

13
Geometrical partitioning problem
  • The general geometrical partitioning problem
  • Given a set of p processors P1, P2, ..., Pp, the
    relative speed of each of which is characterized
    by a positive constant, si, ( ),
  • Partition a unit square into p rectangles so that
  • There is one-to-one mapping between the
    rectangles and the processors
  • The area of the rectangle allocated to processor
    Pi is equal to si
  • The partitioning minimizes ,
    where wi is the width and hi is the
    height of the rectangle allocated to processor Pi

14
Geometrical partitioning problem (ctd)
  • Motivation behind the formulation
  • Proportionality of the areas to the speeds
  • Balancing the load of the processors
  • Minimization of the sum of half-perimeters
  • Multiple partitionings can balance the load
  • Minimizes the total volume of communications
  • At each step of MM, each receiving processor
    receives data the half-perimeter of its
    rectangle
  • gt In total, the communicated data

15
Geometrical partitioning problem (ctd)
  • Motivation behind the formulation (ctd)
  • Option minimizing the maximal half-perimeter
  • Parallel communications
  • The use of a unit square instead of a rectangle
  • No loss of generality
  • the optimal solution for an arbitrary rectangle
    is obtained by straightforward scaling of that
    for the unit square
  • Proposition. The general geometrical partitioning
    problem is NP-complete.

16
Restricted geometrical partitioning problems
  • Restricted problems having polynomial solutions
  • Column-based
  • Grid-based
  • Column-based partitioning
  • Rectangles make up columns
  • Has an optimal solution of complexity O(p3)

17
Column-based partitioning problem
18
Column-based partitioning problem (ctd)
  • A more restricted form of the column-based
    partitioning problem
  • The processors are already arranged into a set of
    columns
  • Algorithm 1 Optimal partitioning a unit square
    between p heterogeneous processors arranged into
    c columns, each of which is made of rj
    processors, j1,,c
  • Let the relative speed of the i-th processor from
    the j-th column, Pij, be sij.
  • Then, we first partition the unit square into c
    vertical rectangular slices such that the width
    the j-th slice

19
Column-based partitioning problem (ctd)
  • Algorithm 1 (ctd)
  • Second, each vertical slice is partitioned
    independently into rectangles in proportion with
    the speed of the processors in the corresponding
    processor column.
  • Algorithm 1 is of linear complexity.

20
Grid-based partitioning problem
  • Grid-based partitioning problem
  • The heterogeneous processors form a
    two-dimensional grid

- There exist p and q such that any vertical line
crossing the unit square will pass through
exactly p rectangles and any horizontal line
crossing the square will pass through exactly q
rectangles
21
Grid-based partitioning problem (ctd)
  • Proposition. Let a grid-based partitioning of the
    unit square between p heterogeneous processors
    form c columns, each of which consists of r
    processors, prc. Then, the sum of
    half-perimeters of the rectangles of the
    partitioning will be equal to (rc).
  • The shape rc of the processor grid formed by any
    optimal grid-based partitioning will minimize
    (rc).
  • The sum of half-perimeters of the rectangles of
    the optimal grid-based partitioning does not
    depend on the mapping of the processors onto the
    nodes of the grid.

22
Grid-based partitioning problem (ctd)
  • Algorithm 2 Optimal grid-based partitioning a
    unit square between p heterogeneous processors
  • Step 1 Find the optimal shape rc of the
    processor grid such that prc and (rc) is
    minimal.
  • Step 2 Map the processors onto the nodes of the
    grid.
  • Step 3 Apply Algorithm 3 of the optimal
    partitioning of the unit square to this rc
    arrangement of the p heterogeneous processors.
  • The correctness of Algorithm 2 is obvious.
  • Algorithm 2 returns a column-based partitioning.

23
Grid-based partitioning problem (ctd)
  • The optimal grid-based partitioning can be seen
    as a restricted form of column-based partitioning.

24
Grid-based partitioning problem (ctd)
  • Algorithm 3 Finding r and c such that prc and
    (rc) is minimal
  • while(rgt1)
  • if((p mod r)0))
  • goto stop
  • else
  • r--
  • stop c p / r

25
Grid-based partitioning problem (ctd)
  • Proposition. Algorithm 3 is correct.
  • Proposition. The complexity of Algorithm 2 can be
    bounded by O(p3/2).

26
Experimental results
Specifications of sixteen Linux computers on
which the matrix multiplication is executed
27
Experimental results (ctd)
28
Application to Cartesian partitioning
  • Cartesian partitioning
  • A column-based partitioning, the rectangles of
    which make up rows.

29
Application to Cartesian partitioning (ctd)
  • Cartesian partitioning
  • Plays important role in design of heterogeneous
    parallel algorithms (e.g., in scalable
    algorithms)
  • The Cartesian partitioning problem
  • Very difficult
  • Their may be no Cartesian partitionings perfectly
    balancing the load of processors

30
Application to Cartesian partitioning (ctd)
  • Cartesian partitioning problem in general form
  • Given p processors, the speed of each of which is
    characterized by a given positive constant,
  • Find a Cartesian partitioning of a unit square
    such that
  • There is 1-to-1 mapping between the rectangles
    and the processors
  • The partitioning minimizes

31
Application to Cartesian partitioning (ctd)
  • The Cartesian partitioning problem
  • Not even studied in the general form.
  • If shape rc is given, it proved NP-complete.
  • Unclear if there exists a polynomial algorithm
    when both the shape and the processors mapping
    are given
  • There exists an optimal Cartesian partitioning
    with processors arranged in a non-increasing
    order of speed

32
Application to Cartesian partitioning (ctd)
  • Approximate solutions of the Cartesian
    partitioning problem are based on the observation
  • Let the speed matrix sij of the given rc
    processor arrangement be rank-one
  • Then there exists a Cartesian partitioning
    perfectly
  • balancing the load of the processors

33
Application to Cartesian partitioning (ctd)
  • Algorithm 4 Finding an approximate solution of
    the simplified Cartesian problem (when only the
    shape rc is given)
  • Step 1 Arrange the processors in a
    non-increasing order of speed
  • Step 2 For this arrangement, let
    and
  • be the parameters of the partitioning
  • Step 3 Calculate the areas hiwj of the
    rectangles of this partitioning

34
Application to Cartesian partitioning (ctd)
  • Algorithm 5 Finding an approximate solution of
    the simplified Cartesian problem when only the
    shape rc is given (ctd)
  • Step 4 Re-arrange the processors so that
  • Step 5 If Step 4 does not change the arrangement
    of the processors then return the current
    partitioning and stop the procedure else go to
    Step 2

35
Application to Cartesian partitioning (ctd)
  • Proposition. Let a Cartesian partitioning of the
    unit square between p heterogeneous processors
    form c columns, each of which consists of r
    processors, prc. Then, the sum of
    half-perimeters of the rectangles of the
    partitioning will be (rc).
  • Proof is a trivial exercise
  • Minimization of the communication cost does not
    depend on the speeds of the processors but only
    on their number
  • gt minimization of communication cost and
    minimization of computation cost are two
    independent problems
  • Any Cartesian partitioning minimizing (rc) will
    optimize communication cost

36
Application to Cartesian partitioning (ctd)
  • Now we can extend Algorithm 5
  • By adding the 0-th step, finding the optimal rc
  • The modified algorithm returns an approximate
    solution of the extended Cartesian problem
  • Aimed at minimization of both computation and
    communication cost
  • The modified Algorithm 5 will return an optimal
    solution if the speed matrix for the arrangement
    is a rank-one matrix
Write a Comment
User Comments (0)
About PowerShow.com