Title: Parallelisation of Gridoriented Problems
1Parallelisation of Grid-oriented Problems
- Algoritmen voor parallelle computers
- 8/11/2000
2Grid-oriented problems
- PDEs, image processing, data set defined on a
gridlocal computations with small stencilsÆ
data dependencies between neighbouring grid
points - grid point generic name for data associated
withgrid point, pixel, cell, finite element, - grid, data set associated work partitioned in
subdomainsthe subdomains are assigned (mapped)
to processors
3Grid-oriented problems (cont.)
- extra tasks (compared with sequential code)
- partitioning mapping to ensure work load
balance and communication minimisation - communication between neighbouring subdomains
4Model problems
- PDEs
- explicit time integration (forward Euler)
- relaxation methods (Jacobi, Gauss-Seidel, SOR, )
- on a structured (regular) 2D grid
- image processing
- convolution
- on a 2D pixel matrix
- same data-dependency pattern
- Æ same parallelisation strategy
5Explicit time integration convolution
6Computational molecules
- 5 point stencil 9 point stencil
7Computational molecules (cont.)
- two different 9 point stencils
8Subdomains overlap regions
- Note overlap region can have a width gt 1
9Skeleton of a typical program
- in every subdomain (processor)
- exchange data in the overlap region
- communication with procs. holding neighbouring
subdomains - do calculations for all grid points in subdomain
- check for stopping criterion (e.g. convergence
check) - global communication (reduction)
10Exchange overlap regions
11Analysis of communication overhead
- assume p processors n nx x ny points per
subdomain - only communication overhead no sequential
partno load imbalance -
- T(n,p) parallel execution time T(n,1)
execution time on 1 proc. - Tcalc calculation time Tcomm communication
time -
- Speedup
- Efficiency
-
12Analysis of communication overhead (cont.)
- Communication overhead
- relative to calculation cost !
- For the model problems
- tcalc time to perform a floating point
operation - tcomm average time to communicate one floating
point number - Note in case of 1 message of length m tcomm
(ts mtw)/m !! -
13Analysis of communication overhead (cont.)
- Communication overhead
- depends on
- the size of the subdomain large subdomains have
a small perimeter to surface ratio - the machine characteristic tcomm/tcalc
indicates how fast communication can be performed
compared with floating point operations - the algorithm via the ratio cc /cf fc is small
when many flops per grid point (cf) compared with
the amount of data assocoated with a grid point
(cc)
14Partitioning strategies
- 2D grid M grid points n M/p grid points per
proc. - blockwise partitioning stripwise
partitioning - n nx x ny square blockwise partitioning if
nx ny n
15Partitioning strategies (cont.)
- communication volume perimeter of subdomain
- square subdomains (nx ny) minimal perimeter
- ? blockwise partitioning is to be preferred
- BUT
- stripwise partitioning
- higher communication volume
- fewer neighbours Æ fewer messages
- choice depends on problem machine
characteristics - stripwise partitioning may be better also when
communication mainly in one direction - (an-isotropic communication)
16Comm. overhead dependence on problem size
- 2D grid M grid points n M/p grid points
per proc. - blockwise partitioning
- per proc points
-
- fc (and speedup efficiency) is constant when n
(problem size per proc) is constant and p grows - fc ?(speedup efficiencyØ) when total problem
size is constant and p grows
17Comm. overhead dependence on problem size
- 2D grid M grid points n M/p grid points
per proc. - stripwise partitioning
- per proc points
-
- fc ? (speedup efficiency Ø) when n is constant
and p grows - fc ?? (speedup efficiency ØØ) when total
problem size is constant and p grows
18Comm. overhead dependence on problem size
- 3D problemscommunication volume
- µ surface to volume ratio of the subdomains
- blockwise partitioning points per
proc. - fc increases slower as function of n than in 2D
case - d-dimensional problems fc µ 1/n1/d
19Comm. overhead dependence on comput. molecule
- computational molecules of increasing size
- when the molecule covers the whole domain (i.e.
new value - of grid point depends on all other grid points)
!! -
l l u l l
l l l l l l l l l l l l l l l l l l l l l l l l u
l l l l l l l l l l l l l l l l l l l l l l l l
l l l l u l l l l
l l l l l l l l l l l l u l l l l l l l l l l l l
20Analysis of load imbalance
- Let calculation time for processor i,
i 1 p - average calculation time
- maximal calculation time (over all
procs.) - Assume
- number of operations (counted sequentially)
independent of p - communication time sequential fraction can be
neglected - Execution time of the parallel program
determined by - Efficiency
- Load balance factor
21Analysis of load imbalance (cont.)
- Load balance factor does not depend on !
22Analysis of load imbalance (cont.)
- Assume in addition
- amount of work is equal for each grid point
- procs. are (implicitly) synchronised by the
communication at the end of each iteration - Let Nmax maximum number of grid points per
subdomain - Naverage M/p average number of grid
points /subdomain - then
23Load imbalance and partitioning
- If computational cost is NOT equal for each grid
point - different physics in different regions
- grid points corresponding to boundary
conditions - optimal partitioning w.r.t. work load balance
difficult to compute - If work load imbalance is due only to boundary
conditions - then blockwise partitioning ensures that the
boundary - conditions are well distributed over the
processors - Æ good load balance
- blockwise partitioning stripwise
partitioning
24Load imbalance and partitioning (cont.)
- If in a rectangular grid, the number of grid
lines is not a multiple of p then typically the
grid is partitioned in (unequal) rectangles - not optimal w.r.t work load balance, but easy
- Also in this case, a blockwise partitioning leads
to - minimal work load imbalance
-
- blockwise partitioning stripwise
partitioning -