Parallel Algorithm Design - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Parallel Algorithm Design

Description:

Example: Computing Mandelbrot Set. Colour of each pixel c is defined solely from its coordinates: ... Mandelbrot: Dynamic Load Balancing ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 56

Provided by: Pao3

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Algorithm Design

1
Parallel Algorithm Design

Involves all of the following
identifying the portions of the work that can be
performed concurrently
mapping the concurrent pieces of work onto
multiple processes running in parallel
distributing the input, output and intermediate
data associated with the program
managing access to data shared by multiple
processes
synchronizing the processes in various stages of
parallel program execution
Optimal choices depend on the parallel
architecture

2
Platform dependency example

Problem
process each element of an array, with
interaction between neighbouring elements
Setting message passing computer
Solution distribute the array into blocks of
size n/p

3
Overview
Preliminaries Decomposition techniques Characteris
tics of tasks and interactions Load
Balancing Minimizing communication
overhead Parallel Algorithm Models
4
Overview
Preliminaries Decomposition techniques Characteris
tics of tasks and interactions Load
Balancing Minimizing communication
overhead Parallel Algorithm Models
5
Preliminaries

Task Dependency Graph
directed acyclic graph capturing causal
dependency between tasks
a task corresponding to a node can be executed
only after all tasks on the other sides of the
incoming edges have already been executed

6
Preliminaries

Granularity
fine grained large number of small tasks
coarse grained small number of large tasks
Degree of Concurrency
the maximal number of tasks that can be executed
simultaneously
Critical Path
the costliest directed path between any pair of
start and finish nodes in the task dependency
graph
the cost of the path is the sum of the weights
of the nodes
Task Interaction Graph
tasks correspond to nodes and an edge connects
two tasks if they communicate/interact with each
other

7
Exercises
3
4
1
5
2
6
5
4
3
3
2

For this task graph, determine
degree of concurrency
critical path

8
Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
9
Decomposition techniques

Why?
find/create parallelism
How?
Recursive Decomposition
Data Decomposition
Task Decomposition
Exploratory Decomposition
Speculative Decomposition

10
Recursive Decomposition

Divide and conquer leads to natural concurrency.
quick sort

6
2
5
8
9
5
1
7
3
4
3
0
2
5
5
1
3
4
3
0
6
8
9
7
1
0
2
5
5
3
4
3
6
7
8
9

finding minimum recursively
rMin(A0..n-1) min(rMin(A0..n/2-1,
rMin(An/2..n-1))

11
Data Decomposition

Data Decomposition
Begin by focusing on the largest data
structures, or the ones that are accessed most
frequently.
Divide the data into small pieces, if possible
of similar size.
Strive for more aggressive partitioning then
your target computer will allow.
Use the data partitioning as a guideline for
partitioning the computation into separate tasks
associate some computation with each data
element.
Take communication requirements into account
when partitioning data.

12
Data Partitioning

Partitioning according to
input data (e.g. find minimum, sorting)
output data (e.g. matrix multiplication)
intermediate data (bucket sort)
Associate tasks with the data
do as much as you can with the data before
further communication
owner computes rule
Partition in a way that minimizes the cost of
communication

13
Task Decomposition

Partition the computation into many small tasks
of approximately uniform computational
requirements.
Associate data with each task.
Common in problems where there are no obvious
data structures to partition, or where the data
structures are highly unstructured.

14
Exploratory Decomposition

commonly used in search space exploration
unlike the data decomposition, the search space
is not known beforehand
computation can terminate as soon as a solution
is found
the amount of work can be more or less then in
the sequential case

15
Speculative Decomposition

Example
discrete event simulation
execute both branches concurrently, then keep
the one that was taken
total amount of work is always more then in the
sequential case, but execution time can be less

16
Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
17
Task Characteristics

Influence load balancing and decomposition
choices
Characteristics of Tasks
Task generation static vs dynamic
Task sizes uniform, non-uniform, known,
unknown
Size of Data Associated with Tasks influences
mapping decisions, input/output sizes
Characteristics of Inter-Task Communication
static vs dynamic
regular vs irregular
read-only vs read-write
one way vs two way

18
Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
19
Load Balancing
Efficiency adversely affected by uneven workload
P0 P1

computation P2
idle (wasted) P3 P4
20
Load Balancing (cont.)
Load balancing shifting work from heavily loaded
processors to lightly loaded ones. P0 P1

computation P2

idle (wasted) P3
moved P4

Static load balancing
before execution
Dynamic load balancing
during execution

21
Static Load Balancing

Map data and tasks into processors prior to
execution
the tasks must be known beforehand (static task
generation)
usually task sizes need to be known in order to
work well
even if the sizes are known (but non-uniform),
the problem of optimal mapping is NP hard (but
there are reasonable approximation schemes)

22
1D Array Partitioning
23
2D Array Partitioning
p0
p2
p3
p4
p0
p1
p5
p2
p4
p0
p1

p2

p3

p1
p3
p5
p4

p5

p1
p2
p0
p1
p2
p0
p0
p1

p2
p3

p4

p5

24
Example Geometric Operations

Image filtering, geometric transformations,
Trivial observation
Workload is directly proportional to the number
of objects.
If dealing with pixels, the workload is
proportional to area.

25
Dynamic Load Balancing

Centralized Schemes
master-slave
master generates tasks and distributes workload
easy to program, prone to master becoming
bottleneck
self scheduling
take a task from the work pool when you are
ready
chunk scheduling
self scheduling with taking single task can be
costly
take a chunk of tasks at once
when there are few tasks left, the chunk size
decreases

26
Dynamic Load Balancing

Distributed Schemes
Distributively share your workload with other
processors.
Issues
how to pair sending and receiving processors
transfer of workload initiated by sender or
receiver?
how much work to transfer?
when to decide to transfer?

27
Example Computing Mandelbrot Set
Colour of each pixel c is defined solely from its
coordinates int getColour(Complex c) int
colour 0 Complex z (0,0) while
((zlt2) (colourltmax)) z z2c
colour return colour
1.5
-1.5
-2 real 1
28
Mandelbrot Set Example (cont.)

Possible partitioning strategies
Partition by individual output pixels most
aggressive partitioning.
Partition by rows.
Partition by columns.
Partition by 2-D blocks.
Want to explore Mandelbrot set? See
http//aleph0.clarku.edu/djoyce/julia/explorer.ht
ml

29
Mandelbrot Bad Static Load Balancing
p0
p1
p2
p3
p0 p1 p2 p3 workload
30
Mandelbrot Dynamic Load Balancing

The task size can be anything from single node to
sizeable blocks.
blocks and/or stripes can be used
Optimal size depends on the overhead associated
with task distribution (mostly network
properties).
one pixel is usually too small

31
Mandelbrot Load Balancing Issues
Do we really need dynamic load balancing?
32
Mandelbrot Good Static Load Balancing
p0
p1
p2
p3
p0 p1 p2 p3 workload
The same can be achieved by random distribution
of small areas.
33
Mandelbrot again!

A more advanced algorithm might be more
complicated to load balance
if the border of an area is of the same colour,
it can be filled without computing the inner area
recursive decomposition can be used

34
Mandelbrot again!

A more advanced algorithm might be more
complicated to load balance
if the border of an area is of the same colour,
it can be filled without computing the inner area
recursive decomposition can be used

35
Mandelbrot again!

A more advanced algorithm might be more
complicated to load balance
if the border of an area is of the same colour,
it can be filled without computing the inner area
recursive decomposition can be used

36
Lead Balancing Summary

Static vs Dynamic
what?
advantages/disadvantages
when to use them?
Techniques for Static LB
Techniques for Dynamic LB

37
Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
38
Minimizing Communication Overhead
Maximizing Data Locality Minimizing Contention
and Hot Spots Overlapping Computation with
Communication Replicating Data or
Computation Using Optimized Collective
Communication Routines Overlapping Communication
39
Maximizing data locality

Example matrix multiplication CAxB
ci,j G ai,kbk,j
straightforward data partitioning without data
repetition

n-1
k0
p0
p1
p2
p3
p0
p1
p2
p3
p4
p5
p6
p7
p4
p5
p6
p7
p8
p9
p10
p11
p8
p9
p10
p11
p12
p13
p14
p15
p12
p13
p14
p15
partitioning matrix A partitioning matrix B
When p0 computes c0,0, it has to communicate
with When p6 computes c1,2, it has to
communicate with
40
Maximizing data locality

Example matrix multiplication CAxB
ci,j G ai,kbk,j
straightforward data partitioning without data
repetition

n-1
k0
p0
p1
p2
p3
p0
p4
p8
p12
p4
p5
p6
p7
p1
p5
p9
p13
p8
p9
p10
p11
p2
p6
p10
p14
p12
p13
p14
p15
p3
p7
p11
p15
partitioning matrix A partitioning matrix B
When p0 computes c0,0, it has to communicate
with When p6 computes c1,2, it has to
communicate with
41
Maximizing data locality I
Minimizing volume of data exchange Matrix
multiplication example
Communication n2n2/p per processor
x

Communication 2n2/vp per processor

x
42
Maximizing data locality II
Minimizing volume of data exchange surface to
volume ratio 2D Simulation example
Communication volume 2n per processor
Communication volume 4n/vp per processor
43
Maximizing data locality III

Minimizing frequency of interaction
communication start-up time is much greater then
per-byte time
Sparse matrix multiplication example
examine your vectors, figure out which entries
are non-zero
request all data you need in one block (or as
much as fits into your memory)
process locally with received data

44
Trading Memory vs Communication

If you have enough memory, you can sometimes
trade memory for communication by having the
input data stored at multiple places.
Example 1 low level image filtering
each pixel x is replaced by a function of x and
of its neighbourhood
if each processor stores not only its pixels,
but also their neighbourhoods, no communications
is needed

45
Trading Memory vs Communication II

Example 2 matrix multiplication CAxB
replicating the rows/columns allows to eliminate
communication

p0
p0
p6
p6

matrix A
matrix B
Processor computing ci,j stores row i of matrix A
and column j of matrix B.
no further communication needed
what is the memory overhead? How many processors
store row i of matrix A? How many processors
store column j of matrix B?

46
Overlapping Computation with Communication

Mandelbrot with dynamic load balancing
request the next block when the current one is
nearing its completion
e.g. with k-row blocks, ask when you start
computing the last row of the current block
analysis/experimentation needed to identify the
optimal prefetch distance

47
Overlapping Communication
Example 4 consecutive broadcasts
p0
p1
p2
p3
p4
p5
p6
p7
4 x
12 steps
p0
p1
p2
p3
p4
p5
p6
p7
10 steps with pipelining
48
Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
49
Parallel Algorithm Models

Data-Parallel Model
Task Graph Model
Work Pool Model
Master-Slave Model
Pipeline (Producer-Consumer) Model

50
Data Parallel Model

divide data up amongst processors.
process different data segments in parallel
communicate boundary information, if necessary
includes loop parallelism
well suited for SIMD machines
communication is often implicit (HPF)

51
Task Graph Model

decompose algorithm into different sections
assign sections to different processors
often uses fork()/join()/spawn()
usually does not yield itself to high level of
parallelism

52
Work Pool Model

dynamic mapping of tasks to processes
typically small amount of data per task
the pool of tasks (priority queue, hash table,
tree) can be centralized or distributed

get task
work pool
P3
t8
t7
process task
t0
t3
t2
P0
possibly add task
P1
P2
53
Master-Slave Model

master generates and allocates tasks
can be also hierarchical/multilayer
master potentially a bottleneck
overlapping communication and computation at the
master often useful

54
Pipelining,

a sequence of tasks whose execution can overlap
sequential processor must execute them
sequentially, without overlap
parallel computer can overlap the tasks,
increasing throughput (but not decreasing latency)

55
5-step Guide to Parallelization

Identify computational hotspots
find what is worth to parallelize
Partition the problem into smaller
semi-independent tasks
find/create parallelism
Identify communication requirements between these
tasks
realize the constraints communication puts on
parallelism
Agglomerate smaller tasks into larger tasks
group the basic tasks together so that the
communication is minimized, while still allowing
good load balancing properties
Map tasks/data to actual processors
balance the load of processors, while trying to
minimize communication