Title: Parallel Algorithm Design
1Parallel Algorithm Design
- Involves all of the following
- identifying the portions of the work that can be
performed concurrently - mapping the concurrent pieces of work onto
multiple processes running in parallel - distributing the input, output and intermediate
data associated with the program - managing access to data shared by multiple
processes - synchronizing the processes in various stages of
parallel program execution - Optimal choices depend on the parallel
architecture
2Platform dependency example
- Problem
- process each element of an array, with
interaction between neighbouring elements - Setting message passing computer
- Solution distribute the array into blocks of
size n/p
3Overview
Preliminaries Decomposition techniques Characteris
tics of tasks and interactions Load
Balancing Minimizing communication
overhead Parallel Algorithm Models
4Overview
Preliminaries Decomposition techniques Characteris
tics of tasks and interactions Load
Balancing Minimizing communication
overhead Parallel Algorithm Models
5Preliminaries
- Task Dependency Graph
- directed acyclic graph capturing causal
dependency between tasks - a task corresponding to a node can be executed
only after all tasks on the other sides of the
incoming edges have already been executed
6Preliminaries
- Granularity
- fine grained large number of small tasks
- coarse grained small number of large tasks
- Degree of Concurrency
- the maximal number of tasks that can be executed
simultaneously - Critical Path
- the costliest directed path between any pair of
start and finish nodes in the task dependency
graph - the cost of the path is the sum of the weights
of the nodes - Task Interaction Graph
- tasks correspond to nodes and an edge connects
two tasks if they communicate/interact with each
other
7Exercises
3
4
1
5
2
6
5
4
3
3
2
- For this task graph, determine
- degree of concurrency
- critical path
8Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
9Decomposition techniques
- Why?
- find/create parallelism
- How?
- Recursive Decomposition
- Data Decomposition
- Task Decomposition
- Exploratory Decomposition
- Speculative Decomposition
10Recursive Decomposition
- Divide and conquer leads to natural concurrency.
- quick sort
6
2
5
8
9
5
1
7
3
4
3
0
2
5
5
1
3
4
3
0
6
8
9
7
1
0
2
5
5
3
4
3
6
7
8
9
- finding minimum recursively
- rMin(A0..n-1) min(rMin(A0..n/2-1,
rMin(An/2..n-1))
11Data Decomposition
- Data Decomposition
- Begin by focusing on the largest data
structures, or the ones that are accessed most
frequently. - Divide the data into small pieces, if possible
of similar size. - Strive for more aggressive partitioning then
your target computer will allow. - Use the data partitioning as a guideline for
partitioning the computation into separate tasks
associate some computation with each data
element. - Take communication requirements into account
when partitioning data.
12Data Partitioning
- Partitioning according to
- input data (e.g. find minimum, sorting)
- output data (e.g. matrix multiplication)
- intermediate data (bucket sort)
- Associate tasks with the data
- do as much as you can with the data before
further communication - owner computes rule
- Partition in a way that minimizes the cost of
communication
13Task Decomposition
- Partition the computation into many small tasks
of approximately uniform computational
requirements. - Associate data with each task.
- Common in problems where there are no obvious
data structures to partition, or where the data
structures are highly unstructured.
14Exploratory Decomposition
- commonly used in search space exploration
- unlike the data decomposition, the search space
is not known beforehand - computation can terminate as soon as a solution
is found - the amount of work can be more or less then in
the sequential case
15Speculative Decomposition
- Example
- discrete event simulation
- execute both branches concurrently, then keep
the one that was taken - total amount of work is always more then in the
sequential case, but execution time can be less
16Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
17Task Characteristics
- Influence load balancing and decomposition
choices - Characteristics of Tasks
- Task generation static vs dynamic
- Task sizes uniform, non-uniform, known,
unknown - Size of Data Associated with Tasks influences
mapping decisions, input/output sizes - Characteristics of Inter-Task Communication
- static vs dynamic
- regular vs irregular
- read-only vs read-write
- one way vs two way
18Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
19Load Balancing
Efficiency adversely affected by uneven workload
P0 P1
computation P2
idle (wasted) P3 P4
20Load Balancing (cont.)
Load balancing shifting work from heavily loaded
processors to lightly loaded ones. P0 P1
computation P2
idle (wasted) P3
moved P4
- Static load balancing
- before execution
- Dynamic load balancing
- during execution
21Static Load Balancing
- Map data and tasks into processors prior to
execution - the tasks must be known beforehand (static task
generation) - usually task sizes need to be known in order to
work well - even if the sizes are known (but non-uniform),
the problem of optimal mapping is NP hard (but
there are reasonable approximation schemes)
221D Array Partitioning
232D Array Partitioning
p0
p2
p3
p4
p0
p1
p5
p2
p4
p0
p1
p2
p3
p1
p3
p5
p4
p5
p1
p2
p0
p1
p2
p0
p0
p1
p2
p3
p4
p5
24Example Geometric Operations
- Image filtering, geometric transformations,
- Trivial observation
- Workload is directly proportional to the number
of objects. - If dealing with pixels, the workload is
proportional to area.
25Dynamic Load Balancing
- Centralized Schemes
- master-slave
- master generates tasks and distributes workload
- easy to program, prone to master becoming
bottleneck - self scheduling
- take a task from the work pool when you are
ready - chunk scheduling
- self scheduling with taking single task can be
costly - take a chunk of tasks at once
- when there are few tasks left, the chunk size
decreases
26Dynamic Load Balancing
- Distributed Schemes
- Distributively share your workload with other
processors. - Issues
- how to pair sending and receiving processors
- transfer of workload initiated by sender or
receiver? - how much work to transfer?
- when to decide to transfer?
27Example Computing Mandelbrot Set
Colour of each pixel c is defined solely from its
coordinates int getColour(Complex c) int
colour 0 Complex z (0,0) while
((zlt2) (colourltmax)) z z2c
colour return colour
1.5
-1.5
-2 real 1
28Mandelbrot Set Example (cont.)
- Possible partitioning strategies
- Partition by individual output pixels most
aggressive partitioning. - Partition by rows.
- Partition by columns.
- Partition by 2-D blocks.
- Want to explore Mandelbrot set? See
- http//aleph0.clarku.edu/djoyce/julia/explorer.ht
ml
29Mandelbrot Bad Static Load Balancing
p0
p1
p2
p3
p0 p1 p2 p3 workload
30Mandelbrot Dynamic Load Balancing
- The task size can be anything from single node to
sizeable blocks. - blocks and/or stripes can be used
- Optimal size depends on the overhead associated
with task distribution (mostly network
properties). - one pixel is usually too small
31Mandelbrot Load Balancing Issues
Do we really need dynamic load balancing?
32Mandelbrot Good Static Load Balancing
p0
p1
p2
p3
p0 p1 p2 p3 workload
The same can be achieved by random distribution
of small areas.
33Mandelbrot again!
- A more advanced algorithm might be more
complicated to load balance - if the border of an area is of the same colour,
it can be filled without computing the inner area - recursive decomposition can be used
34Mandelbrot again!
- A more advanced algorithm might be more
complicated to load balance - if the border of an area is of the same colour,
it can be filled without computing the inner area - recursive decomposition can be used
35Mandelbrot again!
- A more advanced algorithm might be more
complicated to load balance - if the border of an area is of the same colour,
it can be filled without computing the inner area - recursive decomposition can be used
36Lead Balancing Summary
- Static vs Dynamic
- what?
- advantages/disadvantages
- when to use them?
- Techniques for Static LB
- Techniques for Dynamic LB
37Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
38Minimizing Communication Overhead
Maximizing Data Locality Minimizing Contention
and Hot Spots Overlapping Computation with
Communication Replicating Data or
Computation Using Optimized Collective
Communication Routines Overlapping Communication
39Maximizing data locality
- Example matrix multiplication CAxB
- ci,j G ai,kbk,j
- straightforward data partitioning without data
repetition
n-1
k0
p0
p1
p2
p3
p0
p1
p2
p3
p4
p5
p6
p7
p4
p5
p6
p7
p8
p9
p10
p11
p8
p9
p10
p11
p12
p13
p14
p15
p12
p13
p14
p15
partitioning matrix A partitioning matrix B
When p0 computes c0,0, it has to communicate
with When p6 computes c1,2, it has to
communicate with
40Maximizing data locality
- Example matrix multiplication CAxB
- ci,j G ai,kbk,j
- straightforward data partitioning without data
repetition
n-1
k0
p0
p1
p2
p3
p0
p4
p8
p12
p4
p5
p6
p7
p1
p5
p9
p13
p8
p9
p10
p11
p2
p6
p10
p14
p12
p13
p14
p15
p3
p7
p11
p15
partitioning matrix A partitioning matrix B
When p0 computes c0,0, it has to communicate
with When p6 computes c1,2, it has to
communicate with
41Maximizing data locality I
Minimizing volume of data exchange Matrix
multiplication example
Communication n2n2/p per processor
x
Communication 2n2/vp per processor
x
42Maximizing data locality II
Minimizing volume of data exchange surface to
volume ratio 2D Simulation example
Communication volume 2n per processor
Communication volume 4n/vp per processor
43Maximizing data locality III
- Minimizing frequency of interaction
- communication start-up time is much greater then
per-byte time - Sparse matrix multiplication example
- examine your vectors, figure out which entries
are non-zero - request all data you need in one block (or as
much as fits into your memory) - process locally with received data
44Trading Memory vs Communication
- If you have enough memory, you can sometimes
trade memory for communication by having the
input data stored at multiple places. - Example 1 low level image filtering
- each pixel x is replaced by a function of x and
of its neighbourhood - if each processor stores not only its pixels,
but also their neighbourhoods, no communications
is needed
45Trading Memory vs Communication II
- Example 2 matrix multiplication CAxB
- replicating the rows/columns allows to eliminate
communication
p0
p0
p6
p6
- matrix A
matrix B - Processor computing ci,j stores row i of matrix A
and column j of matrix B. - no further communication needed
- what is the memory overhead? How many processors
store row i of matrix A? How many processors
store column j of matrix B?
46Overlapping Computation with Communication
- Mandelbrot with dynamic load balancing
- request the next block when the current one is
nearing its completion - e.g. with k-row blocks, ask when you start
computing the last row of the current block - analysis/experimentation needed to identify the
optimal prefetch distance
47Overlapping Communication
Example 4 consecutive broadcasts
p0
p1
p2
p3
p4
p5
p6
p7
4 x
12 steps
p0
p1
p2
p3
p4
p5
p6
p7
10 steps with pipelining
48Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
49Parallel Algorithm Models
- Data-Parallel Model
- Task Graph Model
- Work Pool Model
- Master-Slave Model
- Pipeline (Producer-Consumer) Model
50Data Parallel Model
- divide data up amongst processors.
- process different data segments in parallel
- communicate boundary information, if necessary
- includes loop parallelism
- well suited for SIMD machines
- communication is often implicit (HPF)
51Task Graph Model
- decompose algorithm into different sections
- assign sections to different processors
- often uses fork()/join()/spawn()
- usually does not yield itself to high level of
parallelism
52Work Pool Model
- dynamic mapping of tasks to processes
- typically small amount of data per task
- the pool of tasks (priority queue, hash table,
tree) can be centralized or distributed
get task
work pool
P3
t8
t7
process task
t0
t3
t2
P0
possibly add task
P1
P2
53Master-Slave Model
- master generates and allocates tasks
- can be also hierarchical/multilayer
- master potentially a bottleneck
- overlapping communication and computation at the
master often useful
54Pipelining,
- a sequence of tasks whose execution can overlap
- sequential processor must execute them
sequentially, without overlap - parallel computer can overlap the tasks,
increasing throughput (but not decreasing latency)
555-step Guide to Parallelization
- Identify computational hotspots
- find what is worth to parallelize
- Partition the problem into smaller
semi-independent tasks - find/create parallelism
- Identify communication requirements between these
tasks - realize the constraints communication puts on
parallelism - Agglomerate smaller tasks into larger tasks
- group the basic tasks together so that the
communication is minimized, while still allowing
good load balancing properties - Map tasks/data to actual processors
- balance the load of processors, while trying to
minimize communication