Parallel Algorithm Design - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Parallel Algorithm Design

Description:

Example: Computing Mandelbrot Set. Colour of each pixel c is defined solely from its coordinates: ... Mandelbrot: Dynamic Load Balancing ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 56
Provided by: Pao3
Category:

less

Transcript and Presenter's Notes

Title: Parallel Algorithm Design


1
Parallel Algorithm Design
  • Involves all of the following
  • identifying the portions of the work that can be
    performed concurrently
  • mapping the concurrent pieces of work onto
    multiple processes running in parallel
  • distributing the input, output and intermediate
    data associated with the program
  • managing access to data shared by multiple
    processes
  • synchronizing the processes in various stages of
    parallel program execution
  • Optimal choices depend on the parallel
    architecture

2
Platform dependency example
  • Problem
  • process each element of an array, with
    interaction between neighbouring elements
  • Setting message passing computer
  • Solution distribute the array into blocks of
    size n/p

3
Overview
Preliminaries Decomposition techniques Characteris
tics of tasks and interactions Load
Balancing Minimizing communication
overhead Parallel Algorithm Models
4
Overview
Preliminaries Decomposition techniques Characteris
tics of tasks and interactions Load
Balancing Minimizing communication
overhead Parallel Algorithm Models
5
Preliminaries
  • Task Dependency Graph
  • directed acyclic graph capturing causal
    dependency between tasks
  • a task corresponding to a node can be executed
    only after all tasks on the other sides of the
    incoming edges have already been executed

6
Preliminaries
  • Granularity
  • fine grained large number of small tasks
  • coarse grained small number of large tasks
  • Degree of Concurrency
  • the maximal number of tasks that can be executed
    simultaneously
  • Critical Path
  • the costliest directed path between any pair of
    start and finish nodes in the task dependency
    graph
  • the cost of the path is the sum of the weights
    of the nodes
  • Task Interaction Graph
  • tasks correspond to nodes and an edge connects
    two tasks if they communicate/interact with each
    other

7
Exercises
3
4
1
5
2
6
5
4
3
3
2
  • For this task graph, determine
  • degree of concurrency
  • critical path

8
Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
9
Decomposition techniques
  • Why?
  • find/create parallelism
  • How?
  • Recursive Decomposition
  • Data Decomposition
  • Task Decomposition
  • Exploratory Decomposition
  • Speculative Decomposition

10
Recursive Decomposition
  • Divide and conquer leads to natural concurrency.
  • quick sort

6
2
5
8
9
5
1
7
3
4
3
0
2
5
5
1
3
4
3
0
6
8
9
7
1
0
2
5
5
3
4
3
6
7
8
9
  • finding minimum recursively
  • rMin(A0..n-1) min(rMin(A0..n/2-1,
    rMin(An/2..n-1))

11
Data Decomposition
  • Data Decomposition
  • Begin by focusing on the largest data
    structures, or the ones that are accessed most
    frequently.
  • Divide the data into small pieces, if possible
    of similar size.
  • Strive for more aggressive partitioning then
    your target computer will allow.
  • Use the data partitioning as a guideline for
    partitioning the computation into separate tasks
    associate some computation with each data
    element.
  • Take communication requirements into account
    when partitioning data.

12
Data Partitioning
  • Partitioning according to
  • input data (e.g. find minimum, sorting)
  • output data (e.g. matrix multiplication)
  • intermediate data (bucket sort)
  • Associate tasks with the data
  • do as much as you can with the data before
    further communication
  • owner computes rule
  • Partition in a way that minimizes the cost of
    communication

13
Task Decomposition
  • Partition the computation into many small tasks
    of approximately uniform computational
    requirements.
  • Associate data with each task.
  • Common in problems where there are no obvious
    data structures to partition, or where the data
    structures are highly unstructured.

14
Exploratory Decomposition
  • commonly used in search space exploration
  • unlike the data decomposition, the search space
    is not known beforehand
  • computation can terminate as soon as a solution
    is found
  • the amount of work can be more or less then in
    the sequential case

15
Speculative Decomposition
  • Example
  • discrete event simulation
  • execute both branches concurrently, then keep
    the one that was taken
  • total amount of work is always more then in the
    sequential case, but execution time can be less

16
Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
17
Task Characteristics
  • Influence load balancing and decomposition
    choices
  • Characteristics of Tasks
  • Task generation static vs dynamic
  • Task sizes uniform, non-uniform, known,
    unknown
  • Size of Data Associated with Tasks influences
    mapping decisions, input/output sizes
  • Characteristics of Inter-Task Communication
  • static vs dynamic
  • regular vs irregular
  • read-only vs read-write
  • one way vs two way

18
Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
19
Load Balancing
Efficiency adversely affected by uneven workload
P0 P1

computation P2
idle (wasted) P3 P4
20
Load Balancing (cont.)
Load balancing shifting work from heavily loaded
processors to lightly loaded ones. P0 P1

computation P2

idle (wasted) P3
moved P4
  • Static load balancing
  • before execution
  • Dynamic load balancing
  • during execution

21
Static Load Balancing
  • Map data and tasks into processors prior to
    execution
  • the tasks must be known beforehand (static task
    generation)
  • usually task sizes need to be known in order to
    work well
  • even if the sizes are known (but non-uniform),
    the problem of optimal mapping is NP hard (but
    there are reasonable approximation schemes)

22
1D Array Partitioning
23
2D Array Partitioning
p0
p2
p3
p4
p0
p1
p5
p2
p4
p0
p1


p2


p3


p1
p3
p5
p4


p5







p1
p2
p0
p1
p2
p0
p0
p1


p2
p3


p4

p5










24
Example Geometric Operations
  • Image filtering, geometric transformations,
  • Trivial observation
  • Workload is directly proportional to the number
    of objects.
  • If dealing with pixels, the workload is
    proportional to area.

25
Dynamic Load Balancing
  • Centralized Schemes
  • master-slave
  • master generates tasks and distributes workload
  • easy to program, prone to master becoming
    bottleneck
  • self scheduling
  • take a task from the work pool when you are
    ready
  • chunk scheduling
  • self scheduling with taking single task can be
    costly
  • take a chunk of tasks at once
  • when there are few tasks left, the chunk size
    decreases

26
Dynamic Load Balancing
  • Distributed Schemes
  • Distributively share your workload with other
    processors.
  • Issues
  • how to pair sending and receiving processors
  • transfer of workload initiated by sender or
    receiver?
  • how much work to transfer?
  • when to decide to transfer?

27
Example Computing Mandelbrot Set
Colour of each pixel c is defined solely from its
coordinates int getColour(Complex c) int
colour 0 Complex z (0,0) while
((zlt2) (colourltmax)) z z2c
colour return colour
1.5
-1.5
-2 real 1
28
Mandelbrot Set Example (cont.)
  • Possible partitioning strategies
  • Partition by individual output pixels most
    aggressive partitioning.
  • Partition by rows.
  • Partition by columns.
  • Partition by 2-D blocks.
  • Want to explore Mandelbrot set? See
  • http//aleph0.clarku.edu/djoyce/julia/explorer.ht
    ml

29
Mandelbrot Bad Static Load Balancing
p0
p1
p2
p3
p0 p1 p2 p3 workload
30
Mandelbrot Dynamic Load Balancing
  • The task size can be anything from single node to
    sizeable blocks.
  • blocks and/or stripes can be used
  • Optimal size depends on the overhead associated
    with task distribution (mostly network
    properties).
  • one pixel is usually too small

31
Mandelbrot Load Balancing Issues
Do we really need dynamic load balancing?
32
Mandelbrot Good Static Load Balancing
p0
p1
p2
p3
p0 p1 p2 p3 workload
The same can be achieved by random distribution
of small areas.
33
Mandelbrot again!
  • A more advanced algorithm might be more
    complicated to load balance
  • if the border of an area is of the same colour,
    it can be filled without computing the inner area
  • recursive decomposition can be used

34
Mandelbrot again!
  • A more advanced algorithm might be more
    complicated to load balance
  • if the border of an area is of the same colour,
    it can be filled without computing the inner area
  • recursive decomposition can be used

35
Mandelbrot again!
  • A more advanced algorithm might be more
    complicated to load balance
  • if the border of an area is of the same colour,
    it can be filled without computing the inner area
  • recursive decomposition can be used

36
Lead Balancing Summary
  • Static vs Dynamic
  • what?
  • advantages/disadvantages
  • when to use them?
  • Techniques for Static LB
  • Techniques for Dynamic LB

37
Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
38
Minimizing Communication Overhead
Maximizing Data Locality Minimizing Contention
and Hot Spots Overlapping Computation with
Communication Replicating Data or
Computation Using Optimized Collective
Communication Routines Overlapping Communication
39
Maximizing data locality
  • Example matrix multiplication CAxB
  • ci,j G ai,kbk,j
  • straightforward data partitioning without data
    repetition

n-1
k0
p0
p1
p2
p3
p0
p1
p2
p3
p4
p5
p6
p7
p4
p5
p6
p7
p8
p9
p10
p11
p8
p9
p10
p11
p12
p13
p14
p15
p12
p13
p14
p15
partitioning matrix A partitioning matrix B
When p0 computes c0,0, it has to communicate
with When p6 computes c1,2, it has to
communicate with
40
Maximizing data locality
  • Example matrix multiplication CAxB
  • ci,j G ai,kbk,j
  • straightforward data partitioning without data
    repetition

n-1
k0
p0
p1
p2
p3
p0
p4
p8
p12
p4
p5
p6
p7
p1
p5
p9
p13
p8
p9
p10
p11
p2
p6
p10
p14
p12
p13
p14
p15
p3
p7
p11
p15
partitioning matrix A partitioning matrix B
When p0 computes c0,0, it has to communicate
with When p6 computes c1,2, it has to
communicate with
41
Maximizing data locality I
Minimizing volume of data exchange Matrix
multiplication example
Communication n2n2/p per processor
x

Communication 2n2/vp per processor

x
42
Maximizing data locality II
Minimizing volume of data exchange surface to
volume ratio 2D Simulation example
Communication volume 2n per processor
Communication volume 4n/vp per processor
43
Maximizing data locality III
  • Minimizing frequency of interaction
  • communication start-up time is much greater then
    per-byte time
  • Sparse matrix multiplication example
  • examine your vectors, figure out which entries
    are non-zero
  • request all data you need in one block (or as
    much as fits into your memory)
  • process locally with received data

44
Trading Memory vs Communication
  • If you have enough memory, you can sometimes
    trade memory for communication by having the
    input data stored at multiple places.
  • Example 1 low level image filtering
  • each pixel x is replaced by a function of x and
    of its neighbourhood
  • if each processor stores not only its pixels,
    but also their neighbourhoods, no communications
    is needed

45
Trading Memory vs Communication II
  • Example 2 matrix multiplication CAxB
  • replicating the rows/columns allows to eliminate
    communication

p0
p0
p6
p6
  • matrix A
    matrix B
  • Processor computing ci,j stores row i of matrix A
    and column j of matrix B.
  • no further communication needed
  • what is the memory overhead? How many processors
    store row i of matrix A? How many processors
    store column j of matrix B?

46
Overlapping Computation with Communication
  • Mandelbrot with dynamic load balancing
  • request the next block when the current one is
    nearing its completion
  • e.g. with k-row blocks, ask when you start
    computing the last row of the current block
  • analysis/experimentation needed to identify the
    optimal prefetch distance

47
Overlapping Communication
Example 4 consecutive broadcasts
p0
p1
p2
p3
p4
p5
p6
p7
4 x
12 steps
p0
p1
p2
p3
p4
p5
p6
p7
10 steps with pipelining
48
Overview
Preliminaries Decomposition Techniques Characteris
tics of Tasks and Interactions Load
Balancing Minimizing Communication
Overhead Parallel Algorithm Models
49
Parallel Algorithm Models
  • Data-Parallel Model
  • Task Graph Model
  • Work Pool Model
  • Master-Slave Model
  • Pipeline (Producer-Consumer) Model

50
Data Parallel Model
  • divide data up amongst processors.
  • process different data segments in parallel
  • communicate boundary information, if necessary
  • includes loop parallelism
  • well suited for SIMD machines
  • communication is often implicit (HPF)

51
Task Graph Model
  • decompose algorithm into different sections
  • assign sections to different processors
  • often uses fork()/join()/spawn()
  • usually does not yield itself to high level of
    parallelism

52
Work Pool Model
  • dynamic mapping of tasks to processes
  • typically small amount of data per task
  • the pool of tasks (priority queue, hash table,
    tree) can be centralized or distributed

get task
work pool
P3
t8
t7
process task
t0
t3
t2
P0
possibly add task
P1
P2
53
Master-Slave Model
  • master generates and allocates tasks
  • can be also hierarchical/multilayer
  • master potentially a bottleneck
  • overlapping communication and computation at the
    master often useful

54
Pipelining,
  • a sequence of tasks whose execution can overlap
  • sequential processor must execute them
    sequentially, without overlap
  • parallel computer can overlap the tasks,
    increasing throughput (but not decreasing latency)

55
5-step Guide to Parallelization
  • Identify computational hotspots
  • find what is worth to parallelize
  • Partition the problem into smaller
    semi-independent tasks
  • find/create parallelism
  • Identify communication requirements between these
    tasks
  • realize the constraints communication puts on
    parallelism
  • Agglomerate smaller tasks into larger tasks
  • group the basic tasks together so that the
    communication is minimized, while still allowing
    good load balancing properties
  • Map tasks/data to actual processors
  • balance the load of processors, while trying to
    minimize communication
Write a Comment
User Comments (0)
About PowerShow.com