Parallelization%20Strategies%20and%20Load%20Balancing - PowerPoint PPT Presentation

About This Presentation

Title:

Parallelization%20Strategies%20and%20Load%20Balancing

Description:

Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 33

Provided by: Universi473

Learn more at: https://www.nsm.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallelization%20Strategies%20and%20Load%20Balancing

1
Parallelization Strategies and Load Balancing

Some material borrowed from lectures of J.
Demmel, UC Berkeley

2
Ideas for dividing work

Embarrassingly parallel computations
ideal case. after perhaps some initial
communication, all processes operate
independently until the end of the job
examples computing pi general Monte Carlo
calculations simple geometric transformation of
an image
static or dynamic (worker pool) task assignment

3
Ideas for dividing work

Partitioning
partition the data, or the domain, or the task
list, perhaps master/slave
examples dot product of vectors integration on
a fixed interval N-body problem using domain
decomposition
static or dynamic task assignment need for care

4
Ideas for dividing work

Divide Conquer
recursively partition the data, or the domain, or
the task list
examples tree algorithm for N-body problem
multipole multigrid
usually dynamic work assignments

5
Ideas for dividing work

Pipelining
a sequence of tasks performed by one of a host of
processors functional decomposition
examples upper triangular linear solves
pipeline sorts
usually dynamic work assignments

6
Ideas for dividing work

Synchronous Computing
same computation on different sets of data often
domain decomposition
examples iterative linear system solves
often can schedule static work assignments, if
data structures dont change

7
Load balancing

Determined by
Task costs
Task dependencies
Locality needs
Spectrum of solutions
Static - all information available before
starting
Semi-Static - some info before starting
Dynamic - little or no info before starting
Survey of solutions
How each one works
Theoretical bounds, if any
When to use it

8
Load Balancing in General

Large literature
A closely related problem is scheduling, which is
to determine the order in which tasks run

9
Load Balancing Problems

Tasks costs
Do all tasks have equal costs?
Task dependencies
Can all tasks be run in any order (including
parallel)?
Task locality
Is it important for some tasks to be scheduled on
the same processor (or nearby) to reduce
communication cost?

10
Task cost spectrum
11
Task Dependency Spectrum
12
Task Locality Spectrum
13
Approaches

Static load balancing
Semi-static load balancing
Self-scheduling
Distributed task queues
Diffusion-based load balancing
DAG scheduling
Mixed Parallelism

14
Static Load Balancing

All information is available in advance
Common cases
dense matrix algorithms, e.g. LU factorization
done using blocked/cyclic layout
blocked for locality, cyclic for load balancing
usually a regular mesh, e.g., FFT
done using cyclictransposeblocked layout for 1D
sparse-matrix-vector multiplication
use graph partitioning, where graph does not
change over time

15
Semi-Static Load Balance

Domain changes slowly locality is important
use static algorithm
do some computation, allowing some load imbalance
on later steps
recompute a new load balance using static
algorithm
Particle simulations, particle-in-cell (PIC)
methods
tree-structured computations (Barnes Hut, etc.)
grid computations with dynamically changing grid,
which changes slowly

16
Self-Scheduling

Self scheduling
Centralized pool of tasks that are available to
run
When a processor completes its current task, look
at the pool
If the computation of one task generates more,
add them to the pool
Originally used for
Scheduling loops by compiler (really the
runtime-system)

17
When is Self-Scheduling a Good Idea?

A set of tasks without dependencies
can also be used with dependencies, but most
analysis has only been done for task sets without
dependencies
Cost of each task is unknown
Locality is not important
Using a shared memory multiprocessor, so a
centralized pool of tasks is fine

18
Variations on Self-Scheduling

Dont grab small unit of parallel work.
Chunk of tasks of size K.
If K large, access overhead for task queue is
small
If K small, likely to have load balance
Four variations
Use a fixed chunk size
Guided self-scheduling
Tapering
Weighted Factoring

19
Variation 1 Fixed Chunk Size

How to compute optimal chunk size
Requires a lot of information about the problem
characteristics e.g. task costs, number
Need off-line algorithm not useful in practice.
All tasks must be known in advance

20
Variation 2 Guided Self-Scheduling

Use larger chunks at the beginning to avoid
excessive overhead and smaller chunks near the
end to even out the finish times.

21
Variation 3 Tapering

Chunk size, Ki is a function of not only the
remaining work, but also the task cost variance
variance is estimated using history information
high variance gt small chunk size should be used
low variant gt larger chunks OK

22
Variation 4 Weighted Factoring

Similar to self-scheduling, but divide task cost
by computational power of requesting node
Useful for heterogeneous systems
Also useful for shared resource e.g. NOWs
as with Tapering, historical information is used
to predict future speed
speed may depend on the other loads currently
on a given processor

23
Distributed Task Queues

The obvious extension of self-scheduling to
distributed memory
Good when locality is not very important
Distributed memory multiprocessors
Shared memory with significant synchronization
overhead
Tasks that are known in advance
The costs of tasks is not known in advance

24
DAG Scheduling

Directed acyclic graph (DAG) of tasks
nodes represent computation (weighted)
edges represent orderings and usually
communication (may also be weighted)
usually not common to have DAG in advance

25
DAG Scheduling

Two application domains where DAGs are known
Digital Signal Processing computations
Sparse direct solvers (mainly Cholesky, since it
doesnt require pivoting).
Basic strategy partition DAG to minimize
communication and keep all processors busy
NP complete, so need approximations
Different than graph partitioning, which was for
tasks with communication but no dependencies

26
Mixed Parallelism

Another variation - a problem with 2 levels of
parallelism
course-grained task parallelism
good when many tasks, bad if few
fine-grained data parallelism
good when much parallelism within a task, bad if
little

27
Mixed Parallelism

Adaptive mesh refinement
Discrete event simulation, e.g., circuit
simulation
Database query processing
Sparse matrix direct solvers

28
Mixed Parallelism Strategies
29
Which Strategy to Use
30
Switch Parallelism A Special Case
31
A Simple Performance Model for Data Parallelism
32
Values of Sigma - problem size

Write a Comment

User Comments (0)