Title: Introduction to Parallel Computing by Grama, Gupta, Karypis, Kumar
1Introduction to Parallel Computingby Grama,
Gupta, Karypis, Kumar
- Selected Topics from Chapter 3 Principles of
Parallel Algorithm Design
2Elements of a Parallel Algorithm
- Pieces of work that can be done concurrently
- Tasks
- Mapping of the tasks onto multiple processors
- Processes vs processors
- Distributing the input, output, and intermediate
results across different processors - Management of access to shared data
- Either input or intermediate
- Synchronization of the processors at various
point of the parallel execution
3Finding Concurrent Pieces of Work
- Decomposition
- The process of dividing the computation into
smaller pieces of work called tasks - Tasks are programmer defined and are considered
to be indivisible.
4Tasks can be of different sizes
5Task-Dependency Graph
- In most cases, there are dependencies between the
different tasks - Certain task(s) can only start once some other
task(s) have finished - Example Producer-consumer relationships
- These dependencies are represented using a DAG
called a task-dependency graph
6Task-Dependency Graph (cont)
- A task-dependency graph is a directed acyclic
graph in which the nodes represent tasks and the
directed edges indicate the dependences between
them - The task corresponding to a node can be executed
when all tasks connected to this node by incoming
edges have been completed. - The number and size of the tasks that the problem
is decomposed into determines the granularity of
the decomposition. - Called fine-grained for a large nr of small tasks
- Called coarse-grained for a small nr of large
tasks
7Task-Dependency Graph (cont)
- Key Concepts Derived from Task-Dependency Graph
- Degree of Concurrency
- The number of tasks that can be executed
concurrently - We are usually most concerned about the average
degree of concurrency - Critical Path
- The longest vertex-weighted path in the graph
- The weights inside nodes represent the task size
- Is the sum of the weights of nodes along the path
- The degree of concurrency and critical path
length normally increase as granularity becomes
smaller
8Q Task-Interaction Graph
- Captures the pattern of interaction between tasks
- This graph usually contains the task-dependency
graph as a subgraph. - True since there may be interactions between
tasks even if there are no dependencies. - These interactions usually due to accesses of
shared data
9Task Dependency and Interaction Graphs
- These graphs are important in developing
effective mapping of the tasks onto the different
processors - Need to maximize concurrency and minimize
overheads.
10Common Decomposition Methods
- Data Decomposition
- Recursive Decomposition
- Exploratory Decomposition
- Speculative Decomposition
- Hybrid Decomposition
task decomposition methods
11Recursive Decomposition
- Suitable for problems that can be solved using
the divide and conquer paradigm - Each of the subproblems generated by the divide
step becomes a new task.
12Example Quicksort
13Another Example Finding the Minimum
- Note that we can obtain divide-and-conquer
algorithms for problems that are usually solved
by using other methods.
14Recursive Decomposition
- How good are the decompositions produced?
- Average Concurrency?
- Length of critical path?
- How do the quicksort and min-finding
decompositions measure up?
15Data Decomposition
- Used to derive concurrency for problems that
operate on large amounts of data - The idea is to derive the tasks by focusing on
the multiplicity of data - Data decomposition is often performed in two
steps - Step 1 Partition the data
- Step 2 Induce a computational partitioning from
the data partitioning.
16Data Decomposition (cont)
- Which data should we partition
- Input/Output/Intermediate?
- All of above
- This leads to different data decomposition
methods - How to induce a computational partitioning
- Use the owner-computes rule
17Exploratory Decomposition
- Used to decompose computations that correspond to
a search of the space of solutions.
18Example 15-puzzle Problem
19Exploratory Decomposition
- Not general purpose
- After sufficient branches are generated, each
node can be assigned the task to explore further
down one branch - As soon as one task finds a solution, the other
tasks can be terminated. - It can result in speedup and slowdown anomalies
- The work performed by the parallel formulation of
an algorithm can be either smaller or greater
than that performed by the serial algorithm.
20Speculative Decomposition
- Used to extract concurrency in problems in which
the next step is one of many possible actions
that can only be determined when the current task
finishes. - This decomposition assumes a certain outcome of
the currently executed task and executes some of
the next steps - Similar to speculative execution at the
microprocessor
21Speculative Decomposition
- Difference from exploratory decompostion
- In speculative decomposition, the input at a
branch leading to multiple tasks is unknown. - In exploratory decomputation, the output of the
multiple tasks originating at the branch is
unknown.
22Example Discrete Event Simulation
23Speculative Execution
- If predictions are wrong
- Work is wasted
- Work may need to be undone
- State-restoring overhead
- Memory/computations
- However, it may be the only way to extract
concurrency!
24Mapping Tasks to Processors
- A good mapping strives to achieve the following
conflicting goals - Reducing the amount of that processor spend
interacting with each other. - Reducing the amount of total time that some
processors are active while others are idle. - Good mappings attempt to reduce the parallel
processing overheads - If Tp is the parallel runtime using p processors
and Ts is the sequential runtime (for the same
algorithm), then the total overhead To is pTp
Ts. - This is the work that is done by the parallel
system that is beyond that required for the
serial system.
25Add Slides from Karypis Lecture Slides
- Add Slides 37-52 here from the PDF lecture slides
by Karypis for Chapter 3 of the textbook, - Introduction to Parallel Computing, Second
Edition, Ananth Grama, Anshul Gupta, George
Karypis, Vipin Kumar, Addison Wesley, 2003. - Topics covered on these slides are sections
3.3-3.5 - Characteristics of Tasks and Interactions
- Mapping Techniques for Load Balancing
- Methods for Containing Interaction Overheads
- These slides can be easiest seen by going to
View and choosing Full Screen. Exit from
Full Screen using Esc key.
26Parallel Algorithm Models
- The Task Graph Model
- Closely related to Fosters Task/Channel Model
- Includes the task dependency graph where
dependencies usually result from communications
between two tasks - Also includes the task-interaction graph, which
also captures other interactions between tasks
such as data sharing - The Work Pool Model
- The Master-Slave Model
- The Pipeline or Producer-Consumer Model
- Hybrid Models
27The Task Graph Model
- The computations in a parallel algorithm can be
viewed as a task-dependency graph. - Tasks are mapped to processors so that locality
is promoted - Volume and frequency of interactions are reduced
- Asynchronous interaction methods are used to
overlap interactions with computation - Typically used to solve problems in which the
data related to a task is rather large compared
to the amount of computation.
28The Task Graph Model (cont.)
- Examples of algorithms based on task graph model
- Parallel Quicksort (Section 9.4.1)
- Sparse Matrix Factorization
- Multiple parallel algorithms derived from
divide-and-conquer decompositions. - Task Parallelism
- The type of parallelism that is expressed by the
independent tasks in a task-dependency graph.
29The Work Pool Model
- Also called the Task Pool Model
- Involves dynamic mapping of tasks onto processes
for load balancing - Any task may be potentially be performed by any
process - The mapping of tasks to processors can be
centralized or decentralized. - Pointers to tasks may be stored in
- a physically shared list, a priority queue, hash
table, or tree - a physically distributed data structure.
30The Work Pool Model (cont.)
- When work is generated dynamically and a
decentralized mapping is used, then a termination
detection algorithm is required - When used with a message passing paradigm,
normally the data required by the tasks is
relatively small when compared to the
computations - Tasks can be readily moved around without causing
too much data interaction overhead - Granularity of tasks can be adjusted to obtain
desired tradeoff between load imbalance and the
overhead of adding and extracting tasks
31The Work Pool Model (cont.)
- Examples of algorithms based on the Work Pool
Model - Chunk-Scheduling
32Master-Slave Model
- Also called the Manager-Worker model
- One or more master processes generate work and
allocate it to workers - Managers can allocate tasks in advance if they
can estimate the size of tasks or if a random
mapping can avoid load-balancing problems - Normally, workers are assigned smaller tasks, as
needed - Work can be performed in phases
- Work in each phase is completed and workers
synchronized before next phase is started. - Normally, any worker can do any assigned task
33Master-Slave Model (cont)
- Can be generalized to a multi-level
manager-worker model - Top level managers feed large chunks of tasks to
second-level managers - Second-level managers subdivide tasks to their
workers and may also perform some of the work - Danger of manager becoming a bottleneck
- Can happen if tasks are too small
- Granularity of tasks should be chosen so that
cost of doing work dominates cost of
synchronization - Waiting time may be reduced if worker requests
are non-deterministic.
34Master-Slave Model (cont)
- Examples of algorithms based on the Master-Slave
Model - A master-slave example for centralized
load-balancing is mentioned for centralized
dynamic load balancing in Section 3.4.2 (page
130) - Several examples are given in textbook, Barry
Wilkinson and Michael Allen, Parallel
Programming Techniques and Applications Using
Networked Workstations and Parallel Computers,
1st or 2nd Edition,1999 2005, Prentice Hall.
35Pipeline or Producer-Consumer Model
- Usually similiar to the linear array model
studied in Akls textbook. - A stream of data is passed through a succession
of processes, each of which performs some task on
it. - Called Stream Parallelism
- With exception of process initiating the work for
the pipeline, - Arrival of new data triggers the execution of a
new task by a process in the pipeline. - Each process can be viewed as a consumer of the
data items produced by the process preceding it
36Pipeline or Producer-Consumer Model (cont)
- Each process in pipeline can be viewed as a
producer of data for the process following it. - The pipeline is a chain of producers and
consumers - The pipeline does not need to be a linear chain.
Instead, it can be a directed graph. - Process could form pipelines in form of
- Linear or multidimensional arrays
- Trees
- General graphs with or without cycles
37Pipeline or Producer-Consumer Model (cont)
- Load balancing is a function of task granularity
- With larger tasks, it takes longer to fill up the
pipeline - This keeps tasks waiting
- Too fine a granularity increases overhead, as
processes will need to receive new data and
initiate a new task after a small amount of
computation - Examples of algorithms based on this model
- A two-dimensional pipeline is used in the
parallel LU factorization algorithm discussed in
Section 8.3.1 - An entire chapter is devoted to this model in
previously mentioned textbook by Wilkinson
Allen.
38Hybrid Models
- In some cases, more than one model may be used in
designing an algorithm, resulting in a hybrid
algorithm - Parallel quicksort (Section 3.2.5 and 9.4.1) is
an application for which a hybrid model is ideal.