Title: HPC Parallel Programming: From Concept to Compile
1HPC Parallel ProgrammingFrom Concept to Compile
- A One day Introductory Workshop
- October 12th, 2004
2Schedule
- Concept (a frame of mind)
- Compile (application)
3Introduction
- Programming parallel computers
- Compiler extension
- Sequential programming language extension
- Parallel programming layer
- New parallel languages
4 Concept
- Parallel Algorithm Design
- Programming paradigms
- Parallel Random Access Machine the PRAM model
- result, specialist, agenda - RSA model
- Task / Channel - the PCAM model
- Bulk Synchronous Parallel the BSP model
- Pattern Language
5Compile
- Serial
- Introduction to OpenMP
- Introduction to MPI
- Profilers
- Libraries
- Debugging
- Performance Analysis Formulas
6 - By the end of this workshop you will be exposed
to - Different parallel programming models and
paradigms - Serial Im not bad, I was built this way
programming and how you can optimize it - OpenMP and MPI
- libraries
- debugging
7 - Eu est velociter perfectus
- Well Done is Quickly Done
- Caesar Augustus
8Introduction
- What is Parallel Computing?
- It is the ability to program in a language that
allows you to explicitly indicate how different
portions of the computation may be executed
concurrently by different processors
9 - Why do it?
- The need for speed
- How much speedup can be determined by
- Amdahls Law S(p) p/(1(p-1)f) where
- f - fraction of the computation that cannot be
divided into concurrent tasks, 0 f 1 and - p - the number of processors
- So if we have 20 processors and a serial portion
of 5 we will get a speedup of 20/(1(20-1).05)
10.26 - Also Gustafson-Barsiss Law which takes into
account scalability, and - Karp-Flatt Metric which takes into account the
parallel overhead, and - Isoefficiency Relation which is used to determine
the range of processors for which a particular
level of efficiency can be determined. Parallel
overhead increases as the number of processors
increase, so to maintain efficiency increase the
problem size
10 - Why Do Parallel Computing some other reasons
- Time Reduce the turnaround time of applications
- Performance Parallel computing is the only way
to extend performance toward the TFLOP realm - Cost/Performance Traditional vector computers
become too expensive as one pushes the
performance barrier - Memory Applications often require memory that
goes beyond that addressable by a single
processor - Whole classes of important algorithms are ideal
for parallel execution. Most algorithms can
benefit from parallel processing such as Laplace
equation, Monte Carlo, FFT (signal processing),
image processing - Life itself is a set of concurrent processes
- Scientists use modeling so why not model systems
in a way closer to nature
11 - Many complex scientific problems require large
computing resources. Problems such as - Quantum chemistry, statistical mechanics, and
relativistic physics - Cosmology and astrophysics
- Computational fluid dynamics and turbulence
- Biology, genome sequencing, genetic engineering
- Medicine
- Global weather and environmental modeling
- One such place is http//www-fp.mcs.anl.gov/grand-
challenges/
12 Programming Parallel Computers
- In 1988 four distinct paths for application
software development on parallel computers were
identified by McGraw and Axelrod - Extend an existing compiler to translate
sequential programs into parallel programs - Extend an existing language with new operations
that allow users to express parallelism - Add a new language layer on top of an existing
sequential language - Define a totally new parallel language
13Compiler extension
- Design parallelizing compilers that exploit
parallelism in existing programs written in a
sequential language - Advantages
- billions of dollars and thousands of years of
programmer effort have already gone into legacy
programs. - Automatic parallelization can save money and
labour. - It has been an active area of research for over
twenty years - Companies such as Parallel Software Products
http//www.parallelsp.com/ offer compilers that
translate F77 code into parallel programs for MPI
and OpenMP - Disadvantages
- Pits programmer and compiler in game of hide and
seek. The programmer hides parallelism in DO
loops and control structures and the compiler
might irretrievably lose some parallelism
14Sequential Programming Language Design
- Extend a sequential language with functions that
allow programmers to create, terminate,
synchronize and communicate with parallel
processes - Advantages
- Easiest, quickest, and least expensive since it
only requires the development of a subroutine
library - Libraries meeting the MPI standard exist for
almost every parallel computer - Gives programmers flexibility with respect to
program development - Disadvantages
- Compiler is not involved in generation of
parallel code therefore it cannot flag errors - It is very easy to write parallel programs that
are difficult to debug
15Parallel Programming layers
- Think of a parallel program consisting of 2
layers. The bottom layer contains the core of
the computation which manipulates its portion of
data to gets its result. The upper layer
controls creation and synchronization of
processes. A compiler would then translate these
two levels into code for execution on parallel
machines - Advantages
- Allows users to depict parallel programs as
directed graphs with nodes depicting sequential
procedures and arcs representing data dependences
among procedures - Disadvantages
- Requires programmer to learn and use a new
parallel programming system
16New Parallel Languages
- Develop a parallel language from scratch. Let
the programmer express parallel operations
explicitly. The programming language Occam is one
such famous example http//wotug.ukc.ac.uk/paralle
l/occam/ - Advantages
- Explicit parallelism means programmer and
compiler are now allies instead of adversaries - Disadvantages
- Requires development of new compilers. It
typically takes years for vendors to develop
high-quality compilers for their parallel
architectures - Some parallel languages such as C were not
adapted as standard compromising severely
portable code - User resistance. Who wants to learn another
language
17 - The most popular approach continues to be
augmenting existing sequential languages with
low-level constructs expressed by function calls
or compiler directives - Advantages
- Can exhibit high efficiency
- Portable to a wide range of parallel systems
- C, C, F90 with MPI or OpenMP are such examples
- Disadvantages
- More difficult to code and debug
18Concept
- An algorithm (from OED) is a set of rules or
process, usually one expressed in algebraic
notation now used in computing - A parallel algorithm is one in which the rules or
process are concurrent - There is no simple recipe for designing parallel
algorithms. However, it can benefit from a
methodological approach. It allows the
programmer to focus on machine-independent issues
such as concurrency early in the design process
and machine-specific aspects later - You will be introduced to such approaches and
models and hopefully gain some insight into the
design process - Examining these models is a good way to start
thinking in parallel
19Parallel Programming Paradigms
- Parallel applications can be classified into well
defined programming paradigms - Each paradigm is a class of algorithms that have
the same control structure - Experience suggests that there are a relatively
few paradigms underlying most parallel programs - The choice of paradigm is determined by the
computing resources which can define the level of
granularity and type of parallelism inherent in
the program which reflects the structure of
either the data or application
20Parallel Programming Paradigms
- The most systematic definition of paradigms comes
from a technical report from the University of
Basel in 1993 entitled BACS Basel Algorithm
Classification Scheme - A generic tuple of factors which characterize a
parallel algorithm - Process properties (structure, topology,
execution) - Interaction properties
- Data properties (partitioning, placement)
- The following paradigms were described
- Task-Farming (or Master/Slave)
- Single Program Multiple Data (SPMD)
- Data Pipelining
- Divide and Conquer
- Speculative Parallelism
21PPP Task-Farming
- Task-farming consists of two entities
- Master which decomposes the problem into small
tasks and distributes/farms them to the slave
processes. It also gathers the partial results
and produces the final computational result - Slave which gets a message with a task, executes
the task and sends the result back to the master - It can use either static load balancing
(distribution of tasks is all performed at the
beginning of the computation) or dynamic
load-balancing (when the number of tasks exceeds
the number of processors or is unknown, or when
execution times are not predictable, or when
dealing with unbalanced problems). This paradigm
responds quite well to the loss of processors and
can be scaled by extending the single master to a
set of masters
22PPP Single Program Multiple data (SPMD)
- SPMD is the most commonly used paradigm
- Each process executes the same piece of code but
on a different part of the data which involves
the splitting of the application data among the
available processors. This is also referred to
as geometric parallelism, domain decomposition,
or data parallelism - Applications can be very efficient if the data is
well distributed by the processes on a
homogeneous system. If different workloads are
evident then some sort of load balancing scheme
is necessary during run-time execution - Highly sensitive to loss of a process. Unusually
results in a deadlock until global
synchronization point is reached
23PPP Data Pipelining
- Data pipelining is fine-grained parallelism and
is based on a functional decomposition approach - The tasks (capable of concurrent operation) are
identified and each processor executes a small
part of the total algorithm - One of the simplest and most popular functional
decomposition paradigms and can also be referred
to as data-flow parallelism. - Communication between stages of the pipeline can
be asynchronous since the efficiency is directly
dependent on the ability to balance the load
across the stages - Often used in data reduction and image processing
24PPP Divide and Conquer
- The divide and conquer approach is well known in
sequential algorithm development in which a
problem is divided into two or more subproblems.
Each subproblem is solved independently and the
results combined - In parallel divide and conquer, the subproblems
can be solved at the same time - Three generic computational operations split,
compute, and join (sort of like a virtual tree
where the tasks are computed at the leaf nodes)
25PPP Speculative Parallelism
- Employed when it is difficult to obtain
parallelism through any one of the previous
paradigms - Deals with complex data dependencies which can be
broken down into smaller parts using some
speculation or heuristic to facilitate the
parallelism
26PRAM Parallel Random Access Machine
- Descendent of RAM (Random Access Machine)
- A theoretical model of parallel computation in
which an arbitrary but finite number of
processors can access any value in an arbitrarily
large shared memory in a single time step - Introduced in the 1970s it still remains popular
since it is theoretically tractable and gives
algorithm designers a common target. The
Prentice Hall book from 1989 entitled the Design
and Analysis of Parallel algorithms, gives a good
introduction to the design of algorithms using
this model
27PRAM cont
- The three most important variations on this model
are - EREW (exclusive read exclusive write) where any
memory location may be access only once in any
one step - CREW (concurrent read exclusive write) where any
memory location may be read any number of times
during a single step but written to only once
after the reads have finished - CRCW (concurrent read concurrent write) where any
memory location may be written to or read from
any number of times during a single step. Some
rule or priority must be given to resolve
multiple writes
28PRAM cont
- This model has problems
- PRAMs cannot be emulated optimally on all
architectures - Problem lies in the assumption that every
processor can access the memory simultaneously in
a single step. Even in hypercubes messages must
take several hops between source and destination
and it grows logarithmically with the machines
size. As a result any buildable computer will
experience a logarithmic slowdown relative to the
PRAM model as its size increases - One solution is to take advantage of cases in
which there is greater parallelism in the process
than in the hardware it is running on, enabling
each physical processor to emulate many virtual
processors. An example of such is as follows
29PRAM cont
- Example
- Process A sends request
- Process B runs while request travels to memory
- Process C runs while memory services request
- Process D runs while reply returns to processor
- Process A is re-scheduled
- The efficiency with which physically resizable
architectures could emulate the PRAM is dictated
by the theorem - If each of P processors sends a single message to
a randomly-selected partner, it is highly
probable that at least on processor will receive
O(P/log log P) messages, and some others will
receive none but - If each processor sends log P messages to
randomly-selected partners, there is a high
probability that no processor will receive more
than 3 log P messages. - So if problem size is increased at least
logarithmically faster than machine size,
efficiency can be held constant. The problem is
that it holds constant for hypercubes in which
the communication links grow with the number of
processors. - Several ways around the above limitation has been
suggested. Such as
30PRAM cont
- XPRAM where computations are broken up into steps
such that no processor may communicate more that
a certain number of times per single time step. - Programs which fit this model can be emulated
efficiently - Problem is that it is difficult to design
algorithms in which the frequency of
communication decreases as the problem size
increases - Bird-Meertens formalism where the allowed set of
communications would be restricted to those which
can be emulated efficiently - The scan-vector model proposed by Blelloch
accounts for the relative distance of different
portions of memory - Another option was proposed by Ramade in 1991
which uses a butterfly network in which each node
is a processor/memory pair. Routing messages in
this model is complicated but the end result is
an optimal PRAM emulation
31Result, Agenda, Specialist Model
- The RAS model was proposed by Nicholas Carriero
and David Gelernter in their book How to Write
Parallel Programs in 1990 - To write a parallel program
- Choose a pattern that is most natural to the
problem - Write a program using the method that is most
natural for that pattern, and - If the resulting program is not efficient, then
transform it methodically into a more efficient
version
32RAS
- Sounds simple. We can envision parallelism in
terms of - Result - focuses on the shape of the finished
product - Plan a parallel application around a data
structure yielded as the final result. We get
parallelism by computing all elements of the
result simultaneously - Agenda - focuses on the list of tasks to be
performed - Plan a parallel application around a particular
agenda of tasks, and then assign many processes
to execute the tasks - Specialist - focuses on the make-up of the work
- Plan an application around an ensemble of
specialists connected into a logical network of
some kind. Parallelism results in all nodes
being active simultaneously much like
pipe-lining
33RAS - Result
- In most cases the easiest way to think of a
parallel program is to think of the resulting
data structure. It is a good starting point for
any problem whose goal is to produce a series of
values with predictable organization and
interdependencies - Such a program reads as follows
- Build a data structure
- Determine the value of all elements of the
structure simultaneously - Terminate when all values are known
- If all values are independent then all
computations start in parallel. However, if some
elements cannot be computed until certain other
values are known, then those tasks are blocked - As a simple example consider adding two n-element
vectors (i.e. add the ith elements of both and
store the sum in another vector)
34RAS - Agenda
- Agenda parallelism adapts well to many different
problems - The most flexible is the master-worker paradigm
- in which a master process initializes the
computation and creates a collection of identical
worker processes - Each worker process is capable of performing any
step in the computation - Workers seek a task to perform and then repeat
- When no tasks remain, the program is finished
- An example would be to find the lowest ratio of
salary to dependents in a database. The master
fills a bag with records and each worker draws
from the bag, computes the ratio, sends the
results back to the master. The master keeps
track of the minimum and when tasks are complete
reports the answer
35RAS - Specialist
- Specialist parallelism involves programs that are
conceived in terms of a logical network. - Best understood in which each node executes an
autonomous computation and inter-node
communication follows predictable paths - An example could be a circuit simulation where
each element is realized by a separate process
36RAS - Example
- Consider a naïve n-body simulator where on each
iteration of the simulation we calculate the
prevailing forces between each body and all the
rest, and update each bodys position accordingly - With the result parallelism approach it is easy
to restate the problem description as follows - Suppose n bodies, q iterations of the simulation,
compute matrix M such that M i, j is the
position of the ith body after the jth iteration - Define each entry in terms of other entries i.e.
write a function to compute position (i, j)
37RAS - Example
- With the agenda parallelism model we can
repeatedly apply the transformation compute next
position to all bodies in the set - So the steps involved would be to
- Create a master process and have it generate n
initial task descriptors ( one for each body ) - On the first iteration, each process repeatedly
grabs a task descriptor and computes the next
position of the corresponding body, until all
task descriptors are used - The master can the store information about each
bodys position at the last iteration in a
distributed table structure where each process
can refer to it directly
38RAS - Example
- Finally, with the specialist parallelism approach
we create a series of processes, each one
specializing in a single body (i.e. each
responsible for computing a single bodys current
position throughout the entire simulation) - At the start of each iteration, each process
sends data to and receives data from each other
process - The data included in the incoming message group
of messages is sufficient to allow each process
to compute a new position for its body then
repeat
39Task Channel model
- THERE IS NO SIMPLE RECIPE FOR DESIGNING PARALLEL
ALGORITHMS - However, with suggestions by Ian Foster and his
book Designing and Building Parallel Programs
there is a methodology we can use - The task/channel method is one most often sited
as a practical means to organize the design
process - It represents a parallel computation as a set of
tasks that may interact with each other by
sending messages through channels - It can be viewed as a directed graph where
vertices represent tasks and directed edges
represent channels - A thorough examination of this design process
will conclude with a practical example
40 - A task is a program, its local memory, and a
collection of I/O ports - The local memory contains the programs
instructions and its private data - It can send local data values to other tasks via
output ports - It also receives data values from other tasks via
input ports - A channel is a message queue that connects one
tasks output port with another tasks input port - Data values appear at the input port in the same
order as they were placed in the output port of
the channel - Tasks cannot receive data until it is sent (i.e.
receiving is blocked) - Sending is never blocked
41 - The four stages of Fosters Design process are
- Partitioning the process of dividing the
computation and data into pieces - Communication the pattern of send and receives
between tasks - Agglomeration process of grouping tasks into
larger tasks to simplify programming or improve
performance - Mapping the processes of assigning tasks to
processors - Commonly referred to as PCAM
42Partitioning
- Discover as much parallelism as possible. To
this end strive to split the computation and data
into smaller pieces - There are two approaches
- Domain decomposition
- Functional decomposition
43PCAM partitioning domain decomposition
- Domain decomposition is where you first divide
the data into pieces and then determine how to
associate computations with the data - Typically focus on the largest or most frequently
accessed data structure in the program - Consider a 3D matrix. It can be partitioned as
- Collection of 2D slices, resulting in a 1D
collection of tasks - Collection of 1D slices, resulting in a 2D
collection of tasks - Each matrix element separately resulting in a 3D
collection of tasks - At this point in the design process it is usually
best to maximize the number of tasks hence 3D
partitioning is best
44PCAM partitioning functional decomposition
- Functional decomposition is complimentary to
domain decomposition in which the computation is
first divided into pieces and then the data items
are associated with each computation. This is
often know as pipelining which yield a collection
of concurrent tasks - Consider brain surgery
- before surgery begins a set of CT images are
input to form a 3D model of the brain - The system tracks the position of the instruments
converting physical coordinates into image
coordinates and displaying them on a monitor.
While one task is converting physical coordinates
to image coordinates, another is displaying the
previous image, and yet another is tracking the
instrument for the next image. (Anyone remember
the movie The Fantastic Voyage?)
45PCAM Partitioning - Checklist
- Regardless of decomposition we must maximize the
number of primitive tasks since it is the upper
bound on the parallelism we can exploit. Foster
has presented us a checklist to evaluate the
quality of the partitioning - There are at least an order of magnitude more
tasks than processors on the target parallel
machine. If not, there may be little flexibility
in later design options - Avoid redundant computation and storage
requirements since the design may not work well
when the size of the problem increases - Tasks are of comparable size. If not, it may be
hard to allocate each processor equal amounts of
work - The number of tasks scale with problem size. If
not, it may be difficult to solve larger problems
when more processors are available - Investigate alternative partitioning to maximize
flexibility later
46PCAM-Communication
- After the tasks have been identified it is
necessary to understand the communication
patterns between them - Communications are considered part of the
overhead of a parallel algorithm, since the
sequential algorithm does not need to do this.
Minimizing this overhead is an important goal - Two such patterns local and global are more
commonly used than others (structured/unstructured
, static/dynamic, synchronous/asynchronous) - Local communication exists when a task need
values from a small number of other tasks (its
neighbours) in order to form a computation - Global communication exits when a large number of
tasks must supply data in order to form a
computation (e.g. performing a parallel reduction
operation computing the sum of values over N
tasks)
47PCAM Communication - checklist
- These are guidelines and not hard and fast rules
- Are the communication operations balanced between
tasks? Unbalanced communication requirements
suggest a non-scalable construct - Each task communicates only with a small number
of neighbours - Tasks are able to communicate concurrently. If
not the algorithm is likely to be inefficient and
non-scalable. - Tasks are able to perform their computations
concurrently
48PCAM - Agglomeration
- The first two steps of the design process was
focused on identifying as much parallelism as
possible - At this point the algorithm would probably not
execute efficiently on any particular parallel
computer. For example, if there are many
magnitudes more tasks than processors it can lead
to a significant overhead in communication - In the next two stages of the design we consider
combining tasks into larger tasks and then
mapping them onto physical processors to reduce
parallel overhead
49PCAM - Agglomeration
- Agglomeration (according to OED) is the process
of collecting in a mass. In this case we try
group tasks into larger tasks to facilitate
improvement in performance or to simplify
programming. - There are three main goals to agglomeration
- Lower communication overhead
- Maintain the scalability of the parallel design,
and - Reduce software engineering costs
50PCAM - Agglomeration
- How can we lower communication overhead?
- By agglomerating tasks that communicate with each
other, communication is completely eliminated,
since data values controlled by the tasks are in
the memory of the consolidated task. This
process is known as increasing the locality of
the parallel algorithm - Another way is to combine groups of transmitting
and receiving tasks thereby reducing the number
of messages sent. Sending fewer, longer messages
takes less time than sending more, shorter
messages since there is an associated startup
cost (message latency) inherent with every
message sent which is independent of the length
of the message.
51PCAM - Agglomeration
- How can we maintain the scalability of the
parallel design? - Ensure that you do not combine too many tasks
since porting to a machine with more processors
may be difficult. - For example part of your parallel program is to
manipulate a 3D array 16 X 128 X 256 and the
machine has 8 processors. By agglomerating the
2nd and 3rd dimensions each task would be
responsible for a submatrix of 2 X 128 X 256. We
can even port this to a machine that has 16
processors. However porting this to a machine
with more processors might result in large
changes to the parallel code. Therefore
agglomerating the 2nd and 3rd dimension might not
be a good idea. What about a machine with 50,
64, or 128 processors?
52PCAM - Agglomeration
- How can we reduce software engineering costs?
- By parallelizing a sequential program we can
reduce time and expense of developing a similar
parallel program. Remember Parallel Software
Products
53PCAM Agglomeration - checklist
- Some of these points in this checklist emphasize
quantitative performance analysis which becomes
more important as we move from the abstract to
the concrete - Has the agglomeration increased the locality of
the parallel algorithm? - Do replicated computations take less time than
the communications they replace? - Is the amount of replicated data small enough to
allow the algorithm to scale? - Do agglomerated tasks have similar computational
and communication costs? - Is the number of tasks an increasing function of
the problem size? - Are the number of tasks as small as possible and
yet as large as the number of processors on your
parallel computer? - Is the trade-off between your chosen
agglomeration and the cost of modifications to
existing sequential code reasonable?
54PCAM - Mapping
- In this 4th and final stage we specify where each
task is to execute - The goals of mapping are to maximize processor
utilization and minimize interprocessor
communications. Often these are conflicting
goals - Processor utilization is maximized when the
commutation is balanced evenly. Conversely, it
drops when one or processors are idle - Interprocessor communication increases when two
tasks connected by a channel are mapped to
different processors. Conversely, it decreases
when the two tasks connected by the channel are
mapped to the same processor - Mapping every task to the same processors cut
communications to zero but utilization is reduced
to 1/processors. The point is to choose a
mapping that represents a reasonable balance
between conflicting goals. The mapping problem
has a name and it is
55PCAM - Mapping
- The mapping problem is known to be NP-hard,
meaning that no computationally tractable
(polynomial-time) algorithm exists for evaluating
these trade-offs in the general case. Hence we
must rely on heuristics that can do a reasonably
good job of mapping - Some strategies for decomposition of a problem
are - Perfectly parallel
- Domain
- Control
- Object-oriented
- Hybrid/layered (multiple uses of the above)
56PCAM Mapping decomposition - perfect
- Perfectly parallel
- Applications that require little or no
inter-processor communication when running in
parallel - Easiest type of problem to decompose
- Results in nearly perfect speed-up
- The pi example is almost perfectly parallel
- The only communication occurs at the beginning of
the problem when the number of divisions needs to
be broadcast and at the end where the partial
sums need to be added together - The calculation of the area of each slice
proceeds independently - This would be true even if the area calculation
were replaced by something more complex
57PCAM mapping decomposition - domain
- Domain decomposition
- In simulation and modelling this is the most
common solution - The solution space (which often corresponds to
the real space) is divided up among the
processors. Each processor solves its own little
piece - Finite-difference methods and finite-element
methods lend themselves well to this approach - The method of solution often leads naturally to a
set of simultaneous equations that can be solved
by parallel matrix solvers - Sometimes the solution involves some kind of
transformation of variables (i.e. Fourier
Transform). Here the domain is some kind of
phase space. The solution and the various
transformations involved can be parallelized
58PCAM mapping decomposition - domain
- Solution of a PDE (Laplaces Equation)
- A finite-difference approximation
- Domain is divided into discrete finite
differences - Solution is approximated throughout
- In this case, an iterative approach can be used
to obtain a steady-state solution - Only nearest neighbour cells are considered in
forming the finite difference - Gravitational N-body, structural mechanics,
weather and climate models are other examples
59PCAM mapping decomposition - control
- Control decomposition
- If you cannot find a good domain to decompose,
your problem might lend itself to control
decomposition - Good for
- Unpredictable workloads
- Problems with no convenient static structures
- One set of control decomposition is functional
decomposition - Problem is viewed as a set of operations. It is
among operations where parallelization is done - Many examples in industrial engineering ( i.e.
modelling an assembly line, a chemical plant,
etc.) - Many examples in data processing where a series
of operations is performed on a continuous stream
of data
60PCAM mapping decomposition - control
- Examples
- Image processing given a series of raw images,
perform a series of transformation that yield a
final enhanced image. Solve this in a functional
decomposition (each process represents a
different function in the problem) using data
pipelining - Game playing games feature an irregular search
space. One possible move may lead to a rich set
of possible subsequent moves to search.
61PCAM mapping decomposition - OO
- Object-oriented decomposition is really a
combination of functional and domain
decomposition - Rather than thinking about a dividing data or
functionality, we look at the objects in the
problem - The object can be decomposed as a set of data
structures plus the procedures that act on those
data structures - The goal of object-oriented parallel programming
is distributed objects - Although conceptually clear, in practice it can
be difficult to achieve good load balancing among
the objects without a great deal of fine tuning - Works best for fine-grained problems and in
environments where having functionally ready
at-the-call is more important than worrying about
under-worked processors (i.e. battlefield
simulation) - Message passing is still explicit (no standard
C compiler automatically parallelizes over
objects).
62PCAM mapping decomposition - OO
- Example the client-server model
- The server is an object that has data associated
with it (i.e. a database) and a set of procedures
that it performs (i.e. searches for requested
data within the database) - The client is an object that has data associated
with it (i.e. a subset of data that it has
requested from the database) and a set of
procedures it performs (i.e. some application
that massages the data). - The server and client can run concurrently on
different processors an object-oriented
decomposition of a parallel application - In the real-world, this can be large scale when
many clients (workstations running applications)
access a large central data base kind of like a
distributed supercomputer
63PCAM mapping decomposition -summary
- A good decomposition strategy is
- Key to potential application performance
- Key to programmability of the solution
- There are many different ways of thinking about
decomposition - Decomposition models (domain, control,
object-oriented, etc.) provide standard templates
for thinking about the decomposition of a problem - Decomposition should be natural to the problem
rather than natural to the computer architecture - Communication does no useful work keep it to a
minimum - Always wise to see if a library solution already
exists for your problem - Dont be afraid to use multiple decompositions in
a problem if it seems to fit
64PCAM mapping - considerations
- If the communication pattern among tasks is
regular, create p agglomerated tasks that
minimize communication and map each task to its
own processor - If the number of tasks is fixed and communication
among them regular but the time require to
perform each task is variable, then some sort of
cyclic or interleaved mapping of tasks to
processors may result in a more balanced load - Dynamic load-balancing algorithms are needed when
tasks are created and destroyed at run-time or
computation or communication of tasks vary widely
65PCAM mapping - checklist
- It is important to keep an open mind during the
design process. These points can help you decide
if you have done a good job of considering design
alternatives - Is the design based on one task per processor or
multiple tasks per processor? - Have both static and dynamic allocation of tasks
to processors been considered? - If dynamic allocation of tasks is chosen is the
manager (task allocator) a bottle neck to
performance - If using probabilistic or cyclic methods, do you
have a large enough number of tasks to ensure
reasonable load balance (typically ten times as
many tasks as processors are required)
66PCAM example N-body problem
- In a Newtonian n-body simulation, gravitational
forces have infinite range. Sequential algorithms
to solve these problems have time complexity of
T(n²) per iteration where n is the number of
objects - Let us suppose that we are simulating the motion
of n particles of varying mass in 2D. During
each iteration we need to compute the velocity
vector of each particle, given the positions of
all other particles. - Using the four stage process we get
67PCAM example N-body problem
- Partitioning
- Assume we have one task per particle.
- This particle must know the location of all other
particles - Communication
- A gather operation is a global communication that
takes a dataset distributed among a group of
tasks and collects the items on a single task - An all-gather operation is similar to gather,
except at the end of communication every task has
a copy of the entire dataset - We need to update the location of every particle
so an all-gather is necessary
68PCAM example N-body problem
- So put a channel between every pair of tasks
- During each communication step each task sends it
vector element to one other task. After n 1
communication steps, each task has the position
of all other particles, and it can perform the
calculations needed to determine the velocity and
new location for its particle - Is there a quicker way? Suppose there were only
two particles. If each task had a single
particle, they can exchange copies of their
values. What if there were four particles?
After a single exchange step tasks 0 and 1 could
both have particles v0 and v1 , likewise for
tasks 2 and 3. If task 0 exchanges its pair of
particles with task 2 and task 1 exchanges with
task 3, then all tasks will have all four
particles. A logarithmic number of exchange
steps is sufficient to allow every processor to
acquire the value originally held by every other
processor. So the ith exchange step of messages
have length 2(i-1)
69PCAM example N-body problem
- Agglomeration and mapping
- In general, there are more particles n than
processors p. If n is a multiple of p we can
associate one task per processor and agglomerate
n/p particles into each task.
70PCAM - summary
- Task/channel (PCAM) is a theoretical construct
that represents a parallel computation as a set
of tasks that may interact with each other by
sending messages through channels - It encourages parallel algorithm designs that
maximize local computations and minimize
communications
71BSP Bulk Synchronous Parallel
- BSP model was proposed in 1989. It provides an
elegant theoretical framework for bridging the
gap between parallel hardware and software - BSP allows for the programmer to design an
algorithm as a sequence of large step (supersteps
in the BSP language) each containing many basic
computation or communication operations done in
parallel and a global synchronization at the end,
where all processors wait for each other to
finish their work before they proceed with the
next superstep. - BSP is currently used around the world and very
good text (which this segment is based on) is
called Parallel Scientific Computation by Rob
Bisseling published by Oxford Press in 2004
72BSP Bulk Synchronous Parallel
- Some useful links
- BSP Worldwide organization
- http//www.bsp-worldwide.org
- The Oxford BSP toolset (public domain GNU
license) - http//www.bsp-worldwide.org/implmnts/oxtool
- The source files from the book together with test
programs form a package called BSPedupack and can
be found at - http//www.math.uu.nl/people/bisseling/software.ht
ml - The MPI version called MPIedupack is also
available from the previously mentioned site
73BSP Bulk Synchronous Parallel
- BSP satisfies all requirements of a useful
parallel programming model - Simple enough to allow easy development and
analysis of algorithms - Realistic enough to allow reasonably accurate
modelling of real-life parallel computing - There exists a portability layer in the form of
BSPlib - It has been efficiently implemented in the Oxford
BSP toolset and Paderborn University BSP library - Currently being used as a framework for algorithm
design and implementation on clusters of PCs,
networks of workstations, shared-memory
multiprocessors and large parallel machines with
distributed memory
74BSP Model
- BSP comprises of a computer architecture, a class
of algorithms, and a function for charging costs
to algorithms (hmm no wonder it is a popular
model) - The BSP computer
- consists of a collection of processors, each with
private memory, - and a communication network that allows
processors to access each others memories
75BSP Model
- The BSP algorithm is a series of supersteps which
contain either a number of computation or
communication steps followed by a global barrier
synchronization (i.e. bulk synchronization - What is one possible problem you see right away
with designing algorithms this way?
76BSP Model
- The BSP cost function is classified as an
h-relation and consists of a superstep where at
least one processor sends and receives at most h
data words (real or integer) Therefore h max
hsend, hreceive - It assumes sends and receives are simultaneous
- This charging cost is based on the assumption
that the bottleneck is at the entry or exit of a
communication network - The cost of an h-relation would be
- Tcomm(h) hg l, where
- g is the communication cost per data word, and
l is the global synchronization cost (both in
time units of 1 flop) and the cost of a BSP
algorithm is the expression - a bg cl (a, b, c) where a, b, c depend in
general on p and on the problem size
77BSP Bulk Synchronous Parallel
- This model currently allows you to convert from
BSP to MPI-2 using MPIEDUPACK as an example (i.e.
MPI can be used for programming in the BSP style - The main difference between MPI and BSPlib is
that MPI provides more opportunities for
optimization by the user. However, BSP does
impose a discipline that can prove fruitful in
developing reusable code - The book contains an excellent section on sparse
matrix vector multiplication and if you link to
the website you can download some interesting
solvers http//www.math.uu.nl/people/bisseling/Mon
driaan/mondriaan.html
78Pattern Language
- Primarily from the book Patterns for Parallel
Programming by Mattson, Sanders, and Massingill,
Addison-Wesley, 2004 - From the back cover Its the first parallel
programming guide written specifically to serve
working software developers, not just computer
scientists. The authors introduce a complete,
highly accessible pattern language that will help
any experienced developer think parallel and
start writing effective code almost immediately - The cliché Dont judge a book by its cover
comes to mind
79Pattern Language
- We have come full circle. However, we have
gained some knowledge along the way - A pattern language is not a programming language.
It is an embodiment of design methodologies
which provides domain specific advise to the
application designer - Design patterns were introduced into software
engineering in 1987
80Pattern Language
- Organized into four design spaces (sound familiar
- PCAM) - Finding concurrency
- Structure problem to expose exploitable
concurrency - Algorithm structure
- Structure the algorithm to take advantage of the
concurrency found above - Supporting structures
- Structured approaches that represent the program
and shared data structures - Implementation mechanisms
- Mapping patterns to processors
81Concept - Summary
- What is the common thread of all these models and
paradigms?
82Concept - Conclusion
- You take a problem, break it up into n tasks and
assign them to p processors thats the science - How you break up the problem and exploit the
parallelism now thats the art
83This page intentionally left blank
84Compile
- Serial/sequential program optimization
- Introduction to OpenMP
- Introduction to MPI
- Profilers
- Libraries
- Debugging
- Performance Analysis Formulas
85Serial
- Some of you may be thinking why would I want to
discuss serial in an talk about parallel
computing. - Well, have you ever eaten just one bran flake
or one rolled oat at a time?
86Serial
- Most of the serial optimization techniques can be
used for any program parallel or serial - Well written assembler code will beat high level
programming language any day but who has the
time to write a parallel application in assembler
for one of the myriad of processors available.
However, small sections of assembler might be
more affective. - Reducing the memory requirements of an
application is a good tool that frequently
results in better processor performance - You can use these tools to write efficient code
from scratch or to optimize existing code. - First attempts at optimization may be compiler
options or modifying a loop. However performance
tuning is like trying to reach the speed of light
more and more time or energy is expended but
the peak performance is never reached. It may be
best, before optimizing your program, to consider
how much time and energy you have and are willing
or allowed to commit. Remember, you may spend a
lot of time optimizing for one processor/compiler
only to be told to port the code to another system
87Serial
- Computers have become faster over the past years
(Moores Law). However, application speed has
not kept pace. Why? Perhaps it is because
programmers - Write programs without any knowledge of the
hardware on which they will run - Do not know how to use compilers effectively (how
many use the gnu compilers?) - Do not know how to modify code to improve
performance
88Serial Storage Problems
- Avoid cache thrashing and memory bank contention
by dimensioning multidimensional arrays so that
the dimensions are not powers of two - Eliminate TLB (Translation Lookaside Buffer which
translates virtual memory addresses into physical
memory addresses) misses and memory bank
contention by accessing arrays in unit stride. A
TLB miss is when a process accesses memory which
does not have its translation in the TLB - Avoid Fortran I/O interfaces such as open(),
read(), write(), etc. since they are built on top
of buffered I/O mechanisms fopen(), fread(),
fwrite(), etc.. Fortran adds additional
functionality to the I/O routines which leads to
more overhead for doing the actual transfers - Do your own buffering for I/O and use system
calls to transfer large blocks of data to and
from files
89Serial Compilers and HPC
- A compiler takes a high-level language as input
and produces assembler code and once linked with
other objects, form an executable which can run
on a computer - Initially programmers had no choice but to
program in assembler for a specific processor.
When processors change so would the code - Now programmers write in a high-level language
that can be recompiled for other processors
(source code compatibility). There is also
object and binary compatibility
90Serial the compiler and you
- How the compiler generates good assembly code and
things you can do to help it - Register allocation is when the compiler assigns
quantities to registers. C and C have the
register command. Some optimizations increase
the number of registers required - C/C register data type usual when the
programmer knows the variable will be used many
times and should not be reloaded from memory - C/C asm macro allows assembly code to be
inserted directly into the instruction sequence.
It makes code non-portable - C/C include file math.h generates faster code
when used - Uniqueness of memory addresses. Different
languages make assumptions on whether memory
locations of variables are unique. Aliasing
occurs when multiple elements have the same
memory locations.
91Serial The Compiler and You
- Dead code elimination is the removal of code that
is never used - Constant folding is when expressions with
multiple constants can be folded together and
evaluated at compile time (i.e. A 34 can be
replaced by A 7). Propagation is when variable
references are replaced by a constant value at
compile time (i.e. A34, BA3 can be replaced
by A7 and B10 - Common subexpression elimination (i.e. AB(XY),
CD(XY)) puts repeated expressions into a new
variable - Strength reduction
92Serial Strength reductions
- Replace integer multiplication or division with
shift operations - Multiplies and divides are expensive
- Replace 32-bit integer division by 64-bit
floating-point division - Integer division is much more expensive than
floating-point division - Replace floating-point multiplication with
floating-point addition - YXX is cheaper than Y2X
- Replace multiple floating-point divisions by
division and multiplication - Division is one of the most expensive operations
ax/z, by/z can be replaced by c1/z, axc,
byc - Replace power function by floating-point
multiplications - Power calculations are very expensive and take 50
times longer than performing a multiplication so
YX3 can be replaced by YXXX
93Serial Single Loop Optimization
- Induction variable optimization
- when values in a loop are a linear function of
the induction variable the code can be simplified
by replacing the expression with a counter and
replacing the multiplication by an addition - Prefetching
- what happens when the compiler prefetches off
the end of the array (fortunately it is ignored) - Test promotion in loops
- Branches in code greatly reduce performance since
they interfere with pipelining - Loop peeling
- Handle boundary conditions outside the loop (i.e.
do not test for them inside the loop) - Fusion
- If the loop is the same (i.e. i0 iltn, i) for
more than one loop combine them together - Fission
- Sometime loops need to be split apart to help
performance - Copying
- Loop fission using dynamically allocated meory