Parallel Computing Final Exam Review

About This Presentation

Title:

Parallel Computing Final Exam Review

Description:

... but with arbitration problem(many ... Communication/Arbitration. ... among different processors and how we arbitrate communication related conflicts. ... – PowerPoint PPT presentation

Number of Views:310

Avg rating:3.0/5.0

Slides: 142

Provided by: xw

Learn more at: http://www.cs.csi.cuny.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Computing Final Exam Review

1
Parallel Computing Final Exam Review
2
What is Parallel computing?

Parallel computing involves performing parallel
tasks using more than one computer.
Example in real life with related principles --
book shelving in a library
Single worker
P workers with each worker stacking n/p books,
but with arbitration problem(many workers try to
stack the next book in the same shelf.)
P workers with each worker stacking n/p books,
but without arbitration problem (each worker work
on a different set of shelves)

3
Important Issues in parallel computing

Task/Program Partitioning.
How to split a single task among the processors
so that each processor performs the same amount
of work, and all processors work collectively to
complete the task.
Data Partitioning.
How to split the data evenly among the processors
in such a way that processor interaction is
minimized.
Communication/Arbitration.
How we allow communication among different
processors and how we arbitrate communication
related conflicts.

4
Challenges

Design of parallel computers so that we resolve
the above issues.
Design, analysis and evaluation of parallel
algorithms run on these machines.
Portability and scalability issues related to
parallel programs and algorithms
Tools and libraries used in such systems.

5
Units of Measure in HPC

High Performance Computing (HPC) units are
Flop floating point operation
Flops/s floating point operations per second
Bytes size of data (a double precision floating
point number is 8)
Typical sizes are millions, billions,
trillions
Mega Mflop/s 106 flop/sec Mbyte 220 1048576
106 bytes
Giga Gflop/s 109 flop/sec Gbyte 230 109
bytes
Tera Tflop/s 1012 flop/sec Tbyte 240 1012
bytes
Peta Pflop/s 1015 flop/sec Pbyte 250 1015
bytes
Exa Eflop/s 1018 flop/sec Ebyte 260 1018
bytes
Zetta Zflop/s 1021 flop/sec Zbyte 270 1021
bytes
Yotta Yflop/s 1024 flop/sec Ybyte 280 1024
bytes
See www.top500.org for current list of fastest
machines

6
What is a parallel computer?

A parallel computer is a collection of processors
that cooperatively solve computationally
intensive problems faster than other computers.
Parallel algorithms allow the efficient
programming of parallel computers.
This way the waste of computational resources can
be avoided.
Parallel computer v.s. Supercomputer
supercomputer refers to a general-purpose
computer that can solve computational intensive
problems faster than traditional computers.
A supercomputer may or may not be a parallel
computer.

7
Flynns taxonomy of computer architectures
(control mechanism)

Depending on the execution and data streams
computer architectures can be distinguished into
the following groups.
(1) SISD (Single Instruction Single Data) This
is a sequential computer.
(2) SIMD (Single Instruction Multiple Data)
This is a parallel machine like the TM CM-200.
SIMD machines are suited for data-parallel
programs where the same set of instructions are
executed on a large data set.
Some of the earliest parallel computers such as
the Illiac IV, MPP, DAP, CM-2, and MasPar MP-1
belonged to this class of machines
(3) MISD (Multiple Instructions Single Data)
Some consider a systolic array a member of this
group.
(4) MIMD (Multiple Instructions Multiple Data)
All other parallel machines. A MIMD architecture
can be an MPMD or an SPMD. In a Multiple Program
Multiple Data organization, each processor
executes its own program as opposed to a single
program that is executed by all processors on a
Single Program Multiple Data architecture.
Examples of such platforms include current
generation Sun Ultra Servers, SGI Origin Servers,
multiprocessor PCs, workstation clusters, and the
IBM SP
Note Some consider CM-5 as a combination of a
MIMD and SIMD as it contains control hardware
that allows it to operatein a SIMD mode.

8
SIMD and MIMD Processors
A typical SIMD architecture (a) and a typical
MIMD architecture (b).
9
Taxonomy based on Address-Space Organization
(memory distribution)

Message-Passing Architecture
In a distributed memory machine each processor
has its own memory. Each processor can access its
own memory faster than it can access the memory
of a remote processor (NUMA for Non-Uniform
Memory Access). This architecture is also known
as message-passing architecture and such machines
are commonly referred to as multicomputers.
Examples Cray T3D/T3E, IBM SP1/SP2, workstation
clusters.
Shared-Address-Space Architecture
Provides hardware support for read/write to a
shared address space. Machines built this way are
often called multiprocessors.
(1) A shared memory machine has a single address
space shared by all processors (UMA, for Uniform
Memory Access).
The time taken by a processor to access any
memory word in the system is identical.
Examples SGI Power Challenge, SMP machines.
(2) A distributed shared memory system is a
hybrid between the two previous ones. A global
address space is shared among the processors but
is distributed among them. Example SGI Origin
2000
Note The existence of a cache in shared-memory
parallel machines cause cache coherence problems
when a cached variable is modified by a processor
and the shared-variable is requested by another
processor. cc-NUMA for cachecoherent NUMA
architectures (Origin 2000).

10
NUMA and UMA Shared-Address-Space Platforms
Typical shared-address-space architectures (a)
Uniform-memory access shared-address-space
computer (b) Uniform-memory-access
shared-address-space computer with caches and
memories (c) Non-uniform-memory-access
shared-address-space computer with local memory
only.
11
Message Passing vs. Shared Address Space
Platforms

Message passing requires little hardware support,
other than a network.
Shared address space platforms can easily emulate
message passing. The reverse is more difficult to
do (in an efficient manner).

12
Taxonomy based on processor granularity

The granularity sometimes refers to the power of
individual processors. Sometimes is also used to
denote the degree of parallelism.
(1) A coarse-grained architecture consists of
(usually few) powerful processors (eg old Cray
machines).
(2) a fine-grained architecture consists of
(usually many inexpensive) processors (eg TM
CM-200, CM-2).
(3) a medium-grained architecture is between the
two (eg CM-5).
Process Granularity refers to the amount of
computation assigned to a particular processor of
a parallel machine for a given parallel program.
It also refers, within a single program, to the
amount of computation performed before
communication is issued. If the amount of
computation is small (low degree of concurrency)
a process is fine-grained. Otherwise granularity
is coarse.

13
Taxonomy based on processor synchronization

(1) In a fully synchronous system a global clock
is used to synchronize all operations performed
by the processors.
(2) An asynchronous system lacks any
synchronization facilities. Processor
synchronization needs to be explicit in a users
program.
(3) A bulk-synchronous system comes in between a
fully synchronous and an asynchronous system.
Synchronization of processors is required only at
certain parts of the execution of a parallel
program.

14
Physical Organization of Parallel Platforms
ideal architecture(PRAM)

The Parallel Random Access Machine (PRAM) is one
of the simplest ways to model a parallel
computer.
A PRAM consists of a collection of (sequential)
processors that can synchronously access a global
shared memory in unit time. Each processor can
thus access its shared memory as fast (and
efficiently) as it can access its own local
memory.
The main advantages of the PRAM is its simplicity
in capturing parallelism and abstracting away
communication and synchronization issues related
to parallel computing.
Processors are considered to be in abundance and
unlimited in number. The resulting PRAM
algorithms thus exhibit unlimited parallelism
(number of processors used is a function of
problem size).
The abstraction thus offered by the PRAM is a
fully synchronous collection of processors and a
shared memory which makes it popular for parallel
algorithm design.
It is, however, this abstraction that also makes
the PRAM unrealistic from a practical point of
view.
Full synchronization offered by the PRAM is too
expensive and time demanding in parallel machines
currently in use.
Remote memory (i.e. shared memory) access is
considerably more expensive in real machines than
local memory access
UMA machines with unlimited parallelism are
difficult to build.

15
Four Subclasses of PRAM

Depending on how concurrent access to a single
memory cell (of the shared memory) is resolved,
there are various PRAM variants.
ER (Exclusive Read) or EW (Exclusive Write) PRAMs
do not allow concurrent access of the shared
memory.
It is allowed, however, for CR (Concurrent Read)
or CW (Concurrent Write) PRAMs.
Combining the rules for read and write access
there are four PRAM variants
EREW
access to a memory location is exclusive. No
concurrent read or write operations are allowed.
Weakest PRAM model
CREW
Multiple read accesses to a memory location are
allowed. Multiple write accesses to a memory
location are serialized.
ERCW
Multiple write accesses to a memory location are
allowed. Multiple read accesses to a memory
location are serialized.
Can simulate an EREW PRAM
CRCW
Allows multiple read and write accesses to a
common memory location.
Most powerful PRAM model
Can simulate both EREW PRAM and CREW PRAM

16
Resolve concurrent write access

(1) in the arbitrary PRAM, if multiple processors
write into a single shared memory cell, then an
arbitrary processor succeeds in writing into this
cell.
(2) in the common PRAM, processors must write the
same value into the shared memory cell.
(3) in the priority PRAM the processor with the
highest priority (smallest or largest indexed
processor) succeeds in writing.
(4) in the combining PRAM if more than one
processors write into the same memory cell, the
result written into it depends on the combining
operator. If it is the sum operator, the sum of
the values is written, if it is the maximum
operator the maximum is written.

Note An algorithm designed for the common PRAM
can be executed on a priority or arbitrary PRAM
and exhibit similar complexity. The same holds
for an arbitrary PRAM algorithm when run on a
priority PRAM.
17
Innerconnection Networks for Parallel Computers

Interconnection networks carry data between
processors and to memory.
Interconnects are made of switches and links
(wires, fiber).
Interconnects are classified as static or
dynamic.
Static networks
Consists of point-to-point communication links
among processing nodes
Also referred to as direct networks
Dynamic networks
Built using switches (switching element) and
links
Communication links are connected to one another
dynamically by the switches to establish paths
among processing nodes and memory banks.

18
Static and DynamicInterconnection Networks
Classification of interconnection networks (a) a
static network and (b) a dynamic network.
19
Network Topologies

Bus-Based Networks
The simplest network that consists a shared
medium(bus) that is common to all the nodes.
The distance between any two nodes in the network
is constant (O(1)).
Ideal for broadcasting information among nodes.
Scalable in terms of cost, but not scalable in
terms of performance.
The bounded bandwidth of a bus places limitations
on the overall performance as the number of nodes
increases.
Typical bus-based machines are limited to dozens
of nodes.
Sun Enterprise servers and Intel Pentium based
shared-bus multiprocessors are examples of such
architectures

20
Network Topologies

Crossbar Networks
Employs a grid of switches or switching nodes to
connect p processors to b memory banks.
Nonblocking network
the connection of a processing node to a memory
bank doesnot block the connection of any other
processing nodes to other memory banks.
The total number of switching nodes required is
T(pb). (It is reasonable to assume bgtp)
Scalable in terms of performance
Not scalable in terms of cost.
Examples of machines that employ crossbars
include the Sun Ultra HPC 10000 and the Fujitsu
VPP500

21
Network Topologies Multistage Network

Multistage Networks
Intermediate class of networks between bus-based
network and crossbar network
Blocking networks access to a memory bank by a
processor may disallow access to another memory
bank by another processor.
More scalable than the bus-based network in terms
of performance, more scalable than crossbar
network in terms of cost.

The schematic of a typical multistage
interconnection network.
22
Network Topologies Multistage Omega Network

Omega network
Consists of log p stages, p is the number of
inputs(processing nodes) and also the number of
outputs(memory banks)
Each stage consists of an interconnection pattern
that connects p inputs and p outputs
Perfect shuffle(left rotation)
Each switch has two conncetion modes
Pass-thought conncetion the inputs are sent
straight through to the outputs
Cross-over connection the inputs to the
switching node are crossed over and then sent
out.
Has p/2log p switching nodes

23
Network Topologies Multistage Omega Network
A complete Omega network with the perfect
shuffle interconnects and switches can now be
illustrated
A complete omega network connecting eight inputs
and eight outputs. An omega network has p/2
log p switching nodes, and the cost of such a
network grows as (p log p).
24
Network Topologies Multistage Omega Network
Routing

Let s be the binary representation of the source
and d be that of the destination processor.
The data traverses the link to the first
switching node. If the most significant bits of s
and d are the same, then the data is routed in
pass-through mode by the switch else, it switches
to crossover.
This process is repeated for each of the log p
switching stages.
Note that this is not a non-blocking switch.

25
Network Topologies Multistage Omega Network
Routing
An example of blocking in omega network one of
the messages (010 to 111 or 110 to 100) is
blocked at link AB.
26
Network Topologies - Fixed Connection Networks
(static)

Completely-connection Network
Star-Connected Network
Linear array
2d-array or 2d-mesh or mesh
3d-mesh
Complete Binary Tree (CBT)
2d-Mesh of Trees
Hypercube

27
Evaluating Static Interconnection Network

One can view an interconnection network as a
graph whose nodes correspond to processors and
its edges to links connecting neighboring
processors. The properties of these
interconnection networks can be described in
terms of a number of criteria.
(1) Set of processor nodes V . The cardinality of
V is the number of processors p (also denoted by
n).
(2) Set of edges E linking the processors. An
edge e (u, v) is represented by a pair (u, v)
of nodes. If the graph G (V,E) is directed,
this means that there is a unidirectional link
from processor u to v. If the graph is
undirected, the link is bidirectional. In almost
all networks that will be considered in this
course communication links will be bidirectional.
The exceptions will be clearly distinguished.
(3) The degree du of node u is the number of
links containing u as an endpoint. If graph G is
directed we distinguish between the out-degree of
u (number of pairs (u, v) ? E, for any v ? V )
and similarly, the in-degree of u. T he degree d
of graph G is the maximum of the degrees of its
nodes i.e. d maxudu.

28
Evaluating Static Interconnection Network (cont.)

(4) The diameter D of graph G is the maximum of
the lengths of the shortest paths linking any two
nodes of G. A shortest path between u and v is
the path of minimal length linking u and v. We
denote the length of this shortest path by duv.
Then, D maxu,vduv. The diameter of a graph G
denotes the maximum delay (in terms of number of
links traversed) that will be incurred when a
packet is transmitted from one node to the other
of the pair that contributes to D (i.e. from u to
v or the other way around, if D duv). Of course
such a delay would hold if messages follow
shortest paths (the case for most routing
algorithms).
(5) latency is the total time to send a message
including software overhead. Message latency is
the time to send a zero-length message.
(6) bandwidth is the number of bits transmitted
in unit time.
(7) bisection width is the number of links that
need to be removed from G to split the nodes into
two sets of about the same size (1).

29
Network Topologies - Fixed Connection Networks
(static)

Completely-connection Network
Each node has a direct communication link to
every other node in the network.
Ideal in the sense that a node can send a message
to another node in a single step.
Static counterpart of crossbar switching networks
Nonblocking
Star-Connected Network
One processor acts as the central processor.
Every other processor has a communication link
connecting it to this central processor.
Similar to bus-based network.
The central processor is the bottleneck.

30
Network Topologies Completely Connected and Star
Connected Networks
Example of an 8-node completely connected network.
(a) A completely-connected network of eight
nodes (b) a star connected network of nine
nodes.
31
Network Topologies - Fixed Connection Networks
(static)

Linear array
In a linear array, each node(except the two nodes
at the ends) has two neighbors, one each to its
left and right.
Extension ring or 1-D torus(linear array with
wraparound).
2d-array or 2d-mesh or mesh
The processors are ordered to form a
2-dimensional structure (square) so that each
processor is connected to its four neighbor
(north, south, east, west) except perhaps for the
processors of the boundary.
Extension of linear array to two-dimensions Each
dimension has ?p nodes with a node identified by
a two-tuple (i,j).
3d-mesh
A generalization of a 2d-mesh in three
dimensions. Exercise Find the characteristics of
this network and its generalization in k
dimensions (k gt 2).
Complete Binary Tree (CBT)
There is only one path between any pair of two
nodes
Static tree network have a processing element at
each node of the tree.
Dynamic tree network nodes at intermediate
levels are switching nodes and the leaf nodes are
processing elements.
Communication bottleneck at higher levels of the
tree.
Solution increasing the number of communication
links and switching nodes closer to the root.

32
Network Topologies Linear Arrays
Linear arrays (a) with no wraparound links (b)
with wraparound link.
33
Network Topologies Two- and Three Dimensional
Meshes
Two and three dimensional meshes (a) 2-D mesh
with no wraparound (b) 2-D mesh with wraparound
link (2-D torus) and (c) a 3-D mesh with no
wraparound.
34
Network Topologies Tree-Based Networks
Complete binary tree networks (a) a static tree
network and (b) a dynamic tree network.
35
Network Topologies Fat Trees
A fat tree network of 16 processing nodes.
36
Evaluating Network Topologies

Linear array
V N, E N -1, d 2, D N -1, bw 1
(bisection width).
2d-array or 2d-mesh or mesh
For V N, we have a vN vN mesh structure,
with E 2N O(N), d 4, D 2vN - 2, bw
vN.
3d-mesh
Exercise Find the characteristics of this
network and its generalization in k dimensions (k
gt 2).
Complete Binary Tree (CBT) on N 2n leaves
For a complete binary tree on N leaves, we define
the level of a node to be its distance from the
root. The root is of level 0 and the number of
nodes of level i is 2i. Then, V 2N - 1, E
2N - 2 O(N), d 3, D 2lgN, bw 1(c)

37
Network Topologies - Fixed Connection Networks
(static)

2d-Mesh of Trees
An N2-leaf 2d-MOT consists of N2 nodes ordered as
in a 2d-array N N (but without the links). The N
rows and N columns of the 2d-MOT form N row CBT
and N column CBTs respectively.
For such a network, V N22N(N -1), E
O(N2), d 3, D 4lgN, bw N. The 2d-MOT
possesses an interesting decomposition property.
If the 2N roots of the CBTs are removed we get 4
N/2 N/2 CBT s.
The 2d-Mesh of Trees (2d-MOT) combines the
advantages of 2d-meshes and binary trees. A
2d-mesh has large bisection width but large
diameter (vN). On the other hand a binary tree on
N leaves has small bisection width but small
diameter. The 2d-MOT has small diameter and large
bisection width.
A 3d-MOTcan be defined similarly.

38
Network Topologies - Fixed Connection Networks
(static)

Hypercube
The hypercube is the major representative of a
class of networks that are called hypercubic
networks. Other such networks is the butterfly,
the shuffle-exchange graph, de-Bruijn graph,
Cube-connected cycles etc.
Each vertex of an n-dimensional hypercube is
represented by a binary string of length n.
Therefore there are V 2n N vertices in
such a hypercube. Two vertices are connected by
an edge if their strings differ in exactly one
bit position.
Let u u1u2 . . .ui . . . un. An edge is a
dimension i edge if it links two nodes that
differ in the i-th bit position.
This way vertex u is connected to vertex ui u1u2
. . . ui . . .un with a dimension i edge.
Therefore E N lg N/2 and d lgN n.
The hypercube is the first network examined so
far that has degree that is not a constant but a
very slowly growing function of N. The diameter
of the hypercube is D lgN. A path from node u
to node v can be determined by correcting the
bits of u to agree with those of v starting from
dimension 1 in a left-to-right fashion. The
bisection width of the hypercube is bw N. This
is a result of the following property of the
hypercube. If all edges of dimension i are
removed from an n dimensional hypercube, we get
two hypercubes each one of dimension n - 1.

39
Network Topologies Hypercubes and their
Construction
Construction of hypercubes from hypercubes of
lower dimension.
40
Evaluating Static Interconnection Networks
Network Diameter BisectionWidth Arc Connectivity Cost (No. of links)
Completely-connected
Star
Complete binary tree
Linear array
2-D mesh, no wraparound
2-D wraparound mesh
Hypercube
Wraparound k-ary d-cube
41
Performance Metrics for Parallel Systems

Number of processing elements p
Execution Time
Parallel runtime the time that elapses from the
moment a parallel computation starts to the
moment the last processing element finishes
execution.
Ts serial runtime
Tp parallel runtime
Total Parallel Overhead T0
Total time collectively spent by all the
processing elements running time required by
the fastest known sequential algorithm for
solving the same problem on a single processing
element.
T0pTp-Ts

42
Performance Metrics for Parallel Systems

Speedup S
The ratio of the serial runtime of the best
sequential algorithm for solving a problem to the
time taken by the parallel algorithm to solve the
same problem on p processing elements.
STs(best)/Tp
Example adding n numbers TpT(logn), Ts T(n),
S T(n/logn)
Theoretically, speedup can never exceed the
number of processing elements p(Sltp).
Proof Assume a speedup is greater than p, then
each processing element can spend less than time
Ts/p solving the problem. In this case, a single
processing element could emulate the p processing
elements and solve the problem in fewer than Ts
units of time. This is a contradiction because
speedup, by definition, is computed with respect
to the best sequential algorithm.
Superlinear speedup In practice, a speedup
greater than p is sometimes observed, this
usually happens when the work performed by a
serial algorithm is greater than its parallel
formulation or due to hardware features that put
the serial implementation at a disadvantage.

43
Example for Superlinear speedup

Superlinear speedup
Example1 Superlinear effects from caches With
the problem instance size of A and 64KB cache,
the cache hit rate is 80. Assume latency to
cache of 2ns and latency of DRAM of 100ns, then
memory access time is 20.81000.221.6ns. If
the computation is memory bound and performs one
FLOP/memory access, this corresponds to a
processing rate of 46.3 MFLOPS. With the problem
instance size of A/2 and 64KB cache, the cache
hit rate is higher, i.e., 90, 8 the remaining
data comes from local DRAM and the other 2 comes
from the remote DRAM with latency of 400ns, then
memory access time is 20.91000.084000.0217.8
. The corresponding execution rate at each
processor is 56.18MFLOPS, and for two processors
the total processing rate is 112.36MFLOPS. Then
the speedup will be 112.36/46.32.43!

44
Example for Superlinear speedup

Superlinear speedup
Example2 Superlinear effects due to exploratory
decomposition explore leaf nodes of an
unstructured tree. Each leaf has a label
associated with it and the objective is to find a
node with a specified label, say S. The
solution node is the rightmost leaf in the tree.
A serial formulation of this problem based on
depth-first tree traversal explores the entire
tree, i.e. all 14 nodes, time is 14 units time.
Now a parallel formulation in which the left
subtree is explored by processing element 0 and
the right subtree is explored by processing
element 1. The total work done by the parallel
algorithm is only 9 nodes and corresponding
parallel time is 5 units time. Then the speedup
is 14/52.8.

45
Performance Metrics for Parallel Systems(cont.)

Efficiency E
Ratio of speedup to the number of processing
element.
ES/p
A measure of the fraction of time for which a
processing element is usefully employed.
Examples adding n numbers on n processing
elements TpT(logn), Ts T(n), S T(n/logn), E
T(1/logn)
Cost(also called Work or processor-time product)
W
Product of parallel runtime and the number of
processing elements used.
WTpp
Examples adding n numbers on n processing
elements W T(nlogn).
Cost-optimal if the cost of solving a problem on
a parallel computer has the same asymptotic
growth(in T terms) as a function of the input
size as the fastest-known sequential algorithm on
a single processing element.
Problem Size W2
The number of basic computation steps in the best
sequential algorithm to solve the problem on a
single processing element.
W2Ts of the fastest known algorithm to solve the
problem on a sequential computer.

46
Parallel vs Sequential Computing Amdahls

Theorem 0.1 (Amdahls Law) Let f, 0 f 1, be
the fraction of a computation that is inherently
sequential. Then the maximum obtainable speedup S
on p processors is S 1/(f (1 - f)/p)
Proof. Let T be the sequential running time for
the named computation. fT is the time spent on
the inherently sequential part of the program. On
p processors the remaining computation, if fully
parallelizable, would achieve a running time of
at most (1-f)T/p. This way the running time of
the parallel program on p processors is the sum
of the execution time of the sequential and
parallel components that is, fT (1 - f)T/p. The
maximum allowable speedup is therefore S T/(fT
(1 - f)T/p) and the result is proven.

47
Amdahls Law

Amdahl used this observation to advocate the
building of even more powerful sequential
machines as one cannot gain much by using
parallel machines. For example if f 10, then S
10 as p ? 8. The underlying assumption in
Amdahls Law is that the sequential component of
a program is a constant fraction of the whole
program. In many instances as problem size
increases the fraction of computation that is
inherently sequential decreases with time. In
many cases even a speedup of 10 is quite
significant by itself.
In addition Amdahls law is based on the concept
that parallel computing always tries to minimize
parallel time. In some cases a parallel computer
is used to increase the problem size that can be
solved in a fixed amount of time. For example in
weather prediction this would increase the
accuracy of say a three-day forecast or would
allow a more accurate five-day forecast.

48
Parallel vs Sequential Computing Gustaffsons Law

Theorem 0.2 (Gustafsons Law) Let the execution
time of a parallel algorithm consist of a
sequential segment fT and a parallel segment (1 -
f)T and the sequential segment is constant. The
scaled speedup of the algorithm is then. S (fT
(1 - f)Tp)/(fT (1 - f)T) f (1 - f)p
For f 0.05, we get S 19.05, whereas Amdahls
law gives an S 10.26.
1 proc p proc
fT fT
(1-f)Tp (1-f)T
T(f(1-f)p) T
Amdahls Law assumes that problem size is fixed
when it deals with scalability. Gustafsons Law
assumes that running time is fixed.

49
Brents Scheduling Principle(Emulations)

Suppose we have an unlimited parallelism
efficient parallel algorithm, i.e. an algorithm
that runs on zillions of processors. In practice
zillions of processors may not available. Suppose
we have only p processors. A question that arises
is what can we do to run the efficient zillion
processor algorithm on our limited machine.
One answer is emulation simulate the zillion
processor algorithm on the p processor machine.
Theorem 0.3 (Brents Principle) Let the execution
time of a parallel algorithm requires m
operations and runs in parallel time t. Then
running this algorithm on a limited processor
machine with only p processors would require time
m/p t.
Proof Let mi be the number of computational
operations at the i-th step, i.e. .If
we assign the p processors on the i-th step to
work on these mi operations they can conclude in
time . Thus the total
running time on p processors would be

50
The Message Passing Interface (MPI) Introduction

The Message-Passing Interface (MPI)is an attempt
to create a standard to allow tasks executing on
multiple processors to communicate through some
standardized communication primitives.
It defines a standard library for message passing
that one can use to develop message-passing
program using C or Fortran.
The MPI standard define both the syntax and the
semantics of these functional interface to
message passing.
MPI comes intro a variety of flavors, freely
available such as LAM-MPI and MPIch, and also
commercial versions such as Critical Softwares
WMPI.
It supporst message-passing on a variety of
platforms from Linux-based or Windows-based PC to
supercomputer and multiprocessor systems.
After the introduction of MPI whose functionality
includes a set of 125 functions, a revision of
the standard took place that added C support,
external memory accessibility and also Remote
Memory Access (similar to BSPs put and get
capability)to the standard. The resulting
standard is known as MPI-2 and has grown to
almost 241 functions.

51
The Message Passing InterfaceA minimum set

A minimum set of MPI functions is described
below. MPI functions use the prefix MPI and after
the prefix the remaining keyword start with a
capital letter.
A brief explanation of the primitives can be
found on the textbook (beginning page 242). A
more elaborate presentation is available in the
optional book.

Function Class Standard Function Operation
Initialization and Termination MPI MPI MPI_Init MPI_Finalize Start of SPMD code End of SPMD code
Abnormal Stop MPI MPI_Abort One process halts all
Process Control MPI MPI MPI MPI_Comm_size MPI_Comm_rank MPI_Wtime Number of processes Identifier of Calling Process Local (wall-clock) time
Message Passing MPI MPI MPI_Send MPI_Recv Send message to remote proc. Receive mess. from remote proc.
52
General MPI program
include ltmpi.hgt Main(int argc, char
argv) MPI_Init(argc, argv) /no MPI
functions called before this/ MPI_Finalize()
/no MPI functions called after this/
53
MPI and MPI-2Initialization and Termination

include ltmpi.hgt
int MPI_Init (int argc, char argv)
int MPI_Finalize(void)
Multiple processes from the same source are
created by issuing the function MPI_Init and
these processes are safely terminated by issuing
a MPI_Finalize.
The arguments of MPI_Init are the command line
arguments minus the ones that were used/processed
by the MPI implementation(main functions
parameters argc and argv). Thus command line
processing should only be performed in the
program after the execution of this function
call. Successful return returns a MPI_SUCCESS
otherwise an error-code that is implementation
dependent is returned.
Definitions are available in ltmpi.hgt.

54
MPI and MPI-2Abort

int MPI_Abort(MPI_Comm comm, int errcode)
Note that MPI_Abort aborts an MPI program
cleanly. The first one is a communicator and
second argument is an integer error code.
A communicator is a collection of processes that
can send messages to each other. A default
communicator is MPI_COMM_WORLD, which consists of
all the processes running when program execution
begins.

55
MPI and MPI-2Communicators Process Control

Under MPI, a communication domain is a set of
processors that are allowed to communicate with
each other. Information about such a domain is
stored in a communicator that uniquely identify
the processors that participate in a
communication operation.
A default communication domain is all the
processors of a parallel execution it is called
MPI COMM WORLD. By using multiple communicators
between possibly overlapping groups of processors
we make sure that messages are not interfering
with each other.
include ltmpi.hgt
int MPI_Comm_size ( MPI_Comm comm, int size)
int MPI_Comm_rank ( MPI_Comm comm, int rank)
Thus
MPI_Comm_size ( MPI_COMM_WORLD, nprocs)
MPI_Comm_rank ( MPI_COMM_WORLD, pid )
return the number of processors nprocs and the
processor id pid of the calling processor.

56
MPI and MPI-2Communicators Process Control
Example

A hello world! Program in MPI is the following
one.
include ltmpi.hgt
int main(int argc, char argv)
int nprocs, mypid
MPI_Init(argc,argv)
MPI_Comm_size(MPI_COMM_WORLD,nprocs)
MPI_Comm_rank(MPI_COMM_WORLD,mypid )
printf("Hello world from process d of total
d\n", mypid, nprocs)
MPI_Finalize()

57
Submit job to Grid Engine

(1)- created mpi program, e.g. example.c.
(2)- compiled my program, using mpicc -O2
example.c -o example
(3)- type command of vi submit
(4)- i put this into vi submit
!/bin/bash
-S /bin/sh
-N example
-q default
-pe lammpi 4
-cwd
mpirun C ./example
(5)- then ran chmod ax submit
(6)- then ran qsub submit
(7)- ran qstat to check the status of your program

58
MPIMessage-Passing primitives

include ltmpi.hgt
/ Blocking send and receive /
int MPI_Send(void buf, int count, MPI_Datatype
dtype, int dest, int tag, MPI_Comm comm)
int MPI_Recv(void buf, int count, MPI_Datatype
dtype, int src, int tag, MPI_Comm comm,
MPI_Status stat)
/ Non-Blocking send and receive /
int MPI_Isend(void buf, int count, MPI_Datatype
dtype, int dest, int tag, MPI_Comm comm,
MPI_Request req)
int MPI_Irecv(void buf, int count, MPI_Datatype
dtype, int src, int tag, MPI_Comm comm,
MPI_Request req)
int MPI_Wait(MPI_Request preq, MPI_Status
stat)
buf - initial address of send/receive buffer
count - number of elements in send buffer
(nonnegative integer) or maximum number of
elements in receive buffer.
dtyp - datatype of each send/receive buffer
element (handle)
dest,src - rank of destination/source (integer)
Wild-card MPI_ANY_SOURCE for recv only. No
wildcard for dest.
tag - message tag (integer). Range 0...32767.
Wild-card MPI_ANY_TAG for recv only send must
specify tag.
comm - communicator (handle)
stat - status object (Status), which can be the
MPI constant it returns the source and tag of
the message that was acctually received.

59
data type correspondence between MPI and C

MPI_CHAR --gt signed char ,
MPI_SHORT --gt signed short int ,
MPI_INT --gt signed int
MPI_LONG --gt signed long int ,
MPI_UNSIGNED_CHAR --gt unsigned char ,
MPI_UNSIGNED_SHORT --gt unsigned short int ,
MPI_UNSIGNED --gt unsigned int
MPI_UNSIGNED_LONG --gt unsigned long int ,
MPI_FLOAT --gt float ,
MPI_DOUBLE --gt double,
MPI_LONG_DOUBLE --gt long double,
MPI_BYTE
MPI_PACKED

60
MPIMessage-Passing primitives

The MPI Send and MPI Recv functions are blocking,
that is they do not return unless it is safe to
modify or use the contents of the send/receive
buffer.
MPI also provides for non-blocking send and
receive primitives. These are MPI Isend and MPI
Irecv, where the I stands for Immediate.
These functions allow a process to post that it
wants to send to or receive from another process,
and then allow the process to call a function
(eg. MPI Wait to complete the send-receive pair.
Non-blocking send-receives allow for the
overlapping of computation/communication. Thus
MPI Wait plays the role of synchronizer the
send/receive are only advisories and
communication is only effected at the MPI Wait.

61
MPIMessage-Passing primitives tag

A tag is simply an integer argument that is
passed to a communication function and that can
be used to uniquely identify a message. For
example, in MPI if process A sends a message to
process B, then in order for B to receive the
message, the tag used in A's call to MPI_Send
must be the same as the tag used in B's called to
MPI_Recv. Thus, if the characteristics of two
messages sent by A to B are identical (i.e., same
count and datatype), then A and B can distinguish
between the two by using different tags.
For example, suppose A is sending two floats, x
and y, to B. Then the processes can be sure that
the values are received correctly, regardless of
the order in which A sends and B receives,
provided different tags are used
/ Assume system provides some buffering /
if (my_rank A)
tag 0
MPI_Send(x, 1, MPI_FLOAT, B, tag,
MPI_COMM_WORLD)
. . .
tag 1
MPI_Send(y, 1, MPI_FLOAT, B, tag,
MPI_COMM_WORLD)
else if (my_rank B)
tag 1
MPI_Recv(y, 1, MPI_FLOAT, A, tag,
MPI_COMM_WORLD, status)
. . .
tag 0
MPI_Recv(x, 1, MPI_FLOAT, A, tag,
MPI_COMM_WORLD, status)

62
MPI Message-Passing primitivesCommunicators

Now if one message from process A to process B is
being sent by the library, and another, with
identical characteristics, is being sent by the
user's code, unless the library developer insists
that user programs refrain from using certain tag
values, this approach cannot be made to work.
Clearly, partitioning the set of possible tags is
at best an inconvenience if one wishes to modify
an existing user code so that it can use a
library that partitions tags, each message
passing function in the entire user code must be
checked.
The solution that was ultimately decided on was
the communicator.
Formally, a communicator is a pair of objects
the first is a group or ordered collection of
processes, and the second is a context, which can
be viewed as a unique, system-defined tag. Every
communication function in MPI takes a
communicator argument, and a communication can
succeed only if all the processes participating
in the communication use the same communicator
argument. Thus, a library can either require that
its functions be passed a unique library-specific
communicator, or its functions can create their
own unique communicator. In either case, it is
straightforward for the library designer and the
user to make certain that their messages are not
confused.

63
For example, suppose now that the user's code is
sending a float, x, from process A to process B,
while the library is sending a float, y, from A
to B / Assume system provides some buffering
/ void User_function(int my_rank, float x)
MPI_Status status if (my_rank A) /
MPI_COMM_WORLD is pre-defined in MPI /
MPI_Send(x, 1, MPI_FLOAT, B, 0,
MPI_COMM_WORLD) else if (my_rank B)
MPI_Recv(x, 1, MPI_FLOAT, A, 0,
MPI_COMM_WORLD, status) . . . void
Library_function(float y) MPI_Comm
library_comm MPI_Status status int my_rank
/ Create a communicator with the same group /
/ as MPI_COMM_WORLD, but a different context /
MPI_Comm_dup(MPI_COMM_WORLD, library_comm) /
Get process rank in new communicator /
MPI_Comm_rank(library_comm, my_rank) if
(my_rank A) MPI_Send(y, 1, MPI_FLOAT, B, 0,
library_comm) else if (my_rank B)
MPI_Recv(y, 1, MPI_FLOAT, A, 0, library_comm,
status) . . . int main(int argc, char
argv) . . . if (my_rank A)
User_function(A, x) . . . Library_function
(y) else if (my_rank B)
Library_function(y) . . . User_function(B,
x) . . .
64
MPI Message-Passing primitivesUser-defined
datatypes

The second main innovation in MPI, user-defined
datatypes, allows programmers to exploit this
power, and as a consequence, to create messages
consisting of logically unified sets of data
rather than only physically contiguous blocks of
data.
Loosely, an MPI datatype is a sequence of
displacements in memory together with a
collection of basic datatypes (e.g., int, float,
double, and char). Thus, an MPI-datatype
specifies the layout in memory of data to be
collected into a single message or data to be
distributed from a single message.
For example, suppose we specify a sparse matrix
entry with the following definition.
typedef struct
double entry
int row, col
mat_entry_t
MPI provides functions for creating a variable
that stores the layout in memory of a variable of
type mat_entry_t. One does this by first defining
an MPI datatype
MPI_Datatype mat_entry_mpi_t
to be used in communication functions, and then
calling various MPI functions to initialize
mat_entry_mpi_t so that it contains the required
layout. Then, if we define
mat_entry_t x
we can send x by simply calling
MPI_Send(x, 1, mat_entry_mpi_t, dest, tag,
comm)
and we can receive x with a similar call to
MPI_Recv.

65
MPI An example with the blocking operations

include ltstdio.hgt
include ltmpi.hgt
define N 10000000 // Choose N to be multiple of
nprocs to avoid problems.
// Parallel sum of 1 , 2 , 3, ... , N
int main(int argc,char argv)
int pid,nprocs,i,j
int sum, start, end, total
MPI_Status status
MPI_Init(argc,argv)
MPI_Comm_size(MPI_COMM_WORLD,nprocs)
MPI_Comm_rank(MPI_COMM_WORLD,pid )
sum 0 total 0
start (N/nprocs)pid 1 // Each processor
end (N/nprocs)(pid1)
for(istartiltendi) sum i
if (pid ! 0 )
MPI_Send(sum,1,MPI_INT,0,1,MPI_COMM_WORLD)
else

66
Non-Blocking send and Receive

/ Non-Blocking send and receive /
int MPI_Isend(void buf, int count, MPI_Datatype
dtype, int dest, int tag, MPI_Comm comm,
MPI_Request req)
int MPI_Irecv(void buf, int count, MPI_Datatype
dtype, int src, int tag, MPI_Comm comm,
MPI_Request req)
int MPI_Wait(MPI_Request preq, MPI_Status
stat)
include "mpi.h"
int MPI_Wait ( MPI_Request request, MPI_Status
status)
Waits for an MPI send or receive to complete
Input Parameter
request (handle)
Output Parameter
status object (Status) . May be
MPI_STATUS_IGNORE.

67
MPI An example with the non-blocking operations

include ltstdio.hgt
include ltmpi.hgt
define N 10000000 // Choose N to be multiple of
nprocs to avoid problems.
// Parallel sum of 1 , 2 , 3, ... , N
int main(int argc,char argv)
int pid,nprocs,i,j
int sum, start, end, total
MPI_Status status
MPI_Request request
MPI_Init(argc,argv)
MPI_Comm_size(MPI_COMM_WORLD,nprocs)
MPI_Comm_rank(MPI_COMM_WORLD,pid )
sum 0 total 0
start (N/nprocs)pid 1 // Each processor
end (N/nprocs)(pid1)
for(istartiltendi) sum i
if (pid ! 0 )
// MPI_Send(sum,1,MPI_INT,0,1,MPI_COMM_WORLD)
MPI_Isend(sum,1,MPI_INT,0,1,MPI_COMM_WORLD,requ
est)

68
MPI Basic Collective Operations

One simple collective operations
int MPI_Bcast(void message, int count,
MPI_Datatype datatype, int root, MPI_Comm comm)
The routine MPI_Bcast sends data from one process
to all others

69
MPI_Bcast
Process 1
Process 2
Process 3
Process 0
Data Present
70
Simple Program that Demonstrates MPI_Bcast

include ltmpi.hgt
include ltstdio.hgt
int main (int argc, char argv)
int k,id,p,size
MPI_Init(argc,argv)
MPI_Comm_rank(MPI_COMM_WORLD, id)
MPI_Comm_size(MPI_COMM_WORLD, size)
if(id 0)
k 20
else
k 10
for(p0 pltsize p)
if(id p)
printf("Process d k d
before\n",id,k)
//note MPI_Bcast must be put where all
other processes
//can see it.
MPI_Bcast(k,1,MPI_INT,0,MPI_COMM_WORLD)
for(p0 pltsize p)

71
Simple Program that Demonstrates MPI_Bcast

The Output would look like
Process 0 k 20 before
Process 0 k 20 after
Process 3 k 10 before
Process 3 k 20 after
Process 2 k 10 before
Process 2 k 20 after
Process 1 k 10 before
Process 1 k 20 after

72
Parallel Algorithm Assumptions

Convention In this subject we name processors
arbitrarily either 0, 1, . . . , p - 1 or 1, 2, .
. . , p.
The input to a particular problem would reside in
the cells of the shared memory. We assume, in
order to simplify the exposition of our
algorithms, that a cell is wide enough (in bits
or bytes) to accommodate a single instance of the
input (eg. a key or a floating point number). If
the input is of size n, the first n cells
numbered 0, . . . , n - 1 store the input.
We assume that the number of processors of the
PRAM is n or a polynomial function of the size n
of the input. Processor indices are 0, 1, . . . ,
n - 1.

73
PRAM Algorithm Matrix Multiplication

Matrix Multiplication
A simple algorithm for multiplying two n n
matrices on a CREW PRAM with time complexity T
O(lg n) and P n3 follows. For convenience,
processors are indexed as triples (i, j, k),
where i, j, k 1, . . . , n. In the first step
processor (i, j, k) concurrently reads aij and
bjk and performs the multiplication aijbjk. In
the following steps, for all i, k the results (i,
, k) are combined, using the parallel sum
algorithm to form cik ?j aijbjk. After lgn
steps, the result cik is thus computed.
The same algorithm also works on the EREW PRAM
with the same time and processor complexity. The
first step of the CREW algorithm need to be
changed only. We avoid concurrency by
broadcasting element aij to processors (i, j, )
using the broadcasting algorithm of the EREW PRAM
in O(lg n) steps. Similarly, bjk is broadcast to
processors (, j, k).
The above algorithm also shows how an n-processor
EREW PRAM can simulate an n-processor CREW PRAM
with an O(lg n) slowdown.

74
Matrix Multiplication

CREW EREW
1. aij to all (i,j,) procs O(1) O(lgn)
bjk to all (,j,k) procs O(1) O(lgn)
2. aijbjk at (i,j,k) proc O(1) O(1)
3. parallel sumj aij bjk (i,,k) procs O(lgn)
O(lgn) n procs participate
4. cik sumj aijbjk O(1) O(1)
TO(lgn),PO(n3 ) WO( n3 lgn) W2 O(n3 )

75
PRAM Algorithm Logical AND operation

Problem. Let X1 . . .,Xn be binary/boolean
values. Find X X1 ? X2 ? . . . ? Xn.
The sequential problem accepts a P 1, T
O(n),W O(n) direct solution.
An EREW PRAM algorithm solution for this problem
works the same way as the PARALLEL SUM algorithm
and its performance is P O(n), T O(lg n),W
O(n lg n) along with the improvements in P and W
mentioned for the PARALLEL SUM algorithm.
In the remainder we will investigate a CRCW PRAM
algorithm. Let binary value Xi reside in the
shared memory location i. We can find X X1 ? X2
? . . . ? Xn in constant time on a CRCW PRAM.
Processor 1 first writes an 1 in shared memory
cell 0. If Xi 0, processor i writes a 0 in
memory cell 0. The result X is then stored in
this memory cell.
The result stored in cell 0 is 1 (TRUE) unless a
processor writes a 0 in cell 0 then one of the
Xi is 0 (FALSE) and the result X should be FALSE,
as it is.

76
Logical AND operation

begin Logical AND (X1 . . .Xn)
1. Proc 1 writ1es in cell 0.
2. if Xi 0 processor i writes 0 into cell 0.
end Logical AND
Exercise Give an O(1) CRCW algorithm for LOGICAL
OR.

77
Parallel Operations with Multiple Outputs
Parallel Prefix

Problem definition Given a set of n values x0,
x1, . . . , xn-1 and an associative operator, say
, the parallel prefix problem is to compute the
following n results/sums.
0 x0,
1 x0 x1,
2 x0 x1 x2,
. . .
n - 1 x0 x1 . . . xn-1.
Parallel prefix is also called prefix sums or
scan. It has many uses in parallel computing such
as in load-balancing the work assigned to
processors and compacting data structures such as
arrays.
We shall prove that computing ALL THE SUMS is no
more difficult that computing the single sum x0
. . .xn-1.

78
Parallel Prefix Algorithm1 divide-and-conquer

x0 x1 x2 x3 x4 x5 x6 x7
ltltParalel Prefix "Box" for 8 inputs
------------------- --------------------
1 2
ltltlt 2 PP Boxes for 4 inputs each
------------------- --------------------
Take rightmost output of Box 1 and
combine it with the outputs of Box2
x0...x3
x0..x7
x0...x2 x0...x6
x0x1 x0...x5
x0 x0...x4

79
Parallel Prefix Algorithm 2

An algorithm for parallel prefix on an EREW PRAM
would require lg n phases. In phase i, processor
j reads the contents of cells j and j - 2i (if it
exists) combines them and stores the result in
cell j.
The EREW PRAM algorithm that solves the parallel
prefix problem has performance P O(n), T O(lg
n), and W O(n lg n), W2 O(n).

80
Parallel Prefix Algorithm 2 Example