Parallel Architectures

About This Presentation

Title:

Parallel Architectures

Description:

Chapter 2 Parallel Architectures * CPU 0 flushes cache block X step 26 Interconnection Network CPU 0 CPU 1 CPU 2 4 X Caches Memories Directories X U 0 0 0 ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 161

Provided by: Ober55

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Architectures

1
Chapter 2

Parallel Architectures

2
Outline

Some chapter references
Brief review of complexity
Terminology for comparisons
Interconnection networks
Processor arrays
Multiprocessors
Multicomputers
Flynns Taxonomy moved to Chpt 1

3
Some Chapter References

Selim Akl, The Design and Analysis of Parallel
Algorithms, Prentice Hall, 1989 (earlier
textbook).
G. C. Fox, What Have We Learnt from Using Real
Parallel Machines to Solve Real Problems?
Technical Report C3P-522, Cal Tech, December
1989. (Included in part in more recent books
co-authored by Fox.)
A. Grama, A. Gupta, G. Karypis, V. Kumar,
Introduction to Parallel Computing, Second
Edition, 2003 (first edition 1994), Addison
Wesley.
Harry Jordan, Gita Alaghband, Fundamentals of
Parallel Processing Algorithms, Architectures,
Languages, Prentice Hall, 2003, Ch 1, 3-5.
F. Thomson Leighton Introduction to Parallel
Algorithms and Architectures Arrays, Trees,
Hypercubes 1992 Morgan Kaufmann Publishers.

4
References - continued

Gregory Pfsiter, In Search of Clusters The
ongoing Battle in Lowly Parallelism, 2nd Edition,
Ch 2. (Discusses details of some serious problems
that MIMDs incur).
Michael Quinn, Parallel Programming in C with MPI
and OpenMP, McGraw Hill,2004 (Current Textbook),
Chapter 2.
Michael Quinn, Parallel Computing Theory and
Practice, McGraw Hill, 1994, Ch. 1,2
Sayed H. Roosta, Parallel Processing Parallel
Algorithms Theory and Computation, Springer
Verlag, 2000, Chpt 1.
Wilkinson Allen, Parallel Programming
Techniques and Applications, Prentice Hall, 2nd
Edition, 2005, Ch 1-2.

5
Brief Review Complexity Concepts Needed for
Comparisons

Whenever we define a counting function, we
usually characterize the growth rate of that
function in terms of complexity classes.
Technical Definition We say a function f(n) is
in O(g(n)), if (and only if) there are positive
constants c and n0 such that
0 f(n) ? cg(n) for n ? n0
O(n) is read as big-oh of n.
This notation can be used to separate counting
functions into complexity classes that
characterize the size of the count.
We can use it for any kind of counting functions
such as timings, bisection widths, etc.

6
Big-Oh and Asymptotic Growth Rate

The big-Oh notation gives an upper bound on the
(asymptotic) growth rate of a function
The statement f(n) is O(g(n)) means that the
growth rate of f(n) is not greater than the
growth rate of g(n)
We can use the big-Oh notation to rank functions
according to their growth rate

Assume f(n) is O(g(n)) g(n) is O(f(n))
g(n) grows faster Yes No
f(n) grows faster No Yes
Same growth Yes Yes
7
Relatives of Big-Oh

big-Omega
f(n) is ?(g(n)) if there is a constant c gt 0
and an integer constant n0 ? 1 such that
f(n) ? cg(n) for n ? n0
Intuitively, this says up to a constant factor,
f(n) asymptotically is greater than or equal to
g(n)
big-Theta
f(n) is ?(g(n)) if there are constants c gt 0 and
c gt 0 and an integer constant n0 ? 1 such that
0 cg(n) ? f(n) ? cg(n) for n ? n0
Intuitively, this says up to a constant factor,
f(n) and g(n) are asymptotically the same.
Note These concepts are covered in algorithm
courses

8
Relatives of Big-Oh

little-oh
f(n) is o(g(n)) if, for any constant c gt 0, there
is an integer constant n0 ? 0 such that 0 ? f(n)
lt cg(n) for n ? n0
Intuitively, this says f(n) is, up to a constant,
asymptotically strictly less than g(n), so f(n) ?
?(g(n)).
little-omega
f(n) is ?(g(n)) if, for any constant c gt 0, there
is an integer constant n0 ? 0 such that f(n) gt
cg(n) 0 for n ? n0
Intuitively, this says f(n) is, up to a constant,
asymptotically strictly greater than g(n), so
f(n) ? ?(g(n)).
These are not used as much as the earlier
definitions, but they round out the picture.

9
Summary for Intuition for Asymptotic Notation

big-Oh
f(n) is O(g(n)) if f(n) is asymptotically less
than or equal to g(n)
big-Omega
f(n) is ?(g(n)) if f(n) is asymptotically greater
than or equal to g(n)
big-Theta
f(n) is ?(g(n)) if f(n) is asymptotically equal
to g(n)
little-oh
f(n) is o(g(n)) if f(n) is asymptotically
strictly less than g(n)
little-omega
f(n) is ?(g(n)) if is asymptotically strictly
greater than g(n)

10
A CALCULUS DEFINITION OF O, ?(often easier to
use)
Definition Let f and g be functions defined on
the positive integers with nonnegative values. We
say g is in O(f) if and only if lim
g(n)/f(n) c n -gt ? for some nonnegative real
number c--- i.e. the limit exists and is not
infinite. Definition We say f is in ?(g) if and
only if f is in O(g) and g is in O(f) Note
Often use L'Hopital's Rule to calculate the
limits you need.
11
Why Asymptotic Behavior is Important

1) Allows us to compare counts on large sets.
2) Helps us understand the maximum size of input
that can be handled in a given time, provided we
know the environment in which we are running.
3) Stresses the fact that even dramatic speedups
in hardware can not overcome the handicap of an
asymptotically slow algorithm.

12
Recall ORDER WINS OUT(Example from Baases
Algorithms Text)
The TRS-80 Main language support BASIC -
typically a slow running interpreted language For
more details on TRS-80 see http//mate.kjsl.com/t
rs80/
The CRAY-YMP Language used in example FORTRAN- a
fast running language For more details on
CRAY-YMP see
http//ds.dial.pipex.com/town/park/abm64/CrayWWWSt
uff/Cfaqp1.htmlTOC3
13
CRAY YMP TRS-80with FORTRAN
with BASICcomplexity is 3n3
complexity is 19,500,000n
microsecond (abbr µsec) One-millionth of a
second. millisecond (abbr msec) One-thousandth of
a second.
n is 10 100 1000 2500 10000 1000000
3 microsec
200 millisec
2 sec
3 millisec
20 sec
3 sec
50 sec
50 sec
49 min
3.2 min
95 years
5.4 hours
14
Interconnection Networks

Uses of interconnection networks
Connect processors to shared memory
Connect processors to each other
Interconnection media types
Shared medium
Switched medium
Different interconnection networks define
different parallel machines.
The interconnection networks properties
influence the type of algorithm used for various
machines as it effects how data is routed.

15
Shared versus Switched Media

With shared medium, one message is sent all
processors listen
With switched medium, multiple messages are
possible.

16
Shared Medium

Allows only message at a time
Messages are broadcast
Each processor listens to every message
Before sending a message, a processor listen
until medium is unused
Collisions require resending of messages
Ethernet is an example

17
Switched Medium

Supports point-to-point messages between pairs of
processors
Each processor is connected to one switch
Advantages over shared media
Allows multiple messages to be sent
simultaneously
Allows scaling of the network to accommodate the
increase in processors

18
Switch Network Topologies

View switched network as a graph
Vertices processors or switches
Edges communication paths
Two kinds of topologies
Direct
Indirect

19
Direct Topology

Ratio of switch nodes to processor nodes is 11
Every switch node is connected to
1 processor node
At least 1 other switch node

Indirect Topology

Ratio of switch nodes to processor nodes is
greater than 11
Some switches simply connect to other switches

20
Terminology for Evaluating Switch Topologies

We need to evaluate 4 characteristics of a
network in order to help us understand their
effectiveness in implementing efficient parallel
algorithms on a machine with a given network.
These are
The diameter
The bisection width
The edges per node
The constant edge length
Well define these and see how they affect
algorithm choice.
Then we will investigate several different
topologies and see how these characteristics are
evaluated.

21
Terminology for Evaluating Switch Topologies

Diameter Largest distance between two switch
nodes.
A low diameter is desirable
It puts a lower bound on the complexity of
parallel algorithms which requires communication
between arbitrary pairs of nodes.

22
Terminology for Evaluating Switch Topologies

Bisection width The minimum number of edges
between switch nodes that must be removed in
order to divide the network into two halves
(within 1 node, if the number of processors is
odd.)
High bisection width is desirable.
In algorithms requiring large amounts of data
movement, the size of the data set divided by the
bisection width puts a lower bound on the
complexity of an algorithm,
Actually proving what the bisection width of a
network is can be quite difficult.

23
Terminology for Evaluating Switch Topologies

Number of edges per node
It is best if the maximum number of edges/node is
a constant independent of network size, as this
allows the processor organization to scale more
easily to a larger number of nodes.
Degree is the maximum number of edges per node.
Constant edge length? (yes/no)
Again, for scalability, it is best if the nodes
and edges can be laid out in 3D space so that the
maximum edge length is a constant independent of
network size.

24
Evaluating Switch Topologies

Many have been proposed and analyzed. We will
consider several well known ones
2-D mesh
linear network
binary tree
hypertree
butterfly
hypercube
shuffle-exchange
Those in yellow have been used in commercial
parallel computers.

25
2-D Meshes
Note Circles represent switches and squares
represent processors in all these slides.
26
2-D Mesh Network

Direct topology
Switches arranged into a 2-D lattice or grid
Communication allowed only between neighboring
switches
Torus Variant that includes wraparound
connections between switches on edge of mesh

27
Evaluating 2-D Meshes(Assumes mesh is a square)

n number of processors
Diameter
?(n1/2)
Places a lower bound on algorithms that require
processing with arbitrary nodes sharing data.
Bisection width
?(n1/2)
Places a lower bound on algorithms that require
distribution of data to all nodes.
Max number of edges per switch
4 is the degree
Constant edge length?
Yes
Does this scale well?
Yes

28
Linear Network

Switches arranged into a 1-D mesh
Direct topology
Corresponds to a row or column of a 2-D mesh
Ring A variant that allows a wraparound
connection between switches on the end.
The linear and ring networks have many
applications
Essentially supports a pipeline in both
directions
Although these networks are very simple, they
support many optimal algorithms.

29
Evaluating Linear and Ring Networks

Diameter
Linear n-1 or T(n)
Ring ?n/2? or T(n)
Bisection width
Linear 1 or T(1)
Ring 2 or T(1)
Degree for switches
2
Constant edge length?
Yes
Does this scale well?
Yes

30
Binary Tree Network

Indirect topology
n 2d processor nodes, 2n-1 switches, where d
0,1,... is the number of levels

i.e. 23 8 processors on bottom and 2(n) 1
2(8) 1 15 switches
31
Evaluating Binary Tree Network

Diameter
2 log n or O(log n).
Note- this is small
Bisection width
1, the lowest possible number
Degree
3
Constant edge length?
No
Does this scale well?
No

32
Hypertree Network (of degree 4 and depth 2)

Front view 4-ary tree of height 2
(b) Side view upside down binary tree of height
d
(c) Complete network

33
Hypertree Network

Indirect topology
Note- the degree k and the depth d must be
specified.
This gives from the front a k-ary tree of height
d.
From the side, the same network looks like an
upside down binary tree of height d.
Joining the front and side views yields the
complete network.

34
Evaluating 4-ary Hypertree with Depth d

A 4-ary hypertree has n 4d processors
General formula for k-ary hypertree is n kd
Diameter is 2d 2 log n
shares the low diameter of binary tree
Bisection width 2d1
Note here, 2d1 23 8
Large value - much better than binary tree
Constant edge length?
No
Degree 6

35
Butterfly Network
A 23 8 processor butterfly network with 8432
switching nodes

Indirect topology
n 2d processornodes connectedby n(log n
1)switching nodes

As complicated as this switching network appears
to be, it is really quite simple as it admits a
very nice routing algorithm! Wrapped Butterfly
When top and bottom ranks are merged into single
rank.
The rows are called ranks.
36
Building the 23 Butterfly Network

There are 8 processors.
Have 4 ranks (i.e. rows) with 8 switches per
rank.
Connections
Node(i,j), for i gt 0, is connected to two nodes
on rank i-1, namely node(i-1,j) and node(i-1,m),
where m is the integer found by flipping the ith
most significant bit in the binary d-bit
representation of j.
For example, suppose i 2 and j 3. Then node
(2,3) is connected to node (1,3).
To get the other connection, 3 0112. So, flip
2nd significant bit i.e. 0012 and connect
node(2,3) to node(1,1) --- NOTE There is an
error on pg 32 on this example.
Nodes connected by a cross edge from rank i to
rank i1 have node numbers that differ only in
their (i1) bit.

37
Why It Is Called a Butterfly Network

Walk cycles such as node(i,j), node(i-1,j),
node(i,m), node(i-1,m), node(i,j) where m is
determined by the bit flipping as shown and you
see a butterfly

38
Butterfly Network Routing
Send message from processor 2 to processor
5. Algorithm 0 means ship left 1 means ship
right. 1) 5 101. Pluck off leftmost bit 1
and send 01msg to right. 2) Pluck off leftmost
bit 0 and send 1msg to left. 3) Pluck off
leftmost bit 1 and send msg to right. Each
cross edge followed changes address by 1 bit.
39
Evaluating the Butterfly Networkwith n Processors

Diameter
log n
Bisection width
n / 2 (Likely error 32/216)
Degree
4 (even for d gt 3)
Constant edge length?
No, grows exponentially
as rank size increase

On pg 442, Leighton gives ?(n / log(n)) as
the bisection width. Simply remove cross edges
between two successive levels to create bisection
cut.
40
Hypercube (also called binary n-cube)
A hypercube with n 2d processors switches
for d4
41
Hypercube (or Binary n-cube) n 2d Processors

Direct topology
2 x 2 x x 2 mesh
Number of nodes is a power of 2
Node addresses 0, 1, , n-1
Node i is connected to k nodes whose addresses
differ from i in exactly one bit position.
Example k 0111 is connected to 1111, 0011,
0101, and 0110

42
Growing a HypercubeNote For d 4, it is called
a 4-D hypercube or just a 4 cube
43
Evaluating Hypercube Networkwith n 2d nodes

Diameter
d log n
Bisection width
n / 2
Edges per node
log n
Constant edge length?
No.
The length of the longest edge increases as n
increases.

44
Routing on the Hypercube Network

Example Send a message from node 2 0010 to
node 5 0101
The nodes differ in 3 bits so the shortest path
will be of length 3.
One path is
0010 ? 0110 ?
0100 ? 0101
obtained by flipping one of the differing bits at
each step.

As with the butterfly network, bit flipping
helps you route on this network.

45
A Perfect Shuffle

A permutation that is produced as follows is
called a perfect shuffle
Given a power of 2 cards, numbered 0, 1, 2, ...,
2d -1, write the card number with d bits. By left
rotating the bits with a wrap, we calculate the
position of the card after the perfect shuffle.
Example For d 3, card 5 101. Left rotating
and wrapping gives us 011. So, card 5 goes to
position 3. Note that card 0 000 and card 7
111, stay in position.

46
Shuffle-exchange Network with n 2d Processors
0
1
2
3
4
5
6
7

Direct topology
Number of nodes is a power of 2
Nodes have addresses 0, 1, , 2d-1
Two outgoing links from node i
Shuffle link to node LeftCycle(i)
Exchange link between node i and node i1
when i is even

47
Shuffle-exchange Addressing 16 processors
No arrows on line segment means it is
bidirectional. Otherwise, you must follow the
arrows. Devising a routing algorithm for this
network is interesting and may be a homework
problem.
48
Evaluating the Shuffle-exchange

Diameter
2log n 1
Edges per node
3
Constant edge length?
No
Bisection width
?(n/ log n)
Between 2n/log n and n/(2 log n)
See Leighton pg 480

49
Two Problems with Shuffle-Exchange

Shuffle-Exchange does not expand well
A large shuffle-exchange network does not
decompose well into smaller separate shuffle
exchange networks.
In a large shuffle-exchange network, a small
percentage of nodes will be hot spots
They will encounter much heavier traffic
Above results are in dissertation of one of
Batchers students.

50
Comparing Networks(See Table 2.1)

All have logarithmic diameterexcept 2-D mesh
4nary Hypertree, butterfly, and hypercube have
bisection width n / 2 (Likely untrue on
butterfly)
All have constant edges per node except hypercube
Only 2-D mesh, linear, and ring topologies keep
edge lengths constant as network size increases
Shuffle-exchange is a good compromise- fixed
number of edges per node, low diameter, good
bisection width.
However, negative results on preceding slide also
need to be considered.

51
Alternate Names for SIMDs

Recall that all active processors of a true SIMD
computer must simultaneously access the same
memory location.
The value in the i-th processor can be viewed as
the i-th component of a vector.
SIMD machines are sometimes called vector
computers Jordan,et.al. or processor arrays
Quinn 94,04 based on their ability to execute
vector and matrix operations efficiently.

52
SIMD Computers

SIMD computers that focus on vector operations
Support some vector and possibly matrix
operations in hardware
Usually limit or provide less support for
non-vector type operations involving data in the
vector components.
General purpose SIMD computers
May also provide some vector and possibly matrix
operations in hardware.
More support for traditional type operations
(e.g., other than for vector/matrix data types).

53
Pipelined Architectures

Pipelined architectures are sometimes considered
to be SIMD architectures
See pg 37 of Textbook pg 8-9 Jordan et. al.
Vector components are entered successively into
first processor in pipeline.
The i-th processor of the pipeline receives the
output from the (i-1)th processor.
Normal operations in each processor are much
larger (coarser) in pipelined computers than in
true SIMDs
Pipelined is somewhat SIMD in nature in that
synchronization is not required.

54
Why Processor Arrays?

Historically, high cost of control units
Scientific applications have data parallelism

55
Data/instruction Storage

Front end computer
Also called the control unit
Holds and runs program
Data manipulated sequentially
Processor array
Data manipulated in parallel

56
Processor Array Performance

Performance work done per time unit
Performance of processor array
Speed of processing elements
Utilization of processing elements

57
Performance Example 1

1024 processors
Each adds a pair of integers in 1 ?sec (1
microsecond or one millionth of second or 10-6
second.)
What is the performance when adding two
1024-element vectors (one per processor)?

58
Performance Example 2

512 processors
Each adds two integers in 1 ?sec
What is the performance when adding two vectors
of length 600?
Since 600 gt 512, 88 processor must add two pairs
of integers.
The other 424 processors add only a single pair
of integers.

59
Example of a 2-D Processor Interconnection
Network in a Processor Array
Each VLSI chip has 16 processing elements. Each
PE can simultaneously send a value to a neighbor.
PE processor element
60
SIMD Execution Style

The traditional (SIMD, vector, processor array)
execution style (Quinn 94, pg 62, Quinn 2004,
pgs 37-43
The sequential processor that broadcasts the
commands to the rest of the processors is called
the front end or control unit (or sometimes
host).
The front end is a general purpose CPU that
stores the program and the data that is not
manipulated in parallel.
The front end normally executes the sequential
portions of the program.
Alternately, all PEs needing computation can
execute steps synchronously and avoid broadcast
cost to distribute results
Each processing element has a local memory that
cannot be directly accessed by the control unit
or other processing elements.

61
SIMD Execution Style

Collectively, the individual memories of the
processing elements (PEs) store the (vector) data
that is processed in parallel.
When the front end encounters an instruction
whose operand is a vector, it issues a command to
the PEs to perform the instruction in parallel.
Although the PEs execute in parallel, some units
can be allowed to skip particular instructions.

62
Masking on Processor Arrays

All the processors work in lockstep except those
that are masked out (by setting mask register).
The conditional if-then-else is different for
processor arrays than sequential version
Every active processor tests to see if its data
meets the negation of the boolean condition.
If it does, it sets its mask bit so those
processors will not participate in the operation
initially.
Next the unmasked processors, execute the THEN
part.
Afterwards, mask bits (for original set of active
processors) are flipped and unmasked processors
perform the the ELSE part.

63
if (COND) then A else B
64
if (COND) then A else B
65
if (COND) then A else B
66
SIMD Machines

An early SIMD computer designed for vector and
matrix processing was the Illiac IV computer
Initial development at the University of Illinois
1965-70
Moved to NASA Ames, completed in 1972 but not
fully functional until 1976.
See Jordan et. al., pg 7 and Wikepedia
The MPP, DAP, the Connection Machines CM-1 and
CM-2, and MasPars MP-1 and MP-2 are examples of
SIMD computers
See Akl pg 8-12 and Quinn, 94
The CRAY-1 and the Cyber-205 use pipelined
arithmetic units to support vector operations and
are sometimes called a pipelined SIMD
See Jordan, et al, p7, Quinn 94, pg 61-2, and
Quinn 2004, pg37).

67
SIMD Machines

Quinn 1994, pg 63-67 discusses the CM-2
Connection Machine (with 64K PEs) and a smaller
updated CM-200.
Professor Batcher was the chief architect for the
STARAN and the MPP (Massively Parallel Processor)
and an advisor for the ASPRO
ASPRO is a small second generation STARAN used by
the Navy in the spy planes.
Professor Batcher is best known architecturally
for the MPP, which is at the Smithsonian
Institute currently displayed at a D.C. airport.

68
Todays SIMDs

SIMD functionality is sometimes embedded in
sequential machines.
Others are being build as part of hybrid
architectures.
Some SIMD and SIMD-like features are included in
some multi/many core processing units
Some SIMD-like architectures have been build as
special purpose machines, although some of these
could classify as general purpose.
Some of this work has been proprietary.
The fact that a parallel computer is SIMD or
SIMD-like is often not advertised by the company
building them.

69
A Company that Recently Built an Inexpensive SIMD

ClearSpeed produced a COTS (commodity off the
shelf) SIMD Board
WorldScape has developed some defense and
commercial applications for this computer.
Not a traditional SIMD as the hardware doesnt
tightly synchronize the execution of
instructions.
Hardware design supports efficient
synchronization
This machine is programmed like a SIMD.
The U.S. Navy observed that their machines
process radar a magnitude faster than others.
Quite a bit of information about this machine is
posted at www.clearspeed.com

70
An Example of a Hybrid SIMD

Embedded Massively Parallel Accelerators

Other accelerators Decypher, Biocellerator,
GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2,
BioScan
(This and next three slides are due to Prabhakar
R. Gudla (U of Maryland) at a CMSC 838T
Presentation, 4/23/2003.)

71
Hybrid Architecture

combines SIMD and MIMD paradigm within a parallel
architecture ? Hybrid Computer

72
Architecture of Systola 1024

Instruction Systolic Array
32 ? 32 mesh of processing elements
wavefront instruction execution

73
SIMDs Embedded in SISDs

Intel's Pentium 4 included what they call MMX
technology to gain a significant performance
boost
IBM and Motorola incorporated the technology into
their G4 PowerPC chip in what they call their
Velocity Engine.
Both MMX technology and the Velocity Engine are
the chip manufacturer's name for their
proprietary SIMD processors and parallel
extensions to their operating code.
This same approach is used by NVidia and Evans
Sutherland to dramatically accelerate graphics
rendering.

74
Special Purpose SIMDs in the Bioinformatics Arena

Parcel
Acquired by Celera Genomics in 2000
Products include the sequence supercomputer
GeneMatcher, which has a high throughput sequence
analysis capability
Supports over a million processors
GeneMatcher was used by Celera in their race with
U.S. government to complete the description of
the human genome sequencing
TimeLogic, Inc
Has DeCypher, a reconfigurable SIMD

75
Advantages of SIMDs

Reference Roosta, pg 10
Less hardware than MIMDs as they have only one
control unit.
Control units are complex.
Less memory needed than MIMD
Only one copy of the instructions need to be
stored
Allows more data to be stored in memory.
Less startup time in communicating between PEs.

76
Advantages of SIMDs (cont)

Single instruction stream and synchronization of
PEs make SIMD applications easier to program,
understand, debug.
Similar to sequential programming
Control flow operations and scalar operations can
be executed on the control unit while PEs are
executing other instructions.
MIMD architectures require explicit
synchronization primitives, which create a
substantial amount of additional overhead.

77
Advantages of SIMDs (cont)

During a communication operation between PEs,
PEs send data to a neighboring PE in parallel and
in lock step
No need to create a header with routing
information as routing is determined by program
steps.
the entire communication operation is executed
synchronously
SIMDs are deterministic have much more
predictable running time.
Can normally compute a tight (worst case) upper
bound for the time for communications operations.
Less complex hardware in SIMD since no message
decoder is needed in the PEs
MIMDs need a message decoder in each PE.

78
SIMD Shortcomings(with some rebuttals)

Claims are from our textbook i.e., Quinn 2004.
Similar statements are found in Grama, et. al.
Claim 1 Not all problems are data-parallel
While true, most problems seem to have a data
parallel solution.
In Fox, et.al., the observation was made in
their study of large parallel applications at
national labs, that most were data parallel by
nature, but often had points where significant
branching occurred.

79
SIMD Shortcomings(with some rebuttals)

Claim 2 Speed drops for conditionally executed
branches
MIMDs processors can execute multiple branches
concurrently.
For an if-then-else statement with execution
times for the then and else parts being
roughly equal, about ½ of the SIMD processors are
idle during its execution
With additional branching, the average number of
inactive processors can become even higher.
With SIMDs, only one of these branches can be
executed at a time.
This reason justifies the study of multiple SIMDs
(or MSIMDs).
On many applications, any branching is quite
shallow.

80
SIMD Shortcomings(with some rebuttals)

Claim 2 (cont) Speed drops for conditionally
executed code
In Fox, et.al., the observation was made that
for the real applications surveyed, the MAXIMUM
number of active branches at any point in time
was about 8.
The cost of the extremely simple processors used
in a SIMD are extremely low
Programmers used to worry about full utilization
of memory but stopped this after memory cost
became insignificant overall.

81
SIMD Shortcomings(with some rebuttals)

Claim 3 Dont adapt to multiple users well.
This is true to some degree for all parallel
computers.
If usage of a parallel processor is dedicated to
a important problem, it is probably best not to
risk compromising its performance by sharing
This reason also justifies the study of multiple
SIMDs (or MSIMD).
SIMD architecture has not received the attention
that MIMD has received and can greatly benefit
from further research.

82
SIMD Shortcomings(with some rebuttals)

Claim 4 Do not scale down well to starter
systems that are affordable.
This point is arguable and its truth is likely
to vary rapidly over time
ClearSpeed produced a very economical SIMD board
that plugs into a PC with about 48 processors
per chip and 2-3 chips per board.

83
SIMD Shortcomings(with some rebuttals)

Claim 5 Requires customized VLSI for processors
and expense of control units in PCs has dropped.
Reliance on COTS (Commodity, off-the-shelf parts)
has dropped the price of MIMDS
Expense of PCs (with control units) has dropped
significantly
However, reliance on COTS has fueled the success
of low level parallelism provided by clusters
and restricted new innovative parallel
architecture research for well over a decade.

84
SIMD Shortcomings(with some rebuttals)

Claim 5 (cont.)
There is strong evidence that the period of
continual dramatic increases in speed of PCs and
clusters is ending.
Continued rapid increases in parallel performance
in the future will be necessary in order to solve
important problems that are beyond our current
capabilities
Additionally, with the appearance of the very
economical COTS SIMDs, this claim no longer
appears to be relevant.

85
Multiprocessors

Multiprocessor multiple-CPU computer with a
shared memory
Same address on two different CPUs refers to the
same memory location
Avoids three cited criticisms for SIMDs
Can be built from commodity CPUs
Naturally support multiple users
Maintain efficiency in conditional code

86
Centralized Multiprocessor
87
Centralized Multiprocessor

Straightforward extension of uniprocessor
Add CPUs to bus
All processors share same primary memory
Memory access time same for all CPUs
Uniform memory access (UMA) multiprocessor
Also called a symmetrical multiprocessor (SMP)

88
Private and Shared Data

Private data items used only by a single
processor
Shared data values used by multiple processors
In a centralized multiprocessor (i.e. SMP),
processors communicate via shared data values

89
Problems Associated with Shared Data

The cache coherence problem
Replicating data across multiple caches reduces
contention among processors for shared data
values.
But - how can we ensure different processors have
the same value for same address?
The cache coherence problem is when an obsolete
value is still stored in a processors cache.

90
Write Invalidate Protocol

Most common solution to cache coherency
Each CPUs cache controller monitors (snoops) the
bus identifies which cache blocks are requested
by other CPUs.
A PE gains exclusive control of data item before
performing write.
Before write occurs, all other copies of data
item cached by other PEs are invalidated.
When any other CPU tries to read a memory
location from an invalidated cache block,
a cache miss occurs
It has to retrieve updated data from memory

91
Cache-coherence Problem
Memory
7
X
92
Cache-coherence Problem
Memory
Read from memory is not a problem.
7
X
7
93
Cache-coherence Problem
Memory
7
X
7
7
94
Cache-coherence Problem
Write to main memory is a problem.
Memory
2
X
2
7
95
Write Invalidate Protocol
A cache control monitor snoops the bus to
see which cache block is being requested by
other processors.
7
X
7
7
96
Write Invalidate Protocol
7
X
Intent to write X
7
7
Before a write can occur, all copies of data at
that address are declared invalid.
97
Write Invalidate Protocol
7
X
Intent to write X
7
98
Write Invalidate Protocol
When another processor tries to read from this
location in cache, it receives a cache miss error
and will have to refresh from main memory.
2
X
2
99
Synchronization Required for Shared Data

Mutual exclusion
Definition At most one process can be engaged in
an activity at any time.
Example Only one processor can write to the
same address in main memory at the same time.
We say that process must mutually exclude all
others while it performs this write.
Barrier synchronization
Definition Guarantees that no process will
proceed beyond a designated point (called the
barrier) until every process reaches that point.

100
Distributed Multiprocessor

Distributes primary memory among processors
Increase aggregate memory bandwidth and lower
average memory access time
Allows greater number of processors
Also called non-uniform memory access (NUMA)
multiprocessor
Local memory access time is fast
Non-local memory access time can vary
Distributed memories have one logical address
space

101
Distributed Multiprocessors
102
Cache Coherence

Some NUMA multiprocessors do not support it in
hardware
Only instructions and private data are stored in
cache
Policy creates a large memory access time
variance
Implementation more difficult
No shared memory bus to snoop
Directory-based protocol needed

103
Directory-based Protocol

Distributed directory contains information about
cacheable memory blocks
One directory entry for each cache block
Each entry has
Sharing status
Which processors have copies

104
Sharing Status

Uncached -- (denoted by U)
Block not in any processors cache
Shared (denoted by S)
Cached by one or more processors
Read only
Exclusive (denoted by E)
Cached by exactly one processor
Processor has written to block
Copy in memory is obsolete

105
Directory-based Protocol - step1
106
X has value 7 step 2
Interconnection Network
Bit Vector
X
U 0 0 0
Directories
7
X
Memories
Caches
107
CPU 0 Reads X step 3
Interconnection Network
X
U 0 0 0
Directories
7
X
Memories
Caches
108
CPU 0 Reads X step 4
Interconnection Network
X
S 1 0 0
Directories
7
X
Memories
Caches
109
CPU 0 Reads X step 5
Interconnection Network
X
S 1 0 0
Directories
Memories
Caches
110
CPU 2 Reads X step 6
Interconnection Network
X
S 1 0 0
Directories
Memories
Caches
111
CPU 2 Reads X step 7
Interconnection Network
X
S 1 0 1
Directories
Memories
Caches
112
CPU 2 Reads X step 8
Interconnection Network
X
S 1 0 1
Directories
Memories
Caches
113
CPU 0 Writes 6 to X step 9
Interconnection Network
Write Miss
X
S 1 0 1
Directories
Memories
Caches
114
CPU 0 Writes 6 to X step 10
Interconnection Network
X
S 1 0 1
Directories
Invalidate
Memories
Caches
115
CPU 0 Writes 6 to X step 11
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
6
X
116
CPU 1 Reads X step 12
Interconnection Network
Read Miss
X
E 1 0 0
Directories
Memories
Caches
117
CPU 1 Reads X step 13
Interconnection Network
Switch to Shared
X
E 1 0 0
Directories
Memories
Caches
118
CPU 1 Reads X step 14
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
119
CPU 1 Reads X step 15
Interconnection Network
X
S 1 1 0
Directories
Memories
Caches
120
CPU 2 Writes 5 to X step 16
Interconnection Network
X
S 1 1 0
Directories
Memories
Write Miss
Caches
121
CPU 2 Writes 5 to X - step 17
Interconnection Network
Invalidate
X
S 1 1 0
Directories
Memories
Caches
122
CPU 2 Writes 5 to X step 18
Interconnection Network
X
E 0 0 1
Directories
Memories
5
X
Caches
123
CPU 0 Writes 4 to X step 19
Interconnection Network
X
E 0 0 1
Directories
Memories
Caches
124
CPU 0 Writes 4 to X step 20
Interconnection Network
X
E 1 0 0
Directories
Memories
Take Away
Caches
125
CPU 0 Writes 4 to X step 21
Interconnection Network
X
E 0 1 0
Directories
Memories
Caches
126
CPU 0 Writes 4 to X step 22
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
127
CPU 0 Writes 4 to X step 23
Interconnection Network
X
E 1 0 0
Directories
Creates cache block storage for X
Memories
Caches
128
CPU 0 Writes 4 to X step 24
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
4
X
129
CPU 0 Writes Back X Block step 25
Interconnection Network
Data Write Back
X
E 1 0 0
Directories
Memories
Caches
130
CPU 0 flushes cache block X step 26
Interconnection Network
X
U 0 0 0
Directories
Memories
Caches
131
Characteristics of Multiprocessors

Interprocessor communication is done in the
memory interface by read and write instructions
Memory may be physically distributed and the
reads and writes from different processors may
take different time.
Congestion and hotspots in the interconnection
network may occur.
Memory latency (i.e., time to complete a read or
write) may be long and variable.
Most messages through the bus or interconnection
network are the size of single memory words.
Randomization of requests may be used to reduce
the probability of collisions.

132
Multicomputers

Distributed memory multiple-CPU computer
Same address on different processors refers to
different physical memory locations
Processors interact through message passing

133
Typically, Two Flavors of Multicomputers

Commercial multicomputers
Custom switch network
Low latency (the time it takes to send a
message).
High bandwidth (data path width) across
processors
Commodity clusters
Mass produced computers, switches and other
equipment
Use low cost components
Message latency is higher
Communications bandwidth is lower

134
Multicomputer Communication

Processors are connected by an interconnection
network
Each processor has a local memory and can only
access its own local memory
Data is passed between processors using messages,
as required by the program
Data movement across the network is also
asynchronous
A common approach is to use MPI to handling
message passing

135
Multicomputer Communications (cont)

Multicomputers can be scaled to larger sizes much
easier than multiprocessors.
The amount of data transmissions between
processors have a huge impact on the performance
The distribution of the data among the processors
is a very important factor in the performance
efficiency.

136
Message-Passing Advantages

No problem with simultaneous access to data.
Allows different PCs to operate on the same data
independently.
Allows PCs on a network to be easily upgraded
when faster processors become available.

137
Disadvantages of Message-Passing

Programmers must make explicit message-passing
calls in the code
This is low-level programming and is error prone.
Data is not shared but copied, which increases
the total data size.
Data Integrity
Difficulty in maintaining correctness of multiple
copies of data item.

138
Some Interconnection Network Terminology (1/2)

References Wilkinson, et. al. Grama, et. al.
Also, earlier slides on architecture networks.
A link is the connection between two nodes.
A switch that enables packets to be routed
through the node to other nodes without
disturbing the processor is assumed.
The link between two nodes can be either
bidirectional or use two directional links .
Can assume either one wire that carries one bit
or parallel wires (one wire for each bit in
word).
The above choices do not have a major impact on
the concepts presented in this course.

139
Network Terminology (2/2)

The bandwidth is the number of bits that can be
transmitted in unit time (i.e., bits per second).
The network latency is the time required to
transfer a message through the network.
The communication latency is the total time
required to send a message, including software
overhead and interface delay.
The message latency or startup time is the time
required to send a zero-length message.
Includes software hardware overhead, such as
Choosing a route
packing and unpacking the message

140
Circuit Switching Message Passing

Technique establishes a path and allows the
entire message to transfer uninterrupted.
Similar to telephone connection that is held
until the end of the call.
Links used are not available to other messages
until the transfer is complete.
Latency (message transfer time) If the length of
control packet sent to establish path is small
wrt (with respect to) the message length, the
latency is essentially
the constant L/B, where L is message length and B
is bandwidth.

141
Store-and-forward Packet Switching

Message is divided into packets of information
Each packet includes source and destination
addresses.
Packets can not exceed a fixed, maximum size
(e.g., 1000 byte).
A packet is stored in a node in a buffer until it
can move to the next node.
Different packets typically follow different
routes but are re-assembled at the destination,
as the packets arrive.
Movements of packets is asynchronous.

142
Packet Switching (cont)

At each node, the designation information is
looked at and used to select which node to
forward the packet to.
Routing algorithms (often probabilistic) are used
to avoid hot spots and to minimize traffic jams.
Significant latency is created by storing each
packet in each node it reaches.
Latency increases linearly with the length of the
route.

143
Virtual Cut-Through Package Switching

Used to reduce the latency.
Allows packet to pass through a node without
being stored, if the outgoing link is available.
If complete path is available, a message can
immediately move from source to destination..

144
Wormhole Routing

Alternate to store-and-forward packet routing
A message is divided into small units called
flits (flow control units).
Flits are 1-2 bytes in size.
Can be transferred in parallel on links with
multiple wires.
Only head of flit is initially transferred when
the next link becomes available.

145
Wormhole Routing (cont)

As each flit moves forward, the next flit can
move forward.
The entire path must be reserved for a message as
these packets pull each other along (like cars of
a train).
Request/acknowledge bit messages are required to
coordinate these pull-along moves.
See Wilkinson, et. al.
Latency If the head of the flit is very small
compared to the length of the message, then the
latency is essentially the constant L/B, with L
the message length and B the link bandwidth.

146
Deadlock

Routing algorithms needed to find a path between
the nodes.
Adaptive routing algorithms choose different
paths, depending on traffic conditions.
Livelock is a deadlock-type situation where a
packet continues to go around the network,
without ever reaching its destination.
Deadlock No packet can be forwarded because they
are blocked by other stored packets waiting to be
forwarded.

147
Asymmetric Multicomputers

Has a front-end that interacts with users and I/O
devices.
Processors in back end are used for computation.
Programming similar to SIMDs (i.e., processor
arrays)
Common with early multicomputers
Examples of asymmetrical multicomputers given in
textbook.

148
Asymmetrical MC Advantages

Back-end processors dedicated to parallel
computations ? Easier to understand, model, tune
performance
Only a simple back-end operating system needed ?
Easy for a vendor to create

149
Asymmetrical MC Disadvantages

Front-end computer is a single point of failure
Single front-end computer limits scalability of
system
Primitive operating system in back-end processors
makes debugging difficult
Every application requires development of both
front-end and back-end program

150
Symmetric Multicomputers

Every computer executes the same operating system
and has identical functionality.
Users may log into any computer to edit or
compile their programs.
Any or all computers may be involved in the
execution of their program.
During execution of programs, every PE executes
the same program.
When only one PE should execute an operation, an
if statement is used to select the PE.

151
Symmetric Multicomputers
152
Symmetrical MC Advantages

Alleviate performance bottleneck caused by single
front-end computer
Better support for debugging
Every processor executes same program

153
Symmetrical MC Disadvantages

More difficult to maintain illusion of single
parallel computer
No simple way to balance program development
workload among processors
More difficult to achieve high performance when
multiple processes run on each processor
Details on next slide

154
Symmetric MC Disadvantages (cont)

(cont.) More difficult to achieve high
performance when multiple processes run on each
processor
Processes on same processor compete for same
resources
CPU Cycles
Cache space
Memory bandwidth
Increased cache misses
Cache is PE oriented instead of process
oriented

155
Best Model for Commodity Cluster

Full-Fledged operating system (e.g., Linux)
desirable
Feature of symmetric multicomputer
Desirable to increase cache hits
Favors having only a single user process on each
PE
Favors most nodes being off-limits for program
development
Need fast network
Keep program development users off networks and
have them access front-end by another path.
Reserve interconnection network to usage by
parallel processes
Overall, a mixed model may be best for commodity
clusters