Title: Parallel Architectures
1Chapter 2
2Outline
- Some chapter references
- Brief review of complexity
- Terminology for comparisons
- Interconnection networks
- Processor arrays
- Multiprocessors
- Multicomputers
- Flynns Taxonomy moved to Chpt 1
3Some Chapter References
- Selim Akl, The Design and Analysis of Parallel
Algorithms, Prentice Hall, 1989 (earlier
textbook). - G. C. Fox, What Have We Learnt from Using Real
Parallel Machines to Solve Real Problems?
Technical Report C3P-522, Cal Tech, December
1989. (Included in part in more recent books
co-authored by Fox.) - A. Grama, A. Gupta, G. Karypis, V. Kumar,
Introduction to Parallel Computing, Second
Edition, 2003 (first edition 1994), Addison
Wesley. - Harry Jordan, Gita Alaghband, Fundamentals of
Parallel Processing Algorithms, Architectures,
Languages, Prentice Hall, 2003, Ch 1, 3-5. - F. Thomson Leighton Introduction to Parallel
Algorithms and Architectures Arrays, Trees,
Hypercubes 1992 Morgan Kaufmann Publishers.
4References - continued
- Gregory Pfsiter, In Search of Clusters The
ongoing Battle in Lowly Parallelism, 2nd Edition,
Ch 2. (Discusses details of some serious problems
that MIMDs incur). - Michael Quinn, Parallel Programming in C with MPI
and OpenMP, McGraw Hill,2004 (Current Textbook),
Chapter 2. - Michael Quinn, Parallel Computing Theory and
Practice, McGraw Hill, 1994, Ch. 1,2 - Sayed H. Roosta, Parallel Processing Parallel
Algorithms Theory and Computation, Springer
Verlag, 2000, Chpt 1. - Wilkinson Allen, Parallel Programming
Techniques and Applications, Prentice Hall, 2nd
Edition, 2005, Ch 1-2.
5Brief Review Complexity Concepts Needed for
Comparisons
- Whenever we define a counting function, we
usually characterize the growth rate of that
function in terms of complexity classes. - Technical Definition We say a function f(n) is
in O(g(n)), if (and only if) there are positive
constants c and n0 such that - 0 f(n) ? cg(n) for n ? n0
- O(n) is read as big-oh of n.
- This notation can be used to separate counting
functions into complexity classes that
characterize the size of the count. - We can use it for any kind of counting functions
such as timings, bisection widths, etc.
6Big-Oh and Asymptotic Growth Rate
- The big-Oh notation gives an upper bound on the
(asymptotic) growth rate of a function - The statement f(n) is O(g(n)) means that the
growth rate of f(n) is not greater than the
growth rate of g(n) - We can use the big-Oh notation to rank functions
according to their growth rate
Assume f(n) is O(g(n)) g(n) is O(f(n))
g(n) grows faster Yes No
f(n) grows faster No Yes
Same growth Yes Yes
7Relatives of Big-Oh
- big-Omega
- f(n) is ?(g(n)) if there is a constant c gt 0
- and an integer constant n0 ? 1 such that
- f(n) ? cg(n) for n ? n0
- Intuitively, this says up to a constant factor,
f(n) asymptotically is greater than or equal to
g(n) - big-Theta
- f(n) is ?(g(n)) if there are constants c gt 0 and
c gt 0 and an integer constant n0 ? 1 such that
0 cg(n) ? f(n) ? cg(n) for n ? n0 - Intuitively, this says up to a constant factor,
f(n) and g(n) are asymptotically the same. - Note These concepts are covered in algorithm
courses
8Relatives of Big-Oh
- little-oh
- f(n) is o(g(n)) if, for any constant c gt 0, there
is an integer constant n0 ? 0 such that 0 ? f(n)
lt cg(n) for n ? n0 - Intuitively, this says f(n) is, up to a constant,
asymptotically strictly less than g(n), so f(n) ?
?(g(n)). - little-omega
- f(n) is ?(g(n)) if, for any constant c gt 0, there
is an integer constant n0 ? 0 such that f(n) gt
cg(n) 0 for n ? n0 - Intuitively, this says f(n) is, up to a constant,
asymptotically strictly greater than g(n), so
f(n) ? ?(g(n)). - These are not used as much as the earlier
definitions, but they round out the picture.
9Summary for Intuition for Asymptotic Notation
- big-Oh
- f(n) is O(g(n)) if f(n) is asymptotically less
than or equal to g(n) - big-Omega
- f(n) is ?(g(n)) if f(n) is asymptotically greater
than or equal to g(n) - big-Theta
- f(n) is ?(g(n)) if f(n) is asymptotically equal
to g(n) - little-oh
- f(n) is o(g(n)) if f(n) is asymptotically
strictly less than g(n) - little-omega
- f(n) is ?(g(n)) if is asymptotically strictly
greater than g(n)
10A CALCULUS DEFINITION OF O, ?(often easier to
use)
Definition Let f and g be functions defined on
the positive integers with nonnegative values. We
say g is in O(f) if and only if lim
g(n)/f(n) c n -gt ? for some nonnegative real
number c--- i.e. the limit exists and is not
infinite. Definition We say f is in ?(g) if and
only if f is in O(g) and g is in O(f) Note
Often use L'Hopital's Rule to calculate the
limits you need.
11Why Asymptotic Behavior is Important
- 1) Allows us to compare counts on large sets.
- 2) Helps us understand the maximum size of input
that can be handled in a given time, provided we
know the environment in which we are running. - 3) Stresses the fact that even dramatic speedups
in hardware can not overcome the handicap of an
asymptotically slow algorithm.
12Recall ORDER WINS OUT(Example from Baases
Algorithms Text)
The TRS-80 Main language support BASIC -
typically a slow running interpreted language For
more details on TRS-80 see http//mate.kjsl.com/t
rs80/
The CRAY-YMP Language used in example FORTRAN- a
fast running language For more details on
CRAY-YMP see
http//ds.dial.pipex.com/town/park/abm64/CrayWWWSt
uff/Cfaqp1.htmlTOC3
13CRAY YMP TRS-80with FORTRAN
with BASICcomplexity is 3n3
complexity is 19,500,000n
microsecond (abbr µsec) One-millionth of a
second. millisecond (abbr msec) One-thousandth of
a second.
n is 10 100 1000 2500 10000 1000000
3 microsec
200 millisec
2 sec
3 millisec
20 sec
3 sec
50 sec
50 sec
49 min
3.2 min
95 years
5.4 hours
14Interconnection Networks
- Uses of interconnection networks
- Connect processors to shared memory
- Connect processors to each other
- Interconnection media types
- Shared medium
- Switched medium
- Different interconnection networks define
different parallel machines. - The interconnection networks properties
influence the type of algorithm used for various
machines as it effects how data is routed.
15Shared versus Switched Media
- With shared medium, one message is sent all
processors listen - With switched medium, multiple messages are
possible.
16Shared Medium
- Allows only message at a time
- Messages are broadcast
- Each processor listens to every message
- Before sending a message, a processor listen
until medium is unused - Collisions require resending of messages
- Ethernet is an example
17Switched Medium
- Supports point-to-point messages between pairs of
processors - Each processor is connected to one switch
- Advantages over shared media
- Allows multiple messages to be sent
simultaneously - Allows scaling of the network to accommodate the
increase in processors
18Switch Network Topologies
- View switched network as a graph
- Vertices processors or switches
- Edges communication paths
- Two kinds of topologies
- Direct
- Indirect
19Direct Topology
- Ratio of switch nodes to processor nodes is 11
- Every switch node is connected to
- 1 processor node
- At least 1 other switch node
Indirect Topology
- Ratio of switch nodes to processor nodes is
greater than 11 - Some switches simply connect to other switches
20Terminology for Evaluating Switch Topologies
- We need to evaluate 4 characteristics of a
network in order to help us understand their
effectiveness in implementing efficient parallel
algorithms on a machine with a given network. - These are
- The diameter
- The bisection width
- The edges per node
- The constant edge length
- Well define these and see how they affect
algorithm choice. - Then we will investigate several different
topologies and see how these characteristics are
evaluated.
21Terminology for Evaluating Switch Topologies
- Diameter Largest distance between two switch
nodes. - A low diameter is desirable
- It puts a lower bound on the complexity of
parallel algorithms which requires communication
between arbitrary pairs of nodes.
22Terminology for Evaluating Switch Topologies
- Bisection width The minimum number of edges
between switch nodes that must be removed in
order to divide the network into two halves
(within 1 node, if the number of processors is
odd.) - High bisection width is desirable.
- In algorithms requiring large amounts of data
movement, the size of the data set divided by the
bisection width puts a lower bound on the
complexity of an algorithm, - Actually proving what the bisection width of a
network is can be quite difficult.
23Terminology for Evaluating Switch Topologies
- Number of edges per node
- It is best if the maximum number of edges/node is
a constant independent of network size, as this
allows the processor organization to scale more
easily to a larger number of nodes. - Degree is the maximum number of edges per node.
- Constant edge length? (yes/no)
- Again, for scalability, it is best if the nodes
and edges can be laid out in 3D space so that the
maximum edge length is a constant independent of
network size.
24Evaluating Switch Topologies
- Many have been proposed and analyzed. We will
consider several well known ones - 2-D mesh
- linear network
- binary tree
- hypertree
- butterfly
- hypercube
- shuffle-exchange
- Those in yellow have been used in commercial
parallel computers.
252-D Meshes
Note Circles represent switches and squares
represent processors in all these slides.
262-D Mesh Network
- Direct topology
- Switches arranged into a 2-D lattice or grid
- Communication allowed only between neighboring
switches - Torus Variant that includes wraparound
connections between switches on edge of mesh
27Evaluating 2-D Meshes(Assumes mesh is a square)
- n number of processors
- Diameter
- ?(n1/2)
- Places a lower bound on algorithms that require
processing with arbitrary nodes sharing data. - Bisection width
- ?(n1/2)
- Places a lower bound on algorithms that require
distribution of data to all nodes. - Max number of edges per switch
- 4 is the degree
- Constant edge length?
- Yes
- Does this scale well?
- Yes
28Linear Network
- Switches arranged into a 1-D mesh
- Direct topology
- Corresponds to a row or column of a 2-D mesh
- Ring A variant that allows a wraparound
connection between switches on the end. - The linear and ring networks have many
applications - Essentially supports a pipeline in both
directions - Although these networks are very simple, they
support many optimal algorithms.
29Evaluating Linear and Ring Networks
- Diameter
- Linear n-1 or T(n)
- Ring ?n/2? or T(n)
- Bisection width
- Linear 1 or T(1)
- Ring 2 or T(1)
- Degree for switches
- 2
- Constant edge length?
- Yes
- Does this scale well?
- Yes
30Binary Tree Network
- Indirect topology
- n 2d processor nodes, 2n-1 switches, where d
0,1,... is the number of levels
i.e. 23 8 processors on bottom and 2(n) 1
2(8) 1 15 switches
31Evaluating Binary Tree Network
- Diameter
- 2 log n or O(log n).
- Note- this is small
- Bisection width
- 1, the lowest possible number
- Degree
- 3
- Constant edge length?
- No
- Does this scale well?
- No
32Hypertree Network (of degree 4 and depth 2)
- Front view 4-ary tree of height 2
- (b) Side view upside down binary tree of height
d - (c) Complete network
33Hypertree Network
- Indirect topology
- Note- the degree k and the depth d must be
specified. - This gives from the front a k-ary tree of height
d. - From the side, the same network looks like an
upside down binary tree of height d. - Joining the front and side views yields the
complete network.
34Evaluating 4-ary Hypertree with Depth d
- A 4-ary hypertree has n 4d processors
- General formula for k-ary hypertree is n kd
- Diameter is 2d 2 log n
- shares the low diameter of binary tree
- Bisection width 2d1
- Note here, 2d1 23 8
- Large value - much better than binary tree
- Constant edge length?
- No
- Degree 6
35Butterfly Network
A 23 8 processor butterfly network with 8432
switching nodes
- Indirect topology
- n 2d processornodes connectedby n(log n
1)switching nodes
As complicated as this switching network appears
to be, it is really quite simple as it admits a
very nice routing algorithm! Wrapped Butterfly
When top and bottom ranks are merged into single
rank.
The rows are called ranks.
36Building the 23 Butterfly Network
- There are 8 processors.
- Have 4 ranks (i.e. rows) with 8 switches per
rank. - Connections
- Node(i,j), for i gt 0, is connected to two nodes
on rank i-1, namely node(i-1,j) and node(i-1,m),
where m is the integer found by flipping the ith
most significant bit in the binary d-bit
representation of j. - For example, suppose i 2 and j 3. Then node
(2,3) is connected to node (1,3). - To get the other connection, 3 0112. So, flip
2nd significant bit i.e. 0012 and connect
node(2,3) to node(1,1) --- NOTE There is an
error on pg 32 on this example. - Nodes connected by a cross edge from rank i to
rank i1 have node numbers that differ only in
their (i1) bit.
37Why It Is Called a Butterfly Network
- Walk cycles such as node(i,j), node(i-1,j),
node(i,m), node(i-1,m), node(i,j) where m is
determined by the bit flipping as shown and you
see a butterfly
38Butterfly Network Routing
Send message from processor 2 to processor
5. Algorithm 0 means ship left 1 means ship
right. 1) 5 101. Pluck off leftmost bit 1
and send 01msg to right. 2) Pluck off leftmost
bit 0 and send 1msg to left. 3) Pluck off
leftmost bit 1 and send msg to right. Each
cross edge followed changes address by 1 bit.
39Evaluating the Butterfly Networkwith n Processors
- Diameter
- log n
- Bisection width
- n / 2 (Likely error 32/216)
- Degree
- 4 (even for d gt 3)
- Constant edge length?
- No, grows exponentially
- as rank size increase
On pg 442, Leighton gives ?(n / log(n)) as
the bisection width. Simply remove cross edges
between two successive levels to create bisection
cut.
40Hypercube (also called binary n-cube)
A hypercube with n 2d processors switches
for d4
41Hypercube (or Binary n-cube) n 2d Processors
- Direct topology
- 2 x 2 x x 2 mesh
- Number of nodes is a power of 2
- Node addresses 0, 1, , n-1
- Node i is connected to k nodes whose addresses
differ from i in exactly one bit position. - Example k 0111 is connected to 1111, 0011,
0101, and 0110
42Growing a HypercubeNote For d 4, it is called
a 4-D hypercube or just a 4 cube
43Evaluating Hypercube Networkwith n 2d nodes
- Diameter
- d log n
- Bisection width
- n / 2
- Edges per node
- log n
- Constant edge length?
- No.
- The length of the longest edge increases as n
increases.
44Routing on the Hypercube Network
- Example Send a message from node 2 0010 to
node 5 0101 - The nodes differ in 3 bits so the shortest path
will be of length 3. - One path is
- 0010 ? 0110 ?
- 0100 ? 0101
- obtained by flipping one of the differing bits at
each step.
- As with the butterfly network, bit flipping
helps you route on this network.
45A Perfect Shuffle
- A permutation that is produced as follows is
called a perfect shuffle - Given a power of 2 cards, numbered 0, 1, 2, ...,
2d -1, write the card number with d bits. By left
rotating the bits with a wrap, we calculate the
position of the card after the perfect shuffle. - Example For d 3, card 5 101. Left rotating
and wrapping gives us 011. So, card 5 goes to
position 3. Note that card 0 000 and card 7
111, stay in position.
46Shuffle-exchange Network with n 2d Processors
0
1
2
3
4
5
6
7
- Direct topology
- Number of nodes is a power of 2
- Nodes have addresses 0, 1, , 2d-1
- Two outgoing links from node i
- Shuffle link to node LeftCycle(i)
- Exchange link between node i and node i1
- when i is even
47Shuffle-exchange Addressing 16 processors
No arrows on line segment means it is
bidirectional. Otherwise, you must follow the
arrows. Devising a routing algorithm for this
network is interesting and may be a homework
problem.
48Evaluating the Shuffle-exchange
- Diameter
- 2log n 1
- Edges per node
- 3
- Constant edge length?
- No
- Bisection width
- ?(n/ log n)
- Between 2n/log n and n/(2 log n)
- See Leighton pg 480
49Two Problems with Shuffle-Exchange
- Shuffle-Exchange does not expand well
- A large shuffle-exchange network does not
decompose well into smaller separate shuffle
exchange networks. - In a large shuffle-exchange network, a small
percentage of nodes will be hot spots - They will encounter much heavier traffic
- Above results are in dissertation of one of
Batchers students.
50Comparing Networks(See Table 2.1)
- All have logarithmic diameterexcept 2-D mesh
- 4nary Hypertree, butterfly, and hypercube have
bisection width n / 2 (Likely untrue on
butterfly) - All have constant edges per node except hypercube
- Only 2-D mesh, linear, and ring topologies keep
edge lengths constant as network size increases - Shuffle-exchange is a good compromise- fixed
number of edges per node, low diameter, good
bisection width. - However, negative results on preceding slide also
need to be considered.
51Alternate Names for SIMDs
- Recall that all active processors of a true SIMD
computer must simultaneously access the same
memory location. - The value in the i-th processor can be viewed as
the i-th component of a vector. - SIMD machines are sometimes called vector
computers Jordan,et.al. or processor arrays
Quinn 94,04 based on their ability to execute
vector and matrix operations efficiently.
52SIMD Computers
- SIMD computers that focus on vector operations
- Support some vector and possibly matrix
operations in hardware - Usually limit or provide less support for
non-vector type operations involving data in the
vector components. - General purpose SIMD computers
- May also provide some vector and possibly matrix
operations in hardware. - More support for traditional type operations
(e.g., other than for vector/matrix data types).
53Pipelined Architectures
- Pipelined architectures are sometimes considered
to be SIMD architectures - See pg 37 of Textbook pg 8-9 Jordan et. al.
- Vector components are entered successively into
first processor in pipeline. - The i-th processor of the pipeline receives the
output from the (i-1)th processor. - Normal operations in each processor are much
larger (coarser) in pipelined computers than in
true SIMDs - Pipelined is somewhat SIMD in nature in that
synchronization is not required.
54Why Processor Arrays?
- Historically, high cost of control units
- Scientific applications have data parallelism
55Data/instruction Storage
- Front end computer
- Also called the control unit
- Holds and runs program
- Data manipulated sequentially
- Processor array
- Data manipulated in parallel
56Processor Array Performance
- Performance work done per time unit
- Performance of processor array
- Speed of processing elements
- Utilization of processing elements
57Performance Example 1
- 1024 processors
- Each adds a pair of integers in 1 ?sec (1
microsecond or one millionth of second or 10-6
second.) - What is the performance when adding two
1024-element vectors (one per processor)?
58Performance Example 2
- 512 processors
- Each adds two integers in 1 ?sec
- What is the performance when adding two vectors
of length 600? - Since 600 gt 512, 88 processor must add two pairs
of integers. - The other 424 processors add only a single pair
of integers.
59Example of a 2-D Processor Interconnection
Network in a Processor Array
Each VLSI chip has 16 processing elements. Each
PE can simultaneously send a value to a neighbor.
PE processor element
60SIMD Execution Style
- The traditional (SIMD, vector, processor array)
execution style (Quinn 94, pg 62, Quinn 2004,
pgs 37-43 - The sequential processor that broadcasts the
commands to the rest of the processors is called
the front end or control unit (or sometimes
host). - The front end is a general purpose CPU that
stores the program and the data that is not
manipulated in parallel. - The front end normally executes the sequential
portions of the program. - Alternately, all PEs needing computation can
execute steps synchronously and avoid broadcast
cost to distribute results - Each processing element has a local memory that
cannot be directly accessed by the control unit
or other processing elements.
61SIMD Execution Style
- Collectively, the individual memories of the
processing elements (PEs) store the (vector) data
that is processed in parallel. - When the front end encounters an instruction
whose operand is a vector, it issues a command to
the PEs to perform the instruction in parallel. - Although the PEs execute in parallel, some units
can be allowed to skip particular instructions.
62Masking on Processor Arrays
- All the processors work in lockstep except those
that are masked out (by setting mask register). - The conditional if-then-else is different for
processor arrays than sequential version - Every active processor tests to see if its data
meets the negation of the boolean condition. - If it does, it sets its mask bit so those
processors will not participate in the operation
initially. - Next the unmasked processors, execute the THEN
part. - Afterwards, mask bits (for original set of active
processors) are flipped and unmasked processors
perform the the ELSE part.
63if (COND) then A else B
64if (COND) then A else B
65if (COND) then A else B
66SIMD Machines
- An early SIMD computer designed for vector and
matrix processing was the Illiac IV computer - Initial development at the University of Illinois
1965-70 - Moved to NASA Ames, completed in 1972 but not
fully functional until 1976. - See Jordan et. al., pg 7 and Wikepedia
- The MPP, DAP, the Connection Machines CM-1 and
CM-2, and MasPars MP-1 and MP-2 are examples of
SIMD computers - See Akl pg 8-12 and Quinn, 94
- The CRAY-1 and the Cyber-205 use pipelined
arithmetic units to support vector operations and
are sometimes called a pipelined SIMD - See Jordan, et al, p7, Quinn 94, pg 61-2, and
Quinn 2004, pg37).
67SIMD Machines
- Quinn 1994, pg 63-67 discusses the CM-2
Connection Machine (with 64K PEs) and a smaller
updated CM-200. - Professor Batcher was the chief architect for the
STARAN and the MPP (Massively Parallel Processor)
and an advisor for the ASPRO - ASPRO is a small second generation STARAN used by
the Navy in the spy planes. - Professor Batcher is best known architecturally
for the MPP, which is at the Smithsonian
Institute currently displayed at a D.C. airport.
68Todays SIMDs
- SIMD functionality is sometimes embedded in
sequential machines. - Others are being build as part of hybrid
architectures. - Some SIMD and SIMD-like features are included in
some multi/many core processing units - Some SIMD-like architectures have been build as
special purpose machines, although some of these
could classify as general purpose. - Some of this work has been proprietary.
- The fact that a parallel computer is SIMD or
SIMD-like is often not advertised by the company
building them.
69A Company that Recently Built an Inexpensive SIMD
- ClearSpeed produced a COTS (commodity off the
shelf) SIMD Board - WorldScape has developed some defense and
commercial applications for this computer. - Not a traditional SIMD as the hardware doesnt
tightly synchronize the execution of
instructions. - Hardware design supports efficient
synchronization - This machine is programmed like a SIMD.
- The U.S. Navy observed that their machines
process radar a magnitude faster than others. - Quite a bit of information about this machine is
posted at www.clearspeed.com
70An Example of a Hybrid SIMD
- Embedded Massively Parallel Accelerators
- Other accelerators Decypher, Biocellerator,
GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2,
BioScan - (This and next three slides are due to Prabhakar
R. Gudla (U of Maryland) at a CMSC 838T
Presentation, 4/23/2003.)
71Hybrid Architecture
- combines SIMD and MIMD paradigm within a parallel
architecture ? Hybrid Computer
72Architecture of Systola 1024
- Instruction Systolic Array
- 32 ? 32 mesh of processing elements
- wavefront instruction execution
73SIMDs Embedded in SISDs
- Intel's Pentium 4 included what they call MMX
technology to gain a significant performance
boost - IBM and Motorola incorporated the technology into
their G4 PowerPC chip in what they call their
Velocity Engine. - Both MMX technology and the Velocity Engine are
the chip manufacturer's name for their
proprietary SIMD processors and parallel
extensions to their operating code. - This same approach is used by NVidia and Evans
Sutherland to dramatically accelerate graphics
rendering.
74Special Purpose SIMDs in the Bioinformatics Arena
- Parcel
- Acquired by Celera Genomics in 2000
- Products include the sequence supercomputer
GeneMatcher, which has a high throughput sequence
analysis capability - Supports over a million processors
- GeneMatcher was used by Celera in their race with
U.S. government to complete the description of
the human genome sequencing - TimeLogic, Inc
- Has DeCypher, a reconfigurable SIMD
75Advantages of SIMDs
- Reference Roosta, pg 10
- Less hardware than MIMDs as they have only one
control unit. - Control units are complex.
- Less memory needed than MIMD
- Only one copy of the instructions need to be
stored - Allows more data to be stored in memory.
- Less startup time in communicating between PEs.
76Advantages of SIMDs (cont)
- Single instruction stream and synchronization of
PEs make SIMD applications easier to program,
understand, debug. - Similar to sequential programming
- Control flow operations and scalar operations can
be executed on the control unit while PEs are
executing other instructions. - MIMD architectures require explicit
synchronization primitives, which create a
substantial amount of additional overhead.
77Advantages of SIMDs (cont)
- During a communication operation between PEs,
- PEs send data to a neighboring PE in parallel and
in lock step - No need to create a header with routing
information as routing is determined by program
steps. - the entire communication operation is executed
synchronously - SIMDs are deterministic have much more
predictable running time. - Can normally compute a tight (worst case) upper
bound for the time for communications operations. - Less complex hardware in SIMD since no message
decoder is needed in the PEs - MIMDs need a message decoder in each PE.
78SIMD Shortcomings(with some rebuttals)
- Claims are from our textbook i.e., Quinn 2004.
- Similar statements are found in Grama, et. al.
- Claim 1 Not all problems are data-parallel
- While true, most problems seem to have a data
parallel solution. - In Fox, et.al., the observation was made in
their study of large parallel applications at
national labs, that most were data parallel by
nature, but often had points where significant
branching occurred.
79SIMD Shortcomings(with some rebuttals)
- Claim 2 Speed drops for conditionally executed
branches - MIMDs processors can execute multiple branches
concurrently. - For an if-then-else statement with execution
times for the then and else parts being
roughly equal, about ½ of the SIMD processors are
idle during its execution - With additional branching, the average number of
inactive processors can become even higher. - With SIMDs, only one of these branches can be
executed at a time. - This reason justifies the study of multiple SIMDs
(or MSIMDs). - On many applications, any branching is quite
shallow.
80SIMD Shortcomings(with some rebuttals)
- Claim 2 (cont) Speed drops for conditionally
executed code - In Fox, et.al., the observation was made that
for the real applications surveyed, the MAXIMUM
number of active branches at any point in time
was about 8. - The cost of the extremely simple processors used
in a SIMD are extremely low - Programmers used to worry about full utilization
of memory but stopped this after memory cost
became insignificant overall.
81SIMD Shortcomings(with some rebuttals)
- Claim 3 Dont adapt to multiple users well.
- This is true to some degree for all parallel
computers. - If usage of a parallel processor is dedicated to
a important problem, it is probably best not to
risk compromising its performance by sharing - This reason also justifies the study of multiple
SIMDs (or MSIMD). - SIMD architecture has not received the attention
that MIMD has received and can greatly benefit
from further research.
82SIMD Shortcomings(with some rebuttals)
- Claim 4 Do not scale down well to starter
systems that are affordable. - This point is arguable and its truth is likely
to vary rapidly over time - ClearSpeed produced a very economical SIMD board
that plugs into a PC with about 48 processors
per chip and 2-3 chips per board.
83SIMD Shortcomings(with some rebuttals)
- Claim 5 Requires customized VLSI for processors
and expense of control units in PCs has dropped. - Reliance on COTS (Commodity, off-the-shelf parts)
has dropped the price of MIMDS - Expense of PCs (with control units) has dropped
significantly - However, reliance on COTS has fueled the success
of low level parallelism provided by clusters
and restricted new innovative parallel
architecture research for well over a decade.
84SIMD Shortcomings(with some rebuttals)
- Claim 5 (cont.)
- There is strong evidence that the period of
continual dramatic increases in speed of PCs and
clusters is ending. - Continued rapid increases in parallel performance
in the future will be necessary in order to solve
important problems that are beyond our current
capabilities - Additionally, with the appearance of the very
economical COTS SIMDs, this claim no longer
appears to be relevant.
85Multiprocessors
- Multiprocessor multiple-CPU computer with a
shared memory - Same address on two different CPUs refers to the
same memory location - Avoids three cited criticisms for SIMDs
- Can be built from commodity CPUs
- Naturally support multiple users
- Maintain efficiency in conditional code
86Centralized Multiprocessor
87Centralized Multiprocessor
- Straightforward extension of uniprocessor
- Add CPUs to bus
- All processors share same primary memory
- Memory access time same for all CPUs
- Uniform memory access (UMA) multiprocessor
- Also called a symmetrical multiprocessor (SMP)
88Private and Shared Data
- Private data items used only by a single
processor - Shared data values used by multiple processors
- In a centralized multiprocessor (i.e. SMP),
processors communicate via shared data values
89Problems Associated with Shared Data
- The cache coherence problem
- Replicating data across multiple caches reduces
contention among processors for shared data
values. - But - how can we ensure different processors have
the same value for same address? - The cache coherence problem is when an obsolete
value is still stored in a processors cache.
90Write Invalidate Protocol
- Most common solution to cache coherency
- Each CPUs cache controller monitors (snoops) the
bus identifies which cache blocks are requested
by other CPUs. - A PE gains exclusive control of data item before
performing write. - Before write occurs, all other copies of data
item cached by other PEs are invalidated. - When any other CPU tries to read a memory
location from an invalidated cache block, - a cache miss occurs
- It has to retrieve updated data from memory
91Cache-coherence Problem
Memory
7
X
92Cache-coherence Problem
Memory
Read from memory is not a problem.
7
X
7
93Cache-coherence Problem
Memory
7
X
7
7
94Cache-coherence Problem
Write to main memory is a problem.
Memory
2
X
2
7
95Write Invalidate Protocol
A cache control monitor snoops the bus to
see which cache block is being requested by
other processors.
7
X
7
7
96Write Invalidate Protocol
7
X
Intent to write X
7
7
Before a write can occur, all copies of data at
that address are declared invalid.
97Write Invalidate Protocol
7
X
Intent to write X
7
98Write Invalidate Protocol
When another processor tries to read from this
location in cache, it receives a cache miss error
and will have to refresh from main memory.
2
X
2
99Synchronization Required for Shared Data
- Mutual exclusion
- Definition At most one process can be engaged in
an activity at any time. - Example Only one processor can write to the
same address in main memory at the same time. - We say that process must mutually exclude all
others while it performs this write. - Barrier synchronization
- Definition Guarantees that no process will
proceed beyond a designated point (called the
barrier) until every process reaches that point.
100Distributed Multiprocessor
- Distributes primary memory among processors
- Increase aggregate memory bandwidth and lower
average memory access time - Allows greater number of processors
- Also called non-uniform memory access (NUMA)
multiprocessor - Local memory access time is fast
- Non-local memory access time can vary
- Distributed memories have one logical address
space
101Distributed Multiprocessors
102Cache Coherence
- Some NUMA multiprocessors do not support it in
hardware - Only instructions and private data are stored in
cache - Policy creates a large memory access time
variance - Implementation more difficult
- No shared memory bus to snoop
- Directory-based protocol needed
103Directory-based Protocol
- Distributed directory contains information about
cacheable memory blocks - One directory entry for each cache block
- Each entry has
- Sharing status
- Which processors have copies
104Sharing Status
- Uncached -- (denoted by U)
- Block not in any processors cache
- Shared (denoted by S)
- Cached by one or more processors
- Read only
- Exclusive (denoted by E)
- Cached by exactly one processor
- Processor has written to block
- Copy in memory is obsolete
105Directory-based Protocol - step1
106X has value 7 step 2
Interconnection Network
Bit Vector
X
U 0 0 0
Directories
7
X
Memories
Caches
107CPU 0 Reads X step 3
Interconnection Network
X
U 0 0 0
Directories
7
X
Memories
Caches
108CPU 0 Reads X step 4
Interconnection Network
X
S 1 0 0
Directories
7
X
Memories
Caches
109CPU 0 Reads X step 5
Interconnection Network
X
S 1 0 0
Directories
Memories
Caches
110CPU 2 Reads X step 6
Interconnection Network
X
S 1 0 0
Directories
Memories
Caches
111CPU 2 Reads X step 7
Interconnection Network
X
S 1 0 1
Directories
Memories
Caches
112CPU 2 Reads X step 8
Interconnection Network
X
S 1 0 1
Directories
Memories
Caches
113CPU 0 Writes 6 to X step 9
Interconnection Network
Write Miss
X
S 1 0 1
Directories
Memories
Caches
114CPU 0 Writes 6 to X step 10
Interconnection Network
X
S 1 0 1
Directories
Invalidate
Memories
Caches
115CPU 0 Writes 6 to X step 11
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
6
X
116CPU 1 Reads X step 12
Interconnection Network
Read Miss
X
E 1 0 0
Directories
Memories
Caches
117CPU 1 Reads X step 13
Interconnection Network
Switch to Shared
X
E 1 0 0
Directories
Memories
Caches
118CPU 1 Reads X step 14
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
119CPU 1 Reads X step 15
Interconnection Network
X
S 1 1 0
Directories
Memories
Caches
120CPU 2 Writes 5 to X step 16
Interconnection Network
X
S 1 1 0
Directories
Memories
Write Miss
Caches
121CPU 2 Writes 5 to X - step 17
Interconnection Network
Invalidate
X
S 1 1 0
Directories
Memories
Caches
122CPU 2 Writes 5 to X step 18
Interconnection Network
X
E 0 0 1
Directories
Memories
5
X
Caches
123CPU 0 Writes 4 to X step 19
Interconnection Network
X
E 0 0 1
Directories
Memories
Caches
124CPU 0 Writes 4 to X step 20
Interconnection Network
X
E 1 0 0
Directories
Memories
Take Away
Caches
125CPU 0 Writes 4 to X step 21
Interconnection Network
X
E 0 1 0
Directories
Memories
Caches
126CPU 0 Writes 4 to X step 22
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
127CPU 0 Writes 4 to X step 23
Interconnection Network
X
E 1 0 0
Directories
Creates cache block storage for X
Memories
Caches
128CPU 0 Writes 4 to X step 24
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
4
X
129CPU 0 Writes Back X Block step 25
Interconnection Network
Data Write Back
X
E 1 0 0
Directories
Memories
Caches
130CPU 0 flushes cache block X step 26
Interconnection Network
X
U 0 0 0
Directories
Memories
Caches
131Characteristics of Multiprocessors
- Interprocessor communication is done in the
memory interface by read and write instructions - Memory may be physically distributed and the
reads and writes from different processors may
take different time. - Congestion and hotspots in the interconnection
network may occur. - Memory latency (i.e., time to complete a read or
write) may be long and variable. - Most messages through the bus or interconnection
network are the size of single memory words. - Randomization of requests may be used to reduce
the probability of collisions.
132Multicomputers
- Distributed memory multiple-CPU computer
- Same address on different processors refers to
different physical memory locations - Processors interact through message passing
133Typically, Two Flavors of Multicomputers
- Commercial multicomputers
- Custom switch network
- Low latency (the time it takes to send a
message). - High bandwidth (data path width) across
processors - Commodity clusters
- Mass produced computers, switches and other
equipment - Use low cost components
- Message latency is higher
- Communications bandwidth is lower
134Multicomputer Communication
- Processors are connected by an interconnection
network - Each processor has a local memory and can only
access its own local memory - Data is passed between processors using messages,
as required by the program - Data movement across the network is also
asynchronous - A common approach is to use MPI to handling
message passing
135Multicomputer Communications (cont)
- Multicomputers can be scaled to larger sizes much
easier than multiprocessors. - The amount of data transmissions between
processors have a huge impact on the performance - The distribution of the data among the processors
is a very important factor in the performance
efficiency.
136Message-Passing Advantages
- No problem with simultaneous access to data.
- Allows different PCs to operate on the same data
independently. - Allows PCs on a network to be easily upgraded
when faster processors become available.
137Disadvantages of Message-Passing
- Programmers must make explicit message-passing
calls in the code - This is low-level programming and is error prone.
- Data is not shared but copied, which increases
the total data size. - Data Integrity
- Difficulty in maintaining correctness of multiple
copies of data item.
138Some Interconnection Network Terminology (1/2)
- References Wilkinson, et. al. Grama, et. al.
Also, earlier slides on architecture networks. - A link is the connection between two nodes.
- A switch that enables packets to be routed
through the node to other nodes without
disturbing the processor is assumed. - The link between two nodes can be either
bidirectional or use two directional links . - Can assume either one wire that carries one bit
or parallel wires (one wire for each bit in
word). - The above choices do not have a major impact on
the concepts presented in this course.
139Network Terminology (2/2)
- The bandwidth is the number of bits that can be
transmitted in unit time (i.e., bits per second). - The network latency is the time required to
transfer a message through the network. - The communication latency is the total time
required to send a message, including software
overhead and interface delay. - The message latency or startup time is the time
required to send a zero-length message. - Includes software hardware overhead, such as
- Choosing a route
- packing and unpacking the message
140Circuit Switching Message Passing
- Technique establishes a path and allows the
entire message to transfer uninterrupted. - Similar to telephone connection that is held
until the end of the call. - Links used are not available to other messages
until the transfer is complete. - Latency (message transfer time) If the length of
control packet sent to establish path is small
wrt (with respect to) the message length, the
latency is essentially - the constant L/B, where L is message length and B
is bandwidth.
141Store-and-forward Packet Switching
- Message is divided into packets of information
- Each packet includes source and destination
addresses. - Packets can not exceed a fixed, maximum size
(e.g., 1000 byte). - A packet is stored in a node in a buffer until it
can move to the next node. - Different packets typically follow different
routes but are re-assembled at the destination,
as the packets arrive. - Movements of packets is asynchronous.
142Packet Switching (cont)
- At each node, the designation information is
looked at and used to select which node to
forward the packet to. - Routing algorithms (often probabilistic) are used
to avoid hot spots and to minimize traffic jams. - Significant latency is created by storing each
packet in each node it reaches. - Latency increases linearly with the length of the
route.
143Virtual Cut-Through Package Switching
- Used to reduce the latency.
- Allows packet to pass through a node without
being stored, if the outgoing link is available. - If complete path is available, a message can
immediately move from source to destination..
144Wormhole Routing
- Alternate to store-and-forward packet routing
- A message is divided into small units called
flits (flow control units). - Flits are 1-2 bytes in size.
- Can be transferred in parallel on links with
multiple wires. - Only head of flit is initially transferred when
the next link becomes available.
145Wormhole Routing (cont)
- As each flit moves forward, the next flit can
move forward. - The entire path must be reserved for a message as
these packets pull each other along (like cars of
a train). - Request/acknowledge bit messages are required to
coordinate these pull-along moves. - See Wilkinson, et. al.
- Latency If the head of the flit is very small
compared to the length of the message, then the
latency is essentially the constant L/B, with L
the message length and B the link bandwidth.
146Deadlock
- Routing algorithms needed to find a path between
the nodes. - Adaptive routing algorithms choose different
paths, depending on traffic conditions. - Livelock is a deadlock-type situation where a
packet continues to go around the network,
without ever reaching its destination. - Deadlock No packet can be forwarded because they
are blocked by other stored packets waiting to be
forwarded.
147Asymmetric Multicomputers
- Has a front-end that interacts with users and I/O
devices. - Processors in back end are used for computation.
- Programming similar to SIMDs (i.e., processor
arrays) - Common with early multicomputers
- Examples of asymmetrical multicomputers given in
textbook.
148Asymmetrical MC Advantages
- Back-end processors dedicated to parallel
computations ? Easier to understand, model, tune
performance - Only a simple back-end operating system needed ?
Easy for a vendor to create
149Asymmetrical MC Disadvantages
- Front-end computer is a single point of failure
- Single front-end computer limits scalability of
system - Primitive operating system in back-end processors
makes debugging difficult - Every application requires development of both
front-end and back-end program
150Symmetric Multicomputers
- Every computer executes the same operating system
and has identical functionality. - Users may log into any computer to edit or
compile their programs. - Any or all computers may be involved in the
execution of their program. - During execution of programs, every PE executes
the same program. - When only one PE should execute an operation, an
if statement is used to select the PE.
151Symmetric Multicomputers
152Symmetrical MC Advantages
- Alleviate performance bottleneck caused by single
front-end computer - Better support for debugging
- Every processor executes same program
153Symmetrical MC Disadvantages
- More difficult to maintain illusion of single
parallel computer - No simple way to balance program development
workload among processors - More difficult to achieve high performance when
multiple processes run on each processor - Details on next slide
154Symmetric MC Disadvantages (cont)
- (cont.) More difficult to achieve high
performance when multiple processes run on each
processor - Processes on same processor compete for same
resources - CPU Cycles
- Cache space
- Memory bandwidth
- Increased cache misses
- Cache is PE oriented instead of process
oriented
155Best Model for Commodity Cluster
- Full-Fledged operating system (e.g., Linux)
desirable - Feature of symmetric multicomputer
- Desirable to increase cache hits
- Favors having only a single user process on each
PE - Favors most nodes being off-limits for program
development - Need fast network
- Keep program development users off networks and
have them access front-end by another path. - Reserve interconnection network to usage by
parallel processes - Overall, a mixed model may be best for commodity
clusters
156Ideal Commodity Cluster Features
- Co-located computers
- Computers dedicated to running a single process
at a time to lower cache misses - Accessible only from the network
- No keyboards or displays
- Users access front-end by another route.
- Identical operating systems
- Identical local disk images
- Administered as an entity
157ParPar Cluster, A Mixed Model
- Mixed model
- Incorporates both asymetrical and symetrical
designs.
158Network of Workstations
- Dispersed computers
- Typically located on users desks
- First priority response time for person at the
keyboard - Parallel jobs wait in background and run with
spare CPU cycles are available. - Different operating systems
- Different local images
- Checkpointing and restarting important
- Typically connected by ethernet
- Too slow for commodity network usage
159A Commodity Cluster vs Network of Workstations
- A commodity cluster contains components of local
area networks - Commodity computers
- Switches
- A network of workstations is a dispersed
collection of computers - Distributed hetergeneous computers
- Located on primary users desks
- Unused cycles available for parallel use
- Example SETI project
160Summary
- Commercial parallel computers appearedin 1980s
- Multiple-CPU computers now dominate
- Small-scale Centralized multiprocessors
- Large-scale Distributed memory architectures
- Multiprocessors
- Multicomputers