Parallel Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Architectures

Description:

Chapter 2 Parallel Architectures * CPU 0 flushes cache block X step 26 Interconnection Network CPU 0 CPU 1 CPU 2 4 X Caches Memories Directories X U 0 0 0 ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 161
Provided by: Ober55
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Parallel Architectures


1
Chapter 2
  • Parallel Architectures

2
Outline
  • Some chapter references
  • Brief review of complexity
  • Terminology for comparisons
  • Interconnection networks
  • Processor arrays
  • Multiprocessors
  • Multicomputers
  • Flynns Taxonomy moved to Chpt 1

3
Some Chapter References
  • Selim Akl, The Design and Analysis of Parallel
    Algorithms, Prentice Hall, 1989 (earlier
    textbook).
  • G. C. Fox, What Have We Learnt from Using Real
    Parallel Machines to Solve Real Problems?
    Technical Report C3P-522, Cal Tech, December
    1989. (Included in part in more recent books
    co-authored by Fox.)
  • A. Grama, A. Gupta, G. Karypis, V. Kumar,
    Introduction to Parallel Computing, Second
    Edition, 2003 (first edition 1994), Addison
    Wesley.
  • Harry Jordan, Gita Alaghband, Fundamentals of
    Parallel Processing Algorithms, Architectures,
    Languages, Prentice Hall, 2003, Ch 1, 3-5.
  • F. Thomson Leighton Introduction to Parallel
    Algorithms and Architectures Arrays, Trees,
    Hypercubes 1992 Morgan Kaufmann Publishers.

4
References - continued
  • Gregory Pfsiter, In Search of Clusters The
    ongoing Battle in Lowly Parallelism, 2nd Edition,
    Ch 2. (Discusses details of some serious problems
    that MIMDs incur).
  • Michael Quinn, Parallel Programming in C with MPI
    and OpenMP, McGraw Hill,2004 (Current Textbook),
    Chapter 2.
  • Michael Quinn, Parallel Computing Theory and
    Practice, McGraw Hill, 1994, Ch. 1,2
  • Sayed H. Roosta, Parallel Processing Parallel
    Algorithms Theory and Computation, Springer
    Verlag, 2000, Chpt 1.
  • Wilkinson Allen, Parallel Programming
    Techniques and Applications, Prentice Hall, 2nd
    Edition, 2005, Ch 1-2.

5
Brief Review Complexity Concepts Needed for
Comparisons
  • Whenever we define a counting function, we
    usually characterize the growth rate of that
    function in terms of complexity classes.
  • Technical Definition We say a function f(n) is
    in O(g(n)), if (and only if) there are positive
    constants c and n0 such that
  • 0 f(n) ? cg(n) for n ? n0
  • O(n) is read as big-oh of n.
  • This notation can be used to separate counting
    functions into complexity classes that
    characterize the size of the count.
  • We can use it for any kind of counting functions
    such as timings, bisection widths, etc.

6
Big-Oh and Asymptotic Growth Rate
  • The big-Oh notation gives an upper bound on the
    (asymptotic) growth rate of a function
  • The statement f(n) is O(g(n)) means that the
    growth rate of f(n) is not greater than the
    growth rate of g(n)
  • We can use the big-Oh notation to rank functions
    according to their growth rate

Assume f(n) is O(g(n)) g(n) is O(f(n))
g(n) grows faster Yes No
f(n) grows faster No Yes
Same growth Yes Yes
7
Relatives of Big-Oh
  • big-Omega
  • f(n) is ?(g(n)) if there is a constant c gt 0
  • and an integer constant n0 ? 1 such that
  • f(n) ? cg(n) for n ? n0
  • Intuitively, this says up to a constant factor,
    f(n) asymptotically is greater than or equal to
    g(n)
  • big-Theta
  • f(n) is ?(g(n)) if there are constants c gt 0 and
    c gt 0 and an integer constant n0 ? 1 such that
    0 cg(n) ? f(n) ? cg(n) for n ? n0
  • Intuitively, this says up to a constant factor,
    f(n) and g(n) are asymptotically the same.
  • Note These concepts are covered in algorithm
    courses

8
Relatives of Big-Oh
  • little-oh
  • f(n) is o(g(n)) if, for any constant c gt 0, there
    is an integer constant n0 ? 0 such that 0 ? f(n)
    lt cg(n) for n ? n0
  • Intuitively, this says f(n) is, up to a constant,
    asymptotically strictly less than g(n), so f(n) ?
    ?(g(n)).
  • little-omega
  • f(n) is ?(g(n)) if, for any constant c gt 0, there
    is an integer constant n0 ? 0 such that f(n) gt
    cg(n) 0 for n ? n0
  • Intuitively, this says f(n) is, up to a constant,
    asymptotically strictly greater than g(n), so
    f(n) ? ?(g(n)).
  • These are not used as much as the earlier
    definitions, but they round out the picture.

9
Summary for Intuition for Asymptotic Notation
  • big-Oh
  • f(n) is O(g(n)) if f(n) is asymptotically less
    than or equal to g(n)
  • big-Omega
  • f(n) is ?(g(n)) if f(n) is asymptotically greater
    than or equal to g(n)
  • big-Theta
  • f(n) is ?(g(n)) if f(n) is asymptotically equal
    to g(n)
  • little-oh
  • f(n) is o(g(n)) if f(n) is asymptotically
    strictly less than g(n)
  • little-omega
  • f(n) is ?(g(n)) if is asymptotically strictly
    greater than g(n)

10
A CALCULUS DEFINITION OF O, ?(often easier to
use)
Definition Let f and g be functions defined on
the positive integers with nonnegative values. We
say g is in O(f) if and only if lim
g(n)/f(n) c n -gt ? for some nonnegative real
number c--- i.e. the limit exists and is not
infinite. Definition We say f is in ?(g) if and
only if f is in O(g) and g is in O(f) Note
Often use L'Hopital's Rule to calculate the
limits you need.
11
Why Asymptotic Behavior is Important
  • 1) Allows us to compare counts on large sets.
  • 2) Helps us understand the maximum size of input
    that can be handled in a given time, provided we
    know the environment in which we are running.
  • 3) Stresses the fact that even dramatic speedups
    in hardware can not overcome the handicap of an
    asymptotically slow algorithm.

12
Recall ORDER WINS OUT(Example from Baases
Algorithms Text)
The TRS-80 Main language support BASIC -
typically a slow running interpreted language For
more details on TRS-80 see http//mate.kjsl.com/t
rs80/
The CRAY-YMP Language used in example FORTRAN- a
fast running language For more details on
CRAY-YMP see
http//ds.dial.pipex.com/town/park/abm64/CrayWWWSt
uff/Cfaqp1.htmlTOC3
13
CRAY YMP TRS-80with FORTRAN
with BASICcomplexity is 3n3
complexity is 19,500,000n
microsecond (abbr µsec) One-millionth of a
second. millisecond (abbr msec) One-thousandth of
a second.
n is 10 100 1000 2500 10000 1000000
3 microsec
200 millisec
2 sec
3 millisec
20 sec
3 sec
50 sec
50 sec
49 min
3.2 min
95 years
5.4 hours
14
Interconnection Networks
  • Uses of interconnection networks
  • Connect processors to shared memory
  • Connect processors to each other
  • Interconnection media types
  • Shared medium
  • Switched medium
  • Different interconnection networks define
    different parallel machines.
  • The interconnection networks properties
    influence the type of algorithm used for various
    machines as it effects how data is routed.

15
Shared versus Switched Media
  1. With shared medium, one message is sent all
    processors listen
  2. With switched medium, multiple messages are
    possible.

16
Shared Medium
  • Allows only message at a time
  • Messages are broadcast
  • Each processor listens to every message
  • Before sending a message, a processor listen
    until medium is unused
  • Collisions require resending of messages
  • Ethernet is an example

17
Switched Medium
  • Supports point-to-point messages between pairs of
    processors
  • Each processor is connected to one switch
  • Advantages over shared media
  • Allows multiple messages to be sent
    simultaneously
  • Allows scaling of the network to accommodate the
    increase in processors

18
Switch Network Topologies
  • View switched network as a graph
  • Vertices processors or switches
  • Edges communication paths
  • Two kinds of topologies
  • Direct
  • Indirect

19
Direct Topology
  • Ratio of switch nodes to processor nodes is 11
  • Every switch node is connected to
  • 1 processor node
  • At least 1 other switch node

Indirect Topology
  • Ratio of switch nodes to processor nodes is
    greater than 11
  • Some switches simply connect to other switches

20
Terminology for Evaluating Switch Topologies
  • We need to evaluate 4 characteristics of a
    network in order to help us understand their
    effectiveness in implementing efficient parallel
    algorithms on a machine with a given network.
  • These are
  • The diameter
  • The bisection width
  • The edges per node
  • The constant edge length
  • Well define these and see how they affect
    algorithm choice.
  • Then we will investigate several different
    topologies and see how these characteristics are
    evaluated.

21
Terminology for Evaluating Switch Topologies
  • Diameter Largest distance between two switch
    nodes.
  • A low diameter is desirable
  • It puts a lower bound on the complexity of
    parallel algorithms which requires communication
    between arbitrary pairs of nodes.

22
Terminology for Evaluating Switch Topologies
  • Bisection width The minimum number of edges
    between switch nodes that must be removed in
    order to divide the network into two halves
    (within 1 node, if the number of processors is
    odd.)
  • High bisection width is desirable.
  • In algorithms requiring large amounts of data
    movement, the size of the data set divided by the
    bisection width puts a lower bound on the
    complexity of an algorithm,
  • Actually proving what the bisection width of a
    network is can be quite difficult.

23
Terminology for Evaluating Switch Topologies
  • Number of edges per node
  • It is best if the maximum number of edges/node is
    a constant independent of network size, as this
    allows the processor organization to scale more
    easily to a larger number of nodes.
  • Degree is the maximum number of edges per node.
  • Constant edge length? (yes/no)
  • Again, for scalability, it is best if the nodes
    and edges can be laid out in 3D space so that the
    maximum edge length is a constant independent of
    network size.

24
Evaluating Switch Topologies
  • Many have been proposed and analyzed. We will
    consider several well known ones
  • 2-D mesh
  • linear network
  • binary tree
  • hypertree
  • butterfly
  • hypercube
  • shuffle-exchange
  • Those in yellow have been used in commercial
    parallel computers.

25
2-D Meshes
Note Circles represent switches and squares
represent processors in all these slides.
26
2-D Mesh Network
  • Direct topology
  • Switches arranged into a 2-D lattice or grid
  • Communication allowed only between neighboring
    switches
  • Torus Variant that includes wraparound
    connections between switches on edge of mesh

27
Evaluating 2-D Meshes(Assumes mesh is a square)
  • n number of processors
  • Diameter
  • ?(n1/2)
  • Places a lower bound on algorithms that require
    processing with arbitrary nodes sharing data.
  • Bisection width
  • ?(n1/2)
  • Places a lower bound on algorithms that require
    distribution of data to all nodes.
  • Max number of edges per switch
  • 4 is the degree
  • Constant edge length?
  • Yes
  • Does this scale well?
  • Yes

28
Linear Network
  • Switches arranged into a 1-D mesh
  • Direct topology
  • Corresponds to a row or column of a 2-D mesh
  • Ring A variant that allows a wraparound
    connection between switches on the end.
  • The linear and ring networks have many
    applications
  • Essentially supports a pipeline in both
    directions
  • Although these networks are very simple, they
    support many optimal algorithms.

29
Evaluating Linear and Ring Networks
  • Diameter
  • Linear n-1 or T(n)
  • Ring ?n/2? or T(n)
  • Bisection width
  • Linear 1 or T(1)
  • Ring 2 or T(1)
  • Degree for switches
  • 2
  • Constant edge length?
  • Yes
  • Does this scale well?
  • Yes

30
Binary Tree Network
  • Indirect topology
  • n 2d processor nodes, 2n-1 switches, where d
    0,1,... is the number of levels

i.e. 23 8 processors on bottom and 2(n) 1
2(8) 1 15 switches
31
Evaluating Binary Tree Network
  • Diameter
  • 2 log n or O(log n).
  • Note- this is small
  • Bisection width
  • 1, the lowest possible number
  • Degree
  • 3
  • Constant edge length?
  • No
  • Does this scale well?
  • No

32
Hypertree Network (of degree 4 and depth 2)
  • Front view 4-ary tree of height 2
  • (b) Side view upside down binary tree of height
    d
  • (c) Complete network

33
Hypertree Network
  • Indirect topology
  • Note- the degree k and the depth d must be
    specified.
  • This gives from the front a k-ary tree of height
    d.
  • From the side, the same network looks like an
    upside down binary tree of height d.
  • Joining the front and side views yields the
    complete network.

34
Evaluating 4-ary Hypertree with Depth d
  • A 4-ary hypertree has n 4d processors
  • General formula for k-ary hypertree is n kd
  • Diameter is 2d 2 log n
  • shares the low diameter of binary tree
  • Bisection width 2d1
  • Note here, 2d1 23 8
  • Large value - much better than binary tree
  • Constant edge length?
  • No
  • Degree 6

35
Butterfly Network
A 23 8 processor butterfly network with 8432
switching nodes
  • Indirect topology
  • n 2d processornodes connectedby n(log n
    1)switching nodes

As complicated as this switching network appears
to be, it is really quite simple as it admits a
very nice routing algorithm! Wrapped Butterfly
When top and bottom ranks are merged into single
rank.
The rows are called ranks.
36
Building the 23 Butterfly Network
  • There are 8 processors.
  • Have 4 ranks (i.e. rows) with 8 switches per
    rank.
  • Connections
  • Node(i,j), for i gt 0, is connected to two nodes
    on rank i-1, namely node(i-1,j) and node(i-1,m),
    where m is the integer found by flipping the ith
    most significant bit in the binary d-bit
    representation of j.
  • For example, suppose i 2 and j 3. Then node
    (2,3) is connected to node (1,3).
  • To get the other connection, 3 0112. So, flip
    2nd significant bit i.e. 0012 and connect
    node(2,3) to node(1,1) --- NOTE There is an
    error on pg 32 on this example.
  • Nodes connected by a cross edge from rank i to
    rank i1 have node numbers that differ only in
    their (i1) bit.

37
Why It Is Called a Butterfly Network
  • Walk cycles such as node(i,j), node(i-1,j),
    node(i,m), node(i-1,m), node(i,j) where m is
    determined by the bit flipping as shown and you
    see a butterfly

38
Butterfly Network Routing
Send message from processor 2 to processor
5. Algorithm 0 means ship left 1 means ship
right. 1) 5 101. Pluck off leftmost bit 1
and send 01msg to right. 2) Pluck off leftmost
bit 0 and send 1msg to left. 3) Pluck off
leftmost bit 1 and send msg to right. Each
cross edge followed changes address by 1 bit.
39
Evaluating the Butterfly Networkwith n Processors
  • Diameter
  • log n
  • Bisection width
  • n / 2 (Likely error 32/216)
  • Degree
  • 4 (even for d gt 3)
  • Constant edge length?
  • No, grows exponentially
  • as rank size increase

On pg 442, Leighton gives ?(n / log(n)) as
the bisection width. Simply remove cross edges
between two successive levels to create bisection
cut.
40
Hypercube (also called binary n-cube)
A hypercube with n 2d processors switches
for d4
41
Hypercube (or Binary n-cube) n 2d Processors
  • Direct topology
  • 2 x 2 x x 2 mesh
  • Number of nodes is a power of 2
  • Node addresses 0, 1, , n-1
  • Node i is connected to k nodes whose addresses
    differ from i in exactly one bit position.
  • Example k 0111 is connected to 1111, 0011,
    0101, and 0110

42
Growing a HypercubeNote For d 4, it is called
a 4-D hypercube or just a 4 cube
43
Evaluating Hypercube Networkwith n 2d nodes
  • Diameter
  • d log n
  • Bisection width
  • n / 2
  • Edges per node
  • log n
  • Constant edge length?
  • No.
  • The length of the longest edge increases as n
    increases.

44
Routing on the Hypercube Network
  • Example Send a message from node 2 0010 to
    node 5 0101
  • The nodes differ in 3 bits so the shortest path
    will be of length 3.
  • One path is
  • 0010 ? 0110 ?
  • 0100 ? 0101
  • obtained by flipping one of the differing bits at
    each step.
  • As with the butterfly network, bit flipping
    helps you route on this network.

45
A Perfect Shuffle
  • A permutation that is produced as follows is
    called a perfect shuffle
  • Given a power of 2 cards, numbered 0, 1, 2, ...,
    2d -1, write the card number with d bits. By left
    rotating the bits with a wrap, we calculate the
    position of the card after the perfect shuffle.
  • Example For d 3, card 5 101. Left rotating
    and wrapping gives us 011. So, card 5 goes to
    position 3. Note that card 0 000 and card 7
    111, stay in position.

46
Shuffle-exchange Network with n 2d Processors
0
1
2
3
4
5
6
7
  • Direct topology
  • Number of nodes is a power of 2
  • Nodes have addresses 0, 1, , 2d-1
  • Two outgoing links from node i
  • Shuffle link to node LeftCycle(i)
  • Exchange link between node i and node i1
  • when i is even

47
Shuffle-exchange Addressing 16 processors
No arrows on line segment means it is
bidirectional. Otherwise, you must follow the
arrows. Devising a routing algorithm for this
network is interesting and may be a homework
problem.
48
Evaluating the Shuffle-exchange
  • Diameter
  • 2log n 1
  • Edges per node
  • 3
  • Constant edge length?
  • No
  • Bisection width
  • ?(n/ log n)
  • Between 2n/log n and n/(2 log n)
  • See Leighton pg 480

49
Two Problems with Shuffle-Exchange
  • Shuffle-Exchange does not expand well
  • A large shuffle-exchange network does not
    decompose well into smaller separate shuffle
    exchange networks.
  • In a large shuffle-exchange network, a small
    percentage of nodes will be hot spots
  • They will encounter much heavier traffic
  • Above results are in dissertation of one of
    Batchers students.

50
Comparing Networks(See Table 2.1)
  • All have logarithmic diameterexcept 2-D mesh
  • 4nary Hypertree, butterfly, and hypercube have
    bisection width n / 2 (Likely untrue on
    butterfly)
  • All have constant edges per node except hypercube
  • Only 2-D mesh, linear, and ring topologies keep
    edge lengths constant as network size increases
  • Shuffle-exchange is a good compromise- fixed
    number of edges per node, low diameter, good
    bisection width.
  • However, negative results on preceding slide also
    need to be considered.

51
Alternate Names for SIMDs
  • Recall that all active processors of a true SIMD
    computer must simultaneously access the same
    memory location.
  • The value in the i-th processor can be viewed as
    the i-th component of a vector.
  • SIMD machines are sometimes called vector
    computers Jordan,et.al. or processor arrays
    Quinn 94,04 based on their ability to execute
    vector and matrix operations efficiently.

52
SIMD Computers
  • SIMD computers that focus on vector operations
  • Support some vector and possibly matrix
    operations in hardware
  • Usually limit or provide less support for
    non-vector type operations involving data in the
    vector components.
  • General purpose SIMD computers
  • May also provide some vector and possibly matrix
    operations in hardware.
  • More support for traditional type operations
    (e.g., other than for vector/matrix data types).

53
Pipelined Architectures
  • Pipelined architectures are sometimes considered
    to be SIMD architectures
  • See pg 37 of Textbook pg 8-9 Jordan et. al.
  • Vector components are entered successively into
    first processor in pipeline.
  • The i-th processor of the pipeline receives the
    output from the (i-1)th processor.
  • Normal operations in each processor are much
    larger (coarser) in pipelined computers than in
    true SIMDs
  • Pipelined is somewhat SIMD in nature in that
    synchronization is not required.

54
Why Processor Arrays?
  • Historically, high cost of control units
  • Scientific applications have data parallelism

55
Data/instruction Storage
  • Front end computer
  • Also called the control unit
  • Holds and runs program
  • Data manipulated sequentially
  • Processor array
  • Data manipulated in parallel

56
Processor Array Performance
  • Performance work done per time unit
  • Performance of processor array
  • Speed of processing elements
  • Utilization of processing elements

57
Performance Example 1
  • 1024 processors
  • Each adds a pair of integers in 1 ?sec (1
    microsecond or one millionth of second or 10-6
    second.)
  • What is the performance when adding two
    1024-element vectors (one per processor)?

58
Performance Example 2
  • 512 processors
  • Each adds two integers in 1 ?sec
  • What is the performance when adding two vectors
    of length 600?
  • Since 600 gt 512, 88 processor must add two pairs
    of integers.
  • The other 424 processors add only a single pair
    of integers.

59
Example of a 2-D Processor Interconnection
Network in a Processor Array
Each VLSI chip has 16 processing elements. Each
PE can simultaneously send a value to a neighbor.
PE processor element
60
SIMD Execution Style
  • The traditional (SIMD, vector, processor array)
    execution style (Quinn 94, pg 62, Quinn 2004,
    pgs 37-43
  • The sequential processor that broadcasts the
    commands to the rest of the processors is called
    the front end or control unit (or sometimes
    host).
  • The front end is a general purpose CPU that
    stores the program and the data that is not
    manipulated in parallel.
  • The front end normally executes the sequential
    portions of the program.
  • Alternately, all PEs needing computation can
    execute steps synchronously and avoid broadcast
    cost to distribute results
  • Each processing element has a local memory that
    cannot be directly accessed by the control unit
    or other processing elements.

61
SIMD Execution Style
  • Collectively, the individual memories of the
    processing elements (PEs) store the (vector) data
    that is processed in parallel.
  • When the front end encounters an instruction
    whose operand is a vector, it issues a command to
    the PEs to perform the instruction in parallel.
  • Although the PEs execute in parallel, some units
    can be allowed to skip particular instructions.

62
Masking on Processor Arrays
  • All the processors work in lockstep except those
    that are masked out (by setting mask register).
  • The conditional if-then-else is different for
    processor arrays than sequential version
  • Every active processor tests to see if its data
    meets the negation of the boolean condition.
  • If it does, it sets its mask bit so those
    processors will not participate in the operation
    initially.
  • Next the unmasked processors, execute the THEN
    part.
  • Afterwards, mask bits (for original set of active
    processors) are flipped and unmasked processors
    perform the the ELSE part.

63
if (COND) then A else B
64
if (COND) then A else B
65
if (COND) then A else B
66
SIMD Machines
  • An early SIMD computer designed for vector and
    matrix processing was the Illiac IV computer
  • Initial development at the University of Illinois
    1965-70
  • Moved to NASA Ames, completed in 1972 but not
    fully functional until 1976.
  • See Jordan et. al., pg 7 and Wikepedia
  • The MPP, DAP, the Connection Machines CM-1 and
    CM-2, and MasPars MP-1 and MP-2 are examples of
    SIMD computers
  • See Akl pg 8-12 and Quinn, 94
  • The CRAY-1 and the Cyber-205 use pipelined
    arithmetic units to support vector operations and
    are sometimes called a pipelined SIMD
  • See Jordan, et al, p7, Quinn 94, pg 61-2, and
    Quinn 2004, pg37).

67
SIMD Machines
  • Quinn 1994, pg 63-67 discusses the CM-2
    Connection Machine (with 64K PEs) and a smaller
    updated CM-200.
  • Professor Batcher was the chief architect for the
    STARAN and the MPP (Massively Parallel Processor)
    and an advisor for the ASPRO
  • ASPRO is a small second generation STARAN used by
    the Navy in the spy planes.
  • Professor Batcher is best known architecturally
    for the MPP, which is at the Smithsonian
    Institute currently displayed at a D.C. airport.

68
Todays SIMDs
  • SIMD functionality is sometimes embedded in
    sequential machines.
  • Others are being build as part of hybrid
    architectures.
  • Some SIMD and SIMD-like features are included in
    some multi/many core processing units
  • Some SIMD-like architectures have been build as
    special purpose machines, although some of these
    could classify as general purpose.
  • Some of this work has been proprietary.
  • The fact that a parallel computer is SIMD or
    SIMD-like is often not advertised by the company
    building them.

69
A Company that Recently Built an Inexpensive SIMD
  • ClearSpeed produced a COTS (commodity off the
    shelf) SIMD Board
  • WorldScape has developed some defense and
    commercial applications for this computer.
  • Not a traditional SIMD as the hardware doesnt
    tightly synchronize the execution of
    instructions.
  • Hardware design supports efficient
    synchronization
  • This machine is programmed like a SIMD.
  • The U.S. Navy observed that their machines
    process radar a magnitude faster than others.
  • Quite a bit of information about this machine is
    posted at www.clearspeed.com

70
An Example of a Hybrid SIMD
  • Embedded Massively Parallel Accelerators
  • Other accelerators Decypher, Biocellerator,
    GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2,
    BioScan
  • (This and next three slides are due to Prabhakar
    R. Gudla (U of Maryland) at a CMSC 838T
    Presentation, 4/23/2003.)

71
Hybrid Architecture
  • combines SIMD and MIMD paradigm within a parallel
    architecture ? Hybrid Computer

72
Architecture of Systola 1024
  • Instruction Systolic Array
  • 32 ? 32 mesh of processing elements
  • wavefront instruction execution

73
SIMDs Embedded in SISDs
  • Intel's Pentium 4 included what they call MMX
    technology to gain a significant performance
    boost
  • IBM and Motorola incorporated the technology into
    their G4 PowerPC chip in what they call their
    Velocity Engine.
  • Both MMX technology and the Velocity Engine are
    the chip manufacturer's name for their
    proprietary SIMD processors and parallel
    extensions to their operating code.
  • This same approach is used by NVidia and Evans
    Sutherland to dramatically accelerate graphics
    rendering.

74
Special Purpose SIMDs in the Bioinformatics Arena
  • Parcel
  • Acquired by Celera Genomics in 2000
  • Products include the sequence supercomputer
    GeneMatcher, which has a high throughput sequence
    analysis capability
  • Supports over a million processors
  • GeneMatcher was used by Celera in their race with
    U.S. government to complete the description of
    the human genome sequencing
  • TimeLogic, Inc
  • Has DeCypher, a reconfigurable SIMD

75
Advantages of SIMDs
  • Reference Roosta, pg 10
  • Less hardware than MIMDs as they have only one
    control unit.
  • Control units are complex.
  • Less memory needed than MIMD
  • Only one copy of the instructions need to be
    stored
  • Allows more data to be stored in memory.
  • Less startup time in communicating between PEs.

76
Advantages of SIMDs (cont)
  • Single instruction stream and synchronization of
    PEs make SIMD applications easier to program,
    understand, debug.
  • Similar to sequential programming
  • Control flow operations and scalar operations can
    be executed on the control unit while PEs are
    executing other instructions.
  • MIMD architectures require explicit
    synchronization primitives, which create a
    substantial amount of additional overhead.

77
Advantages of SIMDs (cont)
  • During a communication operation between PEs,
  • PEs send data to a neighboring PE in parallel and
    in lock step
  • No need to create a header with routing
    information as routing is determined by program
    steps.
  • the entire communication operation is executed
    synchronously
  • SIMDs are deterministic have much more
    predictable running time.
  • Can normally compute a tight (worst case) upper
    bound for the time for communications operations.
  • Less complex hardware in SIMD since no message
    decoder is needed in the PEs
  • MIMDs need a message decoder in each PE.

78
SIMD Shortcomings(with some rebuttals)
  • Claims are from our textbook i.e., Quinn 2004.
  • Similar statements are found in Grama, et. al.
  • Claim 1 Not all problems are data-parallel
  • While true, most problems seem to have a data
    parallel solution.
  • In Fox, et.al., the observation was made in
    their study of large parallel applications at
    national labs, that most were data parallel by
    nature, but often had points where significant
    branching occurred.

79
SIMD Shortcomings(with some rebuttals)
  • Claim 2 Speed drops for conditionally executed
    branches
  • MIMDs processors can execute multiple branches
    concurrently.
  • For an if-then-else statement with execution
    times for the then and else parts being
    roughly equal, about ½ of the SIMD processors are
    idle during its execution
  • With additional branching, the average number of
    inactive processors can become even higher.
  • With SIMDs, only one of these branches can be
    executed at a time.
  • This reason justifies the study of multiple SIMDs
    (or MSIMDs).
  • On many applications, any branching is quite
    shallow.

80
SIMD Shortcomings(with some rebuttals)
  • Claim 2 (cont) Speed drops for conditionally
    executed code
  • In Fox, et.al., the observation was made that
    for the real applications surveyed, the MAXIMUM
    number of active branches at any point in time
    was about 8.
  • The cost of the extremely simple processors used
    in a SIMD are extremely low
  • Programmers used to worry about full utilization
    of memory but stopped this after memory cost
    became insignificant overall.

81
SIMD Shortcomings(with some rebuttals)
  • Claim 3 Dont adapt to multiple users well.
  • This is true to some degree for all parallel
    computers.
  • If usage of a parallel processor is dedicated to
    a important problem, it is probably best not to
    risk compromising its performance by sharing
  • This reason also justifies the study of multiple
    SIMDs (or MSIMD).
  • SIMD architecture has not received the attention
    that MIMD has received and can greatly benefit
    from further research.

82
SIMD Shortcomings(with some rebuttals)
  • Claim 4 Do not scale down well to starter
    systems that are affordable.
  • This point is arguable and its truth is likely
    to vary rapidly over time
  • ClearSpeed produced a very economical SIMD board
    that plugs into a PC with about 48 processors
    per chip and 2-3 chips per board.

83
SIMD Shortcomings(with some rebuttals)
  • Claim 5 Requires customized VLSI for processors
    and expense of control units in PCs has dropped.
  • Reliance on COTS (Commodity, off-the-shelf parts)
    has dropped the price of MIMDS
  • Expense of PCs (with control units) has dropped
    significantly
  • However, reliance on COTS has fueled the success
    of low level parallelism provided by clusters
    and restricted new innovative parallel
    architecture research for well over a decade.

84
SIMD Shortcomings(with some rebuttals)
  • Claim 5 (cont.)
  • There is strong evidence that the period of
    continual dramatic increases in speed of PCs and
    clusters is ending.
  • Continued rapid increases in parallel performance
    in the future will be necessary in order to solve
    important problems that are beyond our current
    capabilities
  • Additionally, with the appearance of the very
    economical COTS SIMDs, this claim no longer
    appears to be relevant.

85
Multiprocessors
  • Multiprocessor multiple-CPU computer with a
    shared memory
  • Same address on two different CPUs refers to the
    same memory location
  • Avoids three cited criticisms for SIMDs
  • Can be built from commodity CPUs
  • Naturally support multiple users
  • Maintain efficiency in conditional code

86
Centralized Multiprocessor
87
Centralized Multiprocessor
  • Straightforward extension of uniprocessor
  • Add CPUs to bus
  • All processors share same primary memory
  • Memory access time same for all CPUs
  • Uniform memory access (UMA) multiprocessor
  • Also called a symmetrical multiprocessor (SMP)

88
Private and Shared Data
  • Private data items used only by a single
    processor
  • Shared data values used by multiple processors
  • In a centralized multiprocessor (i.e. SMP),
    processors communicate via shared data values

89
Problems Associated with Shared Data
  • The cache coherence problem
  • Replicating data across multiple caches reduces
    contention among processors for shared data
    values.
  • But - how can we ensure different processors have
    the same value for same address?
  • The cache coherence problem is when an obsolete
    value is still stored in a processors cache.

90
Write Invalidate Protocol
  • Most common solution to cache coherency
  • Each CPUs cache controller monitors (snoops) the
    bus identifies which cache blocks are requested
    by other CPUs.
  • A PE gains exclusive control of data item before
    performing write.
  • Before write occurs, all other copies of data
    item cached by other PEs are invalidated.
  • When any other CPU tries to read a memory
    location from an invalidated cache block,
  • a cache miss occurs
  • It has to retrieve updated data from memory

91
Cache-coherence Problem
Memory
7
X
92
Cache-coherence Problem
Memory
Read from memory is not a problem.
7
X
7
93
Cache-coherence Problem
Memory
7
X
7
7
94
Cache-coherence Problem
Write to main memory is a problem.
Memory
2
X
2
7
95
Write Invalidate Protocol
A cache control monitor snoops the bus to
see which cache block is being requested by
other processors.
7
X
7
7
96
Write Invalidate Protocol
7
X
Intent to write X
7
7
Before a write can occur, all copies of data at
that address are declared invalid.
97
Write Invalidate Protocol
7
X
Intent to write X
7
98
Write Invalidate Protocol
When another processor tries to read from this
location in cache, it receives a cache miss error
and will have to refresh from main memory.
2
X
2
99
Synchronization Required for Shared Data
  • Mutual exclusion
  • Definition At most one process can be engaged in
    an activity at any time.
  • Example Only one processor can write to the
    same address in main memory at the same time.
  • We say that process must mutually exclude all
    others while it performs this write.
  • Barrier synchronization
  • Definition Guarantees that no process will
    proceed beyond a designated point (called the
    barrier) until every process reaches that point.

100
Distributed Multiprocessor
  • Distributes primary memory among processors
  • Increase aggregate memory bandwidth and lower
    average memory access time
  • Allows greater number of processors
  • Also called non-uniform memory access (NUMA)
    multiprocessor
  • Local memory access time is fast
  • Non-local memory access time can vary
  • Distributed memories have one logical address
    space

101
Distributed Multiprocessors
102
Cache Coherence
  • Some NUMA multiprocessors do not support it in
    hardware
  • Only instructions and private data are stored in
    cache
  • Policy creates a large memory access time
    variance
  • Implementation more difficult
  • No shared memory bus to snoop
  • Directory-based protocol needed

103
Directory-based Protocol
  • Distributed directory contains information about
    cacheable memory blocks
  • One directory entry for each cache block
  • Each entry has
  • Sharing status
  • Which processors have copies

104
Sharing Status
  • Uncached -- (denoted by U)
  • Block not in any processors cache
  • Shared (denoted by S)
  • Cached by one or more processors
  • Read only
  • Exclusive (denoted by E)
  • Cached by exactly one processor
  • Processor has written to block
  • Copy in memory is obsolete

105
Directory-based Protocol - step1
106
X has value 7 step 2
Interconnection Network
Bit Vector
X
U 0 0 0
Directories
7
X
Memories
Caches
107
CPU 0 Reads X step 3
Interconnection Network
X
U 0 0 0
Directories
7
X
Memories
Caches
108
CPU 0 Reads X step 4
Interconnection Network
X
S 1 0 0
Directories
7
X
Memories
Caches
109
CPU 0 Reads X step 5
Interconnection Network
X
S 1 0 0
Directories
Memories
Caches
110
CPU 2 Reads X step 6
Interconnection Network
X
S 1 0 0
Directories
Memories
Caches
111
CPU 2 Reads X step 7
Interconnection Network
X
S 1 0 1
Directories
Memories
Caches
112
CPU 2 Reads X step 8
Interconnection Network
X
S 1 0 1
Directories
Memories
Caches
113
CPU 0 Writes 6 to X step 9
Interconnection Network
Write Miss
X
S 1 0 1
Directories
Memories
Caches
114
CPU 0 Writes 6 to X step 10
Interconnection Network
X
S 1 0 1
Directories
Invalidate
Memories
Caches
115
CPU 0 Writes 6 to X step 11
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
6
X
116
CPU 1 Reads X step 12
Interconnection Network
Read Miss
X
E 1 0 0
Directories
Memories
Caches
117
CPU 1 Reads X step 13
Interconnection Network
Switch to Shared
X
E 1 0 0
Directories
Memories
Caches
118
CPU 1 Reads X step 14
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
119
CPU 1 Reads X step 15
Interconnection Network
X
S 1 1 0
Directories
Memories
Caches
120
CPU 2 Writes 5 to X step 16
Interconnection Network
X
S 1 1 0
Directories
Memories
Write Miss
Caches
121
CPU 2 Writes 5 to X - step 17
Interconnection Network
Invalidate
X
S 1 1 0
Directories
Memories
Caches
122
CPU 2 Writes 5 to X step 18
Interconnection Network
X
E 0 0 1
Directories
Memories
5
X
Caches
123
CPU 0 Writes 4 to X step 19
Interconnection Network
X
E 0 0 1
Directories
Memories
Caches
124
CPU 0 Writes 4 to X step 20
Interconnection Network
X
E 1 0 0
Directories
Memories
Take Away
Caches
125
CPU 0 Writes 4 to X step 21
Interconnection Network
X
E 0 1 0
Directories
Memories
Caches
126
CPU 0 Writes 4 to X step 22
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
127
CPU 0 Writes 4 to X step 23
Interconnection Network
X
E 1 0 0
Directories
Creates cache block storage for X
Memories
Caches
128
CPU 0 Writes 4 to X step 24
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
4
X
129
CPU 0 Writes Back X Block step 25
Interconnection Network
Data Write Back
X
E 1 0 0
Directories
Memories
Caches
130
CPU 0 flushes cache block X step 26
Interconnection Network
X
U 0 0 0
Directories
Memories
Caches
131
Characteristics of Multiprocessors
  • Interprocessor communication is done in the
    memory interface by read and write instructions
  • Memory may be physically distributed and the
    reads and writes from different processors may
    take different time.
  • Congestion and hotspots in the interconnection
    network may occur.
  • Memory latency (i.e., time to complete a read or
    write) may be long and variable.
  • Most messages through the bus or interconnection
    network are the size of single memory words.
  • Randomization of requests may be used to reduce
    the probability of collisions.

132
Multicomputers
  • Distributed memory multiple-CPU computer
  • Same address on different processors refers to
    different physical memory locations
  • Processors interact through message passing

133
Typically, Two Flavors of Multicomputers
  • Commercial multicomputers
  • Custom switch network
  • Low latency (the time it takes to send a
    message).
  • High bandwidth (data path width) across
    processors
  • Commodity clusters
  • Mass produced computers, switches and other
    equipment
  • Use low cost components
  • Message latency is higher
  • Communications bandwidth is lower

134
Multicomputer Communication
  • Processors are connected by an interconnection
    network
  • Each processor has a local memory and can only
    access its own local memory
  • Data is passed between processors using messages,
    as required by the program
  • Data movement across the network is also
    asynchronous
  • A common approach is to use MPI to handling
    message passing

135
Multicomputer Communications (cont)
  • Multicomputers can be scaled to larger sizes much
    easier than multiprocessors.
  • The amount of data transmissions between
    processors have a huge impact on the performance
  • The distribution of the data among the processors
    is a very important factor in the performance
    efficiency.

136
Message-Passing Advantages
  • No problem with simultaneous access to data.
  • Allows different PCs to operate on the same data
    independently.
  • Allows PCs on a network to be easily upgraded
    when faster processors become available.

137
Disadvantages of Message-Passing
  • Programmers must make explicit message-passing
    calls in the code
  • This is low-level programming and is error prone.
  • Data is not shared but copied, which increases
    the total data size.
  • Data Integrity
  • Difficulty in maintaining correctness of multiple
    copies of data item.

138
Some Interconnection Network Terminology (1/2)
  • References Wilkinson, et. al. Grama, et. al.
    Also, earlier slides on architecture networks.
  • A link is the connection between two nodes.
  • A switch that enables packets to be routed
    through the node to other nodes without
    disturbing the processor is assumed.
  • The link between two nodes can be either
    bidirectional or use two directional links .
  • Can assume either one wire that carries one bit
    or parallel wires (one wire for each bit in
    word).
  • The above choices do not have a major impact on
    the concepts presented in this course.

139
Network Terminology (2/2)
  • The bandwidth is the number of bits that can be
    transmitted in unit time (i.e., bits per second).
  • The network latency is the time required to
    transfer a message through the network.
  • The communication latency is the total time
    required to send a message, including software
    overhead and interface delay.
  • The message latency or startup time is the time
    required to send a zero-length message.
  • Includes software hardware overhead, such as
  • Choosing a route
  • packing and unpacking the message

140
Circuit Switching Message Passing
  • Technique establishes a path and allows the
    entire message to transfer uninterrupted.
  • Similar to telephone connection that is held
    until the end of the call.
  • Links used are not available to other messages
    until the transfer is complete.
  • Latency (message transfer time) If the length of
    control packet sent to establish path is small
    wrt (with respect to) the message length, the
    latency is essentially
  • the constant L/B, where L is message length and B
    is bandwidth.

141
Store-and-forward Packet Switching
  • Message is divided into packets of information
  • Each packet includes source and destination
    addresses.
  • Packets can not exceed a fixed, maximum size
    (e.g., 1000 byte).
  • A packet is stored in a node in a buffer until it
    can move to the next node.
  • Different packets typically follow different
    routes but are re-assembled at the destination,
    as the packets arrive.
  • Movements of packets is asynchronous.

142
Packet Switching (cont)
  • At each node, the designation information is
    looked at and used to select which node to
    forward the packet to.
  • Routing algorithms (often probabilistic) are used
    to avoid hot spots and to minimize traffic jams.
  • Significant latency is created by storing each
    packet in each node it reaches.
  • Latency increases linearly with the length of the
    route.

143
Virtual Cut-Through Package Switching
  • Used to reduce the latency.
  • Allows packet to pass through a node without
    being stored, if the outgoing link is available.
  • If complete path is available, a message can
    immediately move from source to destination..

144
Wormhole Routing
  • Alternate to store-and-forward packet routing
  • A message is divided into small units called
    flits (flow control units).
  • Flits are 1-2 bytes in size.
  • Can be transferred in parallel on links with
    multiple wires.
  • Only head of flit is initially transferred when
    the next link becomes available.

145
Wormhole Routing (cont)
  • As each flit moves forward, the next flit can
    move forward.
  • The entire path must be reserved for a message as
    these packets pull each other along (like cars of
    a train).
  • Request/acknowledge bit messages are required to
    coordinate these pull-along moves.
  • See Wilkinson, et. al.
  • Latency If the head of the flit is very small
    compared to the length of the message, then the
    latency is essentially the constant L/B, with L
    the message length and B the link bandwidth.

146
Deadlock
  • Routing algorithms needed to find a path between
    the nodes.
  • Adaptive routing algorithms choose different
    paths, depending on traffic conditions.
  • Livelock is a deadlock-type situation where a
    packet continues to go around the network,
    without ever reaching its destination.
  • Deadlock No packet can be forwarded because they
    are blocked by other stored packets waiting to be
    forwarded.

147
Asymmetric Multicomputers
  • Has a front-end that interacts with users and I/O
    devices.
  • Processors in back end are used for computation.
  • Programming similar to SIMDs (i.e., processor
    arrays)
  • Common with early multicomputers
  • Examples of asymmetrical multicomputers given in
    textbook.

148
Asymmetrical MC Advantages
  • Back-end processors dedicated to parallel
    computations ? Easier to understand, model, tune
    performance
  • Only a simple back-end operating system needed ?
    Easy for a vendor to create

149
Asymmetrical MC Disadvantages
  • Front-end computer is a single point of failure
  • Single front-end computer limits scalability of
    system
  • Primitive operating system in back-end processors
    makes debugging difficult
  • Every application requires development of both
    front-end and back-end program

150
Symmetric Multicomputers
  • Every computer executes the same operating system
    and has identical functionality.
  • Users may log into any computer to edit or
    compile their programs.
  • Any or all computers may be involved in the
    execution of their program.
  • During execution of programs, every PE executes
    the same program.
  • When only one PE should execute an operation, an
    if statement is used to select the PE.

151
Symmetric Multicomputers
152
Symmetrical MC Advantages
  • Alleviate performance bottleneck caused by single
    front-end computer
  • Better support for debugging
  • Every processor executes same program

153
Symmetrical MC Disadvantages
  • More difficult to maintain illusion of single
    parallel computer
  • No simple way to balance program development
    workload among processors
  • More difficult to achieve high performance when
    multiple processes run on each processor
  • Details on next slide

154
Symmetric MC Disadvantages (cont)
  • (cont.) More difficult to achieve high
    performance when multiple processes run on each
    processor
  • Processes on same processor compete for same
    resources
  • CPU Cycles
  • Cache space
  • Memory bandwidth
  • Increased cache misses
  • Cache is PE oriented instead of process
    oriented

155
Best Model for Commodity Cluster
  • Full-Fledged operating system (e.g., Linux)
    desirable
  • Feature of symmetric multicomputer
  • Desirable to increase cache hits
  • Favors having only a single user process on each
    PE
  • Favors most nodes being off-limits for program
    development
  • Need fast network
  • Keep program development users off networks and
    have them access front-end by another path.
  • Reserve interconnection network to usage by
    parallel processes
  • Overall, a mixed model may be best for commodity
    clusters

156
Ideal Commodity Cluster Features
  • Co-located computers
  • Computers dedicated to running a single process
    at a time to lower cache misses
  • Accessible only from the network
  • No keyboards or displays
  • Users access front-end by another route.
  • Identical operating systems
  • Identical local disk images
  • Administered as an entity

157
ParPar Cluster, A Mixed Model
  • Mixed model
  • Incorporates both asymetrical and symetrical
    designs.

158
Network of Workstations
  • Dispersed computers
  • Typically located on users desks
  • First priority response time for person at the
    keyboard
  • Parallel jobs wait in background and run with
    spare CPU cycles are available.
  • Different operating systems
  • Different local images
  • Checkpointing and restarting important
  • Typically connected by ethernet
  • Too slow for commodity network usage

159
A Commodity Cluster vs Network of Workstations
  • A commodity cluster contains components of local
    area networks
  • Commodity computers
  • Switches
  • A network of workstations is a dispersed
    collection of computers
  • Distributed hetergeneous computers
  • Located on primary users desks
  • Unused cycles available for parallel use
  • Example SETI project

160
Summary
  • Commercial parallel computers appearedin 1980s
  • Multiple-CPU computers now dominate
  • Small-scale Centralized multiprocessors
  • Large-scale Distributed memory architectures
  • Multiprocessors
  • Multicomputers
Write a Comment
User Comments (0)
About PowerShow.com