Parallel Architectures - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

Parallel Architectures

Description:

Distributed directory contains information about cacheable memory blocks. One directory entry for each cache block. Each entry has. Sharing status ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 82
Provided by: saikatmuk
Category:

less

Transcript and Presenter's Notes

Title: Parallel Architectures


1
Parallel Architectures
2
Design Issues
  • Interconnection Networks
  • Shared or Distributed Memory
  • Customized or General Purpose Processors
  • Uniform or Nonuniform OS

3
Interconnection Networks
  • Uses of interconnection networks
  • Connect processors to shared memory
  • Connect processors to each other
  • Interconnection media types
  • Shared medium
  • Switched medium

4
Shared versus Switched Media
5
Shared Medium
  • Allows only one message at a time
  • Messages are broadcast
  • Each processor listens to every message
  • Collisions require resending of messages
  • Ethernet is an example

6
Switched Medium
  • Supports point-to-point messages between pairs of
    processors
  • Each processor has its own path to switch
  • Advantages over shared media
  • Allows multiple messages to be sent
    simultaneously
  • Allows scaling of network to accommodate increase
    in processors

7
Switch Network Topologies
  • View switched network as a graph
  • Vertices processors or switches
  • Edges communication paths
  • Direct Topology
  • Ratio of switch nodes to processor nodes is 11
  • Every switch node is connected to
  • 1 processor node
  • At least 1 other switch node
  • Indirect Topology
  • Ratio of switch nodes to processor nodes is
    greater than 11
  • Some switches simply connect other switches

8
Evaluating Switch Topologies
  • Communication
  • Diameter largest distance between two nodes
    lower is better
  • Connectivity
  • Bisection Bandwidth smallest cut that divides
    the network into two roughly equal halves
    higher is better
  • Scalability
  • Edges per switch node, Edge length constant is
    better

9
Two Network Topologies
Linear Array Diameter n-1 Bisection Bandwidth
1 Edges per switch 2 Constant Edge Length?
Yes
n to n connection Diameter 1 Bisection
Bandwidth (n2)/4 Edges per switch
n-1 Constant Edge Length? No
10
2-D Mesh Topology
  • Switches n
  • Diameter 2(sqrt(n)-1)?(n1/2)
  • Bisection Bandwidth ?(n1/2)
  • Number of edges per switch 4
  • Constant Edge Length? Yes
  • Direct topology
  • Switches arranged into a 2-D lattice
  • Communication only between neighboring switches
  • Variants allow wraparound connections

11
Binary Tree Network
Depth d Processor nodes n 2d Switches
2n-1 Diameter 2 log n Bisection Bandwidth
1 Edges / node 3 Constant edge length? No
  • Indirect topology
  • Each processor node connected to leaf of binary
    tree
  • Interior switch nodes have at most 3 links

12
Hypertree Network
  • Depth d k-ary hypertree (k of children)
  • Processors nkd
  • Switches k0.2dk1.2(d-1)kd.20
  • Diameter 2d
  • Bisection Bandwidth k2(d-1)
  • Edges / node k3-1
  • Constant edge length? No
  • Indirect topology
  • Shares low diameter of binary tree
  • Greatly improves bisection width
  • From front looks like k-ary tree of height d
  • From side looks like upside down binary tree of
    height d

13
Butterfly Network
Processors n Switches n(logn1) Diameter log
n Bisection Bandwidth O(n ) Edges per node
4 Constant edge length? No
  • Indirect topology
  • Node (i,j) is connected to node(i-1,j) and
    node(i-1,m)
  • M is the integer formed by inverting the ith most
    significant bit of j

14
Butterfly Network Routing
15
Hypercube Network
Processors n2k Switches n Diameter log
n Bisection Bandwidth n / 2 Edges per node log
n Constant edge length? No
  • Direct topology
  • Number of nodes a power of 2
  • Node addresses 0, 1, , 2k-1 (k-dimensional
    hyprecube)
  • Node i connected to k nodes whose addresses
    differ from i in exactly one bit position

16
Shuffle-Exchange Addressing
Processors n Switches n Diameter O(log n)
Bisection Bandwidth ? n / log n ( n/4 for this
graph) Ref Area Time Complexity for VLSI by C.D.
Thompson in Proc 11th Annual Symposium on Theory
of Computing Edges per node 2 Constant edge
length? No
  • Direct topology
  • Two outgoing links from node i
  • Shuffle link to node LeftCycle(i)
  • Exchange link between nodes whose numbers differ
    in their least significant bit

17
Comparing Networks
  • All have logarithmic diameterexcept 2-D mesh
  • Hypertree, butterfly, and hypercube have
    bisection width n / 2
  • All have constant edges per node except hypercube
  • Only 2-D mesh keeps edge lengths constant as
    network size increases

18
Vector Computers
  • Vector computer instruction set includes
    operations on vectors as well as scalars
  • Two ways to implement vector computers
  • Pipelined vector processor streams data through
    pipelined arithmetic units
  • Processor array many identical, synchronized
    arithmetic processing elements

19
Processor Array
  • Historically, high cost of a control unit
  • Scientific applications have data parallelism

20
Performance Example 1
  • 1024 processors
  • Each adds a pair of integers in 1 ?sec
  • What is performance when adding two 1024-element
    vectors (one per processor)?

21
Performance Example 2
  • 512 processors
  • Each adds two integers in 1 ?sec
  • Performance adding two vectors of length 600?

2? sec because vector length gt processors
22
2-D Processor Interconnection Network
Each VLSI chip has 16 processing elements
23
if (COND) then A else B
24
if (COND) then A else B
25
if (COND) then A else B
26
Processor Array Shortcomings
  • Not all problems are data-parallel
  • Speed drops for conditionally executed code
  • Dont adapt to multiple users well
  • Do not scale down well to starter systems
  • Rely on custom VLSI for processors
  • Expense of control units has dropped

27
Multiprocessors
  • Multiprocessor multiple-CPU computer with a
    shared memory
  • Same address on two different CPUs refers to the
    same memory location
  • Avoid three problems of processor arrays
  • Can be built from commodity CPUs
  • Naturally support multiple users
  • Maintain efficiency in conditional code

28
Centralized Multiprocessor
  • Straightforward extension of uniprocessor
  • Add CPUs to bus
  • All processors share same primary memory
  • Memory access time same for all CPUs
  • Uniform memory access (UMA) multiprocessor
  • Symmetrical multiprocessor (SMP)

29
Private and Shared Data
  • Private data items used only by a single
    processor
  • Shared data values used by multiple processors
  • In a multiprocessor, processors communicate via
    shared data values

30
Problems Associated with Shared Data
  • Cache coherence
  • Different data to address mapping in cache and
    memory
  • Synchronization
  • Trying to access an exclusive resource (e.g.
    memory)
  • Synchronization in computation

31
Cache-coherence Problem
Memory
7
X
Cache
Cache
CPU A
CPU B
32
Cache-coherence Problem
Memory
7
X
7
CPU A
CPU B
33
Cache-coherence Problem
Memory
7
X
7
7
CPU A
CPU B
34
Cache-coherence Problem
Memory
2
X
2
7
CPU A
CPU B
35
Write Invalidate Protocol
7
X
7
Cache control monitor
7
CPU A
CPU B
36
Write Invalidate Protocol
7
X
Intent to write X
7
7
CPU A
CPU B
37
Write Invalidate Protocol
7
X
Intent to write X
7
CPU A
CPU B
38
Write Invalidate Protocol
2
X
2
CPU A
CPU B
39
Synchronization
  • Barriers
  • No process will proceed beyond a designated point
    until every process has reached the barrier
  • Mutual Exclusion
  • At most one process can be engaged in a specific
    activity

40
Distributed Multiprocessor
  • Distribute primary memory among processors
  • Same address on different processors refer to the
    same memory location
  • Increase aggregate memory bandwidth and lower
    average memory access time
  • Allow greater number of processors
  • Also called non-uniform memory access (NUMA)
    multiprocessor

41
Distributed Multiprocessor
42
Cache Coherence
  • Some NUMA multiprocessors do not support it in
    hardware
  • Only instructions, private data in cache
  • Large memory access time variance
  • Implementation more difficult
  • No shared memory bus to snoop
  • Directory-based protocol needed

43
Directory-based Protocol
  • Distributed directory contains information about
    cacheable memory blocks
  • One directory entry for each cache block
  • Each entry has
  • Sharing status
  • Which processors have copies

44
Sharing Status
  • Uncached
  • Block not in any processors cache
  • Shared
  • Cached by one or more processors
  • Read only
  • Exclusive
  • Cached by exactly one processor
  • Processor has written block
  • Copy in memory is obsolete

45
Directory-based Protocol
Interconnection Network
46
Directory-based Protocol
Interconnection Network
Bit Vector
X
U 0 0 0
Directories
7
X
Memories
Caches
CPU 0
CPU 1
CPU 2
47
CPU 0 Reads X
Interconnection Network
X
U 0 0 0
Directories
7
X
Memories
Caches
48
CPU 0 Reads X
Interconnection Network
X
S 1 0 0
Directories
7
X
Memories
Caches
49
CPU 0 Reads X
Interconnection Network
X
S 1 0 0
Directories
7
X
Memories
Caches
50
CPU 2 Reads X
Interconnection Network
X
S 1 0 0
Directories
Memories
Caches
51
CPU 2 Reads X
Interconnection Network
X
S 1 0 1
Directories
Memories
Caches
52
CPU 2 Reads X
Interconnection Network
X
S 1 0 1
Directories
7
X
Memories
Caches
7
X
53
CPU 0 Writes 6 to X
Interconnection Network
Write Miss
X
S 1 0 1
Directories
Memories
Caches
54
CPU 0 Writes 6 to X
Interconnection Network
X
S 1 0 1
Directories
Invalidate
Memories
Caches
55
CPU 0 Writes 6 to X
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
6
X
56
CPU 1 Reads X
Interconnection Network
Read Miss
X
E 1 0 0
Directories
Memories
Caches
57
CPU 1 Reads X
Interconnection Network
Switch to Shared
X
E 1 0 0
Directories
Memories
Caches
58
CPU 1 Reads X
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
59
CPU 1 Reads X
Interconnection Network
X
S 1 1 0
Directories
Memories
Caches
60
CPU 2 Writes 5 to X
Interconnection Network
X
S 1 1 0
Directories
Memories
Write Miss
Caches
61
CPU 2 Writes 5 to X
Interconnection Network
Invalidate
X
S 1 1 0
Directories
Memories
Caches
62
CPU 2 Writes 5 to X
Interconnection Network
X
E 0 0 1
Directories
Memories
5
X
Caches
63
CPU 0 Writes 4 to X
Interconnection Network
X
E 0 0 1
Directories
Memories
Caches
64
CPU 0 Writes 4 to X
Interconnection Network
X
E 1 0 0
Directories
Memories
Take Away
Caches
65
CPU 0 Writes 4 to X
Interconnection Network
X
E 0 1 0
Directories
Memories
Caches
66
CPU 0 Writes 4 to X
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
67
CPU 0 Writes 4 to X
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
68
CPU 0 Writes 4 to X
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
4
X
69
CPU 0 Writes Back X Block
Interconnection Network
Data Write Back
X
E 1 0 0
Directories
Memories
Caches
70
CPU 0 Writes Back X Block
Interconnection Network
X
U 0 0 0
Directories
Memories
Caches
71
Multicomputer
  • Distributed memory multiple-CPU computer
  • Same address on different processors refers to
    different physical memory locations
  • Processors interact through message passing
  • Commercial multicomputers
  • Commodity clusters
  • No cache coherence problems

72
Asymmetrical Multicomputer
73
Asymmetrical MC Advantages
  • Advantages
  • Back-end processors dedicated to parallel
    computations ? Easier to understand, model, tune
    performance
  • Only a simple back-end operating system needed ?
    Easy for a vendor to create
  • Disadvantages
  • Front-end computer is a single point of failure
  • Single front-end computer limits scalability of
    system
  • Primitive operating system in back-end processors
    makes debugging difficult
  • Every application requires development of both
    front-end and back-end program

74
Symmetrical Multicomputer
75
Symmetrical MC Advantages
  • Advantages
  • Alleviate performance bottleneck caused by single
    front-end computer
  • Better support for debugging
  • Every processor executes same program
  • Disadvantages
  • More difficult to maintain illusion of single
    parallel computer
  • No simple way to balance program development
    workload among processors
  • More difficult to achieve high performance when
    multiple processes on each processor

76
ParPar Cluster, A Mixed Model
77
Flynns Taxonomy
  • Instruction stream
  • Data stream
  • Single vs. multiple
  • Four combinations
  • SISD
  • SIMD
  • MISD
  • MIMD

78
SISD
  • Single Instruction, Single Data
  • Single-CPU systems
  • Note co-processors dont count
  • Functional
  • I/O
  • Example PCs

79
SIMD
  • Single Instruction, Multiple Data
  • Two architectures fit this category
  • Pipelined vector processor(e.g., Cray-1)
  • Processor array(e.g., Connection Machine)

80
MISD
  • MultipleInstruction,Single Data
  • Examplesystolic array

81
MIMD
  • Multiple Instruction, Multiple Data
  • Multiple-CPU computers
  • Multiprocessors
  • Multicomputers
Write a Comment
User Comments (0)
About PowerShow.com