Title: Parallel Architectures
1Parallel Architectures
2Design Issues
- Shared or Distributed Memory
- Customized or General Purpose Processors
3Interconnection Networks
- Uses of interconnection networks
- Connect processors to shared memory
- Connect processors to each other
- Interconnection media types
- Shared medium
- Switched medium
4Shared versus Switched Media
5Shared Medium
- Allows only one message at a time
- Messages are broadcast
- Each processor listens to every message
- Collisions require resending of messages
- Ethernet is an example
6Switched Medium
- Supports point-to-point messages between pairs of
processors - Each processor has its own path to switch
- Advantages over shared media
- Allows multiple messages to be sent
simultaneously - Allows scaling of network to accommodate increase
in processors
7Switch Network Topologies
- View switched network as a graph
- Vertices processors or switches
- Edges communication paths
- Direct Topology
- Ratio of switch nodes to processor nodes is 11
- Every switch node is connected to
- 1 processor node
- At least 1 other switch node
- Indirect Topology
- Ratio of switch nodes to processor nodes is
greater than 11 - Some switches simply connect other switches
8Evaluating Switch Topologies
- Communication
- Diameter largest distance between two nodes
lower is better
- Connectivity
- Bisection Bandwidth smallest cut that divides
the network into two roughly equal halves
higher is better
- Scalability
- Edges per switch node, Edge length constant is
better
9Two Network Topologies
Linear Array Diameter n-1 Bisection Bandwidth
1 Edges per switch 2 Constant Edge Length?
Yes
n to n connection Diameter 1 Bisection
Bandwidth (n2)/4 Edges per switch
n-1 Constant Edge Length? No
102-D Mesh Topology
- Switches n
- Diameter 2(sqrt(n)-1)?(n1/2)
- Bisection Bandwidth ?(n1/2)
- Number of edges per switch 4
- Constant Edge Length? Yes
- Direct topology
- Switches arranged into a 2-D lattice
- Communication only between neighboring switches
- Variants allow wraparound connections
11Binary Tree Network
Depth d Processor nodes n 2d Switches
2n-1 Diameter 2 log n Bisection Bandwidth
1 Edges / node 3 Constant edge length? No
- Indirect topology
- Each processor node connected to leaf of binary
tree - Interior switch nodes have at most 3 links
12Hypertree Network
- Depth d k-ary hypertree (k of children)
- Processors nkd
- Switches k0.2dk1.2(d-1)kd.20
- Diameter 2d
- Bisection Bandwidth k2(d-1)
- Edges / node k3-1
- Constant edge length? No
- Indirect topology
- Shares low diameter of binary tree
- Greatly improves bisection width
- From front looks like k-ary tree of height d
- From side looks like upside down binary tree of
height d
13Butterfly Network
Processors n Switches n(logn1) Diameter log
n Bisection Bandwidth O(n ) Edges per node
4 Constant edge length? No
- Indirect topology
- Node (i,j) is connected to node(i-1,j) and
node(i-1,m) - M is the integer formed by inverting the ith most
significant bit of j
14Butterfly Network Routing
15Hypercube Network
Processors n2k Switches n Diameter log
n Bisection Bandwidth n / 2 Edges per node log
n Constant edge length? No
- Direct topology
- Number of nodes a power of 2
- Node addresses 0, 1, , 2k-1 (k-dimensional
hyprecube) - Node i connected to k nodes whose addresses
differ from i in exactly one bit position
16Shuffle-Exchange Addressing
Processors n Switches n Diameter O(log n)
Bisection Bandwidth ? n / log n ( n/4 for this
graph) Ref Area Time Complexity for VLSI by C.D.
Thompson in Proc 11th Annual Symposium on Theory
of Computing Edges per node 2 Constant edge
length? No
- Direct topology
- Two outgoing links from node i
- Shuffle link to node LeftCycle(i)
- Exchange link between nodes whose numbers differ
in their least significant bit
17Comparing Networks
- All have logarithmic diameterexcept 2-D mesh
- Hypertree, butterfly, and hypercube have
bisection width n / 2 - All have constant edges per node except hypercube
- Only 2-D mesh keeps edge lengths constant as
network size increases
18Vector Computers
- Vector computer instruction set includes
operations on vectors as well as scalars - Two ways to implement vector computers
- Pipelined vector processor streams data through
pipelined arithmetic units - Processor array many identical, synchronized
arithmetic processing elements
19Processor Array
- Historically, high cost of a control unit
- Scientific applications have data parallelism
20Performance Example 1
- 1024 processors
- Each adds a pair of integers in 1 ?sec
- What is performance when adding two 1024-element
vectors (one per processor)?
21Performance Example 2
- 512 processors
- Each adds two integers in 1 ?sec
- Performance adding two vectors of length 600?
2? sec because vector length gt processors
222-D Processor Interconnection Network
Each VLSI chip has 16 processing elements
23if (COND) then A else B
24if (COND) then A else B
25if (COND) then A else B
26Processor Array Shortcomings
- Not all problems are data-parallel
- Speed drops for conditionally executed code
- Dont adapt to multiple users well
- Do not scale down well to starter systems
- Rely on custom VLSI for processors
- Expense of control units has dropped
27Multiprocessors
- Multiprocessor multiple-CPU computer with a
shared memory - Same address on two different CPUs refers to the
same memory location - Avoid three problems of processor arrays
- Can be built from commodity CPUs
- Naturally support multiple users
- Maintain efficiency in conditional code
28Centralized Multiprocessor
- Straightforward extension of uniprocessor
- Add CPUs to bus
- All processors share same primary memory
- Memory access time same for all CPUs
- Uniform memory access (UMA) multiprocessor
- Symmetrical multiprocessor (SMP)
29Private and Shared Data
- Private data items used only by a single
processor - Shared data values used by multiple processors
- In a multiprocessor, processors communicate via
shared data values
30Problems Associated with Shared Data
- Cache coherence
- Different data to address mapping in cache and
memory
- Synchronization
- Trying to access an exclusive resource (e.g.
memory) - Synchronization in computation
31Cache-coherence Problem
Memory
7
X
Cache
Cache
CPU A
CPU B
32Cache-coherence Problem
Memory
7
X
7
CPU A
CPU B
33Cache-coherence Problem
Memory
7
X
7
7
CPU A
CPU B
34Cache-coherence Problem
Memory
2
X
2
7
CPU A
CPU B
35Write Invalidate Protocol
7
X
7
Cache control monitor
7
CPU A
CPU B
36Write Invalidate Protocol
7
X
Intent to write X
7
7
CPU A
CPU B
37Write Invalidate Protocol
7
X
Intent to write X
7
CPU A
CPU B
38Write Invalidate Protocol
2
X
2
CPU A
CPU B
39Synchronization
- Barriers
- No process will proceed beyond a designated point
until every process has reached the barrier - Mutual Exclusion
- At most one process can be engaged in a specific
activity
40Distributed Multiprocessor
- Distribute primary memory among processors
- Same address on different processors refer to the
same memory location - Increase aggregate memory bandwidth and lower
average memory access time - Allow greater number of processors
- Also called non-uniform memory access (NUMA)
multiprocessor
41Distributed Multiprocessor
42Cache Coherence
- Some NUMA multiprocessors do not support it in
hardware - Only instructions, private data in cache
- Large memory access time variance
- Implementation more difficult
- No shared memory bus to snoop
- Directory-based protocol needed
43Directory-based Protocol
- Distributed directory contains information about
cacheable memory blocks - One directory entry for each cache block
- Each entry has
- Sharing status
- Which processors have copies
44Sharing Status
- Uncached
- Block not in any processors cache
- Shared
- Cached by one or more processors
- Read only
- Exclusive
- Cached by exactly one processor
- Processor has written block
- Copy in memory is obsolete
45Directory-based Protocol
Interconnection Network
46Directory-based Protocol
Interconnection Network
Bit Vector
X
U 0 0 0
Directories
7
X
Memories
Caches
CPU 0
CPU 1
CPU 2
47CPU 0 Reads X
Interconnection Network
X
U 0 0 0
Directories
7
X
Memories
Caches
48CPU 0 Reads X
Interconnection Network
X
S 1 0 0
Directories
7
X
Memories
Caches
49CPU 0 Reads X
Interconnection Network
X
S 1 0 0
Directories
7
X
Memories
Caches
50CPU 2 Reads X
Interconnection Network
X
S 1 0 0
Directories
Memories
Caches
51CPU 2 Reads X
Interconnection Network
X
S 1 0 1
Directories
Memories
Caches
52CPU 2 Reads X
Interconnection Network
X
S 1 0 1
Directories
7
X
Memories
Caches
7
X
53CPU 0 Writes 6 to X
Interconnection Network
Write Miss
X
S 1 0 1
Directories
Memories
Caches
54CPU 0 Writes 6 to X
Interconnection Network
X
S 1 0 1
Directories
Invalidate
Memories
Caches
55CPU 0 Writes 6 to X
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
6
X
56CPU 1 Reads X
Interconnection Network
Read Miss
X
E 1 0 0
Directories
Memories
Caches
57CPU 1 Reads X
Interconnection Network
Switch to Shared
X
E 1 0 0
Directories
Memories
Caches
58CPU 1 Reads X
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
59CPU 1 Reads X
Interconnection Network
X
S 1 1 0
Directories
Memories
Caches
60CPU 2 Writes 5 to X
Interconnection Network
X
S 1 1 0
Directories
Memories
Write Miss
Caches
61CPU 2 Writes 5 to X
Interconnection Network
Invalidate
X
S 1 1 0
Directories
Memories
Caches
62CPU 2 Writes 5 to X
Interconnection Network
X
E 0 0 1
Directories
Memories
5
X
Caches
63CPU 0 Writes 4 to X
Interconnection Network
X
E 0 0 1
Directories
Memories
Caches
64CPU 0 Writes 4 to X
Interconnection Network
X
E 1 0 0
Directories
Memories
Take Away
Caches
65CPU 0 Writes 4 to X
Interconnection Network
X
E 0 1 0
Directories
Memories
Caches
66CPU 0 Writes 4 to X
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
67CPU 0 Writes 4 to X
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
68CPU 0 Writes 4 to X
Interconnection Network
X
E 1 0 0
Directories
Memories
Caches
4
X
69CPU 0 Writes Back X Block
Interconnection Network
Data Write Back
X
E 1 0 0
Directories
Memories
Caches
70CPU 0 Writes Back X Block
Interconnection Network
X
U 0 0 0
Directories
Memories
Caches
71Multicomputer
- Distributed memory multiple-CPU computer
- Same address on different processors refers to
different physical memory locations - Processors interact through message passing
- Commercial multicomputers
- Commodity clusters
- No cache coherence problems
72Asymmetrical Multicomputer
73Asymmetrical MC Advantages
- Back-end processors dedicated to parallel
computations ? Easier to understand, model, tune
performance - Only a simple back-end operating system needed ?
Easy for a vendor to create
- Front-end computer is a single point of failure
- Single front-end computer limits scalability of
system - Primitive operating system in back-end processors
makes debugging difficult - Every application requires development of both
front-end and back-end program
74Symmetrical Multicomputer
75Symmetrical MC Advantages
- Alleviate performance bottleneck caused by single
front-end computer - Better support for debugging
- Every processor executes same program
- More difficult to maintain illusion of single
parallel computer - No simple way to balance program development
workload among processors - More difficult to achieve high performance when
multiple processes on each processor
76ParPar Cluster, A Mixed Model
77Flynns Taxonomy
- Instruction stream
- Data stream
- Single vs. multiple
- Four combinations
- SISD
- SIMD
- MISD
- MIMD
78SISD
- Single Instruction, Single Data
- Single-CPU systems
- Note co-processors dont count
- Functional
- I/O
- Example PCs
79SIMD
- Single Instruction, Multiple Data
- Two architectures fit this category
- Pipelined vector processor(e.g., Cray-1)
- Processor array(e.g., Connection Machine)
80MISD
- MultipleInstruction,Single Data
- Examplesystolic array
81MIMD
- Multiple Instruction, Multiple Data
- Multiple-CPU computers
- Multiprocessors
- Multicomputers