Title: Course Outline
1Course Outline
- Introduction in algorithms and applications
- Parallel machines and architectures
- Overview of parallel machines, trends in
top-500 - Cluster computers, BlueGene
- Programming methods, languages, and environments
- Message passing (SR, MPI, Java)
- Higher-level language HPF
- Applications
- N-body problems, search algorithms,
bioinformatics - Grid computing Multimedia content analysis
on Grids (guest lecture Frank Seinstra)
2Parallel Machines
- Parallel Computing Techniques and Applications
UsingNetworked Workstations and Parallel
Computers (2/e) - Section 1.3 (part of) 1.4
- Barry Wilkinson and Michael Allen
- Pearson, 2005
3Overview
- Processor organizations
- Types of parallel machines
- Processor arrays
- Shared-memory multiprocessors
- Distributed-memory multicomputers
- Cluster computers
- Blue Gene
4Processor Organization
- Network topology is a graph
- A node is a processor
- An edge is a communication path
- Evaluation criteria
- Diameter (maximum distance)
- Bisection width (minimum number of edges that
should be removed to split the graph into 2
-almost- equal halves) - Number of edges per node
5Mesh
- q-dimensional lattice
- q2 -gt 2-D grid
- Number of nodes k²
- Diameter 2(k - 1)
- Bisection width k
- Edges per node 4
6Binary Tree
- Number of nodes 2k - 1
- Diameter 2 (k -1)
- Bisection width 1
- Edges per node 3
7Hypertree
- Tree with multiple roots (see Figure 3-3), gives
better bisection width - 4-ary tree
- Number of nodes 2k ( 2 k1 - 1)
- Diameter 2k
- Bisection width 2 k1
- Edges per node 6
8Engineering solution fat tree
- Tree with more bandwidth at links near the root
9Hypercube
- k-dimensional cube, each node has binary value,
nodes that differ in 1 bit are connected - Number of nodes 2k
- Diameter k
- Bisection width 2k-1
- Edges per node k
10Hypercube
- Label nodes with binary value, connect nodes that
differ in 1 coordinate - Number of nodes 2k
- Diameter k
- Bisection width 2k-1
- Edges per node k
11Types of parallel machines
- Processor arrays
- Shared-memory multiprocessors
- Distributed-memory multicomputers
12Processor Arrays
- Instructions operate on scalars or vectors
- Processor array front-end synchronized
processing elements
- Front-end
- Sequential machine that executes program
- Vector operations are broadcast to PEs
- Processing element
- Performs operation on its part of the vector
- Communicates with other PEs through a network
13Examples of Processor Arrays
- CM-200, Maspar MP-1, MP-2, ICL DAP (1970s)
- Japanese Earth Simulator (2002, former 1 of
top-500)
14Shared-Memory Multiprocessors
- Bus easily gets saturated gt add caches to CPUs
- Central problem cache coherency
- Snooping cache monitor bus, invalidate copy on
write - Write-through or copy-back
- Bus-based multiprocessors do not scale
15Other Multiprocessor Designs (1/2)
- Switch-based multiprocessors (e.g., crossbar)
- Expensive (requires many very fast components)
16Other Multiprocessor Designs (2/2)
- Non-Uniform Memory Access (NUMA) multiprocessors
- Memory is distributed
- Some memory is faster to access than other memory
- Example
- Teras at Sara,Dutch NationalSupercomputer(1024-
node SGI)
17Distributed-Memory Multicomputers
- Each processor only has a local memory
- Processors communicate by sending messages over a
network - Routing of messages
- Packet-switched message routing split message
into packets, buffered at intermediate nodes - Store-and-forward
- Cut-through routing, wormhole routing
- Circuit-switched message routing establish path
between source and destination
18Store-and-forward Routing
- Messages are forwarded one node at a time
- Forwarding is done in software
- Every processor on path from source to
destination is involved - Latency linear to distance x message length
- Examples Parsytec GCel (T800 transputers), Intel
Ipsc
19Circuit-switched Message Routing
- Each node has a routing module
- Circuit set up between source and destination
- Latency linear to distance message length
- Example Intel iPSC/2
20Modern routing techniques
- Circuit switching needs to reserve all links in
the path (cf. old telephone system) - Packet switching high latency, buffering space
(cf. postal mail) - Cut-through routing packet switching, but
immediately forward (without buffering) packets
if outgoing link is available - Wormhole routing transmit head (few bits) of
message, rest follows like a worm
21Distributed Shared Memory
- Shared memory is relatively easy to program, but
doesnt scale - Distributed memory is hard to program, but does
scale - Distributed Shared Memory (DSM) provide
shared-memory programming model on top of
distributed memory hardware - Shared Virtual Memory (SVM) use memory
management hardware (paging), copy pages over the
network - Object-based provide replicated shared objects
(Orca language) - Was hot research topic in 1990s, but performance
remained the bottleneck
22Flynn's Taxonomy
- Instruction stream sequence of instructions
- Data stream sequence of data manipulated by
instructions
Single Data Multiple Data
Single Instruction SISD SIMD
Multiple Instruction MISD MIMD
SISD Single Instruction Single
Data Traditional uniprocessors SIMD Single
Instruction Multiple Data Processor arrays MISD
Multiple Instruction Single Data Nonexistent? MIM
D Multiple Instruction Multiple Data
Multiprocessors and multicomputers