Title: CS 213: Parallel Processing Architectures
1CS 213 Parallel Processing Architectures
- Laxmi Narayan Bhuyan
- http//www.cs.ucr.edu/bhuyan
- Lecture4
2- Commercial Multiprocessors
3Origin2000 System
- CC-NUMA Architecture
- Upto 512 nodes (1024 processors)
- 195MHz MIPS R10K processor, peak 390MFLOPS or 780
MIPS per proc - Peak SysAD bus bw is 780MB/s, so also Hub-Mem
- Hub to router chip and to Xbow is 1.56 GB/s (both
are off-board)
4Origin Network
- Each router has six pairs of 1.56MB/s
unidirectional links - Two to nodes, four to other routers
- latency 41ns pin to pin across a router
- Flexible cables up to 3 ft long
- Four virtual channels request, reply, other
two for priority or I/O - HPC solution stack running on industry standard
Linux operating systems
5SGI Altix 4700 Servers
- Shared-memory NUMAflex (CC-NUMA?) architecture
- 512 sockets or 1024 cores under one instance of
Linux and as much as 128TB of globally
addressable memory - Dual-core Intel Itanium 2 Series 9000 cpus
- High Bandwidth Fat-Tree Interconnection Network,
called NUMALink - SGI RASC Blade Two high performance Xilinx
Virtex 4 LX200 FPGA chips with 160K logic cells
6- The basic building block of the Altix 4700
system is the compute/memory blade. The compute
blade contains one or two processor sockets.
Each processor socket can contain one Intel
Itanium 2 processor with on-chip L1, L2, and L3
caches, memory DIMMs, and a SHub2 ASIC.
7The SGI Altix 4700 RASC architecture
8Cray T3D Shared Memory
- Build up info in shell
- Remote memory operations encoded in address
9CRAY XD1
2 or 4-way SMP based on AMD 64-bit Opteron
processors. FPGA accelerator at each SMP
10Dual (Quad) SMP and Hardware Accelerator
11Low Latency Message Passing Across Clusters in XD1
The interconnection topology, shown in Fig. 1 has
three levels of latencies communication time
between the CPUs inside one blade is through
shared memory very fast message passing
communication among blades within a chassis,
slower message passing communication between two
different chassis
12IBM Power 4 Shared Memory
13Power-4 Multi-chip Module
1432-way SMP
15NOW Message Passing
- General purpose processor embedded in NIC to
implement VIA to be discussed later
16Myrinet Message Passing
17Interface Processor
18InfiniBand Message Passing
19Latency Comparison
20IBM SP Architecrture
- SMP nodes and Message passing between nodes
- Switch Architecture for High Performance
21IBM SP2 Message Passing
22 Parallel Algorithm and Program DesignSPMD and
MIMD
23SIMD OPERATION
24SIMD Model
- Operations can be performed in parallel on each
element of a large regular data structure, such
as an array - 1 Control Processor (CP) broadcasts to many PEs.
The CP reads an instruction from the control
memory, decodes the instruction, and broadcasts
control signals to all PEs. - Condition flag per PE so that can skip
- Data distributed in each memory
- Early 1980s VLSI gt SIMD rebirth 32 1-bit PEs
memory on a chip was the PE - Data parallel programming languages lay out data
to processor
25Data Parallel Model (SPMD)
- Vector processors have similar ISAs, but no data
placement restriction - SIMD led to Data Parallel Programming languages
- Advancing VLSI led to single chip FPUs and whole
fast µProcs (SIMD less attractive) - SIMD programming model led to Single Program
Multiple Data (SPMD) model - All processors execute identical program
- Data parallel programming languages still useful,
do communication all at once Bulk Synchronous
phases in which all communicate after a global
barrier
26SPMD Programming High-Performance Fortran (HPF)
- Single Program Multiple Data (SPMD)
- FORALL Construct similar to Fork
- FORALL (I1N), A(I) B(I) C(I), END
FORALL - Data Mapping in HPF
- 1. To reduce interprocessor communication
- 2. Load balancing among processors
- http//www.npac.syr.edu/hpfa/
- http//www.crpc.rice.edu/HPFF/
27Parallel Applications SPMD and MIMD
- Commercial Workload
- Multiprogramming and OS Workload
- Scientific/Technical Applications
28Parallel App Commercial Workload
- Online transaction processing workload (OLTP)
(like TPC-B or -C) - Decision support system (DSS) (like TPC-D)
- Web index search (Altavista)
29Parallel App Scientific/Technical
- FFT Kernel 1D complex number FFT
- 2 matrix transpose phases gt all-to-all
communication - Sequential time for n data points O(n log n)
- Example is 1 million point data set
- LU Kernel dense matrix factorization
- Blocking helps cache miss rate, 16x16
- Sequential time for nxn matrix O(n3)
- Example is 512 x 512 matrix
30Parallel App Scientific/Technical
- Barnes App Barnes-Hut n-body algorithm solving a
problem in galaxy evolution - n-body algs rely on forces drop off with
distance if far enough away, can ignore (e.g.,
gravity is 1/d2) - Sequential time for n data points O(n log n)
- Example is 16,384 bodies
- Ocean App Gauss-Seidel multigrid technique to
solve a set of elliptical partial differential
eq.s - red-black Gauss-Seidel colors points in grid to
consistently update points based on previous
values of adjacent neighbors - Multigrid solve finite diff. eq. by iteration
using hierarch. Grid - Communication when boundary accessed by adjacent
subgrid - Sequential time for nxn grid O(n2)
- Input 130 x 130 grid points, 5 iterations
31Parallel Scientific App Scaling
- p is processors
- n is data size
- Computation scales up with n by O( ), scales down
linearly as p is increased - Communication
- FFT all-to-all so n
- LU, Ocean at boundary, so n1/2
- Barnes complexn1/2 greater distance,x log n to
maintain bodies relationships - All scale down 1/p1/2
- Keep n same, but inc. p?
- Inc. n to keep comm. same w. p?