CS 213: Parallel Processing Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

CS 213: Parallel Processing Architectures

Description:

Dual-core Intel Itanium 2 Series 9000 cpus ... Dual (Quad) SMP and Hardware Accelerator. Low Latency Message Passing Across Clusters in XD1 ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 32
Provided by: laxmib
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 213: Parallel Processing Architectures


1
CS 213 Parallel Processing Architectures
  • Laxmi Narayan Bhuyan
  • http//www.cs.ucr.edu/bhuyan
  • Lecture4

2
  • Commercial Multiprocessors

3
Origin2000 System
  • CC-NUMA Architecture
  • Upto 512 nodes (1024 processors)
  • 195MHz MIPS R10K processor, peak 390MFLOPS or 780
    MIPS per proc
  • Peak SysAD bus bw is 780MB/s, so also Hub-Mem
  • Hub to router chip and to Xbow is 1.56 GB/s (both
    are off-board)

4
Origin Network
  • Each router has six pairs of 1.56MB/s
    unidirectional links
  • Two to nodes, four to other routers
  • latency 41ns pin to pin across a router
  • Flexible cables up to 3 ft long
  • Four virtual channels request, reply, other
    two for priority or I/O
  • HPC solution stack running on industry standard
    Linux operating systems

5
SGI Altix 4700 Servers
  • Shared-memory NUMAflex (CC-NUMA?) architecture
  • 512 sockets or 1024 cores under one instance of
    Linux and as much as 128TB of globally
    addressable memory
  • Dual-core Intel Itanium 2 Series 9000 cpus
  • High Bandwidth Fat-Tree Interconnection Network,
    called NUMALink
  • SGI RASC Blade Two high performance Xilinx
    Virtex 4 LX200 FPGA chips with 160K logic cells

6
  • The basic building block of the Altix 4700
    system is the compute/memory blade. The compute
    blade contains one or two processor sockets.
    Each processor socket can contain one Intel
    Itanium 2 processor with on-chip L1, L2, and L3
    caches, memory DIMMs, and a SHub2 ASIC.

7
The SGI Altix 4700 RASC architecture
8
Cray T3D Shared Memory
  • Build up info in shell
  • Remote memory operations encoded in address

9
CRAY XD1
2 or 4-way SMP based on AMD 64-bit Opteron
processors. FPGA accelerator at each SMP
10
Dual (Quad) SMP and Hardware Accelerator
11
Low Latency Message Passing Across Clusters in XD1
The interconnection topology, shown in Fig. 1 has
three levels of latencies communication time
between the CPUs inside one blade is through
shared memory very fast message passing
communication among blades within a chassis,
slower message passing communication between two
different chassis
12
IBM Power 4 Shared Memory
13
Power-4 Multi-chip Module
14
32-way SMP
15
NOW Message Passing
  • General purpose processor embedded in NIC to
    implement VIA to be discussed later

16
Myrinet Message Passing
17
Interface Processor
18
InfiniBand Message Passing
19
Latency Comparison
20
IBM SP Architecrture
  • SMP nodes and Message passing between nodes
  • Switch Architecture for High Performance

21
IBM SP2 Message Passing
22
Parallel Algorithm and Program DesignSPMD and
MIMD
23
SIMD OPERATION
24
SIMD Model
  • Operations can be performed in parallel on each
    element of a large regular data structure, such
    as an array
  • 1 Control Processor (CP) broadcasts to many PEs.
    The CP reads an instruction from the control
    memory, decodes the instruction, and broadcasts
    control signals to all PEs.
  • Condition flag per PE so that can skip
  • Data distributed in each memory
  • Early 1980s VLSI gt SIMD rebirth 32 1-bit PEs
    memory on a chip was the PE
  • Data parallel programming languages lay out data
    to processor

25
Data Parallel Model (SPMD)
  • Vector processors have similar ISAs, but no data
    placement restriction
  • SIMD led to Data Parallel Programming languages
  • Advancing VLSI led to single chip FPUs and whole
    fast µProcs (SIMD less attractive)
  • SIMD programming model led to Single Program
    Multiple Data (SPMD) model
  • All processors execute identical program
  • Data parallel programming languages still useful,
    do communication all at once Bulk Synchronous
    phases in which all communicate after a global
    barrier

26
SPMD Programming High-Performance Fortran (HPF)
  • Single Program Multiple Data (SPMD)
  • FORALL Construct similar to Fork
  • FORALL (I1N), A(I) B(I) C(I), END
    FORALL
  • Data Mapping in HPF
  • 1. To reduce interprocessor communication
  • 2. Load balancing among processors
  • http//www.npac.syr.edu/hpfa/
  • http//www.crpc.rice.edu/HPFF/

27
Parallel Applications SPMD and MIMD
  • Commercial Workload
  • Multiprogramming and OS Workload
  • Scientific/Technical Applications

28
Parallel App Commercial Workload
  • Online transaction processing workload (OLTP)
    (like TPC-B or -C)
  • Decision support system (DSS) (like TPC-D)
  • Web index search (Altavista)

29
Parallel App Scientific/Technical
  • FFT Kernel 1D complex number FFT
  • 2 matrix transpose phases gt all-to-all
    communication
  • Sequential time for n data points O(n log n)
  • Example is 1 million point data set
  • LU Kernel dense matrix factorization
  • Blocking helps cache miss rate, 16x16
  • Sequential time for nxn matrix O(n3)
  • Example is 512 x 512 matrix

30
Parallel App Scientific/Technical
  • Barnes App Barnes-Hut n-body algorithm solving a
    problem in galaxy evolution
  • n-body algs rely on forces drop off with
    distance if far enough away, can ignore (e.g.,
    gravity is 1/d2)
  • Sequential time for n data points O(n log n)
  • Example is 16,384 bodies
  • Ocean App Gauss-Seidel multigrid technique to
    solve a set of elliptical partial differential
    eq.s
  • red-black Gauss-Seidel colors points in grid to
    consistently update points based on previous
    values of adjacent neighbors
  • Multigrid solve finite diff. eq. by iteration
    using hierarch. Grid
  • Communication when boundary accessed by adjacent
    subgrid
  • Sequential time for nxn grid O(n2)
  • Input 130 x 130 grid points, 5 iterations

31
Parallel Scientific App Scaling
  • p is processors
  • n is data size
  • Computation scales up with n by O( ), scales down
    linearly as p is increased
  • Communication
  • FFT all-to-all so n
  • LU, Ocean at boundary, so n1/2
  • Barnes complexn1/2 greater distance,x log n to
    maintain bodies relationships
  • All scale down 1/p1/2
  • Keep n same, but inc. p?
  • Inc. n to keep comm. same w. p?
Write a Comment
User Comments (0)
About PowerShow.com