Architecture of Parallel Computers CSC ECE 506 BlueGene Architecture - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Architecture of Parallel Computers CSC ECE 506 BlueGene Architecture

Description:

... the state of the art of scientific simulation. Advance the state of the art in computer design and ... sPPM (Spare Matrix Multiple Vector Multiply), UMT2000: ... – PowerPoint PPT presentation

Number of Views:174

Avg rating:3.0/5.0

Slides: 44

Provided by: stevenw5

Category:

more less

Transcript and Presenter's Notes

Title: Architecture of Parallel Computers CSC ECE 506 BlueGene Architecture

1
Architecture of Parallel ComputersCSC / ECE 506
BlueGene Architecture

4/26/2007
Dr Steve Hunter

2
BlueGene/L Program

December 1999 IBM Research announced a 5 year,
100M US, effort to build a petaflop/s scale
supercomputer to attack science problems such as
protein folding. Goals
Advance the state of the art of scientific
simulation.
Advance the state of the art in computer design
and software for capability and capacity
markets.
November 2001 Announced Research partnership
with Lawrence Livermore National Laboratory
(LLNL).November 2002 Announced planned
acquisition of a BG/L machine by LLNL as part of
the ASCI Purple contract.
May 11, 2004 Four racks DD1 (4096 nodes at 500
MHz) running Linpack at 11.68 TFlops/s. It was
ranked 4 on 23rd Top500 list.
June 2, 2004 2 racks DD2 (1024 nodes at 700 MHz)
running Linpack at 8.655 TFlops/s. It was ranked
8 on 23rd Top500 list.
September 16, 2004, 8 racks running Linpack at
36.01 TFlops/s.
November 8, 2004, 16 racks running Linpack at
70.72 TFlops/s. It was ranked 1 on the 24th
Top500 list.
December 21, 2004 First 16 racks of BG/L accepted
by LLNL.

3
BlueGene/L Program

Massive collection of low-power CPUs instead of a
moderate-sized collection of high-power CPUs.
A joint development of IBM and DOEs National
Nuclear Security Administration (NNSA) and
installed at DOEs Lawrence Livermore National
Laboratory
BlueGene/L has occupied the No. 1 position on the
last three TOP500 lists (http//www.top500.org/)
It has reached a Linpack benchmark performance of
280.6 TFlop/s (teraflops or trillions of
calculations per second) and still remains the
only system ever to exceed the level of 100
TFlop/s.
BlueGene/L holds the 1 and 3 positions in top
10.
Objective was to retain exceptional
cost/performance levels achieved by
application-specific machines, while generalizing
the massively parallel architecture enough to
enable a relatively broad class of applications
- Overview of BG/L system architecture, IBM JRD
Design approach was to use a very high level of
integration that made simplicity in packaging,
design, and bring-up possible
JRD issue available at http//www.research.ibm.c
om/journal/rd49-23.html

4
BlueGene/L Program

BlueGene is a family of supercomputers.
BlueGene/L is the first step, aimed as a
multipurpose, massively parallel, and
cost/effective supercomputer 12/04
BlueGene/P is the petaflop generation 12/06
BlueGene/Q is the third generation 2010.
Requirements for future generations
Processors will be more powerful.
Networks will be higher bandwidth.
Applications developed on BlueGeneG/L will run
well on BlueGene/P.

5
BlueGene/L Fundamentals

Low Complexity nodes gives more flops per
transistor and per watt
3D Interconnect supports many scientific
simulations as nature as we see it is 3D

6
BlueGene/L Fundamentals

Cellular architecture
Large numbers of low power, more efficient
processors interconnected
Rmax of 280.6 Teraflops
Maximal LINPACK performance achieved
Rpeak of 360 Teraflops
Theoretical peak performance
65,536 dual-processor compute nodes
700MHz IBM PowerPC 440 processors
512 MB memory per compute node, 16 TB in entire
system.
800 TB of disk space
2,500 square feet

7
Comparing Systems (Peak)
8
Comparing Systems (Byte/Flop)

Red Storm 2.0 2003
Earth Simulator 2.0 2002
Intel Paragon 1.8 1992
nCUBE/2 1.0 1990
ASCI Red 1.0 (0.6) 1997
T3E 0.8 1996
BG/L 1.5 0.75(torus)0.75(tree) 2004
Cplant 0.1 1997
ASCI White 0.1 2000
ASCI Q 0.05 Quadrics 2003
ASCI Purple 0.1 2004
Intel Cluster 0.1 IB 2004
Intel Cluster 0.008 GbE 2003
Virginia Tech 0.16 IB 2003
Chinese Acad of Sc 0.04 QsNet 2003
NCSA - Dell 0.04 Myrinet 2003

9
Comparing Systems (GFlops/Watt)

Power efficiencies of recent supercomputers
Blue IBM Machines
Black Other US Machines
Red Japanese Machines

IBM Journal of Research and Development
10
Comparing Systems
10 megawatts approximate usage of 11,000
households
11
BG/L Summary of Performance Results

DGEMM (Double-precision, GEneral
Matrix-Multiply)
92.3 of dual core peak on 1 node
Observed performance at 500 MHz 3.7 GFlops
Projected performance at 700 MHz 5.2 GFlops
(tested in lab up to 650 MHz)
LINPACK
77 of peak on 1 node
70 of peak on 512 nodes (1435 GFlops at 500 MHz)
sPPM (Spare Matrix Multiple Vector Multiply),
UMT2000
Single processor performance roughly on par with
POWER3 at 375 MHz
Tested on up to 128 nodes (also NAS Parallel
Benchmarks)
FFT (Fast Fourier Transform)
Up to 508 MFlops on single processor at 444 MHz
(TU Vienna)
Pseudo-ops performance (5N log N) _at_ 700 MHz of
1300 Mflops (65 of peak)
STREAM impressive results even at 444 MHz
Tuned Copy 2.4 GB/s, Scale 2.1 GB/s, Add
1.8 GB/s, Triad 1.9 GB/s
Standard Copy 1.2 GB/s, Scale 1.1 GB/s, Add
1.2 GB/s, Triad 1.2 GB/s
At 700 MHz Would beat STREAM numbers for most
high end microprocessors
MPI
Latency lt 4000 cycles (5.5 ls at 700 MHz)

12
BlueGene/L Architecture

To achieve this level of integration, the machine
was developed around a processor with moderate
frequency, available in system-on-a-chip (SoC)
technology
This approach was chosen because of the
performance/power advantage
In terms of performance/watt the low-frequency,
low-power, embedded IBM PowerPC core consistently
outperforms high-frequency, high-power,
microprocessors by a factor of 2 to 10
Industry focus on performance / rack
Performance / rack Performance / watt Watt /
rack
Watt / rack 20kW for power and thermal cooling
reasons
Power and cooling
Using conventional techniques, a 360 Tflops
machine would require 10-20 megawatts.
BlueGene/L uses only 1.76 megawatts

13
Microprocessor Power Density Growth
14
System Power Comparison
15
BlueGene/L Architecture

Networks were chosen with extreme scaling in mind
Scale efficiently in terms of both performance
and packaging
Support very small messages
As small as 32 bytes
Includes hardware support for collective
operations
Broadcast, reduction, scan, etc.
Reliability, Availability and Serviceability
(RAS) is another critical issue for scaling
BG/L need to be reliable and usable even at
extreme scaling limits
20 fails per 1,000,000,000 hours 1 node failure
every 4.5 weeks
System Software and Monitoring also important to
scaling
BG/L designed to efficiently utilize a
distributed memory, message-passing programming
model
MPI is the dominant message-passing model with
hardware features added and parameter tuned

16
RAS (Reliability, Availability, Serviceability)

System designed for RAS from top to bottom
System issues
Redundant bulk supplies, power converters, fans,
DRAM bits, cable bits
Extensive data logging (voltage, temp,
recoverable errors ) for failure forecasting
Nearly no single points of failure
Chip design
ECC on all SRAMs
All dataflow outside processors is protected by
error-detection mechanisms
Access to all state via noninvasive back door
Low power, simple design leads to higher
reliability
All interconnects have multiple error detections
and correction coverage
Virtually zero escape probability for link errors

17
BlueGene/L System
136.8 Teraflop/s on LINPACK (64K processors) 1 TF
1000,000,000,000 Flops Rochester Lab 2005
18
BlueGene/L System
19
BlueGene/L System
20
BlueGene/L System
21
Physical Layout of BG/L
22
Midplanes and Racks
23
The Compute Chip

System-on-a-chip (SoC)
1 ASIC
2 PowerPC processors
L1 and L2 Caches
4MB embedded DRAM
DDR DRAM interface and DMA controller
Network connectivity hardware
Control / monitoring equip. (JTAG)

24
Compute Card
25
Node Card
26
BlueGene/L Compute ASIC

IBM CU-11, 0.13 µm
11 x 11 mm die size
25 x 32 mm CBGA
474 pins, 328 signal
1.5/2.5 Volt

27
BlueGene/L Interconnect Networks

3 Dimensional Torus
Main network, for point-to-point communication
High-speed, high-bandwidth
Interconnects all compute nodes (65,536)
Virtual cut-through hardware routing
1.4Gb/s on all 12 node links (2.1 GB/s per node)
1 µs latency between nearest neighbors, 5 µs to
the farthest
4 µs latency for one hop with MPI, 10 µs to the
farthest
Communications backbone for computations
0.7/1.4 TB/s bisection bandwidth, 68TB/s total
bandwidth
Global Tree
One-to-all broadcast functionality
Reduction operations functionality
MPI collective ops in hardware
Fixed-size 256 byte packets
2.8 Gb/s of bandwidth per link
Latency of one way tree traversal 2.5 µs
23TB/s total binary tree bandwidth (64k machine)

28
The Torus Network

3 dimensional 64 x 32 x 32
Each compute node is connected to its six
neighbors x, x-, y, y-, z, z-
Compute card is 1x2x1
Node card is 4x4x2
16 compute cards in 4x2x2 arrangement
Midplane is 8x8x8
16 node cards in 2x2x4 arrangement
Communication path
Each uni-directional link is 1.4Gb/s, or 175MB/s.
Each node can send and receive at 1.05GB/s.
Supports cut-through routing, along with both
deterministic and adaptive routing.
Variable-sized packets of 32,64,96256 bytes
Guarantees reliable delivery

29
Complete BlueGene/L System at LLNL
BG/L I/O nodes 1,024
WAN
48
visualization
64
archive
128
BG/L compute nodes 65,536
Federated Gigabit Ethernet Switch 2,048 ports
CWFS
512
1024
Front-end nodes
8
Service node
8
8
Control network
30
System Software Overview

Operating system - Linux
Compilers - IBM XL C, C, Fortran95
Communication - MPI, TCP/IP
Parallel File System - GPFS, NFS support
System Management - extensions to CSM
Job scheduling - based on LoadLeveler
Math libraries - ESSL

31
BG/L Software Hierarchical Organization

Compute nodes dedicated to running user
application, and almost nothing else - simple
compute node kernel (CNK)
I/O nodes run Linux and provide a more complete
range of OS services files, sockets, process
launch, signaling, debugging, and termination
Service node performs system management services
(e.g., heart beating, monitoring errors) -
transparent to application software

32
BG/L System Software

Simplicity
Space-sharing
Single-threaded
No demand paging
Familiarity
MPI (MPICH2)
IBM XL Compilers for PowerPC

33
Operating Systems

Front-end nodes are commodity systems running
Linux
I/O nodes run a customized Linux kernel
Compute nodes use an extremely lightweight custom
kernel
Service node is a single multiprocessor machine
running a custom OS

34
Compute Node Kernel (CNK)

Single user, dual-threaded
Flat address space, no paging
Physical resources are memory-mapped
Provides standard POSIX functionality (mostly)
Two execution modes
Virtual node mode
Coprocessor mode

35
Service Node OS

Core Management and Control System (CMCS)
BG/Ls global operating system.
MMCS - Midplane Monitoring and Control System
CIOMAN - Control and I/O Manager
DB2 relational database

36
Running a User Job

Compiled, and submitted from front-end node.
External scheduler
Service node sets up partition, and transfers
users code to compute nodes.
All file I/O is done using standard Unix calls
(via the I/O nodes).
Post-facto debugging done on front-end nodes.

37
Performance Issues

User code is easily ported to BG/L.
However, MPI implementation requires effort
skill
Torus topology instead of crossbar
Special hardware, such as collective network.

38
BG/L MPI Software Architecture
GI Global Interrupt CIO Control and I/O
Protocol CH3 Primary device distributed with
MPICH2 communication MPD Multipurpose Daemon
39
MPI_Bcast
40
MPI_Alltoall
41
References

IBM Journal of Research and Development, Vol. 49,
No. 2-3.
http//www.research.ibm.com/journal/rd49-23.html
Overview of the Blue Gene/L system architecture
Packaging the Blue Gene/L supercomputer
Blue Gene/L compute chip Memory and Ethernet
subsystems
Blue Gene/L torus interconnection network
Blue Gene/L programming and operating
environment
Design and implementation of message-passing
services for the Blue Gene/L supercomputer

42
References (cont.)