Title: Architecture of Parallel Computers CSC ECE 506 BlueGene Architecture
1Architecture of Parallel ComputersCSC / ECE 506
BlueGene Architecture
- 4/26/2007
- Dr Steve Hunter
2BlueGene/L Program
- December 1999 IBM Research announced a 5 year,
100M US, effort to build a petaflop/s scale
supercomputer to attack science problems such as
protein folding. Goals - Advance the state of the art of scientific
simulation. - Advance the state of the art in computer design
and software for capability and capacity
markets. - November 2001 Announced Research partnership
with Lawrence Livermore National Laboratory
(LLNL).November 2002 Announced planned
acquisition of a BG/L machine by LLNL as part of
the ASCI Purple contract. - May 11, 2004 Four racks DD1 (4096 nodes at 500
MHz) running Linpack at 11.68 TFlops/s. It was
ranked 4 on 23rd Top500 list. - June 2, 2004 2 racks DD2 (1024 nodes at 700 MHz)
running Linpack at 8.655 TFlops/s. It was ranked
8 on 23rd Top500 list. - September 16, 2004, 8 racks running Linpack at
36.01 TFlops/s. - November 8, 2004, 16 racks running Linpack at
70.72 TFlops/s. It was ranked 1 on the 24th
Top500 list. - December 21, 2004 First 16 racks of BG/L accepted
by LLNL.
3BlueGene/L Program
- Massive collection of low-power CPUs instead of a
moderate-sized collection of high-power CPUs. - A joint development of IBM and DOEs National
Nuclear Security Administration (NNSA) and
installed at DOEs Lawrence Livermore National
Laboratory - BlueGene/L has occupied the No. 1 position on the
last three TOP500 lists (http//www.top500.org/) - It has reached a Linpack benchmark performance of
280.6 TFlop/s (teraflops or trillions of
calculations per second) and still remains the
only system ever to exceed the level of 100
TFlop/s. - BlueGene/L holds the 1 and 3 positions in top
10. - Objective was to retain exceptional
cost/performance levels achieved by
application-specific machines, while generalizing
the massively parallel architecture enough to
enable a relatively broad class of applications
- Overview of BG/L system architecture, IBM JRD - Design approach was to use a very high level of
integration that made simplicity in packaging,
design, and bring-up possible - JRD issue available at http//www.research.ibm.c
om/journal/rd49-23.html
4BlueGene/L Program
- BlueGene is a family of supercomputers.
- BlueGene/L is the first step, aimed as a
multipurpose, massively parallel, and
cost/effective supercomputer 12/04 - BlueGene/P is the petaflop generation 12/06
- BlueGene/Q is the third generation 2010.
- Requirements for future generations
- Processors will be more powerful.
- Networks will be higher bandwidth.
- Applications developed on BlueGeneG/L will run
well on BlueGene/P.
5BlueGene/L Fundamentals
- Low Complexity nodes gives more flops per
transistor and per watt - 3D Interconnect supports many scientific
simulations as nature as we see it is 3D
6BlueGene/L Fundamentals
- Cellular architecture
- Large numbers of low power, more efficient
processors interconnected - Rmax of 280.6 Teraflops
- Maximal LINPACK performance achieved
- Rpeak of 360 Teraflops
- Theoretical peak performance
- 65,536 dual-processor compute nodes
- 700MHz IBM PowerPC 440 processors
- 512 MB memory per compute node, 16 TB in entire
system. - 800 TB of disk space
- 2,500 square feet
7Comparing Systems (Peak)
8Comparing Systems (Byte/Flop)
- Red Storm 2.0 2003
- Earth Simulator 2.0 2002
- Intel Paragon 1.8 1992
- nCUBE/2 1.0 1990
- ASCI Red 1.0 (0.6) 1997
- T3E 0.8 1996
- BG/L 1.5 0.75(torus)0.75(tree) 2004
- Cplant 0.1 1997
- ASCI White 0.1 2000
- ASCI Q 0.05 Quadrics 2003
- ASCI Purple 0.1 2004
- Intel Cluster 0.1 IB 2004
- Intel Cluster 0.008 GbE 2003
- Virginia Tech 0.16 IB 2003
- Chinese Acad of Sc 0.04 QsNet 2003
- NCSA - Dell 0.04 Myrinet 2003
9Comparing Systems (GFlops/Watt)
- Power efficiencies of recent supercomputers
- Blue IBM Machines
- Black Other US Machines
- Red Japanese Machines
IBM Journal of Research and Development
10Comparing Systems
10 megawatts approximate usage of 11,000
households
11BG/L Summary of Performance Results
- DGEMM (Double-precision, GEneral
Matrix-Multiply) - 92.3 of dual core peak on 1 node
- Observed performance at 500 MHz 3.7 GFlops
- Projected performance at 700 MHz 5.2 GFlops
(tested in lab up to 650 MHz) - LINPACK
- 77 of peak on 1 node
- 70 of peak on 512 nodes (1435 GFlops at 500 MHz)
- sPPM (Spare Matrix Multiple Vector Multiply),
UMT2000 - Single processor performance roughly on par with
POWER3 at 375 MHz - Tested on up to 128 nodes (also NAS Parallel
Benchmarks) - FFT (Fast Fourier Transform)
- Up to 508 MFlops on single processor at 444 MHz
(TU Vienna) - Pseudo-ops performance (5N log N) _at_ 700 MHz of
1300 Mflops (65 of peak) - STREAM impressive results even at 444 MHz
- Tuned Copy 2.4 GB/s, Scale 2.1 GB/s, Add
1.8 GB/s, Triad 1.9 GB/s - Standard Copy 1.2 GB/s, Scale 1.1 GB/s, Add
1.2 GB/s, Triad 1.2 GB/s - At 700 MHz Would beat STREAM numbers for most
high end microprocessors - MPI
- Latency lt 4000 cycles (5.5 ls at 700 MHz)
12BlueGene/L Architecture
- To achieve this level of integration, the machine
was developed around a processor with moderate
frequency, available in system-on-a-chip (SoC)
technology - This approach was chosen because of the
performance/power advantage - In terms of performance/watt the low-frequency,
low-power, embedded IBM PowerPC core consistently
outperforms high-frequency, high-power,
microprocessors by a factor of 2 to 10 - Industry focus on performance / rack
- Performance / rack Performance / watt Watt /
rack - Watt / rack 20kW for power and thermal cooling
reasons - Power and cooling
- Using conventional techniques, a 360 Tflops
machine would require 10-20 megawatts. - BlueGene/L uses only 1.76 megawatts
13Microprocessor Power Density Growth
14System Power Comparison
15BlueGene/L Architecture
- Networks were chosen with extreme scaling in mind
- Scale efficiently in terms of both performance
and packaging - Support very small messages
- As small as 32 bytes
- Includes hardware support for collective
operations - Broadcast, reduction, scan, etc.
- Reliability, Availability and Serviceability
(RAS) is another critical issue for scaling - BG/L need to be reliable and usable even at
extreme scaling limits - 20 fails per 1,000,000,000 hours 1 node failure
every 4.5 weeks - System Software and Monitoring also important to
scaling - BG/L designed to efficiently utilize a
distributed memory, message-passing programming
model - MPI is the dominant message-passing model with
hardware features added and parameter tuned
16RAS (Reliability, Availability, Serviceability)
- System designed for RAS from top to bottom
- System issues
- Redundant bulk supplies, power converters, fans,
DRAM bits, cable bits - Extensive data logging (voltage, temp,
recoverable errors ) for failure forecasting - Nearly no single points of failure
- Chip design
- ECC on all SRAMs
- All dataflow outside processors is protected by
error-detection mechanisms - Access to all state via noninvasive back door
- Low power, simple design leads to higher
reliability - All interconnects have multiple error detections
and correction coverage - Virtually zero escape probability for link errors
17BlueGene/L System
136.8 Teraflop/s on LINPACK (64K processors) 1 TF
1000,000,000,000 Flops Rochester Lab 2005
18BlueGene/L System
19BlueGene/L System
20BlueGene/L System
21Physical Layout of BG/L
22Midplanes and Racks
23The Compute Chip
- System-on-a-chip (SoC)
- 1 ASIC
- 2 PowerPC processors
- L1 and L2 Caches
- 4MB embedded DRAM
- DDR DRAM interface and DMA controller
- Network connectivity hardware
- Control / monitoring equip. (JTAG)
24Compute Card
25Node Card
26BlueGene/L Compute ASIC
- IBM CU-11, 0.13 µm
- 11 x 11 mm die size
- 25 x 32 mm CBGA
- 474 pins, 328 signal
- 1.5/2.5 Volt
27BlueGene/L Interconnect Networks
- 3 Dimensional Torus
- Main network, for point-to-point communication
- High-speed, high-bandwidth
- Interconnects all compute nodes (65,536)
- Virtual cut-through hardware routing
- 1.4Gb/s on all 12 node links (2.1 GB/s per node)
- 1 µs latency between nearest neighbors, 5 µs to
the farthest - 4 µs latency for one hop with MPI, 10 µs to the
farthest - Communications backbone for computations
- 0.7/1.4 TB/s bisection bandwidth, 68TB/s total
bandwidth - Global Tree
- One-to-all broadcast functionality
- Reduction operations functionality
- MPI collective ops in hardware
- Fixed-size 256 byte packets
- 2.8 Gb/s of bandwidth per link
- Latency of one way tree traversal 2.5 µs
- 23TB/s total binary tree bandwidth (64k machine)
28The Torus Network
- 3 dimensional 64 x 32 x 32
- Each compute node is connected to its six
neighbors x, x-, y, y-, z, z- - Compute card is 1x2x1
- Node card is 4x4x2
- 16 compute cards in 4x2x2 arrangement
- Midplane is 8x8x8
- 16 node cards in 2x2x4 arrangement
- Communication path
- Each uni-directional link is 1.4Gb/s, or 175MB/s.
- Each node can send and receive at 1.05GB/s.
- Supports cut-through routing, along with both
deterministic and adaptive routing. - Variable-sized packets of 32,64,96256 bytes
- Guarantees reliable delivery
29Complete BlueGene/L System at LLNL
BG/L I/O nodes 1,024
WAN
48
visualization
64
archive
128
BG/L compute nodes 65,536
Federated Gigabit Ethernet Switch 2,048 ports
CWFS
512
1024
Front-end nodes
8
Service node
8
8
Control network
30System Software Overview
- Operating system - Linux
- Compilers - IBM XL C, C, Fortran95
- Communication - MPI, TCP/IP
- Parallel File System - GPFS, NFS support
- System Management - extensions to CSM
- Job scheduling - based on LoadLeveler
- Math libraries - ESSL
31BG/L Software Hierarchical Organization
- Compute nodes dedicated to running user
application, and almost nothing else - simple
compute node kernel (CNK) - I/O nodes run Linux and provide a more complete
range of OS services files, sockets, process
launch, signaling, debugging, and termination - Service node performs system management services
(e.g., heart beating, monitoring errors) -
transparent to application software
32BG/L System Software
- Simplicity
- Space-sharing
- Single-threaded
- No demand paging
- Familiarity
- MPI (MPICH2)
- IBM XL Compilers for PowerPC
33Operating Systems
- Front-end nodes are commodity systems running
Linux - I/O nodes run a customized Linux kernel
- Compute nodes use an extremely lightweight custom
kernel - Service node is a single multiprocessor machine
running a custom OS
34Compute Node Kernel (CNK)
- Single user, dual-threaded
- Flat address space, no paging
- Physical resources are memory-mapped
- Provides standard POSIX functionality (mostly)
- Two execution modes
- Virtual node mode
- Coprocessor mode
35Service Node OS
- Core Management and Control System (CMCS)
- BG/Ls global operating system.
- MMCS - Midplane Monitoring and Control System
- CIOMAN - Control and I/O Manager
- DB2 relational database
36Running a User Job
- Compiled, and submitted from front-end node.
- External scheduler
- Service node sets up partition, and transfers
users code to compute nodes. - All file I/O is done using standard Unix calls
(via the I/O nodes). - Post-facto debugging done on front-end nodes.
37Performance Issues
- User code is easily ported to BG/L.
- However, MPI implementation requires effort
skill - Torus topology instead of crossbar
- Special hardware, such as collective network.
38BG/L MPI Software Architecture
GI Global Interrupt CIO Control and I/O
Protocol CH3 Primary device distributed with
MPICH2 communication MPD Multipurpose Daemon
39MPI_Bcast
40MPI_Alltoall
41References
- IBM Journal of Research and Development, Vol. 49,
No. 2-3. - http//www.research.ibm.com/journal/rd49-23.html
- Overview of the Blue Gene/L system architecture
- Packaging the Blue Gene/L supercomputer
- Blue Gene/L compute chip Memory and Ethernet
subsystems - Blue Gene/L torus interconnection network
- Blue Gene/L programming and operating
environment - Design and implementation of message-passing
services for the Blue Gene/L supercomputer
42References (cont.)
- BG/L homepage _at_ LLNL lthttp//www.llnl.gov/ASC/pla
tforms/bluegenel/gt - BlueGene homepage _at_ IBM lthttp//www.research.ibm.
com/bluegene/gt
43