Architecture of Parallel Computers CSC ECE 506 BlueGene Architecture - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Architecture of Parallel Computers CSC ECE 506 BlueGene Architecture

Description:

... the state of the art of scientific simulation. Advance the state of the art in computer design and ... sPPM (Spare Matrix Multiple Vector Multiply), UMT2000: ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 44
Provided by: stevenw5
Category:

less

Transcript and Presenter's Notes

Title: Architecture of Parallel Computers CSC ECE 506 BlueGene Architecture


1
Architecture of Parallel ComputersCSC / ECE 506
BlueGene Architecture
  • 4/26/2007
  • Dr Steve Hunter

2
BlueGene/L Program
  • December 1999 IBM Research announced a 5 year,
    100M US, effort to build a petaflop/s scale
    supercomputer to attack science problems such as
    protein folding. Goals
  • Advance the state of the art of scientific
    simulation.
  • Advance the state of the art in computer design
    and software for capability and capacity
    markets.
  • November 2001 Announced Research partnership
    with Lawrence Livermore National Laboratory
    (LLNL).November 2002 Announced planned
    acquisition of a BG/L machine by LLNL as part of
    the ASCI Purple contract.
  • May 11, 2004 Four racks DD1 (4096 nodes at 500
    MHz) running Linpack at 11.68 TFlops/s. It was
    ranked 4 on 23rd Top500 list.
  • June 2, 2004 2 racks DD2 (1024 nodes at 700 MHz)
    running Linpack at 8.655 TFlops/s. It was ranked
    8 on 23rd Top500 list.
  • September 16, 2004, 8 racks running Linpack at
    36.01 TFlops/s.
  • November 8, 2004, 16 racks running Linpack at
    70.72 TFlops/s. It was ranked 1 on the 24th
    Top500 list.
  • December 21, 2004 First 16 racks of BG/L accepted
    by LLNL.

3
BlueGene/L Program
  • Massive collection of low-power CPUs instead of a
    moderate-sized collection of high-power CPUs.
  • A joint development of IBM and DOEs National
    Nuclear Security Administration (NNSA) and
    installed at DOEs Lawrence Livermore National
    Laboratory
  • BlueGene/L has occupied the No. 1 position on the
    last three TOP500 lists (http//www.top500.org/)
  • It has reached a Linpack benchmark performance of
    280.6 TFlop/s (teraflops or trillions of
    calculations per second) and still remains the
    only system ever to exceed the level of 100
    TFlop/s.
  • BlueGene/L holds the 1 and 3 positions in top
    10.
  • Objective was to retain exceptional
    cost/performance levels achieved by
    application-specific machines, while generalizing
    the massively parallel architecture enough to
    enable a relatively broad class of applications
    - Overview of BG/L system architecture, IBM JRD
  • Design approach was to use a very high level of
    integration that made simplicity in packaging,
    design, and bring-up possible
  • JRD issue available at http//www.research.ibm.c
    om/journal/rd49-23.html

4
BlueGene/L Program
  • BlueGene is a family of supercomputers.
  • BlueGene/L is the first step, aimed as a
    multipurpose, massively parallel, and
    cost/effective supercomputer 12/04
  • BlueGene/P is the petaflop generation 12/06
  • BlueGene/Q is the third generation 2010.
  • Requirements for future generations
  • Processors will be more powerful.
  • Networks will be higher bandwidth.
  • Applications developed on BlueGeneG/L will run
    well on BlueGene/P.

5
BlueGene/L Fundamentals
  • Low Complexity nodes gives more flops per
    transistor and per watt
  • 3D Interconnect supports many scientific
    simulations as nature as we see it is 3D

6
BlueGene/L Fundamentals
  • Cellular architecture
  • Large numbers of low power, more efficient
    processors interconnected
  • Rmax of 280.6 Teraflops
  • Maximal LINPACK performance achieved
  • Rpeak of 360 Teraflops
  • Theoretical peak performance
  • 65,536 dual-processor compute nodes
  • 700MHz IBM PowerPC 440 processors
  • 512 MB memory per compute node, 16 TB in entire
    system.
  • 800 TB of disk space
  • 2,500 square feet

7
Comparing Systems (Peak)
8
Comparing Systems (Byte/Flop)
  • Red Storm 2.0 2003
  • Earth Simulator 2.0 2002
  • Intel Paragon 1.8 1992
  • nCUBE/2 1.0 1990
  • ASCI Red 1.0 (0.6) 1997
  • T3E 0.8 1996
  • BG/L 1.5 0.75(torus)0.75(tree) 2004
  • Cplant 0.1 1997
  • ASCI White 0.1 2000
  • ASCI Q 0.05 Quadrics 2003
  • ASCI Purple 0.1 2004
  • Intel Cluster 0.1 IB 2004
  • Intel Cluster 0.008 GbE 2003
  • Virginia Tech 0.16 IB 2003
  • Chinese Acad of Sc 0.04 QsNet 2003
  • NCSA - Dell 0.04 Myrinet 2003

9
Comparing Systems (GFlops/Watt)
  • Power efficiencies of recent supercomputers
  • Blue IBM Machines
  • Black Other US Machines
  • Red Japanese Machines

IBM Journal of Research and Development
10
Comparing Systems
10 megawatts approximate usage of 11,000
households
11
BG/L Summary of Performance Results
  • DGEMM (Double-precision, GEneral
    Matrix-Multiply)
  • 92.3 of dual core peak on 1 node
  • Observed performance at 500 MHz 3.7 GFlops
  • Projected performance at 700 MHz 5.2 GFlops
    (tested in lab up to 650 MHz)
  • LINPACK
  • 77 of peak on 1 node
  • 70 of peak on 512 nodes (1435 GFlops at 500 MHz)
  • sPPM (Spare Matrix Multiple Vector Multiply),
    UMT2000
  • Single processor performance roughly on par with
    POWER3 at 375 MHz
  • Tested on up to 128 nodes (also NAS Parallel
    Benchmarks)
  • FFT (Fast Fourier Transform)
  • Up to 508 MFlops on single processor at 444 MHz
    (TU Vienna)
  • Pseudo-ops performance (5N log N) _at_ 700 MHz of
    1300 Mflops (65 of peak)
  • STREAM impressive results even at 444 MHz
  • Tuned Copy 2.4 GB/s, Scale 2.1 GB/s, Add
    1.8 GB/s, Triad 1.9 GB/s
  • Standard Copy 1.2 GB/s, Scale 1.1 GB/s, Add
    1.2 GB/s, Triad 1.2 GB/s
  • At 700 MHz Would beat STREAM numbers for most
    high end microprocessors
  • MPI
  • Latency lt 4000 cycles (5.5 ls at 700 MHz)

12
BlueGene/L Architecture
  • To achieve this level of integration, the machine
    was developed around a processor with moderate
    frequency, available in system-on-a-chip (SoC)
    technology
  • This approach was chosen because of the
    performance/power advantage
  • In terms of performance/watt the low-frequency,
    low-power, embedded IBM PowerPC core consistently
    outperforms high-frequency, high-power,
    microprocessors by a factor of 2 to 10
  • Industry focus on performance / rack
  • Performance / rack Performance / watt Watt /
    rack
  • Watt / rack 20kW for power and thermal cooling
    reasons
  • Power and cooling
  • Using conventional techniques, a 360 Tflops
    machine would require 10-20 megawatts.
  • BlueGene/L uses only 1.76 megawatts

13
Microprocessor Power Density Growth
14
System Power Comparison
15
BlueGene/L Architecture
  • Networks were chosen with extreme scaling in mind
  • Scale efficiently in terms of both performance
    and packaging
  • Support very small messages
  • As small as 32 bytes
  • Includes hardware support for collective
    operations
  • Broadcast, reduction, scan, etc.
  • Reliability, Availability and Serviceability
    (RAS) is another critical issue for scaling
  • BG/L need to be reliable and usable even at
    extreme scaling limits
  • 20 fails per 1,000,000,000 hours 1 node failure
    every 4.5 weeks
  • System Software and Monitoring also important to
    scaling
  • BG/L designed to efficiently utilize a
    distributed memory, message-passing programming
    model
  • MPI is the dominant message-passing model with
    hardware features added and parameter tuned

16
RAS (Reliability, Availability, Serviceability)
  • System designed for RAS from top to bottom
  • System issues
  • Redundant bulk supplies, power converters, fans,
    DRAM bits, cable bits
  • Extensive data logging (voltage, temp,
    recoverable errors ) for failure forecasting
  • Nearly no single points of failure
  • Chip design
  • ECC on all SRAMs
  • All dataflow outside processors is protected by
    error-detection mechanisms
  • Access to all state via noninvasive back door
  • Low power, simple design leads to higher
    reliability
  • All interconnects have multiple error detections
    and correction coverage
  • Virtually zero escape probability for link errors

17
BlueGene/L System
136.8 Teraflop/s on LINPACK (64K processors) 1 TF
1000,000,000,000 Flops Rochester Lab 2005
18
BlueGene/L System
19
BlueGene/L System
20
BlueGene/L System
21
Physical Layout of BG/L
22
Midplanes and Racks
23
The Compute Chip
  • System-on-a-chip (SoC)
  • 1 ASIC
  • 2 PowerPC processors
  • L1 and L2 Caches
  • 4MB embedded DRAM
  • DDR DRAM interface and DMA controller
  • Network connectivity hardware
  • Control / monitoring equip. (JTAG)

24
Compute Card
25
Node Card
26
BlueGene/L Compute ASIC
  • IBM CU-11, 0.13 µm
  • 11 x 11 mm die size
  • 25 x 32 mm CBGA
  • 474 pins, 328 signal
  • 1.5/2.5 Volt

27
BlueGene/L Interconnect Networks
  • 3 Dimensional Torus
  • Main network, for point-to-point communication
  • High-speed, high-bandwidth
  • Interconnects all compute nodes (65,536)
  • Virtual cut-through hardware routing
  • 1.4Gb/s on all 12 node links (2.1 GB/s per node)
  • 1 µs latency between nearest neighbors, 5 µs to
    the farthest
  • 4 µs latency for one hop with MPI, 10 µs to the
    farthest
  • Communications backbone for computations
  • 0.7/1.4 TB/s bisection bandwidth, 68TB/s total
    bandwidth
  • Global Tree
  • One-to-all broadcast functionality
  • Reduction operations functionality
  • MPI collective ops in hardware
  • Fixed-size 256 byte packets
  • 2.8 Gb/s of bandwidth per link
  • Latency of one way tree traversal 2.5 µs
  • 23TB/s total binary tree bandwidth (64k machine)

28
The Torus Network
  • 3 dimensional 64 x 32 x 32
  • Each compute node is connected to its six
    neighbors x, x-, y, y-, z, z-
  • Compute card is 1x2x1
  • Node card is 4x4x2
  • 16 compute cards in 4x2x2 arrangement
  • Midplane is 8x8x8
  • 16 node cards in 2x2x4 arrangement
  • Communication path
  • Each uni-directional link is 1.4Gb/s, or 175MB/s.
  • Each node can send and receive at 1.05GB/s.
  • Supports cut-through routing, along with both
    deterministic and adaptive routing.
  • Variable-sized packets of 32,64,96256 bytes
  • Guarantees reliable delivery

29
Complete BlueGene/L System at LLNL
BG/L I/O nodes 1,024
WAN
48
visualization
64
archive
128
BG/L compute nodes 65,536
Federated Gigabit Ethernet Switch 2,048 ports
CWFS
512
1024
Front-end nodes
8
Service node
8
8
Control network
30
System Software Overview
  • Operating system - Linux
  • Compilers - IBM XL C, C, Fortran95
  • Communication - MPI, TCP/IP
  • Parallel File System - GPFS, NFS support
  • System Management - extensions to CSM
  • Job scheduling - based on LoadLeveler
  • Math libraries - ESSL

31
BG/L Software Hierarchical Organization
  • Compute nodes dedicated to running user
    application, and almost nothing else - simple
    compute node kernel (CNK)
  • I/O nodes run Linux and provide a more complete
    range of OS services files, sockets, process
    launch, signaling, debugging, and termination
  • Service node performs system management services
    (e.g., heart beating, monitoring errors) -
    transparent to application software

32
BG/L System Software
  • Simplicity
  • Space-sharing
  • Single-threaded
  • No demand paging
  • Familiarity
  • MPI (MPICH2)
  • IBM XL Compilers for PowerPC

33
Operating Systems
  • Front-end nodes are commodity systems running
    Linux
  • I/O nodes run a customized Linux kernel
  • Compute nodes use an extremely lightweight custom
    kernel
  • Service node is a single multiprocessor machine
    running a custom OS

34
Compute Node Kernel (CNK)
  • Single user, dual-threaded
  • Flat address space, no paging
  • Physical resources are memory-mapped
  • Provides standard POSIX functionality (mostly)
  • Two execution modes
  • Virtual node mode
  • Coprocessor mode

35
Service Node OS
  • Core Management and Control System (CMCS)
  • BG/Ls global operating system.
  • MMCS - Midplane Monitoring and Control System
  • CIOMAN - Control and I/O Manager
  • DB2 relational database

36
Running a User Job
  • Compiled, and submitted from front-end node.
  • External scheduler
  • Service node sets up partition, and transfers
    users code to compute nodes.
  • All file I/O is done using standard Unix calls
    (via the I/O nodes).
  • Post-facto debugging done on front-end nodes.

37
Performance Issues
  • User code is easily ported to BG/L.
  • However, MPI implementation requires effort
    skill
  • Torus topology instead of crossbar
  • Special hardware, such as collective network.

38
BG/L MPI Software Architecture
GI Global Interrupt CIO Control and I/O
Protocol CH3 Primary device distributed with
MPICH2 communication MPD Multipurpose Daemon
39
MPI_Bcast
40
MPI_Alltoall
41
References
  • IBM Journal of Research and Development, Vol. 49,
    No. 2-3.
  • http//www.research.ibm.com/journal/rd49-23.html
  • Overview of the Blue Gene/L system architecture
  • Packaging the Blue Gene/L supercomputer
  • Blue Gene/L compute chip Memory and Ethernet
    subsystems
  • Blue Gene/L torus interconnection network
  • Blue Gene/L programming and operating
    environment
  • Design and implementation of message-passing
    services for the Blue Gene/L supercomputer

42
References (cont.)
  • BG/L homepage _at_ LLNL lthttp//www.llnl.gov/ASC/pla
    tforms/bluegenel/gt
  • BlueGene homepage _at_ IBM lthttp//www.research.ibm.
    com/bluegene/gt

43
  • The End
Write a Comment
User Comments (0)
About PowerShow.com