Title: Massive Supercomputing Coping with Heterogeneity of Modern Accelerators
1Massive Supercomputing Coping with Heterogeneity
of Modern Accelerators
- Toshio Endo and Satoshi Matsuoka
- Tokyo Institute of Technology, Japan
2Accelerators for High Performance Computing
- In HPC systems, power consumption has been/will
remain a major concern - SIMD accelerators are promising for their
excellent Flops/Watt ratio
NVidia GeForce 8800GTX, 375GFlops(Single
precision), 200W
ClearSpeed X620 accelerator, 80GFlops, 25W
3Heterogeneous Architectures (1)
- HPC systems only with special purpose
accelerators are infeasible - They dont directly support existing compilers,
applications, MPI, Linux - Heterogeneous architectures will be attractive
for - Generality by general purpose CPUs
- Typically x86/x86-64 CPUs
- Higher Flops/Watt ratio by accelerators
- ClearSpeed accelerators, GPGPUs, CellBE
- Example IBM Roadrunner, TokyoTech TSUBAME
4Heterogeneous Architectures (2)
- Objectives
- Running a large parallel application on
heterogeneous systems - Questions
- How can we utilize heterogeneous resources
effectively? - Are they scalable up to supercomputing scale?
ClearSpeed X620 accelerator 80GFlops peak
AMD Opteron 880 4.8GFlops peak / core
5Overview of Our Work
- We take a tightly-coupled program, Linpack
- Combined usage of 10,368 Opteron cores and 648
ClearSpeed SIMD accelerators - gt60TFlops The worlds highest Linpack
performance on heterogeneous supercomputers
6NEC/Sun/ClearSpeed/VoltaireTokyoTech TSUBAME
Supercomputer (2006)
SunFire X4600 16 Opteron core/node x 655nodes
Voltaire ISR9288 Infiniband 10Gbps
ClearSpeed CSX600 SIMD accelerator x 360 PCI-X
boards
- 102TFlops peak
- Opteron 49.8TF ClearSpeed 52.2TF
648
16th supercomputer in the world, 2nd in Asia
(Top500 in Nov 2007)
7Structure of TSUBAME Node with Heterogeneous
Processors
ClearSpeed Accelerator on PCI-X
InfiniBand Interconnect
8 dual-core Opteron CPUs (16 cores)
SunFire X4600
816 Opteron cores x 655 Compute nodes
1.6PByte storage
288Port 10Gbps InfiniBand SW x 6
Cooling Towers (20 units)
9ClearSpeed Accelerator
- PCI-X accelerator boards
- CSX600 SIMD processor x 2 1GB DRAM on board
- 210MHz x 2FP x 96SIMD x 2 80.6GFlops peak
- Configurable up to 250MHz
- Power 25W/board
- Provided software
- Cn programming language
- CSXL BLAS library lt Used by this work
- CSFFT library
10DGEMM Performance of Opteron and ClearSpeed
GOTO BLAS on Opteron (1 core)
CSXL BLAS 2.50 on ClearSpeed
Multiply of (MxB) x (BxM)
- An accelerator is equivalent to 14 CPU cores at
peak - ClearSpeed Performance is much more sensitive to
matrix size!
- GOTO BLAS is by Kazushige Goto, U. Texas
11Linpack Our Target Application Benchmark
- Linpack is a numerical benchmark used in Top500
- Solve N x N dense linear equations
- HPL (High-performance Linpack) by A. Petitet
- A well-known MPI parallel implementation
- Matrix multiply (DGEMM) computation is most
time-consuming O(N3) in total
12Data Decomposition in HPL
Matrix distribution on 6 (2x3) processes
- Matrix is uniformly distributed with 2D
Block-Cyclic distribution
13Flow of HPL (simplified)
Panel fact, etc.
Panel fact, etc
Panel fact, etc
Matrix multiply (DGEMM) for own data
Matrix multiply (DGEMM) for own data
Matrix multiply (DGEMM) for own data
Back to Top
Back to Top
Back to Top
- Performance is bottlenecked by the slowest
process - HPL is designed for uniform systems
14Requirements on Heterogeneous System
- HPL is designed for homogeneous systems, but
- Intra-node heterogeneity A node has both general
purpose CPUs and SIMD accelerator - Inter-node heterogeneity (In previous TSUBAME
configuration) about half the nodes have
accelerators, while others not - We want to keep modification to HPL source code
small -
- How can we run HPL efficiently on heterogeneous
systems?
15Three System Configurations
No Heterogeneity
Intra-node Hetero Inter-node Hetero
Intra-node Hetero
16Our Basic Policy (1/2)
- For intra-node heterogeneity, we virtualize
heterogeneous processors at library layer - Processors are provides of DGEMM performance
- We control mapping between processes and
processors
Example of mapping during DGEMM
Processes
Processors
17Our Basic Policy (2/2)
- For inter-node heterogeneity, we control the
number of processes among nodes - cf. CHARM, AMPI from UIUC
- We can keep kernel workload of each process
uniform (good for HPL ?), while maintaining
heterogeneity
18Careful Tuning is Necessary for Performance
- Since SIMD accelerators are sensitive to many HPL
parameters, careful tuning is necessary - Process granularity
- Process mapping
- Block size
- We need different tuning for each system
configuration
19Tuning of Process Granularity
Coarse
- We can tune process granularity as number of
BLAS threads - If processes are too coarse (a process uses many
threads), it is more difficult to balance among
nodes - If too fine, HPL suffers from duplicated
computation
Fine
20Tuning of Block Size
- When block size B is small, ClearSpeed
performance is heavily degraded - When B is too large, HPL suffers from large
overhead for panel factorization
21Tuning on CPU-only Case
- Focus is bringing out performance of BLAS
performance on CPUs
x 648
16 Opteron cores
- Block size is 240, which is good for GOTO BLAS
22Tuning on Fully-Accd Case
- Focus is
- Process granularity / block size should be large
enough for ClearSpeed BLAS - Balancing among processes, while utilizing both
processors
Clear Speed
x 648
16 Opteron cores
For PCI-X communication
- Block size is 864, which is good for ClearSpeed
BLAS
23Tuning on Half-Accd Case
- Focus is balance among accelerated nodes and
non-accelerated nodes
Node w/o ClearSpeed
x 288
Node with ClearSpeed
Clear Speed
x 360
For PCI-X
24Experimentation
- 648 SunFire X4600 nodes in TSUBAME
- Modified HPL Voltaire MPI GOTO BLAS CSXL
BLAS - Three configurations
- CPU Only Only Opteron CPUs are used
- Half Accd Only half the nodes are accelerated
- Fully Accd All the nodes are accelerated
25Experimental Results
Relative speed (CPU only1)
- 38.18TF in CPU-only
- 48.88TF in Half-Accd
- 28 over CPU Only
- 63.51TF in Fully-Accd
- Check precise figures in the next Top500 in June ?
26Summary
- Scalability of heterogeneous supercomputers with
SIMD accelerators is demonstrated - gt60TFlops Linpack performance is achieved
- Our method works efficiently even when nodes are
partially accelerated - Future work
- From hand-tuning to automatic tuning
- Other useful applications!
27Our Basic Policy (2/2)
- Two types of HPL processes are introduced
- CPU processes use GOTO BLASs DGEMM
- SIMD processes throw DGEMM requests to accelerator
CPU process
SIMD process
SIMD server
Additional SIMD servers are introduced as
multiplexer
28Mapping between Processes and Nodes
- Peak DGEMM performance per node is
- 120 GFlops with accelerator
- 70 GFlops w/o accelerator
Roughly 74
29Mapping between Processes and Processors
Processors are divided into seven processes
- We need consider CPU usage for communication with
accelerator (black region) - Remaining idle CPU cores are used to help
accelerator