Massive Supercomputing Coping with Heterogeneity of Modern Accelerators - PowerPoint PPT Presentation

About This Presentation

Title:

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators

Description:

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators ... An accelerator is equivalent to 14 CPU cores at peak ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 27

Provided by: matsuisT

Category:

more less

Transcript and Presenter's Notes

Title: Massive Supercomputing Coping with Heterogeneity of Modern Accelerators

1
Massive Supercomputing Coping with Heterogeneity
of Modern Accelerators

Toshio Endo and Satoshi Matsuoka
Tokyo Institute of Technology, Japan

2
Accelerators for High Performance Computing

In HPC systems, power consumption has been/will
remain a major concern
SIMD accelerators are promising for their
excellent Flops/Watt ratio

NVidia GeForce 8800GTX, 375GFlops(Single
precision), 200W
ClearSpeed X620 accelerator, 80GFlops, 25W
3
Heterogeneous Architectures (1)

HPC systems only with special purpose
accelerators are infeasible
They dont directly support existing compilers,
applications, MPI, Linux
Heterogeneous architectures will be attractive
for
Generality by general purpose CPUs
Typically x86/x86-64 CPUs
Higher Flops/Watt ratio by accelerators
ClearSpeed accelerators, GPGPUs, CellBE
Example IBM Roadrunner, TokyoTech TSUBAME

4
Heterogeneous Architectures (2)

Objectives
Running a large parallel application on
heterogeneous systems
Questions
How can we utilize heterogeneous resources
effectively?
Are they scalable up to supercomputing scale?

ClearSpeed X620 accelerator 80GFlops peak
AMD Opteron 880 4.8GFlops peak / core

5
Overview of Our Work

We take a tightly-coupled program, Linpack
Combined usage of 10,368 Opteron cores and 648
ClearSpeed SIMD accelerators
gt60TFlops The worlds highest Linpack
performance on heterogeneous supercomputers

6
NEC/Sun/ClearSpeed/VoltaireTokyoTech TSUBAME
Supercomputer (2006)
SunFire X4600 16 Opteron core/node x 655nodes
Voltaire ISR9288 Infiniband 10Gbps
ClearSpeed CSX600 SIMD accelerator x 360 PCI-X
boards

102TFlops peak
Opteron 49.8TF ClearSpeed 52.2TF

648
16th supercomputer in the world, 2nd in Asia
(Top500 in Nov 2007)
7
Structure of TSUBAME Node with Heterogeneous
Processors
ClearSpeed Accelerator on PCI-X
InfiniBand Interconnect
8 dual-core Opteron CPUs (16 cores)
SunFire X4600
8
16 Opteron cores x 655 Compute nodes
1.6PByte storage
288Port 10Gbps InfiniBand SW x 6
Cooling Towers (20 units)
9
ClearSpeed Accelerator

PCI-X accelerator boards
CSX600 SIMD processor x 2 1GB DRAM on board
210MHz x 2FP x 96SIMD x 2 80.6GFlops peak
Configurable up to 250MHz
Power 25W/board
Provided software
Cn programming language
CSXL BLAS library lt Used by this work
CSFFT library

10
DGEMM Performance of Opteron and ClearSpeed
GOTO BLAS on Opteron (1 core)
CSXL BLAS 2.50 on ClearSpeed
Multiply of (MxB) x (BxM)

An accelerator is equivalent to 14 CPU cores at
peak
ClearSpeed Performance is much more sensitive to
matrix size!

- GOTO BLAS is by Kazushige Goto, U. Texas
11
Linpack Our Target Application Benchmark

Linpack is a numerical benchmark used in Top500
Solve N x N dense linear equations
HPL (High-performance Linpack) by A. Petitet
A well-known MPI parallel implementation
Matrix multiply (DGEMM) computation is most
time-consuming O(N3) in total

12
Data Decomposition in HPL
Matrix distribution on 6 (2x3) processes

Matrix is uniformly distributed with 2D
Block-Cyclic distribution

13
Flow of HPL (simplified)
Panel fact, etc.
Panel fact, etc
Panel fact, etc
Matrix multiply (DGEMM) for own data
Matrix multiply (DGEMM) for own data
Matrix multiply (DGEMM) for own data
Back to Top
Back to Top
Back to Top

Performance is bottlenecked by the slowest
process
HPL is designed for uniform systems

14
Requirements on Heterogeneous System

HPL is designed for homogeneous systems, but
Intra-node heterogeneity A node has both general
purpose CPUs and SIMD accelerator
Inter-node heterogeneity (In previous TSUBAME
configuration) about half the nodes have
accelerators, while others not
We want to keep modification to HPL source code
small
How can we run HPL efficiently on heterogeneous
systems?

15
Three System Configurations
No Heterogeneity
Intra-node Hetero Inter-node Hetero
Intra-node Hetero
16
Our Basic Policy (1/2)

For intra-node heterogeneity, we virtualize
heterogeneous processors at library layer
Processors are provides of DGEMM performance
We control mapping between processes and
processors

Example of mapping during DGEMM
Processes
Processors
17
Our Basic Policy (2/2)

For inter-node heterogeneity, we control the
number of processes among nodes
cf. CHARM, AMPI from UIUC

We can keep kernel workload of each process
uniform (good for HPL ?), while maintaining
heterogeneity

18
Careful Tuning is Necessary for Performance

Since SIMD accelerators are sensitive to many HPL
parameters, careful tuning is necessary
Process granularity
Process mapping
Block size
We need different tuning for each system
configuration

19
Tuning of Process Granularity
Coarse

We can tune process granularity as number of
BLAS threads
If processes are too coarse (a process uses many
threads), it is more difficult to balance among
nodes
If too fine, HPL suffers from duplicated
computation

Fine
20
Tuning of Block Size

When block size B is small, ClearSpeed
performance is heavily degraded
When B is too large, HPL suffers from large
overhead for panel factorization

21
Tuning on CPU-only Case

Focus is bringing out performance of BLAS
performance on CPUs

x 648
16 Opteron cores

Block size is 240, which is good for GOTO BLAS

22
Tuning on Fully-Accd Case

Focus is
Process granularity / block size should be large
enough for ClearSpeed BLAS
Balancing among processes, while utilizing both
processors

Clear Speed
x 648
16 Opteron cores
For PCI-X communication

Block size is 864, which is good for ClearSpeed
BLAS

23
Tuning on Half-Accd Case

Focus is balance among accelerated nodes and
non-accelerated nodes

Node w/o ClearSpeed
x 288
Node with ClearSpeed
Clear Speed
x 360
For PCI-X

Block size is 864

24
Experimentation

648 SunFire X4600 nodes in TSUBAME
Modified HPL Voltaire MPI GOTO BLAS CSXL
BLAS
Three configurations
CPU Only Only Opteron CPUs are used
Half Accd Only half the nodes are accelerated
Fully Accd All the nodes are accelerated

25
Experimental Results
Relative speed (CPU only1)

38.18TF in CPU-only
48.88TF in Half-Accd
28 over CPU Only
63.51TF in Fully-Accd
Check precise figures in the next Top500 in June ?

26
Summary

Scalability of heterogeneous supercomputers with
SIMD accelerators is demonstrated
gt60TFlops Linpack performance is achieved
Our method works efficiently even when nodes are
partially accelerated
Future work
From hand-tuning to automatic tuning
Other useful applications!

27
Our Basic Policy (2/2)