Massive Supercomputing Coping with Heterogeneity of Modern Accelerators - PowerPoint PPT Presentation

About This Presentation
Title:

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators

Description:

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators ... An accelerator is equivalent to 14 CPU cores at peak ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 27
Provided by: matsuisT
Category:

less

Transcript and Presenter's Notes

Title: Massive Supercomputing Coping with Heterogeneity of Modern Accelerators


1
Massive Supercomputing Coping with Heterogeneity
of Modern Accelerators
  • Toshio Endo and Satoshi Matsuoka
  • Tokyo Institute of Technology, Japan

2
Accelerators for High Performance Computing
  • In HPC systems, power consumption has been/will
    remain a major concern
  • SIMD accelerators are promising for their
    excellent Flops/Watt ratio

NVidia GeForce 8800GTX, 375GFlops(Single
precision), 200W
ClearSpeed X620 accelerator, 80GFlops, 25W
3
Heterogeneous Architectures (1)
  • HPC systems only with special purpose
    accelerators are infeasible
  • They dont directly support existing compilers,
    applications, MPI, Linux
  • Heterogeneous architectures will be attractive
    for
  • Generality by general purpose CPUs
  • Typically x86/x86-64 CPUs
  • Higher Flops/Watt ratio by accelerators
  • ClearSpeed accelerators, GPGPUs, CellBE
  • Example IBM Roadrunner, TokyoTech TSUBAME

4
Heterogeneous Architectures (2)
  • Objectives
  • Running a large parallel application on
    heterogeneous systems
  • Questions
  • How can we utilize heterogeneous resources
    effectively?
  • Are they scalable up to supercomputing scale?

ClearSpeed X620 accelerator 80GFlops peak
AMD Opteron 880 4.8GFlops peak / core

5
Overview of Our Work
  • We take a tightly-coupled program, Linpack
  • Combined usage of 10,368 Opteron cores and 648
    ClearSpeed SIMD accelerators
  • gt60TFlops The worlds highest Linpack
    performance on heterogeneous supercomputers


6
NEC/Sun/ClearSpeed/VoltaireTokyoTech TSUBAME
Supercomputer (2006)
SunFire X4600 16 Opteron core/node x 655nodes
Voltaire ISR9288 Infiniband 10Gbps
ClearSpeed CSX600 SIMD accelerator x 360 PCI-X
boards
  • 102TFlops peak
  • Opteron 49.8TF ClearSpeed 52.2TF

648
16th supercomputer in the world, 2nd in Asia
(Top500 in Nov 2007)
7
Structure of TSUBAME Node with Heterogeneous
Processors
ClearSpeed Accelerator on PCI-X
InfiniBand Interconnect
8 dual-core Opteron CPUs (16 cores)
SunFire X4600
8
16 Opteron cores x 655 Compute nodes
1.6PByte storage
288Port 10Gbps InfiniBand SW x 6
Cooling Towers (20 units)
9
ClearSpeed Accelerator
  • PCI-X accelerator boards
  • CSX600 SIMD processor x 2 1GB DRAM on board
  • 210MHz x 2FP x 96SIMD x 2 80.6GFlops peak
  • Configurable up to 250MHz
  • Power 25W/board
  • Provided software
  • Cn programming language
  • CSXL BLAS library lt Used by this work
  • CSFFT library

10
DGEMM Performance of Opteron and ClearSpeed
GOTO BLAS on Opteron (1 core)
CSXL BLAS 2.50 on ClearSpeed
Multiply of (MxB) x (BxM)
  • An accelerator is equivalent to 14 CPU cores at
    peak
  • ClearSpeed Performance is much more sensitive to
    matrix size!

- GOTO BLAS is by Kazushige Goto, U. Texas
11
Linpack Our Target Application Benchmark
  • Linpack is a numerical benchmark used in Top500
  • Solve N x N dense linear equations
  • HPL (High-performance Linpack) by A. Petitet
  • A well-known MPI parallel implementation
  • Matrix multiply (DGEMM) computation is most
    time-consuming O(N3) in total

12
Data Decomposition in HPL
Matrix distribution on 6 (2x3) processes
  • Matrix is uniformly distributed with 2D
    Block-Cyclic distribution

13
Flow of HPL (simplified)
Panel fact, etc.
Panel fact, etc
Panel fact, etc
Matrix multiply (DGEMM) for own data
Matrix multiply (DGEMM) for own data
Matrix multiply (DGEMM) for own data
Back to Top
Back to Top
Back to Top
  • Performance is bottlenecked by the slowest
    process
  • HPL is designed for uniform systems

14
Requirements on Heterogeneous System
  • HPL is designed for homogeneous systems, but
  • Intra-node heterogeneity A node has both general
    purpose CPUs and SIMD accelerator
  • Inter-node heterogeneity (In previous TSUBAME
    configuration) about half the nodes have
    accelerators, while others not
  • We want to keep modification to HPL source code
    small
  • How can we run HPL efficiently on heterogeneous
    systems?

15
Three System Configurations
No Heterogeneity
Intra-node Hetero Inter-node Hetero
Intra-node Hetero
16
Our Basic Policy (1/2)
  • For intra-node heterogeneity, we virtualize
    heterogeneous processors at library layer
  • Processors are provides of DGEMM performance
  • We control mapping between processes and
    processors

Example of mapping during DGEMM
Processes
Processors
17
Our Basic Policy (2/2)
  • For inter-node heterogeneity, we control the
    number of processes among nodes
  • cf. CHARM, AMPI from UIUC
  • We can keep kernel workload of each process
    uniform (good for HPL ?), while maintaining
    heterogeneity

18
Careful Tuning is Necessary for Performance
  • Since SIMD accelerators are sensitive to many HPL
    parameters, careful tuning is necessary
  • Process granularity
  • Process mapping
  • Block size
  • We need different tuning for each system
    configuration

19
Tuning of Process Granularity
Coarse
  • We can tune process granularity as number of
    BLAS threads
  • If processes are too coarse (a process uses many
    threads), it is more difficult to balance among
    nodes
  • If too fine, HPL suffers from duplicated
    computation

Fine
20
Tuning of Block Size
  • When block size B is small, ClearSpeed
    performance is heavily degraded
  • When B is too large, HPL suffers from large
    overhead for panel factorization

21
Tuning on CPU-only Case
  • Focus is bringing out performance of BLAS
    performance on CPUs

x 648
16 Opteron cores
  • Block size is 240, which is good for GOTO BLAS

22
Tuning on Fully-Accd Case
  • Focus is
  • Process granularity / block size should be large
    enough for ClearSpeed BLAS
  • Balancing among processes, while utilizing both
    processors

Clear Speed
x 648
16 Opteron cores
For PCI-X communication
  • Block size is 864, which is good for ClearSpeed
    BLAS

23
Tuning on Half-Accd Case
  • Focus is balance among accelerated nodes and
    non-accelerated nodes

Node w/o ClearSpeed
x 288
Node with ClearSpeed
Clear Speed
x 360
For PCI-X
  • Block size is 864

24
Experimentation
  • 648 SunFire X4600 nodes in TSUBAME
  • Modified HPL Voltaire MPI GOTO BLAS CSXL
    BLAS
  • Three configurations
  • CPU Only Only Opteron CPUs are used
  • Half Accd Only half the nodes are accelerated
  • Fully Accd All the nodes are accelerated

25
Experimental Results
Relative speed (CPU only1)
  • 38.18TF in CPU-only
  • 48.88TF in Half-Accd
  • 28 over CPU Only
  • 63.51TF in Fully-Accd
  • Check precise figures in the next Top500 in June ?

26
Summary
  • Scalability of heterogeneous supercomputers with
    SIMD accelerators is demonstrated
  • gt60TFlops Linpack performance is achieved
  • Our method works efficiently even when nodes are
    partially accelerated
  • Future work
  • From hand-tuning to automatic tuning
  • Other useful applications!

27
Our Basic Policy (2/2)
  • Two types of HPL processes are introduced
  • CPU processes use GOTO BLASs DGEMM
  • SIMD processes throw DGEMM requests to accelerator

CPU process
SIMD process
SIMD server
Additional SIMD servers are introduced as
multiplexer
28
Mapping between Processes and Nodes
  • Peak DGEMM performance per node is
  • 120 GFlops with accelerator
  • 70 GFlops w/o accelerator

Roughly 74
29
Mapping between Processes and Processors
Processors are divided into seven processes
  • We need consider CPU usage for communication with
    accelerator (black region)
  • Remaining idle CPU cores are used to help
    accelerator
Write a Comment
User Comments (0)
About PowerShow.com