Next KEK machine - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Next KEK machine

Description:

... 2 simultaneous multiply/add per cycle (8.4 GFlops/core), 1.875 MB ... must be combined with other rows to avoid pipeline stall (wait 5 cycles). Oct 4, 2005 ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 32

Provided by: suchi4

Category:

Tags: kek | cycles | jp | machine | next

more less

Transcript and Presenter's Notes

Title: Next KEK machine

1
Next KEK machine

Shoji Hashimoto (KEK)
_at_ 3rd ILFT Network Workshop at Jefferson Lab.,
Oct. 3-6, 2005

2
KEK supercomputer

Leading computing facility in that time
1985 Hitachi S810/10 350 MFlops
1989 Hitachi S820/80 3 GFlops
1995 Fujitsu VPP500 128 GFlops
2000 Hitachi SR8000 F1 1.2 TFlops
2006 ???

3
Formality

KEK Large Scale Simulation Program call for
proposals of project to be performed on the
supercomputer.
Open for Japanese researcher working on high
energy accelerator science (particle and nuclear
physics, astrophysics, accelerator science,
material science related to the photon factory)
Program Advisory Committee (PAC) decides the
approval and machine time allocation.

4
Usage

Lattice QCD is a dominant user.
About 60-80 of the computer time for lattice QCD
Among them, 60 is by the JLQCD collaboration
Others include Hatsuda-Sasaki, Nakamura et al.,
Suganuma et al., Suzuki et al. (Kanazawa),
Simulation for accelerator design is another big
user beam-beam simulation for the KEK-B factory.

5
JLQCD collaboration

1995 (on VPP500)
Continuum limit in the quenched approximation

ms
BK
fB, fD
6
JLQCD collaboration

2000 (on SR8000)
Dynamical QCD with the improved Wilson fermion

mV vs mPS2
fB, fBs
Kl3 form factor
7
Around the triangle
8
The wall

Chiral extrapolation very hard to go beyond ms/2
Problem for every physical quantities.
Maybe, solved by the new algorithms and machines

New generation of dynamical QCD
9
Upgrade

Thanks to Hideo Matsufuru (Computing Research
Center, KEK) for his hard work.
Upgrade scheduled on March 1st 2006.
Called for bids from vendors.
At least 20x more computing power, measured
mainly using the QCD codes.
No restriction on architecture (scalar or vector,
etc.), but some amount must be a shared memory
machine.
Decision was made, recently.

10
The next machine

A combination of two systems
Hitachi SR11000 K1, 16 nodes, 2.15 TFlops peak
performance.
IBM Blue Gene/L, 10 racks, 57.3 TFlops peak
performance.
Hitachi Ltd. is the prime contractor.

11
Hitachi SR11000 K1
Will be announced, tomorrow.

POWER5 2.1 GHz, dual core, 2 simultaneous
multiply/add per cycle (8.4 GFlops/core), 1.875
MB L2 (on chip), 36 MB L3 (off chip)
8.5 GB/s chip-memory bandwidth, hardware and
software prefetch
16-way SMP (134.4 GFlops/node), 32 GB memory
(DDR2 SDRAM).
16 nodes (2.15 TFlops)
Interconnect Federation switch 8GB/s
(bidirectional)

12
SR11000 node
13
16-way SMP
14
High Density Module
15
IBM Blue Gene/L

Node 2 PowerPC440 (dual core), 700 MHz, double
FPU (5.6 GFlops/chip), 4MB on-chip L3 (shared),
512 MB memory.
Interconnect 3D torus, 1.4 Gbps/link (6 in 6
out) from each node.
Midplane 8x8x8 nodes (2.87 TFlops) rack 2
Midplane
10 rack system
All the info in the following comes from the Red
book (ibm.com/redbooks) and the articles in IBM
Journal of Research and Development.

16
BG/L system
10 Racks
17
BG/L node ASIC

Double Floating-Point-Unit (FPU) added to the
PPC440 core. 2 fused multiply-add per core
Not a true SMP L1 has no cache coherency, L2 has
a snoop.
Shared 4MB L3.
Communication between the two core through the
multiported shared SRAM buffer
Embedded memory controller and networks.

18
Compute note modes

Virtual node mode use both CPUs separately,
running a different process on each core.
Communication using MPI, etc. Memory and
bandwidth are shared.
Co-processor mode use the secondary processor as
a co-processor for communication. Peak performace
is ½.
Hybrid node mode use the secondary processor
also for computation. Need a special care about
the L1 cache incoherency. Used for Linpack.

19
QCD code optimization

Jun Doi and Hikaru Samukawa (IBM Japan)
Use the virtual node mode
Fully used the Double FPU (hand-written assembler
code)
Use a low-level communication API

20
Double FPU

SIMD extension of PPC440.
32 pairs of 64-bit FP register, addresses are
shared.
Quadword load and store.
Primary and secondary pipelines. Fused
multiply-add for each pipe.
Cross operations possible best suited for
complex arithmetic.

21
Examples
22
SU(3) matrixvector
y0 u00 x0 u01 x1 u02
x2 y1 u10 x0 u11 x1
u12 x2 y2 u20 x0 u21
x1 u22 x2
complex mult u00 x0
re(y0)re(u00)re(x0) im(y0)re(u00)
im(x0)
FXPMUL (y0,u00,x0) FXCXNPMA
(y0,u00,x0,y0)
re(y0)-im(u00)im(x0) im(y0)im(u0
0)re(x0)
u01 x1 u02 x2
FXCPMADD (y0,u01,x1,y0) FXCXNPMA
(y0,u01,x1,y0) FXCPMADD
(y0,u02,x2,y0) FXCXNPMA
(y0,u02,x2,y0)
must be combined with other rows to avoid
pipeline stall (wait 5 cycles).
23
Scheduling

3232 registers can hold 32 complex numbers.
3x3(9) for a gauge link 3x4(12) for a spinor
need 2 spinors for input and output
Load the gauge link while computing, using 66
registers. Straightforward for yUx, but not so
for yconjg(U)x.

Use the inline-assembler of gcc xlf and xlc have
intrinsic functions.
Early xlf/xlc wasnt good enough to produce these
code, but is improved more recently.

24
Parallelization on BG/L

Example 243x48 lattice.
Use the virtual node mode.
For the midplane, divide the entire lattice onto
2x8x8x8 processors. For one rack, 2x8x8x16. (2 is
inner-node.)
To use more than one rack, 323x64 lattice is the
minimum.
Each processor has 12x3x3x6 (or 12x3x3x3)
lattice.

25
Communication

Communication is fast
6 links to nearest-neighbors. 1.4 Gbps
(bi-directional) for each link.
latency is 140ns for one hop.

MPI is too heavy
Need additional buffer copy waste the cache and
memory bandwidth.
Multi-thread not available in the virtual node
mode.
Overlapping comp and comm is not possible within
MPI.

26
QCD Enhancement Package

Low-level communication API
Directly send/recv by accessing the torus
interface FIFO. No copy to memory buffer.
Non-blocking send blocking recv.
Up to 224 byte data to send/recv at once (spinor
at one site 192 byte).
Assuming the nearest-neighbor communication.

27
An example
define BGLNET_WORK_REG 30 define
BGLNET_HEADER_REG 30 BGLNetQuad
fifo BGLNet_Send_WaitReady(BGLNET_X_PLUS,fifo,6)
for(i0iltNxi) // put results to reg
24--29 BGLNet_Send_Enqueue_Header(fifo) BGLNet_
Send_Enqueue(fifo,24) BGLNet_Send_Enqueue(fifo,2
5) BGLNet_Send_Enqueue(fifo,26) BGLNet_Send_En
queue(fifo,27) BGLNet_Send_Enqueue(fifo,28) BG
LNet_Send_Enqueue(fifo,29) BGLNet_Send_Packet(fi
fo)
Create packet header
Put the packet header to the send buffer
Put the data to the send buffer
Kick!
28
Benchmark