Title: Next KEK machine
1Next KEK machine
- Shoji Hashimoto (KEK)
- _at_ 3rd ILFT Network Workshop at Jefferson Lab.,
Oct. 3-6, 2005
2KEK supercomputer
- Leading computing facility in that time
- 1985 Hitachi S810/10 350 MFlops
- 1989 Hitachi S820/80 3 GFlops
- 1995 Fujitsu VPP500 128 GFlops
- 2000 Hitachi SR8000 F1 1.2 TFlops
- 2006 ???
3Formality
- KEK Large Scale Simulation Program call for
proposals of project to be performed on the
supercomputer. - Open for Japanese researcher working on high
energy accelerator science (particle and nuclear
physics, astrophysics, accelerator science,
material science related to the photon factory) - Program Advisory Committee (PAC) decides the
approval and machine time allocation.
4Usage
- Lattice QCD is a dominant user.
- About 60-80 of the computer time for lattice QCD
- Among them, 60 is by the JLQCD collaboration
- Others include Hatsuda-Sasaki, Nakamura et al.,
Suganuma et al., Suzuki et al. (Kanazawa), - Simulation for accelerator design is another big
user beam-beam simulation for the KEK-B factory.
5JLQCD collaboration
- 1995 (on VPP500)
- Continuum limit in the quenched approximation
ms
BK
fB, fD
6JLQCD collaboration
- 2000 (on SR8000)
- Dynamical QCD with the improved Wilson fermion
mV vs mPS2
fB, fBs
Kl3 form factor
7Around the triangle
8The wall
- Chiral extrapolation very hard to go beyond ms/2
- Problem for every physical quantities.
- Maybe, solved by the new algorithms and machines
New generation of dynamical QCD
9Upgrade
- Thanks to Hideo Matsufuru (Computing Research
Center, KEK) for his hard work. - Upgrade scheduled on March 1st 2006.
- Called for bids from vendors.
- At least 20x more computing power, measured
mainly using the QCD codes. - No restriction on architecture (scalar or vector,
etc.), but some amount must be a shared memory
machine. - Decision was made, recently.
10The next machine
- A combination of two systems
- Hitachi SR11000 K1, 16 nodes, 2.15 TFlops peak
performance. - IBM Blue Gene/L, 10 racks, 57.3 TFlops peak
performance. - Hitachi Ltd. is the prime contractor.
11Hitachi SR11000 K1
Will be announced, tomorrow.
- POWER5 2.1 GHz, dual core, 2 simultaneous
multiply/add per cycle (8.4 GFlops/core), 1.875
MB L2 (on chip), 36 MB L3 (off chip) - 8.5 GB/s chip-memory bandwidth, hardware and
software prefetch - 16-way SMP (134.4 GFlops/node), 32 GB memory
(DDR2 SDRAM). - 16 nodes (2.15 TFlops)
- Interconnect Federation switch 8GB/s
(bidirectional)
12SR11000 node
1316-way SMP
14High Density Module
15IBM Blue Gene/L
- Node 2 PowerPC440 (dual core), 700 MHz, double
FPU (5.6 GFlops/chip), 4MB on-chip L3 (shared),
512 MB memory. - Interconnect 3D torus, 1.4 Gbps/link (6 in 6
out) from each node. - Midplane 8x8x8 nodes (2.87 TFlops) rack 2
Midplane - 10 rack system
- All the info in the following comes from the Red
book (ibm.com/redbooks) and the articles in IBM
Journal of Research and Development.
16BG/L system
10 Racks
17BG/L node ASIC
- Double Floating-Point-Unit (FPU) added to the
PPC440 core. 2 fused multiply-add per core - Not a true SMP L1 has no cache coherency, L2 has
a snoop. - Shared 4MB L3.
- Communication between the two core through the
multiported shared SRAM buffer - Embedded memory controller and networks.
18Compute note modes
- Virtual node mode use both CPUs separately,
running a different process on each core.
Communication using MPI, etc. Memory and
bandwidth are shared. - Co-processor mode use the secondary processor as
a co-processor for communication. Peak performace
is ½. - Hybrid node mode use the secondary processor
also for computation. Need a special care about
the L1 cache incoherency. Used for Linpack.
19QCD code optimization
- Jun Doi and Hikaru Samukawa (IBM Japan)
- Use the virtual node mode
- Fully used the Double FPU (hand-written assembler
code) - Use a low-level communication API
20Double FPU
- SIMD extension of PPC440.
- 32 pairs of 64-bit FP register, addresses are
shared. - Quadword load and store.
- Primary and secondary pipelines. Fused
multiply-add for each pipe. - Cross operations possible best suited for
complex arithmetic.
21Examples
22SU(3) matrixvector
y0 u00 x0 u01 x1 u02
x2 y1 u10 x0 u11 x1
u12 x2 y2 u20 x0 u21
x1 u22 x2
complex mult u00 x0
re(y0)re(u00)re(x0) im(y0)re(u00)
im(x0)
FXPMUL (y0,u00,x0) FXCXNPMA
(y0,u00,x0,y0)
re(y0)-im(u00)im(x0) im(y0)im(u0
0)re(x0)
u01 x1 u02 x2
FXCPMADD (y0,u01,x1,y0) FXCXNPMA
(y0,u01,x1,y0) FXCPMADD
(y0,u02,x2,y0) FXCXNPMA
(y0,u02,x2,y0)
must be combined with other rows to avoid
pipeline stall (wait 5 cycles).
23Scheduling
- 3232 registers can hold 32 complex numbers.
- 3x3(9) for a gauge link 3x4(12) for a spinor
need 2 spinors for input and output - Load the gauge link while computing, using 66
registers. Straightforward for yUx, but not so
for yconjg(U)x.
- Use the inline-assembler of gcc xlf and xlc have
intrinsic functions. - Early xlf/xlc wasnt good enough to produce these
code, but is improved more recently.
24Parallelization on BG/L
- Example 243x48 lattice.
- Use the virtual node mode.
- For the midplane, divide the entire lattice onto
2x8x8x8 processors. For one rack, 2x8x8x16. (2 is
inner-node.) - To use more than one rack, 323x64 lattice is the
minimum. - Each processor has 12x3x3x6 (or 12x3x3x3)
lattice.
25Communication
- Communication is fast
- 6 links to nearest-neighbors. 1.4 Gbps
(bi-directional) for each link. - latency is 140ns for one hop.
- MPI is too heavy
- Need additional buffer copy waste the cache and
memory bandwidth. - Multi-thread not available in the virtual node
mode. - Overlapping comp and comm is not possible within
MPI.
26QCD Enhancement Package
- Low-level communication API
- Directly send/recv by accessing the torus
interface FIFO. No copy to memory buffer. - Non-blocking send blocking recv.
- Up to 224 byte data to send/recv at once (spinor
at one site 192 byte). - Assuming the nearest-neighbor communication.
27An example
define BGLNET_WORK_REG 30 define
BGLNET_HEADER_REG 30 BGLNetQuad
fifo BGLNet_Send_WaitReady(BGLNET_X_PLUS,fifo,6)
for(i0iltNxi) // put results to reg
24--29 BGLNet_Send_Enqueue_Header(fifo) BGLNet_
Send_Enqueue(fifo,24) BGLNet_Send_Enqueue(fifo,2
5) BGLNet_Send_Enqueue(fifo,26) BGLNet_Send_En
queue(fifo,27) BGLNet_Send_Enqueue(fifo,28) BG
LNet_Send_Enqueue(fifo,29) BGLNet_Send_Packet(fi
fo)
Create packet header
Put the packet header to the send buffer
Put the data to the send buffer
Kick!
28Benchmark
- Wilson solver (BiCGstab)
- 243x48 lattice on a midplace (8x8x8512 nodes,
half rack) - 29.2 of the peak performance
- 32.6 if measured the Dslash only
- Domain-wall solver (CG)
- 243x48 lattice on a midplace Ns16.
- Doesnt fit in the on-chip L3
- 22 of the peak performance
29Comparison
50 improvement
30Physics target
- Future opportunities ab initio calculations at
the physical quark masses - Using dynamical overlap fermion
- Details are under discussion (actions,
algorithms, etc.) - Primitive code has been written test runs are
on-going on SR8000. - Many things to do by March
31Summary
- New KEK machine will be made available for
Japanese lattice community on March 1st, 2006. - Hitachi SR11000 (2.15 TF) IBM BlueGene/L (57.3
TF)