Next KEK machine - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Next KEK machine

Description:

... 2 simultaneous multiply/add per cycle (8.4 GFlops/core), 1.875 MB ... must be combined with other rows to avoid pipeline stall (wait 5 cycles). Oct 4, 2005 ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 32
Provided by: suchi4
Category:
Tags: kek | cycles | jp | machine | next

less

Transcript and Presenter's Notes

Title: Next KEK machine


1
Next KEK machine
  • Shoji Hashimoto (KEK)
  • _at_ 3rd ILFT Network Workshop at Jefferson Lab.,
    Oct. 3-6, 2005

2
KEK supercomputer
  • Leading computing facility in that time
  • 1985 Hitachi S810/10 350 MFlops
  • 1989 Hitachi S820/80 3 GFlops
  • 1995 Fujitsu VPP500 128 GFlops
  • 2000 Hitachi SR8000 F1 1.2 TFlops
  • 2006 ???

3
Formality
  • KEK Large Scale Simulation Program call for
    proposals of project to be performed on the
    supercomputer.
  • Open for Japanese researcher working on high
    energy accelerator science (particle and nuclear
    physics, astrophysics, accelerator science,
    material science related to the photon factory)
  • Program Advisory Committee (PAC) decides the
    approval and machine time allocation.

4
Usage
  • Lattice QCD is a dominant user.
  • About 60-80 of the computer time for lattice QCD
  • Among them, 60 is by the JLQCD collaboration
  • Others include Hatsuda-Sasaki, Nakamura et al.,
    Suganuma et al., Suzuki et al. (Kanazawa),
  • Simulation for accelerator design is another big
    user beam-beam simulation for the KEK-B factory.

5
JLQCD collaboration
  • 1995 (on VPP500)
  • Continuum limit in the quenched approximation

ms
BK
fB, fD
6
JLQCD collaboration
  • 2000 (on SR8000)
  • Dynamical QCD with the improved Wilson fermion

mV vs mPS2
fB, fBs
Kl3 form factor
7
Around the triangle
8
The wall
  • Chiral extrapolation very hard to go beyond ms/2
  • Problem for every physical quantities.
  • Maybe, solved by the new algorithms and machines

New generation of dynamical QCD
9
Upgrade
  • Thanks to Hideo Matsufuru (Computing Research
    Center, KEK) for his hard work.
  • Upgrade scheduled on March 1st 2006.
  • Called for bids from vendors.
  • At least 20x more computing power, measured
    mainly using the QCD codes.
  • No restriction on architecture (scalar or vector,
    etc.), but some amount must be a shared memory
    machine.
  • Decision was made, recently.

10
The next machine
  • A combination of two systems
  • Hitachi SR11000 K1, 16 nodes, 2.15 TFlops peak
    performance.
  • IBM Blue Gene/L, 10 racks, 57.3 TFlops peak
    performance.
  • Hitachi Ltd. is the prime contractor.

11
Hitachi SR11000 K1
Will be announced, tomorrow.
  • POWER5 2.1 GHz, dual core, 2 simultaneous
    multiply/add per cycle (8.4 GFlops/core), 1.875
    MB L2 (on chip), 36 MB L3 (off chip)
  • 8.5 GB/s chip-memory bandwidth, hardware and
    software prefetch
  • 16-way SMP (134.4 GFlops/node), 32 GB memory
    (DDR2 SDRAM).
  • 16 nodes (2.15 TFlops)
  • Interconnect Federation switch 8GB/s
    (bidirectional)

12
SR11000 node
13
16-way SMP
14
High Density Module
15
IBM Blue Gene/L
  • Node 2 PowerPC440 (dual core), 700 MHz, double
    FPU (5.6 GFlops/chip), 4MB on-chip L3 (shared),
    512 MB memory.
  • Interconnect 3D torus, 1.4 Gbps/link (6 in 6
    out) from each node.
  • Midplane 8x8x8 nodes (2.87 TFlops) rack 2
    Midplane
  • 10 rack system
  • All the info in the following comes from the Red
    book (ibm.com/redbooks) and the articles in IBM
    Journal of Research and Development.

16
BG/L system
10 Racks
17
BG/L node ASIC
  • Double Floating-Point-Unit (FPU) added to the
    PPC440 core. 2 fused multiply-add per core
  • Not a true SMP L1 has no cache coherency, L2 has
    a snoop.
  • Shared 4MB L3.
  • Communication between the two core through the
    multiported shared SRAM buffer
  • Embedded memory controller and networks.

18
Compute note modes
  • Virtual node mode use both CPUs separately,
    running a different process on each core.
    Communication using MPI, etc. Memory and
    bandwidth are shared.
  • Co-processor mode use the secondary processor as
    a co-processor for communication. Peak performace
    is ½.
  • Hybrid node mode use the secondary processor
    also for computation. Need a special care about
    the L1 cache incoherency. Used for Linpack.

19
QCD code optimization
  • Jun Doi and Hikaru Samukawa (IBM Japan)
  • Use the virtual node mode
  • Fully used the Double FPU (hand-written assembler
    code)
  • Use a low-level communication API

20
Double FPU
  • SIMD extension of PPC440.
  • 32 pairs of 64-bit FP register, addresses are
    shared.
  • Quadword load and store.
  • Primary and secondary pipelines. Fused
    multiply-add for each pipe.
  • Cross operations possible best suited for
    complex arithmetic.

21
Examples
22
SU(3) matrixvector
y0 u00 x0 u01 x1 u02
x2 y1 u10 x0 u11 x1
u12 x2 y2 u20 x0 u21
x1 u22 x2
complex mult u00 x0
re(y0)re(u00)re(x0) im(y0)re(u00)
im(x0)
FXPMUL (y0,u00,x0) FXCXNPMA
(y0,u00,x0,y0)
re(y0)-im(u00)im(x0) im(y0)im(u0
0)re(x0)
u01 x1 u02 x2
FXCPMADD (y0,u01,x1,y0) FXCXNPMA
(y0,u01,x1,y0) FXCPMADD
(y0,u02,x2,y0) FXCXNPMA
(y0,u02,x2,y0)
must be combined with other rows to avoid
pipeline stall (wait 5 cycles).
23
Scheduling
  • 3232 registers can hold 32 complex numbers.
  • 3x3(9) for a gauge link 3x4(12) for a spinor
    need 2 spinors for input and output
  • Load the gauge link while computing, using 66
    registers. Straightforward for yUx, but not so
    for yconjg(U)x.
  • Use the inline-assembler of gcc xlf and xlc have
    intrinsic functions.
  • Early xlf/xlc wasnt good enough to produce these
    code, but is improved more recently.

24
Parallelization on BG/L
  • Example 243x48 lattice.
  • Use the virtual node mode.
  • For the midplane, divide the entire lattice onto
    2x8x8x8 processors. For one rack, 2x8x8x16. (2 is
    inner-node.)
  • To use more than one rack, 323x64 lattice is the
    minimum.
  • Each processor has 12x3x3x6 (or 12x3x3x3)
    lattice.

25
Communication
  • Communication is fast
  • 6 links to nearest-neighbors. 1.4 Gbps
    (bi-directional) for each link.
  • latency is 140ns for one hop.
  • MPI is too heavy
  • Need additional buffer copy waste the cache and
    memory bandwidth.
  • Multi-thread not available in the virtual node
    mode.
  • Overlapping comp and comm is not possible within
    MPI.

26
QCD Enhancement Package
  • Low-level communication API
  • Directly send/recv by accessing the torus
    interface FIFO. No copy to memory buffer.
  • Non-blocking send blocking recv.
  • Up to 224 byte data to send/recv at once (spinor
    at one site 192 byte).
  • Assuming the nearest-neighbor communication.

27
An example
define BGLNET_WORK_REG 30 define
BGLNET_HEADER_REG 30 BGLNetQuad
fifo BGLNet_Send_WaitReady(BGLNET_X_PLUS,fifo,6)
for(i0iltNxi) // put results to reg
24--29 BGLNet_Send_Enqueue_Header(fifo) BGLNet_
Send_Enqueue(fifo,24) BGLNet_Send_Enqueue(fifo,2
5) BGLNet_Send_Enqueue(fifo,26) BGLNet_Send_En
queue(fifo,27) BGLNet_Send_Enqueue(fifo,28) BG
LNet_Send_Enqueue(fifo,29) BGLNet_Send_Packet(fi
fo)
Create packet header
Put the packet header to the send buffer
Put the data to the send buffer
Kick!
28
Benchmark
  • Wilson solver (BiCGstab)
  • 243x48 lattice on a midplace (8x8x8512 nodes,
    half rack)
  • 29.2 of the peak performance
  • 32.6 if measured the Dslash only
  • Domain-wall solver (CG)
  • 243x48 lattice on a midplace Ns16.
  • Doesnt fit in the on-chip L3
  • 22 of the peak performance

29
Comparison
50 improvement
30
Physics target
  • Future opportunities ab initio calculations at
    the physical quark masses
  • Using dynamical overlap fermion
  • Details are under discussion (actions,
    algorithms, etc.)
  • Primitive code has been written test runs are
    on-going on SR8000.
  • Many things to do by March

31
Summary
  • New KEK machine will be made available for
    Japanese lattice community on March 1st, 2006.
  • Hitachi SR11000 (2.15 TF) IBM BlueGene/L (57.3
    TF)
Write a Comment
User Comments (0)
About PowerShow.com