apeNEXT - PowerPoint PPT Presentation

About This Presentation

Title:

apeNEXT

Description:

APE is the actual ('de-facto') European computing platform for big volume LQCD applications. ... Fluid dynamics (lattice boltzman, weather forecast) ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 30

Provided by: pie60

Category:

more less

Transcript and Presenter's Notes

Title: apeNEXT

1
apeNEXT

Piero Vicini (piero.vicini_at_roma1.infn.it)
INFN Roma

2
APE keywords

Parallel system
Massively parallel 3D array of computing nodes
with periodic boundary conditions
Custom system
Processor extensive use of VLSI
Native implementation of the complex type a x b
c (complex numbers)
Large register file
VLIW microcode
Node interconnections
Optimized for nearest-neighbors communication
Software tools
Apese, TAO, OS, Machine simulator
Dense system
Reliable and safe HW solution
Custom mechanics for wide integration
Cheap system
0.5 /MFlops
Very low cost maintenance

3
The APE family

Our line of Home Made Computers

APE (1988) APE100 (1993) APEmille (1999) apeNEXT (2003)
Architecture SIMD SIMD SIMD SIMD
nodes 16 2048 2048 4096
Topology flexible 1D rigid 3D flexible 3d flexible 3D
Memory 256 MB 8 GB 64 GB 1 TB
registers (w.size) 64 (x32) 128 (x32) 512 (x32) 512 (x64)
clock speed 8 MHz 25 MHz 66 MHz 200 MHz
Total Computing Power 1.5 GFlops 250 GFlops 2 TFlops 8-20 TFlops
4
APE (88) 1 GFlops
5
APE100 (1993) - 100 GFlops
PB (8 nodes) 400 MFlops
6
APEmille 1 TFlops

2048 VLSI processing nodes (0.5 MFlops)
SIMD, synchronous communications
Fully integrated Host computer,
64 PCs cPCI based

Torre 32 PB, 128GFlops
Processing Board (PB) 8 nodes, 4GFlops
Computing node
7
APEmille installations

Bielefeld 130 GF (2 crates)
Zeuthen 520 GF (8 crates)
Milan 130 GF (2 crates)
Bari 65 GF (1 crates)
Trento 65 GF (1 crates)
Pisa 325 GF (5 crates)
Rome 1 650 GF (10 crates)
Rome 2 130 GF (2 crates)
Orsay 16 GF (1/4 crates)
Swansea 65 GF (1 crates)
Gr. Total 1966 GF

8
The apeNEXT architecture

3D mesh of computing nodes
Custom VLSI processor - 200 MHz (JT)
1.6 GFlops per node (complex normal)
256 MB (1 GB) memory per node
First neighbor communication network loosely
synchronous
YZ internal, X on cables
r 8/16 gt 200 MB/s per channel
Scalable 25 GFlops -gt 6 Tflops
Processing Board 4 x 2 x 2 26 GF
Crate (16 PB) 4 x 8 x 8 0.5 TF
Rack (32 PB) 8 x 8 x 8 1 TF
Large systems (8n) x 8 x 8
Linux PCs as Host system

9
Design metodology

VHDL incremental model of the (almost) whole
system
Custom (VLSI and/or FPGA) components derived
from VHDL synthesis tool

Stand-alone simulation of components VHDL model
simulation of the global VHDL model
Powerful test-bed for test vectors generation

First-Time-Right-Silicon

Simplified but complete model of HW-Host
interaction
Test environment for development of compilation
chain, OS
performance (architecture) evaluation at design
time

Software design env.
10
Assembling apeNEXT
JT Asic
JT module
PB
Rack
BackPlane
11
Overview of the JT Architecture

Peak floating point performance of about
1.6GFlops
IEEE compliant double precision
Integer arithmetic performance of about 400 MIPS
Link bandwidth of about 200 MByte/sec each
direction
full duplex
7 links X,X-,Y,Y-,Z,Z-, 7th (I/O)
Support for current generation
DDR memory
Memory bandwidth of 3.2 GByte/sec
400 Mword/sec

12
JT Top Level Diagram
13
The JT Arithmetic BOX

Pipelined complex normal abc (8 flops) per
cycle

At 200 MHz (fully piped) 1.6 GFlops
14
The JT remote IO

fifo-based communication
LVDS
1.6 Gb/s per link (8 bit _at_ 200MHz)
6 (1) independent links

15
JT summary

CMOS 0.18u, 7 metal (ATMEL)
200 MHz
Double Precision Complex Normal Operation
64 bit AGU
8 KW program cache
128 bit local memory channel
61 LVDS 200 MB/s links
BGA package, 600 pins

16
PB

Collaboration with NEURICAM spa

16 Nodes 3D-Interconnected

4x2x2 Topology 26 Gflops, 4.6 GB Memory

Light System
JT Module connectors
Glue Logic (Clock tree 10Mhz)
Global signal interconnection (FPGA)
DC-DC converters (48V to 3.3/2.5)

Dominant Technologies
LVDS 1728 (16629) differential signals
200Mb/s, 144 routed via cables, 576 via
backplane on 12 controlled-impedance layers
High-Speed differential connectors
Samtec QTS (JT Module)
Erni ERMET-ZD (Backplane)

17
JT Module

JT
9 DDR-SDRAM, 256Mbit (x16)
6 Link LVDS up to 400MB/s
Host Fast I/O Link (7th Link)
I2C Link (slow control network)
Dual Power 2.5V1.8V, 7-10W estimated
Dominant technologies
SSTL-II (memory interface)
LVDS (network interface I/O)

18
NEXT BackPlane

16 PB Slots Root Slot

Size 447x600 mm2

4600 LVDS differential signals, point-to-point
up to 600 Mb/s
16 controlled-imp. layers (32)

Press-fit only

Erni/Tyco connectors
ERMET-ZD

Providers
APW (primary)
ERNI (2nd src)

connector kit cost7KEuro (!) PB Insertion
force80-150 Kg(!)
19
PB Mechanics
PB constraints

Power consumption up to 340W

PB-BP insertion force 80-150 Kg (!)

apeNEXT PB

Fully populated PB weight 4-5 Kg

Board-to-Board Connector
Detailed study of airflow
Custom design of card frame and insertion tool
20
Rack mechanics

Problem
PB weight 4-5 Kg,
PB consumption 340W (est.)
32 PB 2 Root Board
Power supply (lt48Vx150A per crate)
Integrated Host PCs
Forced air cooling,
Robust, expandable/modular, CE, EMC ....

Solution

42U rack (h 2,10 m)
EMC proof,
efficient cables routing

19-1U slots per 9 host PCs (rack mounted)

Hot-swap power supply cabinet (modular)

Custom design of card cage and tie bar

Custom design of cooling system

21
(No Transcript)
22
Host I/O Architecture
23
Host I/O Interface

PCI Board, Altera APEX II based

QuadDataRateMemory (x32)

7th Link 1(2) bidir. Chan.

I2C 4 independent ports

PCI Interface 64bit, 66Mhz
PCI Master Mode for 7th Link
PCI Target Mode for I2C

24
Status and expected schedule

JT ready to test September 03
We will receive between 300 to 600 chips
We need 256 processor to assemble a crate !!
We expect them to work !!
The same team designed 7 ASIC of the same
complexity
Impressive full-detailed simulations of multiple
JT systems
More one simulate less one has to test !!
PB,JT Module, BackPlane, Mechanics were built
and tested
Within days/weeks the first working apeNEXT
computer should operate
Mass production will follow asap
End 2003 mass production will start.
INFN requirements is 8-12 TFlops of computing
power !!

25
Software

TAO compilers and linker .. READY
All existing APE program will run with no change
Physical code already been run on the simulator
Kernel of PHYSICS codes
used to benchmark the efficiencies of the FP unit
C COMPILER
gcc (2.93) and lcc have be retargeted
lcc WORKS (almost).

http//www.cs.princeton.edu/software/lcc/
26
Project Costs

Total development cost of 1700 kuro
1050 kuro for VLSI development
550 kuro non VLSI
Manpower involved 20 man/year
Mass production cost 0.5 uro/MFlops

27
Future RD activities

Computing node architecture
Adaptable/reconfigurable computing node
Fat operators, short/custom FP data, multiple
node integration
Evaluation/integration of commercial processor in
APE system
Interconnection architecture and technologies
Custom ape-like network
Interface to host, PCs interconnection
Mechanics assemblies (Perf/Volume,reliability)
Rack, Cables, Power distributions etc
Software
Standard languages (C) full support (compiler,
linker)
Distributed OS
APE system integration in GRID environment

28
Conclusions

JT in fab, ready Summer 03 (300.600 chips)
Everything else ready and tested !!!
If tests ok
mass production starting 4Q03
All components over-dimensioned
Cooling, LVDS tested _at_ 400 Mb/s, power supply on
boards
Makes possible a technology step with no extra
design and relatively low test effort
Installation plans
INFN theoretical group requires 8-12 TFlops
(10-15 cabinets)(on delivering of a working
machine)
DESY considering between 8 TFlops to 16 Tflops
Paris.

29
APE in SciParC

APE is the actual (de-facto) European computing
platform for big volume LQCD applications. But.
Interdisciplinarity is on our pathway (i.e. APE
is not only QCD)
Fluid dynamics (lattice boltzman, weather
forecast)
Complex Systems (spin glasses, real glasses,
protein folding)
Neural networks
Seismic migration
Plasma physics (astrophysics, thermonuclear
engines)
So, in our opinion, its strategic to build
general purpose massively parallel computing
platform dedicated to approach large-scale
computational problem coming from different
fields of research.
The APE group can (want) contribute in
development of such future machines