Title: apeNEXT
1apeNEXT
- Piero Vicini (piero.vicini_at_roma1.infn.it)
- INFN Roma
2APE keywords
- Parallel system
- Massively parallel 3D array of computing nodes
with periodic boundary conditions - Custom system
- Processor extensive use of VLSI
- Native implementation of the complex type a x b
c (complex numbers) - Large register file
- VLIW microcode
- Node interconnections
- Optimized for nearest-neighbors communication
- Software tools
- Apese, TAO, OS, Machine simulator
- Dense system
- Reliable and safe HW solution
- Custom mechanics for wide integration
- Cheap system
- 0.5 /MFlops
- Very low cost maintenance
3The APE family
- Our line of Home Made Computers
APE (1988) APE100 (1993) APEmille (1999) apeNEXT (2003)
Architecture SIMD SIMD SIMD SIMD
nodes 16 2048 2048 4096
Topology flexible 1D rigid 3D flexible 3d flexible 3D
Memory 256 MB 8 GB 64 GB 1 TB
registers (w.size) 64 (x32) 128 (x32) 512 (x32) 512 (x64)
clock speed 8 MHz 25 MHz 66 MHz 200 MHz
Total Computing Power 1.5 GFlops 250 GFlops 2 TFlops 8-20 TFlops
4APE (88) 1 GFlops
5APE100 (1993) - 100 GFlops
PB (8 nodes) 400 MFlops
6APEmille 1 TFlops
- 2048 VLSI processing nodes (0.5 MFlops)
- SIMD, synchronous communications
- Fully integrated Host computer,
- 64 PCs cPCI based
-
Torre 32 PB, 128GFlops
Processing Board (PB) 8 nodes, 4GFlops
Computing node
7APEmille installations
- Bielefeld 130 GF (2 crates)
- Zeuthen 520 GF (8 crates)
- Milan 130 GF (2 crates)
- Bari 65 GF (1 crates)
- Trento 65 GF (1 crates)
- Pisa 325 GF (5 crates)
- Rome 1 650 GF (10 crates)
- Rome 2 130 GF (2 crates)
- Orsay 16 GF (1/4 crates)
- Swansea 65 GF (1 crates)
- Gr. Total 1966 GF
8The apeNEXT architecture
- 3D mesh of computing nodes
- Custom VLSI processor - 200 MHz (JT)
- 1.6 GFlops per node (complex normal)
- 256 MB (1 GB) memory per node
- First neighbor communication network loosely
synchronous - YZ internal, X on cables
- r 8/16 gt 200 MB/s per channel
- Scalable 25 GFlops -gt 6 Tflops
- Processing Board 4 x 2 x 2 26 GF
- Crate (16 PB) 4 x 8 x 8 0.5 TF
- Rack (32 PB) 8 x 8 x 8 1 TF
- Large systems (8n) x 8 x 8
- Linux PCs as Host system
9Design metodology
- VHDL incremental model of the (almost) whole
system - Custom (VLSI and/or FPGA) components derived
from VHDL synthesis tool
- Stand-alone simulation of components VHDL model
simulation of the global VHDL model - Powerful test-bed for test vectors generation
First-Time-Right-Silicon
- Simplified but complete model of HW-Host
interaction - Test environment for development of compilation
chain, OS - performance (architecture) evaluation at design
time
Software design env.
10Assembling apeNEXT
JT Asic
JT module
PB
Rack
BackPlane
11Overview of the JT Architecture
- Peak floating point performance of about
1.6GFlops - IEEE compliant double precision
- Integer arithmetic performance of about 400 MIPS
- Link bandwidth of about 200 MByte/sec each
direction - full duplex
- 7 links X,X-,Y,Y-,Z,Z-, 7th (I/O)
- Support for current generation
- DDR memory
- Memory bandwidth of 3.2 GByte/sec
- 400 Mword/sec
12JT Top Level Diagram
13The JT Arithmetic BOX
- Pipelined complex normal abc (8 flops) per
cycle
At 200 MHz (fully piped) 1.6 GFlops
14The JT remote IO
- fifo-based communication
- LVDS
- 1.6 Gb/s per link (8 bit _at_ 200MHz)
- 6 (1) independent links
15JT summary
- CMOS 0.18u, 7 metal (ATMEL)
- 200 MHz
- Double Precision Complex Normal Operation
- 64 bit AGU
- 8 KW program cache
- 128 bit local memory channel
- 61 LVDS 200 MB/s links
- BGA package, 600 pins
16PB
- Collaboration with NEURICAM spa
- 16 Nodes 3D-Interconnected
- 4x2x2 Topology 26 Gflops, 4.6 GB Memory
- Light System
- JT Module connectors
- Glue Logic (Clock tree 10Mhz)
- Global signal interconnection (FPGA)
- DC-DC converters (48V to 3.3/2.5)
- Dominant Technologies
- LVDS 1728 (16629) differential signals
200Mb/s, 144 routed via cables, 576 via
backplane on 12 controlled-impedance layers - High-Speed differential connectors
- Samtec QTS (JT Module)
- Erni ERMET-ZD (Backplane)
17JT Module
- JT
- 9 DDR-SDRAM, 256Mbit (x16)
- 6 Link LVDS up to 400MB/s
- Host Fast I/O Link (7th Link)
- I2C Link (slow control network)
- Dual Power 2.5V1.8V, 7-10W estimated
- Dominant technologies
- SSTL-II (memory interface)
- LVDS (network interface I/O)
18NEXT BackPlane
- 4600 LVDS differential signals, point-to-point
up to 600 Mb/s - 16 controlled-imp. layers (32)
- Erni/Tyco connectors
- ERMET-ZD
- Providers
- APW (primary)
- ERNI (2nd src)
connector kit cost7KEuro (!) PB Insertion
force80-150 Kg(!)
19PB Mechanics
PB constraints
- Power consumption up to 340W
- PB-BP insertion force 80-150 Kg (!)
apeNEXT PB
- Fully populated PB weight 4-5 Kg
Board-to-Board Connector
Detailed study of airflow
Custom design of card frame and insertion tool
20Rack mechanics
- Problem
- PB weight 4-5 Kg,
- PB consumption 340W (est.)
- 32 PB 2 Root Board
- Power supply (lt48Vx150A per crate)
- Integrated Host PCs
- Forced air cooling,
- Robust, expandable/modular, CE, EMC ....
- 42U rack (h 2,10 m)
- EMC proof,
- efficient cables routing
- 19-1U slots per 9 host PCs (rack mounted)
- Hot-swap power supply cabinet (modular)
- Custom design of card cage and tie bar
- Custom design of cooling system
21(No Transcript)
22Host I/O Architecture
23Host I/O Interface
- PCI Board, Altera APEX II based
- 7th Link 1(2) bidir. Chan.
- PCI Interface 64bit, 66Mhz
- PCI Master Mode for 7th Link
- PCI Target Mode for I2C
24Status and expected schedule
- JT ready to test September 03
- We will receive between 300 to 600 chips
- We need 256 processor to assemble a crate !!
- We expect them to work !!
- The same team designed 7 ASIC of the same
complexity - Impressive full-detailed simulations of multiple
JT systems - More one simulate less one has to test !!
- PB,JT Module, BackPlane, Mechanics were built
and tested - Within days/weeks the first working apeNEXT
computer should operate - Mass production will follow asap
- End 2003 mass production will start.
- INFN requirements is 8-12 TFlops of computing
power !!
25Software
- TAO compilers and linker .. READY
- All existing APE program will run with no change
- Physical code already been run on the simulator
- Kernel of PHYSICS codes
- used to benchmark the efficiencies of the FP unit
- C COMPILER
- gcc (2.93) and lcc have be retargeted
- lcc WORKS (almost).
http//www.cs.princeton.edu/software/lcc/
26Project Costs
- Total development cost of 1700 kuro
- 1050 kuro for VLSI development
- 550 kuro non VLSI
- Manpower involved 20 man/year
-
- Mass production cost 0.5 uro/MFlops
27Future RD activities
- Computing node architecture
- Adaptable/reconfigurable computing node
- Fat operators, short/custom FP data, multiple
node integration - Evaluation/integration of commercial processor in
APE system - Interconnection architecture and technologies
- Custom ape-like network
- Interface to host, PCs interconnection
- Mechanics assemblies (Perf/Volume,reliability)
- Rack, Cables, Power distributions etc
- Software
- Standard languages (C) full support (compiler,
linker) - Distributed OS
- APE system integration in GRID environment
28Conclusions
- JT in fab, ready Summer 03 (300.600 chips)
- Everything else ready and tested !!!
- If tests ok
- mass production starting 4Q03
- All components over-dimensioned
- Cooling, LVDS tested _at_ 400 Mb/s, power supply on
boards - Makes possible a technology step with no extra
design and relatively low test effort - Installation plans
- INFN theoretical group requires 8-12 TFlops
(10-15 cabinets)(on delivering of a working
machine) - DESY considering between 8 TFlops to 16 Tflops
- Paris.
29APE in SciParC
- APE is the actual (de-facto) European computing
platform for big volume LQCD applications. But. - Interdisciplinarity is on our pathway (i.e. APE
is not only QCD) - Fluid dynamics (lattice boltzman, weather
forecast) - Complex Systems (spin glasses, real glasses,
protein folding) - Neural networks
- Seismic migration
- Plasma physics (astrophysics, thermonuclear
engines) -
- So, in our opinion, its strategic to build
general purpose massively parallel computing
platform dedicated to approach large-scale
computational problem coming from different
fields of research. - The APE group can (want) contribute in
development of such future machines