apeNEXT - PowerPoint PPT Presentation

About This Presentation
Title:

apeNEXT

Description:

APE is the actual ('de-facto') European computing platform for big volume LQCD applications. ... Fluid dynamics (lattice boltzman, weather forecast) ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 30
Provided by: pie60
Category:

less

Transcript and Presenter's Notes

Title: apeNEXT


1
apeNEXT
  • Piero Vicini (piero.vicini_at_roma1.infn.it)
  • INFN Roma

2
APE keywords
  • Parallel system
  • Massively parallel 3D array of computing nodes
    with periodic boundary conditions
  • Custom system
  • Processor extensive use of VLSI
  • Native implementation of the complex type a x b
    c (complex numbers)
  • Large register file
  • VLIW microcode
  • Node interconnections
  • Optimized for nearest-neighbors communication
  • Software tools
  • Apese, TAO, OS, Machine simulator
  • Dense system
  • Reliable and safe HW solution
  • Custom mechanics for wide integration
  • Cheap system
  • 0.5 /MFlops
  • Very low cost maintenance

3
The APE family
  • Our line of Home Made Computers

APE (1988) APE100 (1993) APEmille (1999) apeNEXT (2003)
Architecture SIMD SIMD SIMD SIMD
nodes 16 2048 2048 4096
Topology flexible 1D rigid 3D flexible 3d flexible 3D
Memory 256 MB 8 GB 64 GB 1 TB
registers (w.size) 64 (x32) 128 (x32) 512 (x32) 512 (x64)
clock speed 8 MHz 25 MHz 66 MHz 200 MHz
Total Computing Power 1.5 GFlops 250 GFlops 2 TFlops 8-20 TFlops
4
APE (88) 1 GFlops
5
APE100 (1993) - 100 GFlops
PB (8 nodes) 400 MFlops
6
APEmille 1 TFlops
  • 2048 VLSI processing nodes (0.5 MFlops)
  • SIMD, synchronous communications
  • Fully integrated Host computer,
  • 64 PCs cPCI based

Torre 32 PB, 128GFlops
Processing Board (PB) 8 nodes, 4GFlops
Computing node
7
APEmille installations
  • Bielefeld 130 GF (2 crates)
  • Zeuthen 520 GF (8 crates)
  • Milan 130 GF (2 crates)
  • Bari 65 GF (1 crates)
  • Trento 65 GF (1 crates)
  • Pisa 325 GF (5 crates)
  • Rome 1 650 GF (10 crates)
  • Rome 2 130 GF (2 crates)
  • Orsay 16 GF (1/4 crates)
  • Swansea 65 GF (1 crates)
  • Gr. Total 1966 GF

8
The apeNEXT architecture
  • 3D mesh of computing nodes
  • Custom VLSI processor - 200 MHz (JT)
  • 1.6 GFlops per node (complex normal)
  • 256 MB (1 GB) memory per node
  • First neighbor communication network loosely
    synchronous
  • YZ internal, X on cables
  • r 8/16 gt 200 MB/s per channel
  • Scalable 25 GFlops -gt 6 Tflops
  • Processing Board 4 x 2 x 2 26 GF
  • Crate (16 PB) 4 x 8 x 8 0.5 TF
  • Rack (32 PB) 8 x 8 x 8 1 TF
  • Large systems (8n) x 8 x 8
  • Linux PCs as Host system

9
Design metodology
  • VHDL incremental model of the (almost) whole
    system
  • Custom (VLSI and/or FPGA) components derived
    from VHDL synthesis tool
  • Stand-alone simulation of components VHDL model
    simulation of the global VHDL model
  • Powerful test-bed for test vectors generation

First-Time-Right-Silicon
  • Simplified but complete model of HW-Host
    interaction
  • Test environment for development of compilation
    chain, OS
  • performance (architecture) evaluation at design
    time

Software design env.
10
Assembling apeNEXT
JT Asic
JT module
PB
Rack
BackPlane
11
Overview of the JT Architecture
  • Peak floating point performance of about
    1.6GFlops
  • IEEE compliant double precision
  • Integer arithmetic performance of about 400 MIPS
  • Link bandwidth of about 200 MByte/sec each
    direction
  • full duplex
  • 7 links X,X-,Y,Y-,Z,Z-, 7th (I/O)
  • Support for current generation
  • DDR memory
  • Memory bandwidth of 3.2 GByte/sec
  • 400 Mword/sec

12
JT Top Level Diagram
13
The JT Arithmetic BOX
  • Pipelined complex normal abc (8 flops) per
    cycle

At 200 MHz (fully piped) 1.6 GFlops
14
The JT remote IO
  • fifo-based communication
  • LVDS
  • 1.6 Gb/s per link (8 bit _at_ 200MHz)
  • 6 (1) independent links

15
JT summary
  • CMOS 0.18u, 7 metal (ATMEL)
  • 200 MHz
  • Double Precision Complex Normal Operation
  • 64 bit AGU
  • 8 KW program cache
  • 128 bit local memory channel
  • 61 LVDS 200 MB/s links
  • BGA package, 600 pins

16
PB
  • Collaboration with NEURICAM spa
  • 16 Nodes 3D-Interconnected
  • 4x2x2 Topology 26 Gflops, 4.6 GB Memory
  • Light System
  • JT Module connectors
  • Glue Logic (Clock tree 10Mhz)
  • Global signal interconnection (FPGA)
  • DC-DC converters (48V to 3.3/2.5)
  • Dominant Technologies
  • LVDS 1728 (16629) differential signals
    200Mb/s, 144 routed via cables, 576 via
    backplane on 12 controlled-impedance layers
  • High-Speed differential connectors
  • Samtec QTS (JT Module)
  • Erni ERMET-ZD (Backplane)

17
JT Module
  • JT
  • 9 DDR-SDRAM, 256Mbit (x16)
  • 6 Link LVDS up to 400MB/s
  • Host Fast I/O Link (7th Link)
  • I2C Link (slow control network)
  • Dual Power 2.5V1.8V, 7-10W estimated
  • Dominant technologies
  • SSTL-II (memory interface)
  • LVDS (network interface I/O)

18
NEXT BackPlane
  • 16 PB Slots Root Slot
  • Size 447x600 mm2
  • 4600 LVDS differential signals, point-to-point
    up to 600 Mb/s
  • 16 controlled-imp. layers (32)
  • Press-fit only
  • Erni/Tyco connectors
  • ERMET-ZD
  • Providers
  • APW (primary)
  • ERNI (2nd src)

connector kit cost7KEuro (!) PB Insertion
force80-150 Kg(!)
19
PB Mechanics
PB constraints
  • Power consumption up to 340W
  • PB-BP insertion force 80-150 Kg (!)

apeNEXT PB
  • Fully populated PB weight 4-5 Kg

Board-to-Board Connector
Detailed study of airflow
Custom design of card frame and insertion tool
20
Rack mechanics
  • Problem
  • PB weight 4-5 Kg,
  • PB consumption 340W (est.)
  • 32 PB 2 Root Board
  • Power supply (lt48Vx150A per crate)
  • Integrated Host PCs
  • Forced air cooling,
  • Robust, expandable/modular, CE, EMC ....
  • Solution
  • 42U rack (h 2,10 m)
  • EMC proof,
  • efficient cables routing
  • 19-1U slots per 9 host PCs (rack mounted)
  • Hot-swap power supply cabinet (modular)
  • Custom design of card cage and tie bar
  • Custom design of cooling system

21
(No Transcript)
22
Host I/O Architecture
23
Host I/O Interface
  • PCI Board, Altera APEX II based
  • QuadDataRateMemory (x32)
  • 7th Link 1(2) bidir. Chan.
  • I2C 4 independent ports
  • PCI Interface 64bit, 66Mhz
  • PCI Master Mode for 7th Link
  • PCI Target Mode for I2C

24
Status and expected schedule
  • JT ready to test September 03
  • We will receive between 300 to 600 chips
  • We need 256 processor to assemble a crate !!
  • We expect them to work !!
  • The same team designed 7 ASIC of the same
    complexity
  • Impressive full-detailed simulations of multiple
    JT systems
  • More one simulate less one has to test !!
  • PB,JT Module, BackPlane, Mechanics were built
    and tested
  • Within days/weeks the first working apeNEXT
    computer should operate
  • Mass production will follow asap
  • End 2003 mass production will start.
  • INFN requirements is 8-12 TFlops of computing
    power !!

25
Software
  • TAO compilers and linker .. READY
  • All existing APE program will run with no change
  • Physical code already been run on the simulator
  • Kernel of PHYSICS codes
  • used to benchmark the efficiencies of the FP unit
  • C COMPILER
  • gcc (2.93) and lcc have be retargeted
  • lcc WORKS (almost).

http//www.cs.princeton.edu/software/lcc/
26
Project Costs
  • Total development cost of 1700 kuro
  • 1050 kuro for VLSI development
  • 550 kuro non VLSI
  • Manpower involved 20 man/year
  • Mass production cost 0.5 uro/MFlops

27
Future RD activities
  • Computing node architecture
  • Adaptable/reconfigurable computing node
  • Fat operators, short/custom FP data, multiple
    node integration
  • Evaluation/integration of commercial processor in
    APE system
  • Interconnection architecture and technologies
  • Custom ape-like network
  • Interface to host, PCs interconnection
  • Mechanics assemblies (Perf/Volume,reliability)
  • Rack, Cables, Power distributions etc
  • Software
  • Standard languages (C) full support (compiler,
    linker)
  • Distributed OS
  • APE system integration in GRID environment

28
Conclusions
  • JT in fab, ready Summer 03 (300.600 chips)
  • Everything else ready and tested !!!
  • If tests ok
  • mass production starting 4Q03
  • All components over-dimensioned
  • Cooling, LVDS tested _at_ 400 Mb/s, power supply on
    boards
  • Makes possible a technology step with no extra
    design and relatively low test effort
  • Installation plans
  • INFN theoretical group requires 8-12 TFlops
    (10-15 cabinets)(on delivering of a working
    machine)
  • DESY considering between 8 TFlops to 16 Tflops
  • Paris.

29
APE in SciParC
  • APE is the actual (de-facto) European computing
    platform for big volume LQCD applications. But.
  • Interdisciplinarity is on our pathway (i.e. APE
    is not only QCD)
  • Fluid dynamics (lattice boltzman, weather
    forecast)
  • Complex Systems (spin glasses, real glasses,
    protein folding)
  • Neural networks
  • Seismic migration
  • Plasma physics (astrophysics, thermonuclear
    engines)
  • So, in our opinion, its strategic to build
    general purpose massively parallel computing
    platform dedicated to approach large-scale
    computational problem coming from different
    fields of research.
  • The APE group can (want) contribute in
    development of such future machines
Write a Comment
User Comments (0)
About PowerShow.com