Peter Wegner, DESY CHEP03, 25 March 2003 1 - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Peter Wegner, DESY CHEP03, 25 March 2003 1

Description:

Title: I. The Sociological Perspective. Author: mjansson Last modified by: wegnerp Created Date: 1/4/2000 4:29:28 AM Document presentation format – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 21
Provided by: mjansson
Learn more at: https://chep03.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Peter Wegner, DESY CHEP03, 25 March 2003 1


1
LQCD benchmarks on cluster architecturesM.
Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen),A.
Gellrich, H.Wittig (DESY Hamburg) CHEP03, 25
March 2003Category 6 Lattice Gauge Computing
  • Motivation
  • PC Cluster _at_DESY
  • Benchmark architectures
  • DESY Cluster
  • E7500 systems
  • Infiniband blade servers
  • Itanium2
  • Benchmark programs, Results
  • Future
  • Conclusions, Acknowledgements

2
PC Cluster MotivationLQCD, Stream Benchmark,
Myrinet Bandwidth
  • 32/64-bit Dirac Kernel, LQCD (Martin Lüscher,
    (DESY) CERN, 2000)
  • P4, 1.4 GHz, 256 MB Rambus, using SSE1(2)
    instructions incl. cache pre-fetch
  • Time per lattice point
  • 0.926 micro sec (1503 Mflops 32 bit
    arithmetic)
  • 1.709 micro sec (814 Mflops 64 bit arithmetic)
  • Stream Benchmark, Memory Bandwidth
  • P4(1.4 GHz, PC800 Rambus) 1.4 2.0 GB/s
  • PIII (800MHz, PC133 SDRAM) 400 MB/s
  • PIII(400 MHz, PC133 SDRAM) 340 MB/s
  • Myrinet, external Bandwidth
  • 2.02.0 Gb/s optical-connection, bidirectional,
    240 MB/s sustained

3
Benchmark Architectures - DESY Cluster Hardware
Nodes Mainboard Supermicro P4DC6 2 x XEON P4,
1.7 (2.0) GHz, 256 (512) kByte Cache 1 Gbyte
(4x 256 Mbyte) RDRAM IBM 18.3 GB DDYS-T18350
U160 3.5 SCSI disk Myrinet 2000 M3F-PCI64B-2
Interface Network Fast Ethernet Switch Gigaline
2024M, 48x100BaseTX ports GIGAline2024
1000BaseSX-SC Myrinet Fast Interconnect
M3-E32 5 slot chassis, 2xM3-SW16 Line
cards Installation Zeuthen 16 dual CPU nodes,
Hamburg 32 dual CPU nodes
4
Benchmark Architectures DESY Cluster i860
chipset problem
400MHz System Bus
DualChannel RDRAM
3.2 GB/s
MRH
gt1GB/s


800MB/s
P64H
MCH
3.2GB/s
MRH

800MB/s

P64H
Intel Hub Architecture
266 MB/s
64 bit PCI
ATA 100 MB/s (dual IDE Channels)
6 channel audio
64 bit PCI
ICH2
LAN Connection Interface
Up to 4 GB of RDRAM
133 MB/s
PCI Slots (66 MHz, 64bit)
PCI Slots (33 MHz, 32bit)
10/100 Ethernet
bus_read (send) 227 MBytes/s bus_write (recv)
315 MBytes/s of max. 528 MBytes/s
4 USB ports
External Myrinet bandwidth 160 Mbytes/s 90
Mbytes/s bidirectional
5
Benchmark Architectures Intel E7500 chipset
6
Benchmark Architectures - E7500 system
Par-Tec (Wuppertal) 4 Nodes Intel(R) Xeon(TM)
CPU 2.60GHz 2 GB ECC PC1600 (DDR-200) SDRAM
Super Micro P4DPE-G2 Intel E7500
chipset PCI 64/66 2 x Intel(R) PRO/1000
Network Connection Myrinet M3F-PCI64B-2
7
Benchmark Architectures
Leibniz-Rechenzentrum Munich (single cpu
tests) Pentium IV 3,06GHz. with ECC
Rambus Pentium IV 2,53GHz. with Rambus 1066
memory Xeon, 2.4GHz. with PC2100 DDR SDRAM
memory (probably FSB400) Megware 8 nodes
dual XEON, 2.4GHz, E7500 2GB DDR ECC memory
Myrinet2000 Supermicro P4DMS-6GM University
of Erlangen Itanium2, 900MHz, 1.5MB Cache, 10GB
RAM zx1 chipset (HP)
8
Benchmark Architectures - Infiniband
Megware 10 Mellanox ServerBlades Single Xeon
2.2 GHz 2 GB DDR RAM ServerWorks GC-LE
Chipsatz InfiniBand 4X HCA RedHat 7.3, Kernel
2.4.18-3 MPICH-1.2.2.2 und OSU-Patch für
VIA/InfiniBand 0.6.5 Mellanox Firmware
1.14 Mellanox SDK (VAPI) 0.0.4 Compiler GCC
2.96
9
Dirac Operator Benchmark (SSE) 16x163, single
P4/XEON CPU
Dirac operator
Linear Algebra
MFLOPS
10
Parallel (1-dim) Dirac Operator Benchmark (SSE),
even-odd preconditioned, 2 x 163 , XEON CPUs,
single CPU performance
Myrinet2000 i860 90 MB/s
E7500 190 MB/s
11
Parallel (1-dim) Dirac Operator Benchmark (SSE),
even-odd preconditioned, 2 x 163 , XEON CPUs,
single CPU performance,2, 4 nodes
Performance comparisons (MFLOPS)
Single node Single node Dual node Dual node
SSE2 non-SSE SSE2 non-SSE
446 328 (74) 330 283 (85)
Parastation3 software non-blocking I/O support
(MFLOPS, non-SSE)
blocking non-blocking I/O
308 367 (119)

12
Maximal Efficiency of external I/O
MFLOPs (without communication) MFLOPS (with communication) Maximal Bandwidth Efficiency
Myrinet (i860), SSE 579 307 90 90 0.53
Myrinet/GM (E7500), SSE 631 432 190 190 0.68
Myrinet/ Parastation (E7500), SSE 675 446 181 181 0.66
Myrinet/ Parastation (E7500), non-blocking, non-SSE 406 368 hidden 0.91
Gigabit, Ethernet, non-SSE 390 228 100 100 0.58
Infiniband non-SSE 370 297 210 210 0.80

13
Parallel (1-dim) Dirac Operator Benchmark (SSE),
even-odd preconditioned, 2 x 163 , XEON/Itanium2
CPUs, single CPU performance, 4 nodes
4 single CPU nodes, Gbit Ethernet, non-blocking
switch, full duplex P4 (2.4 GHz, 0.5 MB
cache) SSE 285 MFLOPS 88.92 88.92
MB/s non-SSE 228 MFLOPS 75.87 75.87
MB/s Itanium2 (900 MHz, 1.5 MB
cache) non-SSE 197 MFLOPS 63.13 63.13 MB/s

14
Infiniband interconnect

up to 10GB/s Bi-directional
Switch Simple, low cost, multistage network
Link High Speed Serial1x, 4x, and 12x

  
I/O Cntlr
TCA
Target Channel Adapter Interface to I/O
controller SCSI, FC-AL, GbE, ...
TCA
I/O Cntlr
  • Host Channel Adapter
  • Protocol Engine
  • Moves data via messages queued in memory

Chips IBM, Mellanox PCI-X cards Fujitsu,
Mellanox, JNI, IBM
http//www.infinibandta.org
15
Infiniband interconnect

  
16
Parallel (2-dim) Dirac Operator Benchmark
(Ginsparg-Wilson-Fermions) , XEON CPUs, single
CPU performance, 4 nodes
Infiniband vs Myrinet performance, non-SSE
(MFLOPS)
XEON 1.7 GHz Myrinet, i860 chipset XEON 1.7 GHz Myrinet, i860 chipset XEON 2.2 GHz Infiniband, E7500 chipset XEON 2.2 GHz Infiniband, E7500 chipset
32-Bit 64-Bit 32-Bit 64-Bit
8x83 lattice, 2x2 processor grid 370 281 697 477
16x163 lattice, 2x4 processor grid 338 299 609 480
17
Future - Low Power Cluster Architectures ?
18
Future Cluster Architectures - Blade Servers ?
NEXCOM Low voltage blade server 200 low voltage
Intel XEON CPUs (1.6 GHz 30W) in a 42U
Rack Integrated Gbit Ethernet network
Mellanox Infiniband blade server
Single XEON Blades connected via a 10 Gbit (4X)
Infiniband network
MEGWARE, NCSA, Ohio State University
19
Conclusions
PC CPUs have an extremely high sustained LQCD
performance using SSE/SSE2 (SIMDpre-fetch),
assuming a sufficient large local lattice
Bottlenecks are the memory throughput and the
external I/O bandwidth, both components are
improving (Chipsets i860 ? E7500 ? E705 ?
, FSB 400MHz ? 533 MHz ? 667 MHz ? , external
I/O Gbit-Ethernet ? Myrinet2000 ? QSnet ?
Inifiniband ? ) Non-blocking MPI communication
can improve the performance by using adequate
MPI implementations (e.g. ParaStation) 32-bit
Architectures (e.g. IA32) have a much better
price performance ratio than 64-bit architectures
(Itanium, Opteron ?) Large low voltage dense
blade clusters could play an important role in
LQCD computing (low voltage XEON, CENTRINO ?, )
20
Acknowledgements
Acknowledgements We would like to thank Martin
Lüscher (CERN) for the benchmark codes and the
fruitful discussions about PCs for LQCD, and
Isabel Campos Plasencia (Leibnitz-Rechenzentrum
Munich), Gerhard Wellein (Uni Erlangen), Holger
Müller (Megware), Norbert Eicker (Par-Tec), Chris
Eddington (Mellanox) for the opportunity to run
the benchmarks on their clusters.
Write a Comment
User Comments (0)
About PowerShow.com