Title: Dolphin Wulfkit and Scali software The Supercomputer interconnect
1Dolphin Wulfkit and Scali softwareThe
Supercomputer interconnect
Amal DSilva amal_at_summnet.com
- Summation Enterprises Pvt. Ltd.
- Preferred System Integrators since 1991.
2Agenda
- Dolphin Wulfkit hardware
- Scali Software / some commercial benchmarks
- Summation profile
3Interconnect Technologies
Application areas
WAN
LAN
I/O
Memory
Processor
Cache
FibreChannel
SCSI
Myrinet, cLan
Design space for different technologies
Proprietary Busses
Ethernet
Dolphin SCI Technology
ATM
Bus
Network
Cluster Interconnect Requirements
Application requirements
Distance
100 000
10 000
1 000
100
10
1
Bandwidth
50 000
100 000
100 000
100
10 000
1
20
1
1
100 000
1 000
8
Latency
4Interconnect impact on cluster performance
- Some Real-world examples from Top500 May 2004
List - Intel, Bangalore cluster
- 574 Xeon 2.4 GHz CPUs/ GigE interconnect
- Rpeak 2755 GFLOPs Rmax 1196 GFLOPs
- Efficiency 43
- Kabru, IMSc, Chennai
- 288 Xeon 2.4 GHz CPUs/ Wulfkit 3D interconnect
- Rpeak 1382 GFLOPs Rmax 1002 GFLOPs
- Efficiency 72
- Simply put, Kabru gives 84 of the performance
with HALF the number of CPUs !
5Commodity interconnect limitations
- Cluster performance depends primarily on two
factors Bandwidth and Latency - Gigabit Speed limited to 1000mbps (approx 80
Megabytes/s in real world). This is fixed
irrespective of processor power - With Increasing processor speeds, latency time
taken to move data from one node to another is
playing an increasing role in cluster performance - Gigabit typically gives an internode latency of
120 150 microsecs. As a result, CPUs in a node
are often idling waiting to get data from another
node - In any switch based architecture, the switch
becomes the single point of failure. If the
switch goes down, so does the cluster.
6Dolphin Wulfkit advantages
- Internode bandwidth 260 Megabytes/s on Xeon/
(over three times faster that Gigabit). - Latency under 5 microsecs ( over TWENTY FIVE
times quicker than Gigabit) - Matrix type internode connections No switch,
hence no single point of failure - Cards can be moved across processor generations.
This leads to investment protection
7Dolphin Wulfkit advantages (contd.)
- Linear scalability e.g. adding 8 nodes to a 16
node cluster involves known fixed costs eight
nodes and eight Dolphin SCI cards. With any
switch based architecture, there are additional
issues like unused ports on the switch to be
considered. E.g. For Gigabit, one has to throw
away the 16 port switch and buy a 32 port switch - Realworld performance on par /better than
proprietary interconnects like Memory Channel
(HP) and NUMAlink (SGI), at cost effective price
points
8Wulfkit The Supercomputer Interconect
- Wulfkit is based on the Scalable Coherent
Interface (SCI), the ANSI/IEEE 1596-1992 standard
defines a point-to-point interface and a set of
packet protocols. - Wulfkit is not a networking technology, but is a
purpose-designed cluster interconnect. - The SCI interface has two unidirectional links
that operate concurrently. - Bus imitating protocol with packet-based
handshake protocols and guaranteed data delivery. - Upto 667 MegaBytes/s internode bandwidth.
9PCI-SCI Adapter Card 1 slot 2 dimensions
- SCI ADAPTERS (64 bit - 66 MHz)
- PCI / SCI ADAPTER (D335)
- D330 card with LC3 daughter card
- Supports 2 SCI ring connections
- Switching over B-Link
- Used for WulfKit 2D clusters
- PCI 64/66
- D339 2-slot version
LC
LC
PCI
PSB
2D Adapter Card
10System Interconnect
11System Architecture
123D Torus topology (for greater than 64 72 nodes)
13Linköping University - NSC - SCI Clusters
Also in Sweden, Umeå University 120 Athlon nodes
- Monolith 200 node, 2xXeon, 2,2 GHz, 3D SCI
- INGVAR 32 node, 2xAMD 900 MHz, 2D SCI
- Otto 48 node, 2xP4 2.26 GHz, 2D SCI
- Commercial under installation 40, 2xXeon, 2D SCI
- Total 320 SCI nodes
14MPI connect middleware and MPIManage Cluster
setup/ mgmt tools
http//www.scali.com
15Scali Software Platform
- Scali MPI Manage
- Cluster Installation /Management
- Scali MPI Connect
- High Performance MPI Libraries
16Scali MPI Connect
- Fault Tolerant
- High Bandwidth
- Low Latency
- Multi-Thread safe
- Simultaneous Inter/-Intra-node operation
- UNIX command line replicated
- Exact message size option
- Manual/debugger mode for selected processes
- Explicit host specification
- Job queuing
- PBS, DQS, LSF, CCS, NQS, Maui
- Conformance to MPI-1.2 verified through 1665 MPI
tests
17Scali MPI Manage features
- System Installation and Configuration
- System Administration
- System Monitoring Alarms and Event Automation
- Work Load Management
- Hardware Management
- Heterogeneous Cluster Support
18Fault Tolerance
2D Torus topology more routing options XY routing
algorithm Node 33 fails (3) Nodes on 33s
ringlets becomes unavailable Cluster fractured
with current routing setting
19Fault Tolerance
33
Scali advanced routing algorithm From the Turn
Model family of routing algorithms All nodes but
the failed one can be utilised as one big
partition
20Scali MPI Manage GUI
21Monitoring ctd.
22System Monitoring
- Resource Monitoring
- CPU
- Memory
- Disk
- Hardware Monitoring
- Temperature
- Fan Speed
- Operator Alarms on selected Parameters at
Specified Tresholds
23Events/Alarms
24SCI vs. Myrinet 2000Ping-Pong comparison
25Itanium vs Cray T3E Bandwidth
26Itanium vs T3E Latency
27Some Reference Customers
- Max Planck Institute für Plasmaphysik, Germany
- University of Alberta, Canada
- University of Manitoba, Canada
- Etnus Software, USA
- Oracle Inc., USA
- University of Florida, USA
- deCODE Genetics, Iceland
- Uni-Heidelberg, Germany
- GMD, Germany
- Uni-Giessen, Germany
- Uni-Hannover, Germany
- Uni-Düsseldorf, Germany
- Linux NetworX, USA
- Magmasoft AG, Germany
- University of Umeå, Sweden
- University of Linkøping, Sweden
- PGS Inc., USA
- US Naval Air, USA
- Spacetec/Tromsø Satellite Station, Norway
- Norwegian Defense Research Establishment
- Parallab, Norway
- Paderborn Parallel Computing Center, Germany
- Fujitsu Siemens computers, Germany
- Spacebel, Belgium
- Aerospatiale, France
- Fraunhofer Gesellschaft, Germany
- Lockheed Martin TDS, USA
- University of Geneva, Switzerland
- University of Oslo, Norway
- Uni-C, Denmark
- Paderborn Parallel Computing Center
- University of Lund, Sweden
- University of Aachen, Germany
- DNV, Norway
- DaimlerChrysler, Germany
- AEA Technology, Germany
- BMW AG, Germany
28Some more Reference Customers
- Rolls Royce Ltd., UK
- Norsk Hydro, Norway
- NGU, Norway
- University of Santa Cruz, USA
- Jodrell Bank Observatory, UK
- NTT, Japan
- CEA, France
- Ford/Visteon, Germany
- ABB AG, Germany
- National Technical University of Athens, Greece
- Medasys Digital Systems, France
- PDG Linagora S.A., France
- Workstations UK, Ltd., England
- Bull S.A., France
- The Norwegian Meteorological Institute, Norway
- Nanco Data AB, Sweden
- Aspen Systems Inc., USA
- Atipa Linux Solution Inc., USA
- Intel Corporation Inc., USA
- IOWA State University, USA
- Los Alamos National Laboratory, USA
- Penguin Computing Inc., USA
- Times N Systems Inc., USA
- University of Alberta, Canada
- Manash University, Australia
- University of Southern Mississippi, Australia
- Jacusiel Acuna Ltda., Chile
- University of Copenhagen, Denmark
- Caton Sistemas Alternativos, Spain
- Mapcon Geografical Inform, Sweden
- Fujitsu Software Corporation, USA
- City Team OY, Finland
- Falcon Computers, Finland
- Link Masters Ltd., Holland
- MIT, USA
- Paralogic Inc., USA
29Application Benchmarks
- With Dolphin SCI and Scali MPI
30NAS parallel benchmarks (16cpu/8nodes)
31Magma (16cpus/8nodes)
32Eclipse (16cpus/8nodes)
33FEKO Parallel Speedup
34Acusolve (16cpus/8nodes)
35Visage (16cpus/8nodes)
36CFD scaling mm5 linear to 400 CPUs
37Scaling - Fluent Linköping cluster
38Dolphin Software
- All Dolphin SW is free open source (GPL or LGPL)
- SISCI
- SCI-SOCKET
- Low Latency Socket Library
- TCP and UDP Replacement
- User and Kernel level support
- Release 2.0 available
- SCI-MPICH (RWTH Aachen)
- MPICH 1.2 and some MPICH 2 features
- New release is being prepared, beta available
- SCI Interconnect Manager
- Automatic failover recovery.
- No single point of failure in 2D and 3D networks.
- Other
- SCI Reflective Memory, Scali MPI, Linux Labs SCI
Cluster Cray-compatible shmem and Clugres
PostgreSQL, MandrakeSoft Clustering HPC solution,
Xprime X1 Database Performance Cluster for
Microsoft SQL Servers, ClusterFrame from Qlusters
and SunCluster 3.1 (Oracle 9i), MySQL Cluster
39Summation Enterprises Pvt. Ltd.
40- Our expertise Clustering for High Performance
Technical Computing, Clustering for High
Availability, Terabyte Storage solutions, SANs - O.S. skills Linux (Alpha 64bit, x8632 and
64bit), Solaris (SPARC and x86), Tru64unix,
Windows NT/ 2K/ 2003 and the QNX Realtime O.S.
41Summation milestones
- Working with Linux since 1996
- First in India to deploy/ support 64bit Alpha
Linux workstations (1999) - First in India to spec, deploy and support a 26
Processor Alpha Linux cluster (2001) - Only company in India to have worked with
Gigabit, SCI and Myrinet interconnects - Involved with the design, setup, support of many
of the largest HPTC clusters in India.
42Exclusive Distributors /System Integrators in
India
- Dolphin Interconnect AS, Norway
- SCI interconnect for Supercomputer performance
- Scali AS, Norway
- Cluster management tools
- Absoft, Inc., USA
- FORTRAN Development tools
- Steeleye Inc., USA
- High Availability Clustering and Disaster
Recovery Solutions for Windows Linux - Summation is the sole Distributor, Consulting
services Technical support partner for Steeleye
in India
43Partnering with Industry leaders
- Sun Microsystems, Inc.
- Focus on Education Research segments
- High Performance Technical Computing, Grid
Computing Initiative with Sun Grid Engine (SGE/
SGEE) - HPTC Competency Centre
44Wulfkit / HPTC users
- Institute of Mathematical Sciences, Chennai
- 144 node Dual Xeon Wulfkit 3D cluster,
- 9 node Dual Xeon Wulfkit 2D cluster
- 9 node Dual Xeon Ethernet cluster
- 1.4 TB RAID storage
- Bhaba Atomic Research Centre, Mumbai
- 64 node Dual Xeon Wulfkit 2D cluster
- 40 node P4 Wulfkit 3D cluster
- Alpha servers / Linux OpenGL workstations /
Rackmount servers - Harish Chandra Research Institute, Allahabad
- Forty Two node Dual Xeon Wulfkit Cluster,
- 1.1 TB RAID Storage
-
45Wulfkit / HPTC users (contd.)
- Intel Technology India Pvt. Ltd., Bangalore
- Eight node Dual Xeon Wulfkit Clusters (ten nos.)
- NCRA (TIFR), Pune
- 4 node Wulfkit 2D cluster
- Bharat Forge Ltd., Pune
- Nine node Dual Xeon Wulfkit 2D cluster
- Indian Rare Earths Ltd., Mumbai
- 26 Processor Alpha Linux cluster with RAID
storage - Tata Institute of Fundamental Research, Mumbai
- RISC/Unix servers, Four node Xeon cluster
- Centre for Advanced Technology, Indore
- Alpha/ Sun Workstations
46Questions ?
- Amal DSilvaemailamal_at_summnet.com
- GSM 98202 83309