Title: Performance Evaluation of Gigabit Ethernet-Based Interconnects for HPC Clusters
1Performance Evaluation of Gigabit Ethernet-Based
Interconnects for HPC Clusters
- Pawel Pisarczyk pawel.pisarczyk_at_atm.com.pl
- Jaroslaw Weglinski jaroslaw.weglinski_at_atm.com.pl
Cracow, 16 October 2006
2Agenda
- Introduction
- HPC cluster interconnects
- Message propagation model
- Experimental setup
- Results
- Conclusions
3Who we are
- joint stock company
- founded in 1994, earlier (since 1991) as a
departmentwithin PP ATM - IPO in September 2004 (Warsaw Stock Exchange)
- major shares owned by founders (Polish citizens)
- no state capital involved
- financial data
- stock capital about 6 million
- 2005 sales 29,7 million
- about 230 employees
4Mission
- building business value through innovative
information communication technology
initiatives creating new markets in Poland and
abroad - ATM's competitive advantage is based on combining
three key competences - integration of comprehensive IT systems
- telecommunication services
- consulting and software development
5Achievements
- 1991 Polands first company connected to
Internet - 1993 Polands first commercial ISP
- 1994 Polands first LAN with ATM backbone
- 1994 Polands first supercomputer on the
Dongarras Top 500 list - 1995 Polands first MAN in ATM technology
- 1996 Polands first corporate network with voice
data integration - 2000 Polands first prototype Interactive TV
system over a public network - 2002 Polands first validated MES system for a
pharmaceutical factory - 2003 Polands first commercial, public Wireless
LAN - 2004 Polands first public IP content billing
system
6Client base
(based on 2005 sales revenues)
7HPC clusters developed by ATM
- 2004 - Poznan Supercomputing and Networking
Center - 238 Itanium2 CPU, 119 x HP rx2600 nodes with
Gigabit Ethernet interconnect - 2005 - University of Podlasie
- 34 Itanium2 CPU,17 x HP rx2600 nodes with Gigabit
Ethernet interconnect and Lustre 1.2 filesystem - 2005 - Poznan Supercomputing and Networking
Center - 86 dual core Opteron CPU, 42 Sun SunFire v20z and
1 Sun SunFire v40z with Gigabit Ethernet
interconnect - 2006 - Military University of Technology Faculty
of Engineering, Chemistry and Applied Physics - 32 Itanium2 CPU, 16 x HP rx1620 with Gigabit
Ethernet interconnect - 2006 Gdansk University of Technology Department
of Pharmaceutial Technology  and Chemistry - 22 Itanium2 CPU (11 x HP RX1620) with Gigabit
Ethernet interconnect
8Selected software projects related to distributed
systems
- Distributed Multimedia Archive in Interactive
Television (iTVP) Project - scalable storage for iTVP platform with ability
to process the stored content - ATM Objects
- scalable storage for multimedia content
distribution platform - system for Cinem_at_n company (founded by ATM and
Monolith) - Cinem_at_n will introduce high-quality movies, news
and entertainment digital content distribution
services - Spread Screens Manager
- platform for POS TV
- system is currently used by Zabka (shopping
network) and Neckermann (travel service) - about 300 of terminals presenting the multimedia
content located in many polish cities
9Selected current projects
- ATMFS
- distributed filesystem for petabyte scale storage
based on COTS - based on variable-sized chunks
- advanced replication and enhanced error detection
- dependability evaluation based on software fault
injection technique - FastGig
- RDMA stack for Gigabit Ethernet-based
interconnect - message passing latency reduction
- increases the application performance
10Uses of computer networks in HPC clusters
- Exchange of messages between cluster nodes to
coordinate distributed computation - requires high maximal throughput and also low
latency - inefficiency observed when the time consumed in
single computation step is comparable to the
message passing time - Access to shared data through network or cluster
file system - requires high bandwidth when transferring data in
blocks of defined size - filesystem and storage drivers are trying to
reduce number of i/o operations issued (by
buffering data and aggregating transfers)
11Comparison of characteristics of interconnect
technologies
Brett M. Bode, Jason J. Hill, and Troy R.
Benjegerdes Cluster Interconnect Overview
Scalable Computing Laboratory, Ames Laboratory
12Gigabit Ethernet interconnect characteristic
- Popular technology for low cost cluster
interconnects - Satisfied throughput for long frames (1000 bytes
and longer) - High latency and low throughput for small frames
- Those drawbacks are mostly caused by construction
of existing network interfaces - What is the influence of the network stack
implementation for the communication latency?
13Message propagation model
Latency between transferring message to/from MPI
library and transferring data to/from stack
Time difference between sendto/recvfrom function
and driver start_xmit/interrupt functions
Execution time of driver functions
Processing time of the network interface
Propagation latency and latency introduced by
active network elements
14Experimental setup
- Two HP rx2600 servers
- 2 x Intel Itanium2 1.3 MHz 3MB cache
- Debian GNU/Linux Sarge 3.1 operating system
(kernel 2.6.8-2-mckinley-smp) - Gigabit Ethernet interfaces
- Broadcom BCM5701 chipset connected using PCI-X
device bus - In order to eliminate possibility of additional
delays, which may be introduced by external
active network devices, servers were connected
using crossover cables - Two NIC drivers were tested tg3 (polling NAPI
dirver), bcm5700 (interrupt driven driver)
15Tools used for measures
- NetPipe package for measuring throughput and
latency for TCP and several implementations of
MPI - For low level testing test programs working
directly on Ethernet frames were developed - Testing programs and NIC drivers were modified to
allow measuring, inserting and transfer of
timestamps
16Throughput characteristic for tg3 driver
17Latency characteristic for tg3 driver
18Results for tg3 driver
- The overhead introduced by MPI library is
relatively low - There is a big difference between transmission
latencies in the ping-pong and streaming mode - The latency introduced for small frames is
similar to latency introduced by 115kbps UART (in
the case of transmitting one byte only) - We can deduce that there is some mechanism in the
transmission path that delays transmission of
single packets - What is the difference between NAPI and interrupt
driven driver?
19Interrupt driven driver vs NAPI driver
(throughput characteristic)
20Interrupt driven driver vs NAPI driver (latency
characteristic)
21Interrupt driven driver vs NAPI driver (latency
characteristic) - details
22Comparison of bcm5700 and tg3 drivers
- Using default configuration, BCM5700 driver has
worse characteristics than tg3 - Interrupt driven version (default configuration)
cannot achieve more than 650Mb/s of throughput
for frames of any size - After interrupt coalescing disabling, the
performance of BCM5700 driver have exceeded the
results obtained by tg3 driver - Disabling of the polling can improve
characteristics of the network driver, but NAPI
is not the major cause of the transmission delay
23Tools for message processing time measurement
- Timestamps were inserted into the message eat
each processing stage - Processing stages on the transmitter side
- sendto() function
- bcm5700_start_xmit()
- interrupt notifying frame transmit
- Processing stages on the receiver side
- interrupt notifying frame receipt
- netif_rx()
- recvfrom() function
- As high precision timer CPU clock cycle counter
was used, (precision of 0.77ns 1/1.3GHz)
24Transmitter latency in streaming mode
Send
Answer
17 us
17 us
25Distribution of delays in transmission path
between cluster nodes
26Conclusions
- We estimate that RDMA based communication can
reduce MPI message propagation time from 43µs to
23µs (doubling the performance for short
messages) - There is also possibility of reducing T3 and T5
latencies by changing the configuration of the
network interface (transmit and receive
thresholds) - In the conducted research we didnt consider
differences between network interfaces (T3 and T5
delays may be longer or shorter than measured) - Latency introduced by switch is also omitted
- FastGig project include not only communication
library, but also measurement and communication
profiling framework