Title: The Future of Parallel Computing
1The Future of Parallel Computing
- Dave Turner
- In collaboration with
- Xuehua Chen, Adam Oline, and Troy Benjegerdes
- Scalable Computing Laboratory of Ames Laboratory
- This work was funded by the MICS office of the US
Department of Energy
2Outline
- Overview of parallel computing
- Measuring the performance of the communication
system - Improving the message-passing performance
- Taking advantage of the network topology
- Science enabled by parallel computing
3The typical small cluster
- 2.8 GHz dual-Xeon node (1800 each)
- 2 GB RAM
- SuperMicro or Tyan motherboards
- Built-in single or dual Gigabit Ethernet
- 1U or 2U rackmount, as needed for expansion
- 24-port Gigabit Ethernet switch
- Assante or Netgear (1400)
- 47U Rack (3000)
- KVM, UPS, Monitor, cables
- Software
- Intel compilers (560 academic)
- MPI, PVM, PBS scheduler
25000 will get you a master plus 10 compute
nodes (22 processors). Cluster vendors such as
Atipa and Microway will sell fully integrated
clusters.
4The IBM Blue Gene/L
- 65,536 dual-processor nodes
- 700 MHz PowerPC 440 cores with dual floating
point units - 256-512 MB RAM
- Each node runs a very stripped down version of
Linux - Should fit in a room the size of a tennis court
- Peak computational rate of 360 Teraflops
- 5 separate networks
- 3D torus network for MPI communications (MPICH2)
- 1.4 Gbps peak bandwidth in each direction
- A tree network connect every 64 node boards to an
I/O node - Will also handle some MPI collectives
- A Fast Ethernet control network
- Global interrupt network
- I/O nodes
- Gigabit Ethernet connection
- Run a full Linux kernel
5The dual-Athlon cluster with SCI interconnects
- Initially we will connect 64 dual-Athlon PCs with
an 8x8 SCI grid. - The Mezzanine card will allow for a 3D torus in
the future.
6Blade-based clusters
- dual-Xeon blades (5000 each)
- 2-8 GB RAM
- 1-2 slow mini-disks (40-80 GB each)
- Built-in single or dual Gigabit Ethernet
- 1 PCI-X expansion slot
- 6U Chassis
- Holds 10 blades plus a network switch module
- Lots of cooling
- 48U rack
- Could hold 80 blades (160 processors)
Prices will come down.
7Inefficiencies in the communication system
Applications MPI native layer internal
buses driver NIC switch fabric
75 bandwidth 2-3x latency
PCI Memory
Topological bottlenecks
Poor MPI usage No mapping
Hardware limits Driver tuning
OS bypass TCP tuning
8Waveguide simulations using the parallel Finite
Difference Time Domain method
Kai-Ming Ho, Rana Biswas, Mihail Sigalas, Ihab
El-Kady, Mehmet Su Dave Turner, Bogdan Vasiliu
9Waveguide bends in three dimensional layer by
layer photonic band gap materials, M.M. Sigalas,
R. Biswas, K.M. Ho, C.M. Soukoulis, D.E. Turner,
B. Vasiliu, S.C. Kothari, and Shawn Lin,
Microwave and Optical Technology Letters, Vol.
23, 56-59 (Oct. 5, 1999).
10 w
i
t
h
o
r
w
i
t
h
o
u
t
f
e
n
c
e
c
a
l
l
s
.
M
e
a
s
u
r
e
p
e
r
f
o
r
m
a
n
c
e
o
r
d
o
a
n
i
n
t
e
g
r
i
t
y
t
e
s
t
.
http//www.scl.ameslab.gov/Projects/NetPIPE/
11The NetPIPE utility
- NetPIPE does a series of ping-pong tests
between two nodes. - Message sizes are chosen at regular intervals,
and with slight perturbations, to fully test the
communication system for idiosyncrasies. - Latencies reported represent half the ping-pong
time for messages smaller than 64 Bytes.
Some typical uses
- Measuring the overhead of message-passing
protocols. - Help in tuning the optimization parameters of
message-passing libraries. - Optimizing driver and OS parameters (socket
buffer sizes, etc.). - Identifying dropouts in networking hardware and
drivers.
What is not measured
- NetPIPE cannot measure the load on the CPU yet.
- The effects from the different methods for
maintaining message progress. - Scalability with system size.
12Recent additions to NetPIPE
- Can do an integrity test instead of measuring
performance. - Streaming mode measures performance in 1
direction only. - Must reset sockets to avoid effects from a
collapsing window size. - A bi-directional ping-pong mode has been added
(-2). - One-sided Get and Put calls can be measured
(MPI or SHMEM). - Can choose whether to use an intervening
MPI_Fence call to synchronize. - Messages can be bounced between the same
buffers (default mode), or they can be started
from a different area of memory each time. - There are lots of cache effects in SMP
message-passing. - InfiniBand can show similar effects since
memory must be registered with the card.
Process 0
Process 1
0
1
2
3
13Current projects
- Overlapping pair-wise ping-pong tests.
- Must consider synchronization if not using
bi-directional communications.
Ethernet Switch
n0
n1
n2
n3
Line speed vs end-point limited
n0
n1
n2
n3
- Investigate other methods for testing the
global network. - Evaluate the full range from simultaneous
nearest neighbor communications to all-to-all.
14LAM/MPI
- LAM 6.5.6-4 release from the RedHat 7.2
distibution. - Must lamboot the daemons.
- -lamd directs messages through the daemons.
- -O avoids data conversion for homogeneous
systems. - No socket buffer size tuning.
- No threshold adjustments.
Currently developed at Indiana University.
http//www.lam-mpi.org/
15PVM
- PVM 3.4.3 release from the RedHat 7.2
distribution. - Uses XDR encoding and the pvmd daemons by
default. - pvm_setopt(PvmRoute, PvmRouteDirect) bypasses
the pvmd daemons. - pvm_initsend(PvmDataInPlace) avoids XDR
encoding for homogeneous systems.
Developed at Oak Ridge National Laboratory.
http//www.csm.ornl.gov/pvm/
16A NetPIPE example Performance on a Cray T3E
- Raw SHMEM delivers
- 2600 Mbps
- 2-3 us latency
- Cray MPI originally delivered
- 1300 Mbps
- 20 us latency
- MP_Lite delivers
- 2600 Mbps
- 9-10 us latency
- New Cray MPI delivers
- 2400 Mbps
- 20 us latency
The top of the spikes are where the message size
is divisible by 8 Bytes.
17Channel-bonding Gigabit Ethernet for
better communications between nodes
Channel-bonding uses 2 or more Gigabit Ethernet
cards per PC to increase the communication rate
between nodes in a cluster. GigE cards cost 40
each. 24-port switches cost 1400. ? 100 /
computer This is much more cost effective for PC
clusters than using more expensive networking
hardware, and may deliver similar performance.
18Performance for channel-bonded Gigabit Ethernet
GigE can deliver 900 Mbps with latencies of 25-62
us for PCs with 64-bit / 66 MHz PCI
slots. Channel-bonding 2 GigE cards / PC using
MP_Lite doubles the performance for large
messages. Adding a 3rd card does not help
much. Channel-bonding 2 GigE cards / PC using
Linux kernel level bonding actually results in
poorer performance. The same tricks that make
channel-bonding successful in MP_Lite should make
Linux kernel bonding working even better. Any
message-passing system could then make use of
channel-bonding on Linux systems.
Channel-bonding multiple GigE cards using MP_Lite
and Linux kernel bonding
19Performance on Mellanox InfiniBand cards
A new NetPIPE module allows us to measure the raw
performance across InfiniBand hardware (RDMA and
Send/Recv). Burst mode preposts all receives to
duplicate the Mellanox test. The no-cache
performance is much lower when the memory has to
be registered with the card. An MP_Lite
InfiniBand module will be incorporated into
LAM/MPI.
MVAPICH 0.9.1
2010 Gigabit Ethernet
Intel 10 Gigabit Ethernet cards 133 MHz PCI-X
bus Single mode fiber Intel ixgb driver Can only
achieve 2 Gbps now. Latency is 75 us. Streaming
mode delivers up to 3 Gbps. Much more
development work is needed.
21Comparison of high-speed interconnects
InfiniBand can deliver 4500 - 6500 Mbps at a 7.5
us latency. Atoll delivers 1890 Mbps with a 4.7
us latency. SCI delivers 1840 Mbps with only a
4.2 us latency. Myrinet performance reaches 1820
Mbps with an 8 us latency. Channel-bonded GigE
offers 1800 Mbps for very large messages. Gigabit
Ethernet delivers 900 Mbps with a 25-62
us latency. 10 GigE only delivers 2 Gbps with a
75 us latency.
22The MP_Lite message-passing library
- A light-weight MPI implementation
- Highly efficient for the architectures supported
- Designed to be very user-friendly
- Ideal for performing message-passing research
- http//www.scl.ameslab.gov/Projects/MP_Lite/
232-copy SMP message-passing
Processor
Processor
cache
cache
Process 1
Process 0
Shared-memory segment
Main Memory
24Shared-memory message-passing using a
typical semaphore-based approach
One large segment shared by all
processors Minimize lockouts to when linked list
changes only Minimize search time with a second
linked list for each destination Semaphores are
slow Still not scalable
Shared-memory segment
Process 0
0?1
2?0
1?3
Process 1
1?0
3?2
1?2
Process 2
Process 3
25MP_Lite locking FIFO approach
Shared-memory segment
Process 0
The message headers are sent through
shared-memory FIFO pipes. The main segment is
only locked during allocation/de-allocation. A
process spins on an atomic operation with an
occasional schedule yield. An optimized memory
copy routine is used.
0?1
2?0
1?3
Process 1
1?0
3?2
1?2
Process 2
FIFO 0 ? 1
FIFO 0 ? 2
FIFO 0 ? 3
FIFO 1 ? 0
Process 3
FIFO 1 ? 2
FIFO 1 ? 3
FIFO 3 ? 2
26Optimizing the memory copy routine
The 686 version of GLIBC has a memcpy routine
that does byte-by-byte copies for messages not
divisible by 4 bytes. The Intel memcpy is good,
but does not make use of the non-temporal copy in
the Pentium 4 instruction set. An optimal memcpy
is being developed to try to provide the best
performance everywhere.
With the data starting in cache.
2.4 GHz Xeon running RedHat 7.3
27MP_Lite lock free approach
Each process has its own section for outgoing
messages. Other processes only flip a cleanup
flag No lockouts provide excellent
scaling Doubly linked lists for very efficient
searches, or combine with the shared-memory FIFO
method.
Shared-memory segment
Process 0
0?1
0
Process 1
1?3
1?0
1?2
1
2
Process 2
2?0
3?2
3
Process 3
28SMP message-passing performance
with cache effects
With the data starting in cache.
LAM/MPI has the lowest latency at 1 us, with
MPICH2 and MP_Lite at 2 us. MP_Lite dominates in
the cache region due to better lock and header
handling. The non-temporal memory copy boosts
performance by 50 for large messages.
1.7 GHz dual-Xeon running RedHat 8.0
29Bi-directional SMP message-passing performance
With the data starting in cache.
MP_Lite and LAM/MPI have latencies around 3 us,
with MPICH and at 16 us. MP_Lite and MPICH peak
at 7000 Mbps. The non-temporal memory copy of
MP_Lite boosts the large message rate by 20.
Bi-directional results are pretty similar to the
uni-directional results.
1.7 GHz dual-Xeon running RedHat 8.0
30Performance using the Ames Lab Classical
Molecular Dynamics code
Communication times in microseconds per iterative
cycle
2 MPI processes on a 1.7 GHz dual-Xeon computer
running RedHat 7.3
311-copy SMP message-passing for Linux
Kernel Put or
Kernel Get
Processor
Processor
cache
cache
Process 1
Process 0
Kernel copy
Main Memory
This should double the throughput. It is unclear
what the latency will be. It should be
relatively easy to write an MPI implementation.
32Writing the Linux module
- The kernel has 2 functions for transferring data
to and from user space. - copy_from_user() checks for read access then gets
data. - copy_to_user() checks for write access then puts
data. - Write a copy_from_user_to_user() function.
- Create an initialization function
join_copy_group(). - Check for read/write access to all the processes
once. - Expose these to the message-passing layer using a
module kernel_copy.c. - Write an MPI implementation using 1-sided Gets
and Puts. - MPI_Init() will call join_copy_group().
- MPI_Send() will put data if a receive is
preposted, else push to a send buffer and post. - MPI_Recv() will block on message reception, or
posting of a matching buffered send in which it
would get the data.
330-copy SMP message-passing for Linux?
MP_FreeSend() MP_MallocRecv()
Processor
Processor
cache
cache
Process 1
Process 0
Kernel virtual copy
1 2 3
1 2 3
1
2
3
Main Memory
- If the source node does not need to retain a
copy, do an MP_FreeSend(). - The kernel can re-assign the physical memory
pages to the destination node. - The destination node then does an
MP_MallocRecv(). - Only the partially filled pages would need to be
copied. - In Fortran, the source and dest buffers would
need to be similarly aligned.
34- Most applications do not take advantage of the
network topology - ? There are many different topologies, which
can even vary at run-time - ? no portable method for mapping to the
topology - ? loss of performance and scalability
- NodeMap will automatically determine the
network topology at run-time and pass the
information to the application or message-passing
library.
35How NodeMap works
- Gethostname() ? SMP processes
- Latency and bandwidth ? individual connections
- Saturated network performance ? regular meshes
- Global shifts ? identify regular mesh
structures - Vendor functions when available
- Static configuration files as a last resort
How NodeMap will be used
- MPI_Cart_create( reorder 1 ) use NodeMap to
provide best mapping - MPI_Init() can run NodeMap and optimize global
communications
36A parallel integral transport equation based
radiography simulation code
Feyzi Inanc, Bogdan Vasiliu, Dave Turner
Nuclear Science and Engineering 137, 173-182
(2001).
37Performance on ALICE, normalized to 4 nodes
38Summary
- Provided an overview of the current state of
parallel computing. - Measuring and tuning the performance is
necessary, and easy. - Much research is being done to improve
performance. - Channel bonding can double communication
performance for small clusters at a minimal cost. - Parallel computing does take significant effort,
but it opens up new areas of science.
39Contact information
- Dave Turner - turner_at_ameslab.gov
- http//www.scl.ameslab.gov/Projects/MP_Lite/
- http//www.scl.ameslab.gov/Projects/NetPIPE/
40CPCM assisted clusters at Iowa State University
- 9-node PC cluster for Math
- 16 PC Octopus cluster for Biology/Bio-informatic
s - pre-built 22-processor Atipa cluster for
Astronomy - 24-node Alpha cluster with GigE for Physics
- 24-node PC cluster for Materials
- 24-node Athlon cluster with GigE for Physics
- 22-processor Athlon cluster with GigE for
Magnetics
41IBM RS/6000 Workstation Cluster
- The cluster consists of 22 IBM 43P-260 and
- 2 IBM
44P-270 workstations. - Each 43P node consists of
- Dual 200MHz Power3 processors
- (800 MFLOP peak)
- 2.5 GB of RAM
- 18 GB striped disk storage
- Fast Ethernet
- Gigabit Ethernet supporting Jumbo Frames
- Each 44P node consists of
- Quad 375MHz Power3 processors
- (1500 MFLOP peak)
- 16 GB of RAM
- 72 GB of striped disk storage
- Fast Ethernet
- Dual Gigabit Ethernet adapters supporting Jumbo
Frames
The Cluster is currently operated in a mixed
research/production environment with nearly 100
aggregate utilization, mostly due to production
GAMESS calculations. The Cluster was made
possible by an IBM Shared University Research
(SUR) grant and by the DOE MICS program.
http//www.scl.ameslab.gov/Projects/IBMCluster/
42IBM pSeries 64-bit workstation cluster
- The cluster consists of 32 IBM pSeries p640
workstations. Each p640 node consists of - Quad 375MHz Power3 II processors (1500 MFLOP
peak) - 16 GB of RAM
- 144 or 288 GB striped disk storage
- Fast Ethernet
- Dual Gigabit Ethernet adapters supporting Jumbo
Frames - Dual Myrinet 2000 adapters (planned)
- Aggregate total of
- 128 CPUs (192 GigaFLOP peak)
- 1/2 Terabyte of RAM
- 6 terabytes of disk
- Nodes will run a mixture of AIX 5.1L (64 bit
kernel) and 64 bit PPC Linux.
The Cluster was made possible by an IBM Shared
University Research (SUR) grant, the Air Force
Office of Scientific Research, and by the DOE
MICS program.
http//www.scl.ameslab.gov/Projects/pCluster/
43Channel-bonding in MP_Lite
User space
Kernel space
device driver
Application on node 0
Large socket buffers
device queue
GigE card
a
b
dev_q_xmit
DMA
TCP/IP stack
b
TCP/IP stack
GigE card
a
dev_q_xmit
DMA
MP_Lite
device queue
Flow control may stop a given stream at several
places. With MP_Lite channel-bonding, each
stream is independent of the others.
44Linux kernel channel-bonding
User space
Kernel space
device driver
Application on node 0
device queue
Large socket buffer
GigE card
DMA
dqx
bonding.c
TCP/IP stack
dqx
dqx
GigE card
DMA
device queue
A full device queue will stop the flow at
bonding.c to both device queues. Flow control on
the destination node may stop the flow out of the
socket buffer. In both of these cases, problems
with one stream can affect both streams.
45SMP message-passing performance
without cache effects
LAM/MPI has the lowest latency at 1 us, with
MP_Lite at 2 us and MPICH2 and at 3 us. MP_Lite
and LAM/MPI do best in the intermediate
region. The non-temporal memory copy is tuned for
the cache case, kicking in above 128 kB to boost
performance by 50 for large messages. MPI/Pro is
also using an optimized memory copy routine.
With the data starting in main memory.
1.7 GHz dual-Xeon running RedHat 8.0