The Future of Parallel Computing

About This Presentation

Title:

The Future of Parallel Computing

Description:

3D torus network for MPI communications (MPICH2) 1.4 Gbps peak bandwidth in ... The Mezzanine card will allow for a 3D torus in the future. Blade-based clusters ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 46

Provided by: davet162

Category:

more less

Transcript and Presenter's Notes

Title: The Future of Parallel Computing

1
The Future of Parallel Computing

Dave Turner
In collaboration with
Xuehua Chen, Adam Oline, and Troy Benjegerdes
Scalable Computing Laboratory of Ames Laboratory
This work was funded by the MICS office of the US
Department of Energy

2
Outline

Overview of parallel computing
Measuring the performance of the communication
system
Improving the message-passing performance
Taking advantage of the network topology
Science enabled by parallel computing

3
The typical small cluster

2.8 GHz dual-Xeon node (1800 each)
2 GB RAM
SuperMicro or Tyan motherboards
Built-in single or dual Gigabit Ethernet
1U or 2U rackmount, as needed for expansion
24-port Gigabit Ethernet switch
Assante or Netgear (1400)
47U Rack (3000)
KVM, UPS, Monitor, cables
Software
Intel compilers (560 academic)
MPI, PVM, PBS scheduler

25000 will get you a master plus 10 compute
nodes (22 processors). Cluster vendors such as
Atipa and Microway will sell fully integrated
clusters.
4
The IBM Blue Gene/L

65,536 dual-processor nodes
700 MHz PowerPC 440 cores with dual floating
point units
256-512 MB RAM
Each node runs a very stripped down version of
Linux
Should fit in a room the size of a tennis court
Peak computational rate of 360 Teraflops
5 separate networks
3D torus network for MPI communications (MPICH2)
1.4 Gbps peak bandwidth in each direction
A tree network connect every 64 node boards to an
I/O node
Will also handle some MPI collectives
A Fast Ethernet control network
Global interrupt network
I/O nodes
Gigabit Ethernet connection
Run a full Linux kernel

5
The dual-Athlon cluster with SCI interconnects

Initially we will connect 64 dual-Athlon PCs with
an 8x8 SCI grid.
The Mezzanine card will allow for a 3D torus in
the future.

6
Blade-based clusters

dual-Xeon blades (5000 each)
2-8 GB RAM
1-2 slow mini-disks (40-80 GB each)
Built-in single or dual Gigabit Ethernet
1 PCI-X expansion slot
6U Chassis
Holds 10 blades plus a network switch module
Lots of cooling
48U rack
Could hold 80 blades (160 processors)

Prices will come down.
7
Inefficiencies in the communication system
Applications MPI native layer internal
buses driver NIC switch fabric
75 bandwidth 2-3x latency
PCI Memory
Topological bottlenecks
Poor MPI usage No mapping
Hardware limits Driver tuning
OS bypass TCP tuning
8
Waveguide simulations using the parallel Finite
Difference Time Domain method
Kai-Ming Ho, Rana Biswas, Mihail Sigalas, Ihab
El-Kady, Mehmet Su Dave Turner, Bogdan Vasiliu
9
Waveguide bends in three dimensional layer by
layer photonic band gap materials, M.M. Sigalas,
R. Biswas, K.M. Ho, C.M. Soukoulis, D.E. Turner,
B. Vasiliu, S.C. Kothari, and Shawn Lin,
Microwave and Optical Technology Letters, Vol.
23, 56-59 (Oct. 5, 1999).
10

w
i
t
h

o
r

w
i
t
h
o
u
t

f
e
n
c
e

c
a
l
l
s
.

M
e
a
s
u
r
e

p
e
r
f
o
r
m
a
n
c
e

o
r

d
o

a
n

i
n
t
e
g
r
i
t
y

t
e
s
t
.
http//www.scl.ameslab.gov/Projects/NetPIPE/
11
The NetPIPE utility

NetPIPE does a series of ping-pong tests
between two nodes.
Message sizes are chosen at regular intervals,
and with slight perturbations, to fully test the
communication system for idiosyncrasies.
Latencies reported represent half the ping-pong
time for messages smaller than 64 Bytes.

Some typical uses

Measuring the overhead of message-passing
protocols.
Help in tuning the optimization parameters of
message-passing libraries.
Optimizing driver and OS parameters (socket
buffer sizes, etc.).
Identifying dropouts in networking hardware and
drivers.

What is not measured

NetPIPE cannot measure the load on the CPU yet.
The effects from the different methods for
maintaining message progress.
Scalability with system size.

12
Recent additions to NetPIPE

Can do an integrity test instead of measuring
performance.
Streaming mode measures performance in 1
direction only.
Must reset sockets to avoid effects from a
collapsing window size.
A bi-directional ping-pong mode has been added
(-2).
One-sided Get and Put calls can be measured
(MPI or SHMEM).
Can choose whether to use an intervening
MPI_Fence call to synchronize.
Messages can be bounced between the same
buffers (default mode), or they can be started
from a different area of memory each time.
There are lots of cache effects in SMP
message-passing.
InfiniBand can show similar effects since
memory must be registered with the card.

Process 0
Process 1
0
1
2
3
13
Current projects

Overlapping pair-wise ping-pong tests.
Must consider synchronization if not using
bi-directional communications.

Ethernet Switch
n0
n1
n2
n3
Line speed vs end-point limited
n0
n1
n2
n3

Investigate other methods for testing the
global network.
Evaluate the full range from simultaneous
nearest neighbor communications to all-to-all.

14
LAM/MPI

LAM 6.5.6-4 release from the RedHat 7.2
distibution.
Must lamboot the daemons.
-lamd directs messages through the daemons.
-O avoids data conversion for homogeneous
systems.
No socket buffer size tuning.
No threshold adjustments.

Currently developed at Indiana University.
http//www.lam-mpi.org/
15
PVM

PVM 3.4.3 release from the RedHat 7.2
distribution.
Uses XDR encoding and the pvmd daemons by
default.
pvm_setopt(PvmRoute, PvmRouteDirect) bypasses
the pvmd daemons.
pvm_initsend(PvmDataInPlace) avoids XDR
encoding for homogeneous systems.

Developed at Oak Ridge National Laboratory.
http//www.csm.ornl.gov/pvm/
16
A NetPIPE example Performance on a Cray T3E

Raw SHMEM delivers
2600 Mbps
2-3 us latency
Cray MPI originally delivered
1300 Mbps
20 us latency
MP_Lite delivers
2600 Mbps
9-10 us latency
New Cray MPI delivers
2400 Mbps
20 us latency

The top of the spikes are where the message size
is divisible by 8 Bytes.
17
Channel-bonding Gigabit Ethernet for
better communications between nodes
Channel-bonding uses 2 or more Gigabit Ethernet
cards per PC to increase the communication rate
between nodes in a cluster. GigE cards cost 40
each. 24-port switches cost 1400. ? 100 /
computer This is much more cost effective for PC
clusters than using more expensive networking
hardware, and may deliver similar performance.
18
Performance for channel-bonded Gigabit Ethernet
GigE can deliver 900 Mbps with latencies of 25-62
us for PCs with 64-bit / 66 MHz PCI
slots. Channel-bonding 2 GigE cards / PC using
MP_Lite doubles the performance for large
messages. Adding a 3rd card does not help
much. Channel-bonding 2 GigE cards / PC using
Linux kernel level bonding actually results in
poorer performance. The same tricks that make
channel-bonding successful in MP_Lite should make
Linux kernel bonding working even better. Any
message-passing system could then make use of
channel-bonding on Linux systems.
Channel-bonding multiple GigE cards using MP_Lite
and Linux kernel bonding
19
Performance on Mellanox InfiniBand cards
A new NetPIPE module allows us to measure the raw
performance across InfiniBand hardware (RDMA and
Send/Recv). Burst mode preposts all receives to
duplicate the Mellanox test. The no-cache
performance is much lower when the memory has to
be registered with the card. An MP_Lite
InfiniBand module will be incorporated into
LAM/MPI.
MVAPICH 0.9.1
20
10 Gigabit Ethernet
Intel 10 Gigabit Ethernet cards 133 MHz PCI-X
bus Single mode fiber Intel ixgb driver Can only
achieve 2 Gbps now. Latency is 75 us. Streaming
mode delivers up to 3 Gbps. Much more
development work is needed.
21
Comparison of high-speed interconnects
InfiniBand can deliver 4500 - 6500 Mbps at a 7.5
us latency. Atoll delivers 1890 Mbps with a 4.7
us latency. SCI delivers 1840 Mbps with only a
4.2 us latency. Myrinet performance reaches 1820
Mbps with an 8 us latency. Channel-bonded GigE
offers 1800 Mbps for very large messages. Gigabit
Ethernet delivers 900 Mbps with a 25-62
us latency. 10 GigE only delivers 2 Gbps with a
75 us latency.
22
The MP_Lite message-passing library

A light-weight MPI implementation
Highly efficient for the architectures supported
Designed to be very user-friendly
Ideal for performing message-passing research
http//www.scl.ameslab.gov/Projects/MP_Lite/

23
2-copy SMP message-passing
Processor
Processor
cache
cache
Process 1
Process 0
Shared-memory segment
Main Memory
24
Shared-memory message-passing using a
typical semaphore-based approach
One large segment shared by all
processors Minimize lockouts to when linked list
changes only Minimize search time with a second
linked list for each destination Semaphores are
slow Still not scalable
Shared-memory segment
Process 0
0?1
2?0
1?3
Process 1
1?0
3?2
1?2
Process 2
Process 3
25
MP_Lite locking FIFO approach
Shared-memory segment
Process 0
The message headers are sent through
shared-memory FIFO pipes. The main segment is
only locked during allocation/de-allocation. A
process spins on an atomic operation with an
occasional schedule yield. An optimized memory
copy routine is used.
0?1
2?0
1?3
Process 1
1?0
3?2
1?2
Process 2
FIFO 0 ? 1
FIFO 0 ? 2
FIFO 0 ? 3
FIFO 1 ? 0
Process 3
FIFO 1 ? 2
FIFO 1 ? 3
FIFO 3 ? 2
26
Optimizing the memory copy routine
The 686 version of GLIBC has a memcpy routine
that does byte-by-byte copies for messages not
divisible by 4 bytes. The Intel memcpy is good,
but does not make use of the non-temporal copy in
the Pentium 4 instruction set. An optimal memcpy
is being developed to try to provide the best
performance everywhere.
With the data starting in cache.
2.4 GHz Xeon running RedHat 7.3
27
MP_Lite lock free approach
Each process has its own section for outgoing
messages. Other processes only flip a cleanup
flag No lockouts provide excellent
scaling Doubly linked lists for very efficient
searches, or combine with the shared-memory FIFO
method.
Shared-memory segment
Process 0
0?1
0
Process 1
1?3
1?0
1?2
1
2
Process 2
2?0
3?2
3
Process 3
28
SMP message-passing performance
with cache effects
With the data starting in cache.
LAM/MPI has the lowest latency at 1 us, with
MPICH2 and MP_Lite at 2 us. MP_Lite dominates in
the cache region due to better lock and header
handling. The non-temporal memory copy boosts
performance by 50 for large messages.
1.7 GHz dual-Xeon running RedHat 8.0
29
Bi-directional SMP message-passing performance
With the data starting in cache.
MP_Lite and LAM/MPI have latencies around 3 us,
with MPICH and at 16 us. MP_Lite and MPICH peak
at 7000 Mbps. The non-temporal memory copy of
MP_Lite boosts the large message rate by 20.
Bi-directional results are pretty similar to the
uni-directional results.
1.7 GHz dual-Xeon running RedHat 8.0
30
Performance using the Ames Lab Classical
Molecular Dynamics code
Communication times in microseconds per iterative
cycle
2 MPI processes on a 1.7 GHz dual-Xeon computer
running RedHat 7.3
31
1-copy SMP message-passing for Linux
Kernel Put or
Kernel Get
Processor
Processor
cache
cache
Process 1
Process 0
Kernel copy
Main Memory
This should double the throughput. It is unclear
what the latency will be. It should be
relatively easy to write an MPI implementation.
32
Writing the Linux module

The kernel has 2 functions for transferring data
to and from user space.
copy_from_user() checks for read access then gets
data.
copy_to_user() checks for write access then puts
data.
Write a copy_from_user_to_user() function.
Create an initialization function
join_copy_group().
Check for read/write access to all the processes
once.
Expose these to the message-passing layer using a
module kernel_copy.c.
Write an MPI implementation using 1-sided Gets
and Puts.
MPI_Init() will call join_copy_group().
MPI_Send() will put data if a receive is
preposted, else push to a send buffer and post.
MPI_Recv() will block on message reception, or
posting of a matching buffered send in which it
would get the data.

33
0-copy SMP message-passing for Linux?
MP_FreeSend() MP_MallocRecv()
Processor
Processor
cache
cache
Process 1
Process 0
Kernel virtual copy
1 2 3
1 2 3
1
2
3
Main Memory

If the source node does not need to retain a
copy, do an MP_FreeSend().
The kernel can re-assign the physical memory
pages to the destination node.
The destination node then does an
MP_MallocRecv().
Only the partially filled pages would need to be
copied.
In Fortran, the source and dest buffers would
need to be similarly aligned.

Most applications do not take advantage of the
network topology
? There are many different topologies, which
can even vary at run-time
? no portable method for mapping to the
topology
? loss of performance and scalability
NodeMap will automatically determine the
network topology at run-time and pass the
information to the application or message-passing
library.

35
How NodeMap works

Gethostname() ? SMP processes
Latency and bandwidth ? individual connections
Saturated network performance ? regular meshes
Global shifts ? identify regular mesh
structures
Vendor functions when available
Static configuration files as a last resort

How NodeMap will be used

MPI_Cart_create( reorder 1 ) use NodeMap to
provide best mapping
MPI_Init() can run NodeMap and optimize global
communications

36
A parallel integral transport equation based
radiography simulation code
Feyzi Inanc, Bogdan Vasiliu, Dave Turner
Nuclear Science and Engineering 137, 173-182
(2001).
37
Performance on ALICE, normalized to 4 nodes
38
Summary

Provided an overview of the current state of
parallel computing.
Measuring and tuning the performance is
necessary, and easy.
Much research is being done to improve
performance.
Channel bonding can double communication
performance for small clusters at a minimal cost.
Parallel computing does take significant effort,
but it opens up new areas of science.

39
Contact information

Dave Turner - turner_at_ameslab.gov
http//www.scl.ameslab.gov/Projects/MP_Lite/
http//www.scl.ameslab.gov/Projects/NetPIPE/

40
CPCM assisted clusters at Iowa State University

9-node PC cluster for Math
16 PC Octopus cluster for Biology/Bio-informatic
s
pre-built 22-processor Atipa cluster for
Astronomy
24-node Alpha cluster with GigE for Physics
24-node PC cluster for Materials
24-node Athlon cluster with GigE for Physics
22-processor Athlon cluster with GigE for
Magnetics

41
IBM RS/6000 Workstation Cluster

The cluster consists of 22 IBM 43P-260 and
2 IBM
44P-270 workstations.
Each 43P node consists of
Dual 200MHz Power3 processors
(800 MFLOP peak)
2.5 GB of RAM
18 GB striped disk storage
Fast Ethernet
Gigabit Ethernet supporting Jumbo Frames
Each 44P node consists of
Quad 375MHz Power3 processors
(1500 MFLOP peak)
16 GB of RAM
72 GB of striped disk storage
Fast Ethernet
Dual Gigabit Ethernet adapters supporting Jumbo
Frames

The Cluster is currently operated in a mixed
research/production environment with nearly 100
aggregate utilization, mostly due to production
GAMESS calculations. The Cluster was made
possible by an IBM Shared University Research
(SUR) grant and by the DOE MICS program.
http//www.scl.ameslab.gov/Projects/IBMCluster/
42
IBM pSeries 64-bit workstation cluster

The cluster consists of 32 IBM pSeries p640
workstations. Each p640 node consists of
Quad 375MHz Power3 II processors (1500 MFLOP
peak)
16 GB of RAM
144 or 288 GB striped disk storage
Fast Ethernet
Dual Gigabit Ethernet adapters supporting Jumbo
Frames
Dual Myrinet 2000 adapters (planned)
Aggregate total of
128 CPUs (192 GigaFLOP peak)
1/2 Terabyte of RAM
6 terabytes of disk
Nodes will run a mixture of AIX 5.1L (64 bit
kernel) and 64 bit PPC Linux.

The Cluster was made possible by an IBM Shared
University Research (SUR) grant, the Air Force
Office of Scientific Research, and by the DOE
MICS program.
http//www.scl.ameslab.gov/Projects/pCluster/
43
Channel-bonding in MP_Lite
User space
Kernel space
device driver
Application on node 0
Large socket buffers
device queue
GigE card
a
b
dev_q_xmit
DMA
TCP/IP stack
b
TCP/IP stack
GigE card
a
dev_q_xmit
DMA
MP_Lite
device queue
Flow control may stop a given stream at several
places. With MP_Lite channel-bonding, each
stream is independent of the others.
44
Linux kernel channel-bonding
User space
Kernel space
device driver
Application on node 0
device queue
Large socket buffer
GigE card
DMA
dqx
bonding.c
TCP/IP stack
dqx
dqx
GigE card
DMA
device queue
A full device queue will stop the flow at
bonding.c to both device queues. Flow control on
the destination node may stop the flow out of the
socket buffer. In both of these cases, problems
with one stream can affect both streams.
45
SMP message-passing performance
without cache effects
LAM/MPI has the lowest latency at 1 us, with
MP_Lite at 2 us and MPICH2 and at 3 us. MP_Lite
and LAM/MPI do best in the intermediate
region. The non-temporal memory copy is tuned for
the cache case, kicking in above 128 kB to boost
performance by 50 for large messages. MPI/Pro is
also using an optimized memory copy routine.
With the data starting in main memory.
1.7 GHz dual-Xeon running RedHat 8.0

Write a Comment

User Comments (0)