Title: Optimizing SMP MessagePassing Systems
1Optimizing SMP Message-Passing Systems
- Dave Turner
- In collaboration with
- Adam Oline, Xuehua Chen, and Troy Benjegerdes
- This project is funded by the Mathematical,
Information, and - Computational Sciences division of the Department
of Energy.
2The MP_Lite message-passing library
- A lightweight MPI implementation
- Highly efficient for the architectures supported
- Designed to be very user-friendly
- Ideal for performing message-passing research
- http//www.scl.ameslab.gov/Projects/MP_Lite/
3SMP message-passing using a shared-memory segment
Processor
Processor
cache
cache
Process 1
Process 0
Shared-memory segment
Main Memory
- 2-copy mechanism
- Variety of methods to manage the segment
- Variety of methods to signal the destination
process
4MP_Lite lock free approach
Each process has its own section for outgoing
messages. Other processes only flip a cleanup
flag No lockouts provide excellent
scaling Doubly linked lists for very efficient
searches, or combine with the shared-memory FIFO
method.
Shared-memory segment
Process 0
0?1
0
Process 1
1?3
1?0
1?2
1
2
Process 2
2?0
3?2
3
Process 3
51-copy SMP message-passing for Linux
Kernel Put or
Kernel Get
Processor
Processor
cache
cache
Process 1
Process 0
Kernel copy
Main Memory
- This should double the throughput
- Unclear at inception what the latency would be
- Relatively easy to implement a full MPI
implementation - Similar work was done by the BIP-SMP project
6The kcopy.c Linux module
- kcopy_open()
- Does nothing since it is difficult to pass data
in - kcopy_ioctl
- KCOPY_INIT initialize synchronization arrays
- KCOPY_PUT/GET call copy_user_to_user to
transfer message data - KCOPY_SYNC out-of-band synchronization
- KCOPY_BCAST/GATHER support routines for
exchanging initial pointers - kcopy_release()
- copy_user_to_user
- kmap destination pages to kernel space then copy
1 page at a time
destination
physical
source
4 GB
4 GB
4 GB
kernel
kernel
3 GB
3 GB
user
user
user
1 GB
kernel
0 GB
0 GB
0 GB
7Programming for the kcopy.o Linux module
- dd open( /dev/kcopy, O_WRONLY )
- Open the connection to the kcopy.o module and
return a device descriptor - ioctl( dd, KCOPY_INIT, hdr )
- Pass myproc and nprocs to the module
- Initialize the Sync arrays
- ioctl( dd, KCOPY_PUT, hdr )
- hdr sbuf, dbuf, myproc, nprocs, dest, nbytes,
comm (put/get) - Checks for write access using access_ok()
- copy_user_to_user copies nbytes from sbuf to/from
dbuf - ioctl( dd, KCOPY_SYNC, hdr)
- All processes increment their sync element, then
wait for a go signal from proc 0 - ioctl( dd, KCOPY_GATHER, hdr)
- hdr myproc, nprocs, sizeof(element),
myelement - ? array of elements
- ioctl( dd, KCOPY_BCAST, hdr)
- hdr myproc, nprocs, sizeof(element),
myelement - ? element broadcast
8The MP_Lite kput.c module
- 1-sided MP_Put/MP_Get are not implemented yet
- 2-sided communications use both gets and puts
- 4 separate circular queues are maintained to
manage the 2-sided communications - MP_Send
- If receive is pre-posted, put the data to the
destination process - Then post a completion flag
- If blocking on a send, malloc a buffer, copy
data, and post the header to the destination - The destination process will then get the data
from the source process and post a completion
flag - The source process can use the completion flag to
free the send buffer - MP_Recv
- If a buffered send is posted, get the data from
the source process - Then post a completion flag so the source process
can free the send buffer - Else pre-post a receive header
- Then block until a completion notification flag
is posted in return
9SMP message-passing between caches
All MPI libraries can be made to achieve a 1 ms
latency. 2-copy methods can still show some
possible benefits in the cache range
(measurements with codes are needed). The 1-copy
mechanism in MP_Lite nearly doubles the
throughput for large messages. Adding an
optimized non-temporal memory copy routine to the
Linux kernel module nearly doubles the
performance again, but only for Pentium 4
systems.
10SMP message-passing performance without cache
effects
The benefits of the 1-copy method and the
optimized memory copy routine are seen more
clearly when the messages do not start in
cache. There may still be room to improve the
throughput by another 10-20. - Copy
contiguous pages - Get rid of the spin-lock
in kmap Are optimized memory copy routines
available for other processors??? This Linux
module is the perfect place to put any optimized
memory routines.
110-copy SMP message-passing using a copy-on-write
mechanism
Processor
Processor
cache
cache
Process 1
Process 0
Kernel virtual copy
1 2 3
1 2 3
1
2
3
Main Memory
- ? If source and destination buffers are
similarly aligned - ? If the message contains full pages to copy
- ? Give access to both processes and mark as
copy-on-write - Pages will only be copied if either process
writes to it - - Both processes could easily trample the
buffers much later - Effective use would require the user to align
buffers (posix_memalign) - The user must also protect against trampling
(malloc before, free right after) - - This still does not help with transferring
writable buffers
120-copy SMP message-passing by transferring
ownership of the physical pages
MP_SendFree() MP_MallocRecv()
Processor
Processor
cache
cache
Process 1
Process 0
Kernel virtual copy
1 2 3
1 2 3
3
2
Main Memory
1
- If the source node does not need to retain a
copy, do an MP_SendFree(). - The kernel can re-assign the physical pages to
the destination process. - The destination process then does an
MP_MallocRecv(). - Only the partially filled pages would need to be
copied. - Can also provide alignment help to page-align
and pad buffers - Destination process owns the new pages (they
are writable) - - Requires extensions to the MPI standard
- Must modify the code, but should be fairly
easy (but what about Fortran?) - Can be combined with copy-on-write to share
read-only data
13 w
i
t
h
o
r
w
i
t
h
o
u
t
f
e
n
c
e
c
a
l
l
s
.
M
e
a
s
u
r
e
p
e
r
f
o
r
m
a
n
c
e
o
r
d
o
a
n
i
n
t
e
g
r
i
t
y
t
e
s
t
.
http//www.scl.ameslab.gov/Projects/NetPIPE/
NetPIPE_4.x CVS repository on AFS
14NetPIPE 4.x options
-H hostfile ? Choose a host file name other
than the default nphosts -B ? bidirectional
mode -a ? asynchronous sends and receives -S
? Do synchronous sends and receives -O s,d
? Offset the source buffer by s bytes and the
destination buffer by d bytes -I ?
Invalidate cache (no cache effects) -i ? Do
an integrity check (do not measure
performance) -s ? Stream data in one
direction only
15Aggregate Measurements
- Overlapping pair-wise ping-pong tests.
- Measure switch performance or line speed
limitations - Use bidirectional mode to fully saturate and
avoid synchronization concerns
Ethernet Switch
n0
n1
n2
n3
Line speed vs end-point limited
n0
n1
n2
n3
- Investigate other methods for testing the
global network. - Evaluate the full range from simultaneous
nearest neighbor communications to all-to-all.
16The IBM SP at NERSC
- 416 16-way SMP nodes connected by an SP switch
- 380 IBM Nighthawk compute nodes ? 6080 compute
processors - 16-processor SMP node
- 375 MHz Power3 processors
- 4 Flops / cycle ? peak of 1500 MFlops/processor
- ALCMD gets around 150 MFlops/processor
- 16 GB RAM / node (some 32 64 GB nodes)
- Limited to 2 GB / process
- AIX with MPI or OpenMP
- IBM Colony switch connecting the SMP nodes
- 2 network adaptors per node
- MPI
- http//hpcf.nersc.gov/computers/SP/
17SMP message-passing performance on an IBM SP node
The aggregate bi-directional bandwidth is 4500
Mbps between one pair of processes on the same
SMP node with a latency of 14 us. The bandwidth
scales ideally for two pairs communicating
simultaneously. Efficiency drops 80 when 4 pairs
are communicating, saturating the main memory
bandwidth on the node. Communication bound codes
will suffer when run on all 16 processors due to
this saturation.
18Message-passing performance between IBM SP nodes
The aggregate bi-directional bandwidth is 3 Gbps
between one pair of processes between nodes with
a latency of 23 us. The bandwidth scales ideally
for two pairs communicating simultaneously, which
makes sense given that there are two network
adapter cards per node. Efficiency drops 70 when
4 pairs are communicating, just about saturating
the off-node communication bandwidth. Communicatio
n bound codes will suffer when run on more than 4
processors due to this saturation.
1912-processor Cray XD1 chassis
- 6 dual-processor Opteron nodes ? 12 processors
- 2-processor SMP nodes
- 2.2 GHz Opteron processors
- 2 GB RAM / node
- SuSE Linux with the 2.4.21 kernel
- RapidArray interconnect
- 2 network chips per node
- MPICH 1.2.5
20Message-passing performance on a Cray XD1
TCP performance reaches 2 Gbps with a 10 us
latency. The MPI performance between processors
on a dual-processor SMP node on the Cray XD1
reaches 7.5 Gbps with a 1.8 us latency. MPI
performance between nodes reaches 10 Gbps with a
1.8 us latency. With MPI, it is currently faster
to communicate between processors on separate
nodes than within the same SMP node.
21Message-passing performance between nodes on a
Cray XD1
The MPI performance between 2 processors on a
dual-processor SMP node on the Cray XD1 reaches
7.5 Gbps with a 1.8 us latency. There are
severe dropouts starting at 8 kB. Similar
dropouts have been seen in an InfiniBand module
by D.K. Panda based on MPICH. That module may be
the source of the Cray XD1 MPI version. We need
to work with Cray, D.K Panda, and Argonne to
resolve this problem.
22Message-passing performance on a Cray XD1 SMP node
The MPI performance between 2 processors
on a dual-processor SMP node on the Cray XD1
reaches 7.5 Gbps with a 1.8 us latency. The
characteristics show no sign of a special SMP
module, so data probably goes out to the network
chip and back to the 2nd processor. There are
severe dropouts starting at 8 kB similar to
off-node performance.
23Aggregate message-passing performance between
nodes
The MPI performance between 2 nodes
reached a maximum of 10 Gbps. The aggregate
performance for bidirectional communications
between the same 2 nodes reaches 15 Gbps, showing
a loss of 50. The aggregate performance across
the same link using both processors on each node
reaches the same maximum of 15 Gbps. This is
measuring the limitation of a single link. We do
not have access to a large enough system to for
saturation of the RapidArray switch.
24Aggregate message-passing performance across the
switch
The aggregate performance across one link using
both processors on each node reaches 15 Gbps. The
same measurement using 4 nodes (4 pairs of
processors communicating simultaneously) reaches
an ideal doubling, showing no saturation of the
switch.
25- Most applications do not take advantage of the
network topology - ? There are many different topologies, which
can even vary at run-time - ? Schedulers usually do not allow requests
for a given topology - ? no portable method for mapping to the
topology - ? loss of performance and scalability
- NodeMap will automatically determine the
network topology at run-time - and pass the information to the
message-passing library.
26- Network modules identify the topology using a
variety of discovery techniques. - Hosts and switches are stored from nearest
neighbors out. - Topology analysis routines identify regular
topologies. - Latency, maximum and aggregate bandwidth tests
provide more accurate performance measurements. - The best mapping is provided through the
MPI_Init or MPI_Cart_create functions.
27NodeMap network modules
- Run-time use of NodeMap means it must operate
very quickly (seconds, not minutes). - Use brute force ping-pong measurements as a
last resort. - Reduce the complexity of the problem at each
stage. - Always identify SMP processes first using
gethostname. - Identify each nodes nearest neighbors, then
work outwards. - Store the neighbor and switch information in
the manner it is discovered. - Store local neighbors, then 2nd neighbors, etc.
- This data structure based on discovery makes it
easy to write the topology analysis routines. - The network type or types should be known from
the MPI configuration. - This identifies which network modules to run.
- The MPI configuration provides paths to the
native software libraries.
28Myrinet static.map file
s - "s0" 15 0 s - "s1" 14 1 s - "s2" 14 2 s -
"s3" 14 3 s - "s4" 14 4 s - "s5" 14 5 s - "s6"
14 6 s - "s7" 14 7 s - "s8" 14 9 h - "a18" 0 10 h
- "a19" 0 11 h - "4pack" 0 12 h - "m22" 0 13 h -
"m19" 0 14 h - "m18" 0 15 h - "m20" 0 Â s -
"s1" 8 0 s - "s9" 0 1 s - "s10" 0 6 s - "s11"
0 11 s - "s12" 0 12 s - "s13" 0 13 s - "s14" 0 14
s - "s0" 0 15 s - "s15" 0
h - "m22" 1 0 s - "s0" 12 number 0 address
0060dd7fb1f9 gmId 78 hostType 0 Â h - "m27" 1 0 s
- "s14" 14 number 0 address 0060dd7fb0e8 gmId
62 hostType 0
- Parse the gm/sbin/static.map file
- Each host has an entry
- -- Connected to what switch?
- Each switch has an entry
- -- Lists all hosts connected
- -- Lists all switches connected
- Internal switches have no hosts
- Determine the complete topology
- Determine our topology
29Myrinet module for NodeMap
4packgt gm_board_info lanai_cpu_version
0x0900 (LANai9.0) max_lanai_speed 134 MHz
(should be labeled "M3M-PCI64B-2-59521") Â gmID
MAC Address gmName Route ----
----------------- ---------- ------------- 2
0060dd7fb1c0 m26 ba bd 88 3
0060dd7fb0e0 m18 83 4
0060dd7fb1f6 m29 ba b3 89 55
0060dd7facb2 m25 ba b2 88 56
0060dd7fb0ec m24 ba bf 86 58
0060dd7fb106 m20 84 59
0060dd7fb0e3 m19 82 61
0060dd7fb1bd m28 ba be 86 62
0060dd7fb0e8 m27 ba bf 89 67
0060dd7faca8 m23 ba be 89 77
0060dd7facb0 4pack 80 (this node) 78
0060dd7fb1f9 m22 ba be 88 80
0060dd7fb0ed m32 ba b3 88 93
0060dd7fb0e1 m31 ba be 87
- Probe using gm_board_info.
- Use header info to ID board.
- Verify clock rate with dmesg
- Provides exact routing
- - Not really needed
- Measure the latency and bandwidth
- Provide the best mapping
30InfiniBand module for NodeMap
opteron1gt minism InfiniHost0 minismgtd  New
Discovered Node New Node - TypeCA NumPorts02
LID0003 New Discovered Node New Link 4x
FromLID0003 FromPort01 ToLID0004 ToPort08 Â
New Node - TypeSw NumPorts08 LID0004 No
Link 1x FromLID0004 FromPort01 No Link 1x
FromLID0004 FromPort02 No Link 1x
FromLID0004 FromPort03 No Link 1x
FromLID0004 FromPort04 New Discovered Node
New Link 4x FromLID0004 FromPort05 ToLID0002
ToPort06 New Link 4x FromLID0004 FromPort06
ToLID0002 ToPort05 New Link 4x FromLID0004
FromPort07 ToLID0009 ToPort01 New Link 4x
FromLID0004 FromPort08 ToLID0003
ToPort01 Â New Discovered Node New Node -
TypeCA NumPorts02 LID0009 New Link 4x
FromLID0009 FromPort01 ToLID0004 ToPort07
- Probe the subnet manager (minism or other)
- Identify my LID
- Exchange LIDs
- Parse and store the links, switches, and other
HCAs - This is all that is needed to determine the
topology
31IP module for NodeMap
- IP interface can be to many types of network
hardware. - Ethernet, ATM, IP over Myrinet, IP over
InfiniBand, etc. - How to determine what network cards are
present? - ifconfig provides a list of active interfaces
- Does tell what type of network
- May tell what speed for Ethernet
- lspci, hinv provide a description of what is
plugged into the PCI slots - Sometimes helpful, but may require a database
- Can measuring latency identify the number of
switches in between? - This may require many point-to-point
measurements - OS, driver, and NIC can affect the latency
themselves - It may identify which nodes are on a local
switch - ? Can simultaneous measurements be done to make
this efficient? - Use aggregate measurements to probe higher
level switches?
32MPP modules for NodeMap
- Use uname or compiler definitions to identify
the MPP type - - _CRAYT3E is defined for the Cray T3E
- -- _AIX for IBM SP (anything better?)
- ? These identify the global network topology
- Use gethostname to identify SMP processes
- This reduces the complexity of the problem
- Use vendor functions if available (getphysnode
on the Cray T3E) - Do we need a module for each new type of MPP???
33A Generic MPI Module
- How quickly can a brute force analysis be done?
- Run NodeMap once to generate a static map file of
the topology? - Identify SMP processes first using gethostname
- Gather information for a single host
- Measure the latency and maximum throughput to
one or more nodes - Probe using overlapping pair-wise bandwidth
measurements - Try to measure the congestion factor
- Increase the number of pairs until the
performance improves ? X-dimension - Repeat in the Y-dimension
- Use 2D global shifts for several regular
arrangements - Measure the bi-sectional bandwidth ? Can
identify fat trees - Additional tricks needed!!!
n0
n1
n2
n3
34Topological Analysis Routines
- Initially concentrate only on regular
arrangements - N-dimensional mesh/torus, trees, SMP nodes
- Identify the topology from the host/switch
information gathered - How many unique 1st, 2nd, 3rd, neighbors ?
mesh/torus - Determine whether a tree is fully connected
- Eventually handle more irregular arrangements
of nodes. - Identify the general type of network
- - May have an irregular arrangement of nodes on
a mesh/torus - Identify which nodes are irregular
35Performance Measurements
- Performance may be limited by many factors
- Measure latency and max bandwidth for each
network layer - Measure aggregate performance across switches
or a given link - Performance data can help determine the
topology - A tree with fewer internal switches may still
be a fat tree - Feed the performance data to the Mapper along
with topological data
36The Mapper
- NodeMap will initially be run from
MPI_Cart_create(, reorder1) - User must determine which direction requires
the optimal passing - The Mapper will take the topology and
performance data and provide the best mapping of
the desired 2D (or eventually 3D or tree)
algorithm. - Concentrate on regular arrangements first
- Use Gray codes for 2D onto N-dimensional
topologies to guarantee nearest neighbors are
directly connect - NodeMap may also be run from MPI_Init()
- Provide optimal mappings for global operations
(mainly binary trees)
37Conclusions
- Codes must be mapped to the network topology to
scale well - Many codes can use a 2D decomposition
- 2D algorithms can be mapped ideally to most
network topologies - NodeMap will provide automatic mapping to the
topology - Portable means of taking advantage of the network
topology - Questions
- How well will NodeMap handle irregular networks?
- Will it be difficult to provide a reasonable
mapping? - Can a generic MPI module effectively discover a
topology? - If so, how quickly can this be done?
- Will NodeMap need to generate static map files
ahead of time?
38Contact information
- Dave Turner - turner_at_ameslab.gov
- http//www.scl.ameslab.gov/Projects/MP_Lite/
- http//www.scl.ameslab.gov/Projects/NetPIPE/
39Ames Lab Classical Molecular Dynamics code
Embedded atom method, Leonard Jones, Tersoff
potentials Uses cubic spline functions for
optimal performance Local interactions only,
typically 5-6 A ? 50-60 neighbors per atom
2D decomposition of the 3D simulation space Map
neighboring regions to neighboring nodes to
localize communications. Shift coordinates and
accumulators to all nodes above and to the right
to calculate all pair interactions within the
cutoff range. Large systems require just 5
communications, while systems spread across more
nodes may involve many more passes in a
serpentine fashion around half the
interaction range.
40ALCMD scaling on the IBM SP
Proper mapping of the columns to SMP nodes helps
greatly. Parallel efficiency goes from 50 to 70
for 10,000,000 atoms on 1024 processors. Scaling
beyond 1024 processors will be difficult. On-node
and off-node communications are saturated even at
1024 processors (16 x 16-way SMPs)
412D decomposition of algorithms
- Many codes can be naturally decomposed onto a 2D
mesh. - Atomistic codes Classical and Order-N
Tight-Binding - Many matrix operations
- Finite difference and finite element codes
- Grid and multi-grid approaches
- 2D decompositions map well to most network
topologies. - 2D and 3D meshes and toruses, hypercubes, trees,
fat trees - Direct connections to nearest neighbor nodes
prevents contention - Writing algorithms using a 2D decomposition can
provide the initial step to taking advantage of
the network topology.
42Point-to-point Performance
InfiniBand can deliver 4500 - 6500 Mbps at a 7.5
us latency. Atoll delivers 1890 Mbps with a 4.7
us latency. SCI delivers 1840 Mbps with only a
4.2 us latency. Myrinet performance reaches 1820
Mbps with an 8 us latency. Channel-bonded GigE
offers 1800 Mbps for very large messages. Gigabit
Ethernet delivers 900 Mbps with a 25-62
us latency. 10 GigE only delivers 2 Gbps with a
75 us latency.
43Evaluating network performance using ping-pong
tests
- Bounce messages back and forth between
processes to measure the performance - Messages are bounced many times for each test
to provide an accurate timing - The latency is the minimum time needed to send
a small message (1/2 round trip time) - The throughput is the communication rate
- Start with a small message size (1 byte) and
increase exponentially up to 8 MB - Use perturbations from the factors of 2 to
fully test the system
Switch
n0
n1
n2
n3
44Using NetPIPE 4.x
mpirun np nprocs -hostfile nphosts NPmpi
NetPIPE options -H hostfile option to NPmpi
can be used to change the order of the
pairings nplaunch script for launching other
NetPIPE executables Default host file name is
nphosts Lists hosts with first and last
communicating, 2nd and 2nd to last, etc. nplaunch
NPtcp NetPIPE options For aggregate
measurements, use B for bidirectional mode. For
NPmpi, -a may be needed for asynchronous sends
and receives