Optimizing SMP MessagePassing Systems

About This Presentation

Title:

Optimizing SMP MessagePassing Systems

Description:

If blocking on a send, malloc a buffer, copy data, and post the header to the destination ... a completion flag so the source process can free the send buffer ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 45

Provided by: DaveT2

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing SMP MessagePassing Systems

1
Optimizing SMP Message-Passing Systems

Dave Turner
In collaboration with
Adam Oline, Xuehua Chen, and Troy Benjegerdes
This project is funded by the Mathematical,
Information, and
Computational Sciences division of the Department
of Energy.

2
The MP_Lite message-passing library

A lightweight MPI implementation
Highly efficient for the architectures supported
Designed to be very user-friendly
Ideal for performing message-passing research
http//www.scl.ameslab.gov/Projects/MP_Lite/

3
SMP message-passing using a shared-memory segment
Processor
Processor
cache
cache
Process 1
Process 0
Shared-memory segment
Main Memory

2-copy mechanism
Variety of methods to manage the segment
Variety of methods to signal the destination
process

4
MP_Lite lock free approach
Each process has its own section for outgoing
messages. Other processes only flip a cleanup
flag No lockouts provide excellent
scaling Doubly linked lists for very efficient
searches, or combine with the shared-memory FIFO
method.
Shared-memory segment
Process 0
0?1
0
Process 1
1?3
1?0
1?2
1
2
Process 2
2?0
3?2
3
Process 3
5
1-copy SMP message-passing for Linux
Kernel Put or
Kernel Get
Processor
Processor
cache
cache
Process 1
Process 0
Kernel copy
Main Memory

This should double the throughput
Unclear at inception what the latency would be
Relatively easy to implement a full MPI
implementation
Similar work was done by the BIP-SMP project

6
The kcopy.c Linux module

kcopy_open()
Does nothing since it is difficult to pass data
in
kcopy_ioctl
KCOPY_INIT initialize synchronization arrays
KCOPY_PUT/GET call copy_user_to_user to
transfer message data
KCOPY_SYNC out-of-band synchronization
KCOPY_BCAST/GATHER support routines for
exchanging initial pointers
kcopy_release()
copy_user_to_user
kmap destination pages to kernel space then copy
1 page at a time

destination
physical
source
4 GB
4 GB
4 GB
kernel
kernel
3 GB
3 GB
user
user
user
1 GB
kernel
0 GB
0 GB
0 GB
7
Programming for the kcopy.o Linux module

dd open( /dev/kcopy, O_WRONLY )
Open the connection to the kcopy.o module and
return a device descriptor
ioctl( dd, KCOPY_INIT, hdr )
Pass myproc and nprocs to the module
Initialize the Sync arrays
ioctl( dd, KCOPY_PUT, hdr )
hdr sbuf, dbuf, myproc, nprocs, dest, nbytes,
comm (put/get)
Checks for write access using access_ok()
copy_user_to_user copies nbytes from sbuf to/from
dbuf
ioctl( dd, KCOPY_SYNC, hdr)
All processes increment their sync element, then
wait for a go signal from proc 0
ioctl( dd, KCOPY_GATHER, hdr)
hdr myproc, nprocs, sizeof(element),
myelement
? array of elements
ioctl( dd, KCOPY_BCAST, hdr)
hdr myproc, nprocs, sizeof(element),
myelement
? element broadcast

8
The MP_Lite kput.c module

1-sided MP_Put/MP_Get are not implemented yet
2-sided communications use both gets and puts
4 separate circular queues are maintained to
manage the 2-sided communications
MP_Send
If receive is pre-posted, put the data to the
destination process
Then post a completion flag
If blocking on a send, malloc a buffer, copy
data, and post the header to the destination
The destination process will then get the data
from the source process and post a completion
flag
The source process can use the completion flag to
free the send buffer
MP_Recv
If a buffered send is posted, get the data from
the source process
Then post a completion flag so the source process
can free the send buffer
Else pre-post a receive header
Then block until a completion notification flag
is posted in return

9
SMP message-passing between caches
All MPI libraries can be made to achieve a 1 ms
latency. 2-copy methods can still show some
possible benefits in the cache range
(measurements with codes are needed). The 1-copy
mechanism in MP_Lite nearly doubles the
throughput for large messages. Adding an
optimized non-temporal memory copy routine to the
Linux kernel module nearly doubles the
performance again, but only for Pentium 4
systems.
10
SMP message-passing performance without cache
effects
The benefits of the 1-copy method and the
optimized memory copy routine are seen more
clearly when the messages do not start in
cache. There may still be room to improve the
throughput by another 10-20. - Copy
contiguous pages - Get rid of the spin-lock
in kmap Are optimized memory copy routines
available for other processors??? This Linux
module is the perfect place to put any optimized
memory routines.
11
0-copy SMP message-passing using a copy-on-write
mechanism
Processor
Processor
cache
cache
Process 1
Process 0
Kernel virtual copy
1 2 3
1 2 3
1
2
3
Main Memory

? If source and destination buffers are
similarly aligned
? If the message contains full pages to copy
? Give access to both processes and mark as
copy-on-write
Pages will only be copied if either process
writes to it
- Both processes could easily trample the
buffers much later
Effective use would require the user to align
buffers (posix_memalign)
The user must also protect against trampling
(malloc before, free right after)
- This still does not help with transferring
writable buffers

12
0-copy SMP message-passing by transferring
ownership of the physical pages
MP_SendFree() MP_MallocRecv()
Processor
Processor
cache
cache
Process 1
Process 0
Kernel virtual copy
1 2 3
1 2 3
3
2
Main Memory
1

If the source node does not need to retain a
copy, do an MP_SendFree().
The kernel can re-assign the physical pages to
the destination process.
The destination process then does an
MP_MallocRecv().
Only the partially filled pages would need to be
copied.
Can also provide alignment help to page-align
and pad buffers
Destination process owns the new pages (they
are writable)
- Requires extensions to the MPI standard
Must modify the code, but should be fairly
easy (but what about Fortran?)
Can be combined with copy-on-write to share
read-only data

13

w
i
t
h

o
r

w
i
t
h
o
u
t

f
e
n
c
e

c
a
l
l
s
.

M
e
a
s
u
r
e

p
e
r
f
o
r
m
a
n
c
e

o
r

d
o

a
n

i
n
t
e
g
r
i
t
y

t
e
s
t
.
http//www.scl.ameslab.gov/Projects/NetPIPE/
NetPIPE_4.x CVS repository on AFS
14
NetPIPE 4.x options
-H hostfile ? Choose a host file name other
than the default nphosts -B ? bidirectional
mode -a ? asynchronous sends and receives -S
? Do synchronous sends and receives -O s,d
? Offset the source buffer by s bytes and the
destination buffer by d bytes -I ?
Invalidate cache (no cache effects) -i ? Do
an integrity check (do not measure
performance) -s ? Stream data in one
direction only
15
Aggregate Measurements

Overlapping pair-wise ping-pong tests.
Measure switch performance or line speed
limitations
Use bidirectional mode to fully saturate and
avoid synchronization concerns

Ethernet Switch
n0
n1
n2
n3
Line speed vs end-point limited
n0
n1
n2
n3

Investigate other methods for testing the
global network.
Evaluate the full range from simultaneous
nearest neighbor communications to all-to-all.

16
The IBM SP at NERSC

416 16-way SMP nodes connected by an SP switch
380 IBM Nighthawk compute nodes ? 6080 compute
processors
16-processor SMP node
375 MHz Power3 processors
4 Flops / cycle ? peak of 1500 MFlops/processor
ALCMD gets around 150 MFlops/processor
16 GB RAM / node (some 32 64 GB nodes)
Limited to 2 GB / process
AIX with MPI or OpenMP
IBM Colony switch connecting the SMP nodes
2 network adaptors per node
MPI
http//hpcf.nersc.gov/computers/SP/

17
SMP message-passing performance on an IBM SP node
The aggregate bi-directional bandwidth is 4500
Mbps between one pair of processes on the same
SMP node with a latency of 14 us. The bandwidth
scales ideally for two pairs communicating
simultaneously. Efficiency drops 80 when 4 pairs
are communicating, saturating the main memory
bandwidth on the node. Communication bound codes
will suffer when run on all 16 processors due to
this saturation.
18
Message-passing performance between IBM SP nodes
The aggregate bi-directional bandwidth is 3 Gbps
between one pair of processes between nodes with
a latency of 23 us. The bandwidth scales ideally
for two pairs communicating simultaneously, which
makes sense given that there are two network
adapter cards per node. Efficiency drops 70 when
4 pairs are communicating, just about saturating
the off-node communication bandwidth. Communicatio
n bound codes will suffer when run on more than 4
processors due to this saturation.
19
12-processor Cray XD1 chassis

6 dual-processor Opteron nodes ? 12 processors
2-processor SMP nodes
2.2 GHz Opteron processors
2 GB RAM / node
SuSE Linux with the 2.4.21 kernel
RapidArray interconnect
2 network chips per node
MPICH 1.2.5

20
Message-passing performance on a Cray XD1
TCP performance reaches 2 Gbps with a 10 us
latency. The MPI performance between processors
on a dual-processor SMP node on the Cray XD1
reaches 7.5 Gbps with a 1.8 us latency. MPI
performance between nodes reaches 10 Gbps with a
1.8 us latency. With MPI, it is currently faster
to communicate between processors on separate
nodes than within the same SMP node.
21
Message-passing performance between nodes on a
Cray XD1
The MPI performance between 2 processors on a
dual-processor SMP node on the Cray XD1 reaches
7.5 Gbps with a 1.8 us latency. There are
severe dropouts starting at 8 kB. Similar
dropouts have been seen in an InfiniBand module
by D.K. Panda based on MPICH. That module may be
the source of the Cray XD1 MPI version. We need
to work with Cray, D.K Panda, and Argonne to
resolve this problem.
22
Message-passing performance on a Cray XD1 SMP node
The MPI performance between 2 processors
on a dual-processor SMP node on the Cray XD1
reaches 7.5 Gbps with a 1.8 us latency. The
characteristics show no sign of a special SMP
module, so data probably goes out to the network
chip and back to the 2nd processor. There are
severe dropouts starting at 8 kB similar to
off-node performance.
23
Aggregate message-passing performance between
nodes
The MPI performance between 2 nodes
reached a maximum of 10 Gbps. The aggregate
performance for bidirectional communications
between the same 2 nodes reaches 15 Gbps, showing
a loss of 50. The aggregate performance across
the same link using both processors on each node
reaches the same maximum of 15 Gbps. This is
measuring the limitation of a single link. We do
not have access to a large enough system to for
saturation of the RapidArray switch.
24
Aggregate message-passing performance across the
switch
The aggregate performance across one link using
both processors on each node reaches 15 Gbps. The
same measurement using 4 nodes (4 pairs of
processors communicating simultaneously) reaches
an ideal doubling, showing no saturation of the
switch.
25

Most applications do not take advantage of the
network topology
? There are many different topologies, which
can even vary at run-time
? Schedulers usually do not allow requests
for a given topology
? no portable method for mapping to the
topology
? loss of performance and scalability
NodeMap will automatically determine the
network topology at run-time
and pass the information to the
message-passing library.

Network modules identify the topology using a
variety of discovery techniques.
Hosts and switches are stored from nearest
neighbors out.
Topology analysis routines identify regular
topologies.
Latency, maximum and aggregate bandwidth tests
provide more accurate performance measurements.
The best mapping is provided through the
MPI_Init or MPI_Cart_create functions.

27
NodeMap network modules

Run-time use of NodeMap means it must operate
very quickly (seconds, not minutes).
Use brute force ping-pong measurements as a
last resort.
Reduce the complexity of the problem at each
stage.
Always identify SMP processes first using
gethostname.
Identify each nodes nearest neighbors, then
work outwards.
Store the neighbor and switch information in
the manner it is discovered.
Store local neighbors, then 2nd neighbors, etc.
This data structure based on discovery makes it
easy to write the topology analysis routines.
The network type or types should be known from
the MPI configuration.
This identifies which network modules to run.
The MPI configuration provides paths to the
native software libraries.

28
Myrinet static.map file
s - "s0" 15 0 s - "s1" 14 1 s - "s2" 14 2 s -
"s3" 14 3 s - "s4" 14 4 s - "s5" 14 5 s - "s6"
14 6 s - "s7" 14 7 s - "s8" 14 9 h - "a18" 0 10 h
- "a19" 0 11 h - "4pack" 0 12 h - "m22" 0 13 h -
"m19" 0 14 h - "m18" 0 15 h - "m20" 0 s -
"s1" 8 0 s - "s9" 0 1 s - "s10" 0 6 s - "s11"
0 11 s - "s12" 0 12 s - "s13" 0 13 s - "s14" 0 14
s - "s0" 0 15 s - "s15" 0
h - "m22" 1 0 s - "s0" 12 number 0 address
0060dd7fb1f9 gmId 78 hostType 0 h - "m27" 1 0 s
- "s14" 14 number 0 address 0060dd7fb0e8 gmId
62 hostType 0

Parse the gm/sbin/static.map file
Each host has an entry
-- Connected to what switch?
Each switch has an entry
-- Lists all hosts connected
-- Lists all switches connected
Internal switches have no hosts
Determine the complete topology
Determine our topology

29
Myrinet module for NodeMap
4packgt gm_board_info lanai_cpu_version
0x0900 (LANai9.0) max_lanai_speed 134 MHz
(should be labeled "M3M-PCI64B-2-59521") gmID
MAC Address gmName Route ----
----------------- ---------- ------------- 2
0060dd7fb1c0 m26 ba bd 88 3
0060dd7fb0e0 m18 83 4
0060dd7fb1f6 m29 ba b3 89 55
0060dd7facb2 m25 ba b2 88 56
0060dd7fb0ec m24 ba bf 86 58
0060dd7fb106 m20 84 59
0060dd7fb0e3 m19 82 61
0060dd7fb1bd m28 ba be 86 62
0060dd7fb0e8 m27 ba bf 89 67
0060dd7faca8 m23 ba be 89 77
0060dd7facb0 4pack 80 (this node) 78
0060dd7fb1f9 m22 ba be 88 80
0060dd7fb0ed m32 ba b3 88 93
0060dd7fb0e1 m31 ba be 87

Probe using gm_board_info.
Use header info to ID board.
Verify clock rate with dmesg
Provides exact routing
- Not really needed
Measure the latency and bandwidth
Provide the best mapping

30
InfiniBand module for NodeMap
opteron1gt minism InfiniHost0 minismgtd New
Discovered Node New Node - TypeCA NumPorts02
LID0003 New Discovered Node New Link 4x
FromLID0003 FromPort01 ToLID0004 ToPort08
New Node - TypeSw NumPorts08 LID0004 No
Link 1x FromLID0004 FromPort01 No Link 1x
FromLID0004 FromPort02 No Link 1x
FromLID0004 FromPort03 No Link 1x
FromLID0004 FromPort04 New Discovered Node
New Link 4x FromLID0004 FromPort05 ToLID0002
ToPort06 New Link 4x FromLID0004 FromPort06
ToLID0002 ToPort05 New Link 4x FromLID0004
FromPort07 ToLID0009 ToPort01 New Link 4x
FromLID0004 FromPort08 ToLID0003
ToPort01 New Discovered Node New Node -
TypeCA NumPorts02 LID0009 New Link 4x
FromLID0009 FromPort01 ToLID0004 ToPort07

Probe the subnet manager (minism or other)
Identify my LID
Exchange LIDs
Parse and store the links, switches, and other
HCAs
This is all that is needed to determine the
topology

31
IP module for NodeMap

IP interface can be to many types of network
hardware.
Ethernet, ATM, IP over Myrinet, IP over
InfiniBand, etc.
How to determine what network cards are
present?
ifconfig provides a list of active interfaces
Does tell what type of network
May tell what speed for Ethernet
lspci, hinv provide a description of what is
plugged into the PCI slots
Sometimes helpful, but may require a database
Can measuring latency identify the number of
switches in between?
This may require many point-to-point
measurements
OS, driver, and NIC can affect the latency
themselves
It may identify which nodes are on a local
switch
? Can simultaneous measurements be done to make
this efficient?
Use aggregate measurements to probe higher
level switches?

32
MPP modules for NodeMap

Use uname or compiler definitions to identify
the MPP type
- _CRAYT3E is defined for the Cray T3E
-- _AIX for IBM SP (anything better?)
? These identify the global network topology
Use gethostname to identify SMP processes
This reduces the complexity of the problem
Use vendor functions if available (getphysnode
on the Cray T3E)
Do we need a module for each new type of MPP???

33
A Generic MPI Module

How quickly can a brute force analysis be done?
Run NodeMap once to generate a static map file of
the topology?
Identify SMP processes first using gethostname
Gather information for a single host
Measure the latency and maximum throughput to
one or more nodes
Probe using overlapping pair-wise bandwidth
measurements
Try to measure the congestion factor
Increase the number of pairs until the
performance improves ? X-dimension
Repeat in the Y-dimension
Use 2D global shifts for several regular
arrangements
Measure the bi-sectional bandwidth ? Can
identify fat trees
Additional tricks needed!!!

n0
n1
n2
n3
34
Topological Analysis Routines

Initially concentrate only on regular
arrangements
N-dimensional mesh/torus, trees, SMP nodes
Identify the topology from the host/switch
information gathered
How many unique 1st, 2nd, 3rd, neighbors ?
mesh/torus
Determine whether a tree is fully connected
Eventually handle more irregular arrangements
of nodes.
Identify the general type of network
- May have an irregular arrangement of nodes on
a mesh/torus
Identify which nodes are irregular

35
Performance Measurements

Performance may be limited by many factors
Measure latency and max bandwidth for each
network layer
Measure aggregate performance across switches
or a given link
Performance data can help determine the
topology
A tree with fewer internal switches may still
be a fat tree
Feed the performance data to the Mapper along
with topological data

36
The Mapper

NodeMap will initially be run from
MPI_Cart_create(, reorder1)
User must determine which direction requires
the optimal passing
The Mapper will take the topology and
performance data and provide the best mapping of
the desired 2D (or eventually 3D or tree)
algorithm.
Concentrate on regular arrangements first
Use Gray codes for 2D onto N-dimensional
topologies to guarantee nearest neighbors are
directly connect
NodeMap may also be run from MPI_Init()
Provide optimal mappings for global operations
(mainly binary trees)

37
Conclusions

Codes must be mapped to the network topology to
scale well
Many codes can use a 2D decomposition
2D algorithms can be mapped ideally to most
network topologies
NodeMap will provide automatic mapping to the
topology
Portable means of taking advantage of the network
topology
Questions
How well will NodeMap handle irregular networks?
Will it be difficult to provide a reasonable
mapping?
Can a generic MPI module effectively discover a
topology?
If so, how quickly can this be done?
Will NodeMap need to generate static map files
ahead of time?

38
Contact information

Dave Turner - turner_at_ameslab.gov
http//www.scl.ameslab.gov/Projects/MP_Lite/
http//www.scl.ameslab.gov/Projects/NetPIPE/

39
Ames Lab Classical Molecular Dynamics code
Embedded atom method, Leonard Jones, Tersoff
potentials Uses cubic spline functions for
optimal performance Local interactions only,
typically 5-6 A ? 50-60 neighbors per atom
2D decomposition of the 3D simulation space Map
neighboring regions to neighboring nodes to
localize communications. Shift coordinates and
accumulators to all nodes above and to the right
to calculate all pair interactions within the
cutoff range. Large systems require just 5
communications, while systems spread across more
nodes may involve many more passes in a
serpentine fashion around half the
interaction range.
40
ALCMD scaling on the IBM SP
Proper mapping of the columns to SMP nodes helps
greatly. Parallel efficiency goes from 50 to 70
for 10,000,000 atoms on 1024 processors. Scaling
beyond 1024 processors will be difficult. On-node
and off-node communications are saturated even at
1024 processors (16 x 16-way SMPs)
41
2D decomposition of algorithms

Many codes can be naturally decomposed onto a 2D
mesh.
Atomistic codes Classical and Order-N
Tight-Binding
Many matrix operations
Finite difference and finite element codes
Grid and multi-grid approaches
2D decompositions map well to most network
topologies.
2D and 3D meshes and toruses, hypercubes, trees,
fat trees
Direct connections to nearest neighbor nodes
prevents contention
Writing algorithms using a 2D decomposition can
provide the initial step to taking advantage of
the network topology.

42
Point-to-point Performance
InfiniBand can deliver 4500 - 6500 Mbps at a 7.5
us latency. Atoll delivers 1890 Mbps with a 4.7
us latency. SCI delivers 1840 Mbps with only a
4.2 us latency. Myrinet performance reaches 1820
Mbps with an 8 us latency. Channel-bonded GigE
offers 1800 Mbps for very large messages. Gigabit
Ethernet delivers 900 Mbps with a 25-62
us latency. 10 GigE only delivers 2 Gbps with a
75 us latency.
43
Evaluating network performance using ping-pong
tests

Bounce messages back and forth between
processes to measure the performance
Messages are bounced many times for each test
to provide an accurate timing
The latency is the minimum time needed to send
a small message (1/2 round trip time)
The throughput is the communication rate
Start with a small message size (1 byte) and
increase exponentially up to 8 MB
Use perturbations from the factors of 2 to
fully test the system

Switch
n0
n1
n2
n3
44
Using NetPIPE 4.x
mpirun np nprocs -hostfile nphosts NPmpi
NetPIPE options -H hostfile option to NPmpi
can be used to change the order of the
pairings nplaunch script for launching other
NetPIPE executables Default host file name is
nphosts Lists hosts with first and last
communicating, 2nd and 2nd to last, etc. nplaunch
NPtcp NetPIPE options For aggregate
measurements, use B for bidirectional mode. For
NPmpi, -a may be needed for asynchronous sends
and receives

Write a Comment

User Comments (0)