Title: Locality-aware Connection Management and Rank Assignment for Wide-area MPI
1Locality-aware Connection Management and Rank
Assignment for Wide-area MPI
- Hideo Saito Kenjiro Taura
- The University of Tokyo
- May 16, 2007
2Background
- Increase in the bandwidth of WANs
- ? More opportunities to perform parallel
computation using multiple clusters
WAN
3Requirements for Wide-area MPI
- Wide-area connectivity
- Firewalls and private addresses
- Only some nodes can connect to each other
- Perform routing using the connections that happen
to be possible
NAT
Firewall
4Reqs. for Wide-area MPI (2)
- Scalability
- The number of conns. must be limited in order to
scale to thousands of nodes - Various allocation limits of the system (e.g.,
memory, file descriptors, router sessions) - Simplistic schemes that may potentially result
in O(n2) connections wont scale - Lazy connect strategies work formany apps, but
not for those that involve all-to-all
communication
5Reqs. for Wide-area MPI (3)
- Locality awareness
- To achieve high performance with few conns,
select conns. in a locality-aware manner - Many connections with nearby nodes, few
connections with faraway nodes
Few conns. between clusters
Many conns. within a cluster
6Reqs. for Wide-area MPI (4)
- Application awareness
- Select connections according to the applications
communication pattern - Assign ranks according to the applications
communication pattern - Adaptivity
- Automatically, without tedious manual
configuration
rank process ID in MPI
7Contributions of Our Work
- Locality-aware connection management
- Uses latency and traffic information obtained
from a short profiling run - Locality-aware rank assignment
- Uses the same info. to discover rank-process
mappings with low comm. overhead - ? Multi-Cluster MPI (MC-MPI)
- Wide-area-enabled MPI library
8Outline
- Introduction
- Related Work
- Proposed Method
- Profiling Run
- Connection Management
- Rank Assignment
- Experimental Results
- Conclusion
9Grid-enabled MPI Libraries
- MPICH-G2 Karonis et al. 03, MagPIe Kielmann
et al. 99 - Locality-aware communication optimizations
- E.g., wide-area-aware collective operations
(broadcast, reduction, ...) - Doesnt work with Firewalls
10Grid-enabled MPI Libraries (contd)
- MPICH/MADIII Aumage et al. 03, StaMPI Imamura
et al. 00 - Forwarding mechanisms that allow nodes to
communicate even in the presence of FWs - Manual configuration
- Amount of necessary config. becomes overwhelming
as more resources are used
Forward
Firewall
11P2P Overlays
- Pastry Rowstron et al. 00
- Each node maintains just O(log n) connections
- Messages are routed using those connections
- Highly scalable, but routing properties are
unfavorable for high performance computing - Few connections between nearby nodes
- Messages between nearby nodes need to be
forwarded, causing large latency penalties
12Adaptive MPI
Physical Processor
Virtual Processor
- Huang et al. 06
- Performs load balancing by migrating virtual
processors - Balance the exec. times of the physical
processors - Minimize inter-processor communication
- Adapts to apps. by tracking the amount of
communication performed between procs. - Assumes that the communication cost of every
processor pair is the same - MC-MPI takes differences in communication costs
into account
13Lazy Connect Strategies
- MPICH Gropp et al. 96, Scalable MPI over
Infiniband Yu et al. 06 - Establish connections only on demand
- Reduces the number of conns. if each proc. only
communicates with a few other procs. - Some apps. generate all-to-all comm. patterns,
resulting in many connections - E.g., IS in the NAS Parallel Benchmarks
- Doesnt extend to wide-area environments where
some communication may be blocked
14Outline
- Introduction
- Related Work
- Proposed Method
- Profiling Run
- Connection Management
- Rank Assignment
- Experimental Results
- Conclusion
15Overview of Our Method
Short Profiling Run
- Latency matrix (L)
- Traffic matrix (T)
Optimized Real Run
- Locality-aware connection management
- Locality-aware rank assignment
16Outline
- Introduction
- Related Work
- Proposed Method
- Profiling Run
- Connection Management
- Rank Assignment
- Experimental Results
- Conclusion
17Latency Matrix
- Latency matrix L lij
- lij latency between processes i and j in the
target environment - Each process autonomously measures the RTT
between itself and other processes - Reduce the num. of measurements by using the
triangular inequality to estimate RTTs
r
if rttprgtarttrq rttpqrttpr (a constant)
rttpr
rttrq
q
p
rttpq
18Traffic Matrix
- Traffic matrix T tij
- tij traffic between ranks i and j in the target
application - Many applications repeat similar communication
patterns - ? Execute the application for a short amount of
time and make tij the number of transmitted
messages - (E.g., one iteration of an iterative app.)
19Outline
- Introduction
- Related Work
- Proposed Method
- Profiling Run
- Connection Management
- Rank Assignment
- Experimental Results
- Conclusion
20Connection Management
Establishcandidateconnectionson demand
Candidate connections
Bounding Graph
Spanning Tree
Lazy Connection Establishment
Application Body
MPI_Init
21Selection of Candidate Connections
- Each process selects O(log n) neighbors based on
L and T - ? parameter that controls connection density
- n number of processes
...
?/4?
?/?
?/2?
22Bounding Graph
- Procs. try to establish temporary conns. to
their selected neighbors - The collective set ofsuccessful connections
- ? Bounding graph
- (Some conns. may fail due to FWs)
Bounding Graph
23Routing Table Construction
- Construct a routing table using just the
bounding graph - Close the temporary connections
- Conns. of the bounding graph are reestablished
lazily as real conns. - Temporary conns. gt small bufs.
- Real conns. gt large bufs.
Bounding Graph
24Lazy Connection Establishment
FW
Bounding Graph
25Outline
- Introduction
- Related Work
- Proposed Method
- Profiling Run
- Connection Management
- Rank Assignment
- Experimental Results
- Conclusion
26Commonly-used Method
- Sort the processes by host name (or IP address)
and assign ranks in that order - Assumptions
- Most communication takes place between processes
with close ranks - The communication cost between processes with
close host names is low - However,
- Applications have various comm. patterns
- Host names dont necessarily have a correlation
to communication costs
27Our Rank Assignment Scheme
- Find a rank-process mapping with low
communication overhead - Map the rank assignment problem to the Quadratic
Assignment Problem - QAP
- Given two nxn cost matrices, L and T, find a
permutation p of 0, 1, ..., n-1 that minimizes
28Solving QAPs
- NP-Hard, but there are heuristics for finding
good suboptimal solutions - Library based on GRASP Resende et al. 96
- Test against QAPLIB Burkard et al. 97
- Instances of up to n 256
- n processors for problem size n
- Approximate solutions that were within one to two
percent of the best known solution in under one
second
29Outline
- Introduction
- Related Work
- Profiling Run
- Connection Management
- Rank Assignment
- Experimental Results
- Conclusion
30Experimental Environment
sheepXX (64 nodes)
10.8ms
chibaXXX (64 nodes)
6.8ms
- Xeon/Pentium M
- Linux
- Intra-cluster RTT 60-120 microsecs
- TCP send/recv bufs 256KB ea.
6.9ms
4.4ms
4.3ms
istbsXXX (64 nodes)
0.3 ms
hongoXXX (64 nodes)
FW
31Experiment 1 Conn. Management
- Measure the performance of the NPB with limited
numbers of connections - MC-MPI
- Limit the number of connections to 10, 20, ...,
100 by varying ? - Random
- Establish a comparable number of connections
randomly
32BT, LU, MG and SP
SOR (Successive Over-Relxation)
LU (Lower-Upper)
33BT, LU, MG and SP (2)
MG (Multi-Grid)
BT (Block Tridiagonal)
34BT, LU, MG and SP (3)
- of connections actually established was lower
than that shown by the x-axis - B/c of lazy connection establishment
- To be discussed in more detail later
SP (Scalar Pentadiagonal)
35EP
- EP involves very little communication
EP (Embarrassingly Parallel)
36IS
Performance decrease due to congestion!
IS (Integer Sort)
37Experiment 2 Lazy Conn. Establish.
- Compare our lazy conn. establishment method with
an MPICH-like method - MC-MPI
- Select ? so that the maximum number of allowed
connections is 30 - MPICH-like
- Establish connections on demand without
preselecting candidate connections(we can also
say that we preselect all connections)
38Experiment 2 Results
Comparable number of conns. except for IS
Comparable performance except for IS
Connections Established
39Experiment 3 Rank Assignment
- Compare 3 assignment algorithms
- Random
- Hostname (24 patterns)
- Real host names (1)
- What if istbsXXX were named sheepXX, etc. (23)
- MC-MPI (QAP)
chibaXXX
sheepXX
hongoXXX
istbsXXX
40LU and MG
MG
LU
Hostname
Hostname (Best)
Hostname (Worst)
Random
MC-MPI (QAP)
41BT and SP
SP
BT
Hostname
Hostname (Best)
Hostname (Worst)
Random
MC-MPI (QAP)
42BT and SP (contd)
Destination
Hostname
Rank
MC-MPI (QAP)
Rank
Cluster A
Cluster C
Cluster B
Cluster D
Source
43EP and IS
IS
EP
Hostname
Hostname (Best)
Hostname (Worst)
Random
QAP (MC-MPI)
44Outline
- Introduction
- Related Work
- Profiling Run
- Connection Management
- Rank Assignment
- Experimental Results
- Conclusion
45Conclusion
- MC-MPI
- Connection management
- High performance with connections between just
10 of all process pairs - Rank assignment
- Up to 300 faster than locality-unaware
assignments - Future Work
- An API to perform profiling w/in a single run
- Integration of adaptive collectives