Locality-aware Connection Management and Rank Assignment for Wide-area MPI

1 / 45

About This Presentation

Title:

Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Description:

Locality-aware Connection Management and Rank Assignment for Wide-area MPI ... Pastry [Rowstron et al. '00] Each node maintains just O(log n) connections ... –

Number of Views:21

Avg rating:3.0/5.0

Slides: 46

Provided by: hideo5

Category:

more less

Transcript and Presenter's Notes

Title: Locality-aware Connection Management and Rank Assignment for Wide-area MPI

1
Locality-aware Connection Management and Rank
Assignment for Wide-area MPI

Hideo Saito Kenjiro Taura
The University of Tokyo
May 16, 2007

2
Background

Increase in the bandwidth of WANs
? More opportunities to perform parallel
computation using multiple clusters

WAN
3
Requirements for Wide-area MPI

Wide-area connectivity
Firewalls and private addresses
Only some nodes can connect to each other
Perform routing using the connections that happen
to be possible

NAT
Firewall
4
Reqs. for Wide-area MPI (2)

Scalability
The number of conns. must be limited in order to
scale to thousands of nodes
Various allocation limits of the system (e.g.,
memory, file descriptors, router sessions)
Simplistic schemes that may potentially result
in O(n2) connections wont scale
Lazy connect strategies work formany apps, but
not for those that involve all-to-all
communication

5
Reqs. for Wide-area MPI (3)

Locality awareness
To achieve high performance with few conns,
select conns. in a locality-aware manner
Many connections with nearby nodes, few
connections with faraway nodes

Few conns. between clusters
Many conns. within a cluster
6
Reqs. for Wide-area MPI (4)

Application awareness
Select connections according to the applications
communication pattern
Assign ranks according to the applications
communication pattern
Adaptivity
Automatically, without tedious manual
configuration

rank process ID in MPI
7
Contributions of Our Work

Locality-aware connection management
Uses latency and traffic information obtained
from a short profiling run
Locality-aware rank assignment
Uses the same info. to discover rank-process
mappings with low comm. overhead
? Multi-Cluster MPI (MC-MPI)
Wide-area-enabled MPI library

8
Outline

Introduction
Related Work
Proposed Method
Profiling Run
Connection Management
Rank Assignment
Experimental Results
Conclusion

9
Grid-enabled MPI Libraries

MPICH-G2 Karonis et al. 03, MagPIe Kielmann
et al. 99
Locality-aware communication optimizations
E.g., wide-area-aware collective operations
(broadcast, reduction, ...)
Doesnt work with Firewalls

10
Grid-enabled MPI Libraries (contd)

MPICH/MADIII Aumage et al. 03, StaMPI Imamura
et al. 00
Forwarding mechanisms that allow nodes to
communicate even in the presence of FWs
Manual configuration
Amount of necessary config. becomes overwhelming
as more resources are used

Forward
Firewall
11
P2P Overlays

Pastry Rowstron et al. 00
Each node maintains just O(log n) connections
Messages are routed using those connections
Highly scalable, but routing properties are
unfavorable for high performance computing
Few connections between nearby nodes
Messages between nearby nodes need to be
forwarded, causing large latency penalties

12
Adaptive MPI
Physical Processor
Virtual Processor

Huang et al. 06
Performs load balancing by migrating virtual
processors
Balance the exec. times of the physical
processors
Minimize inter-processor communication
Adapts to apps. by tracking the amount of
communication performed between procs.
Assumes that the communication cost of every
processor pair is the same
MC-MPI takes differences in communication costs
into account

13
Lazy Connect Strategies

MPICH Gropp et al. 96, Scalable MPI over
Infiniband Yu et al. 06
Establish connections only on demand
Reduces the number of conns. if each proc. only
communicates with a few other procs.
Some apps. generate all-to-all comm. patterns,
resulting in many connections
E.g., IS in the NAS Parallel Benchmarks
Doesnt extend to wide-area environments where
some communication may be blocked

14
Outline

Introduction
Related Work
Proposed Method
Profiling Run
Connection Management
Rank Assignment
Experimental Results
Conclusion

15
Overview of Our Method
Short Profiling Run

Latency matrix (L)
Traffic matrix (T)

Optimized Real Run

Locality-aware connection management
Locality-aware rank assignment

16
Outline

Introduction
Related Work
Proposed Method
Profiling Run
Connection Management
Rank Assignment
Experimental Results
Conclusion

17
Latency Matrix

Latency matrix L lij
lij latency between processes i and j in the
target environment
Each process autonomously measures the RTT
between itself and other processes
Reduce the num. of measurements by using the
triangular inequality to estimate RTTs

r
if rttprgtarttrq rttpqrttpr (a constant)
rttpr
rttrq
q
p
rttpq
18
Traffic Matrix

Traffic matrix T tij
tij traffic between ranks i and j in the target
application
Many applications repeat similar communication
patterns
? Execute the application for a short amount of
time and make tij the number of transmitted
messages
(E.g., one iteration of an iterative app.)

19
Outline

Introduction
Related Work
Proposed Method
Profiling Run
Connection Management
Rank Assignment
Experimental Results
Conclusion

20
Connection Management
Establishcandidateconnectionson demand
Candidate connections
Bounding Graph
Spanning Tree
Lazy Connection Establishment
Application Body
MPI_Init
21
Selection of Candidate Connections

Each process selects O(log n) neighbors based on
L and T
? parameter that controls connection density
n number of processes

...
?/4?
?/?
?/2?
22
Bounding Graph

Procs. try to establish temporary conns. to
their selected neighbors
The collective set ofsuccessful connections
? Bounding graph
(Some conns. may fail due to FWs)

Bounding Graph
23
Routing Table Construction

Construct a routing table using just the
bounding graph
Close the temporary connections
Conns. of the bounding graph are reestablished
lazily as real conns.
Temporary conns. gt small bufs.
Real conns. gt large bufs.

Bounding Graph
24
Lazy Connection Establishment
FW
Bounding Graph
25
Outline

Introduction
Related Work
Proposed Method
Profiling Run
Connection Management
Rank Assignment
Experimental Results
Conclusion

26
Commonly-used Method

Sort the processes by host name (or IP address)
and assign ranks in that order
Assumptions
Most communication takes place between processes
with close ranks
The communication cost between processes with
close host names is low
However,
Applications have various comm. patterns
Host names dont necessarily have a correlation
to communication costs

27
Our Rank Assignment Scheme

Find a rank-process mapping with low
communication overhead
Map the rank assignment problem to the Quadratic
Assignment Problem
QAP
Given two nxn cost matrices, L and T, find a
permutation p of 0, 1, ..., n-1 that minimizes

28
Solving QAPs

NP-Hard, but there are heuristics for finding
good suboptimal solutions
Library based on GRASP Resende et al. 96
Test against QAPLIB Burkard et al. 97
Instances of up to n 256
n processors for problem size n
Approximate solutions that were within one to two
percent of the best known solution in under one
second

29
Outline

Introduction
Related Work
Profiling Run
Connection Management
Rank Assignment
Experimental Results
Conclusion

30
Experimental Environment
sheepXX (64 nodes)
10.8ms
chibaXXX (64 nodes)
6.8ms

Xeon/Pentium M
Linux
Intra-cluster RTT 60-120 microsecs
TCP send/recv bufs 256KB ea.

6.9ms
4.4ms
4.3ms
istbsXXX (64 nodes)
0.3 ms
hongoXXX (64 nodes)
FW
31
Experiment 1 Conn. Management

Measure the performance of the NPB with limited
numbers of connections
MC-MPI
Limit the number of connections to 10, 20, ...,
100 by varying ?
Random
Establish a comparable number of connections
randomly

32
BT, LU, MG and SP
SOR (Successive Over-Relxation)
LU (Lower-Upper)
33
BT, LU, MG and SP (2)
MG (Multi-Grid)
BT (Block Tridiagonal)
34
BT, LU, MG and SP (3)

of connections actually established was lower
than that shown by the x-axis
B/c of lazy connection establishment
To be discussed in more detail later

SP (Scalar Pentadiagonal)
35
EP

EP involves very little communication

EP (Embarrassingly Parallel)
36
IS
Performance decrease due to congestion!
IS (Integer Sort)
37
Experiment 2 Lazy Conn. Establish.

Compare our lazy conn. establishment method with
an MPICH-like method
MC-MPI
Select ? so that the maximum number of allowed
connections is 30
MPICH-like
Establish connections on demand without
preselecting candidate connections(we can also
say that we preselect all connections)

38
Experiment 2 Results
Comparable number of conns. except for IS
Comparable performance except for IS
Connections Established
39
Experiment 3 Rank Assignment

Compare 3 assignment algorithms
Random
Hostname (24 patterns)
Real host names (1)
What if istbsXXX were named sheepXX, etc. (23)
MC-MPI (QAP)

chibaXXX
sheepXX
hongoXXX
istbsXXX
40
LU and MG
MG
LU
Hostname
Hostname (Best)
Hostname (Worst)
Random
MC-MPI (QAP)
41
BT and SP
SP
BT
Hostname
Hostname (Best)
Hostname (Worst)
Random
MC-MPI (QAP)
42
BT and SP (contd)

Rank Assignment

Traffic Matrix

Destination
Hostname
Rank
MC-MPI (QAP)
Rank
Cluster A
Cluster C
Cluster B
Cluster D
Source
43
EP and IS
IS
EP
Hostname
Hostname (Best)
Hostname (Worst)
Random
QAP (MC-MPI)
44
Outline