Locality-aware Connection Management and Rank Assignment for Wide-area MPI

1 / 45
About This Presentation
Title:

Locality-aware Connection Management and Rank Assignment for Wide-area MPI

Description:

Locality-aware Connection Management and Rank Assignment for Wide-area MPI ... Pastry [Rowstron et al. '00] Each node maintains just O(log n) connections ... –

Number of Views:21
Avg rating:3.0/5.0
Slides: 46
Provided by: hideo5
Category:

less

Transcript and Presenter's Notes

Title: Locality-aware Connection Management and Rank Assignment for Wide-area MPI


1
Locality-aware Connection Management and Rank
Assignment for Wide-area MPI
  • Hideo Saito Kenjiro Taura
  • The University of Tokyo
  • May 16, 2007

2
Background
  • Increase in the bandwidth of WANs
  • ? More opportunities to perform parallel
    computation using multiple clusters

WAN
3
Requirements for Wide-area MPI
  • Wide-area connectivity
  • Firewalls and private addresses
  • Only some nodes can connect to each other
  • Perform routing using the connections that happen
    to be possible

NAT
Firewall
4
Reqs. for Wide-area MPI (2)
  • Scalability
  • The number of conns. must be limited in order to
    scale to thousands of nodes
  • Various allocation limits of the system (e.g.,
    memory, file descriptors, router sessions)
  • Simplistic schemes that may potentially result
    in O(n2) connections wont scale
  • Lazy connect strategies work formany apps, but
    not for those that involve all-to-all
    communication

5
Reqs. for Wide-area MPI (3)
  • Locality awareness
  • To achieve high performance with few conns,
    select conns. in a locality-aware manner
  • Many connections with nearby nodes, few
    connections with faraway nodes

Few conns. between clusters
Many conns. within a cluster
6
Reqs. for Wide-area MPI (4)
  • Application awareness
  • Select connections according to the applications
    communication pattern
  • Assign ranks according to the applications
    communication pattern
  • Adaptivity
  • Automatically, without tedious manual
    configuration

rank process ID in MPI
7
Contributions of Our Work
  • Locality-aware connection management
  • Uses latency and traffic information obtained
    from a short profiling run
  • Locality-aware rank assignment
  • Uses the same info. to discover rank-process
    mappings with low comm. overhead
  • ? Multi-Cluster MPI (MC-MPI)
  • Wide-area-enabled MPI library

8
Outline
  • Introduction
  • Related Work
  • Proposed Method
  • Profiling Run
  • Connection Management
  • Rank Assignment
  • Experimental Results
  • Conclusion

9
Grid-enabled MPI Libraries
  • MPICH-G2 Karonis et al. 03, MagPIe Kielmann
    et al. 99
  • Locality-aware communication optimizations
  • E.g., wide-area-aware collective operations
    (broadcast, reduction, ...)
  • Doesnt work with Firewalls

10
Grid-enabled MPI Libraries (contd)
  • MPICH/MADIII Aumage et al. 03, StaMPI Imamura
    et al. 00
  • Forwarding mechanisms that allow nodes to
    communicate even in the presence of FWs
  • Manual configuration
  • Amount of necessary config. becomes overwhelming
    as more resources are used

Forward
Firewall
11
P2P Overlays
  • Pastry Rowstron et al. 00
  • Each node maintains just O(log n) connections
  • Messages are routed using those connections
  • Highly scalable, but routing properties are
    unfavorable for high performance computing
  • Few connections between nearby nodes
  • Messages between nearby nodes need to be
    forwarded, causing large latency penalties

12
Adaptive MPI
Physical Processor
Virtual Processor
  • Huang et al. 06
  • Performs load balancing by migrating virtual
    processors
  • Balance the exec. times of the physical
    processors
  • Minimize inter-processor communication
  • Adapts to apps. by tracking the amount of
    communication performed between procs.
  • Assumes that the communication cost of every
    processor pair is the same
  • MC-MPI takes differences in communication costs
    into account

13
Lazy Connect Strategies
  • MPICH Gropp et al. 96, Scalable MPI over
    Infiniband Yu et al. 06
  • Establish connections only on demand
  • Reduces the number of conns. if each proc. only
    communicates with a few other procs.
  • Some apps. generate all-to-all comm. patterns,
    resulting in many connections
  • E.g., IS in the NAS Parallel Benchmarks
  • Doesnt extend to wide-area environments where
    some communication may be blocked

14
Outline
  • Introduction
  • Related Work
  • Proposed Method
  • Profiling Run
  • Connection Management
  • Rank Assignment
  • Experimental Results
  • Conclusion

15
Overview of Our Method
Short Profiling Run
  • Latency matrix (L)
  • Traffic matrix (T)

Optimized Real Run
  • Locality-aware connection management
  • Locality-aware rank assignment

16
Outline
  • Introduction
  • Related Work
  • Proposed Method
  • Profiling Run
  • Connection Management
  • Rank Assignment
  • Experimental Results
  • Conclusion

17
Latency Matrix
  • Latency matrix L lij
  • lij latency between processes i and j in the
    target environment
  • Each process autonomously measures the RTT
    between itself and other processes
  • Reduce the num. of measurements by using the
    triangular inequality to estimate RTTs

r
if rttprgtarttrq rttpqrttpr (a constant)
rttpr
rttrq
q
p
rttpq
18
Traffic Matrix
  • Traffic matrix T tij
  • tij traffic between ranks i and j in the target
    application
  • Many applications repeat similar communication
    patterns
  • ? Execute the application for a short amount of
    time and make tij the number of transmitted
    messages
  • (E.g., one iteration of an iterative app.)

19
Outline
  • Introduction
  • Related Work
  • Proposed Method
  • Profiling Run
  • Connection Management
  • Rank Assignment
  • Experimental Results
  • Conclusion

20
Connection Management
Establishcandidateconnectionson demand
Candidate connections
Bounding Graph
Spanning Tree
Lazy Connection Establishment
Application Body
MPI_Init
21
Selection of Candidate Connections
  • Each process selects O(log n) neighbors based on
    L and T
  • ? parameter that controls connection density
  • n number of processes

...
?/4?
?/?
?/2?
22
Bounding Graph
  • Procs. try to establish temporary conns. to
    their selected neighbors
  • The collective set ofsuccessful connections
  • ? Bounding graph
  • (Some conns. may fail due to FWs)

Bounding Graph
23
Routing Table Construction
  • Construct a routing table using just the
    bounding graph
  • Close the temporary connections
  • Conns. of the bounding graph are reestablished
    lazily as real conns.
  • Temporary conns. gt small bufs.
  • Real conns. gt large bufs.

Bounding Graph
24
Lazy Connection Establishment
FW
Bounding Graph
25
Outline
  • Introduction
  • Related Work
  • Proposed Method
  • Profiling Run
  • Connection Management
  • Rank Assignment
  • Experimental Results
  • Conclusion

26
Commonly-used Method
  • Sort the processes by host name (or IP address)
    and assign ranks in that order
  • Assumptions
  • Most communication takes place between processes
    with close ranks
  • The communication cost between processes with
    close host names is low
  • However,
  • Applications have various comm. patterns
  • Host names dont necessarily have a correlation
    to communication costs

27
Our Rank Assignment Scheme
  • Find a rank-process mapping with low
    communication overhead
  • Map the rank assignment problem to the Quadratic
    Assignment Problem
  • QAP
  • Given two nxn cost matrices, L and T, find a
    permutation p of 0, 1, ..., n-1 that minimizes

28
Solving QAPs
  • NP-Hard, but there are heuristics for finding
    good suboptimal solutions
  • Library based on GRASP Resende et al. 96
  • Test against QAPLIB Burkard et al. 97
  • Instances of up to n 256
  • n processors for problem size n
  • Approximate solutions that were within one to two
    percent of the best known solution in under one
    second

29
Outline
  1. Introduction
  2. Related Work
  3. Profiling Run
  4. Connection Management
  5. Rank Assignment
  6. Experimental Results
  7. Conclusion

30
Experimental Environment
sheepXX (64 nodes)
10.8ms
chibaXXX (64 nodes)
6.8ms
  • Xeon/Pentium M
  • Linux
  • Intra-cluster RTT 60-120 microsecs
  • TCP send/recv bufs 256KB ea.

6.9ms
4.4ms
4.3ms
istbsXXX (64 nodes)
0.3 ms
hongoXXX (64 nodes)
FW
31
Experiment 1 Conn. Management
  • Measure the performance of the NPB with limited
    numbers of connections
  • MC-MPI
  • Limit the number of connections to 10, 20, ...,
    100 by varying ?
  • Random
  • Establish a comparable number of connections
    randomly

32
BT, LU, MG and SP
SOR (Successive Over-Relxation)
LU (Lower-Upper)
33
BT, LU, MG and SP (2)
MG (Multi-Grid)
BT (Block Tridiagonal)
34
BT, LU, MG and SP (3)
  • of connections actually established was lower
    than that shown by the x-axis
  • B/c of lazy connection establishment
  • To be discussed in more detail later

SP (Scalar Pentadiagonal)
35
EP
  • EP involves very little communication

EP (Embarrassingly Parallel)
36
IS
Performance decrease due to congestion!
IS (Integer Sort)
37
Experiment 2 Lazy Conn. Establish.
  • Compare our lazy conn. establishment method with
    an MPICH-like method
  • MC-MPI
  • Select ? so that the maximum number of allowed
    connections is 30
  • MPICH-like
  • Establish connections on demand without
    preselecting candidate connections(we can also
    say that we preselect all connections)

38
Experiment 2 Results
Comparable number of conns. except for IS
Comparable performance except for IS
Connections Established
39
Experiment 3 Rank Assignment
  • Compare 3 assignment algorithms
  • Random
  • Hostname (24 patterns)
  • Real host names (1)
  • What if istbsXXX were named sheepXX, etc. (23)
  • MC-MPI (QAP)

chibaXXX
sheepXX
hongoXXX
istbsXXX
40
LU and MG
MG
LU
Hostname
Hostname (Best)
Hostname (Worst)
Random
MC-MPI (QAP)
41
BT and SP
SP
BT
Hostname
Hostname (Best)
Hostname (Worst)
Random
MC-MPI (QAP)
42
BT and SP (contd)
  • Rank Assignment
  • Traffic Matrix

Destination
Hostname
Rank
MC-MPI (QAP)
Rank
Cluster A
Cluster C
Cluster B
Cluster D
Source
43
EP and IS
IS
EP
Hostname
Hostname (Best)
Hostname (Worst)
Random
QAP (MC-MPI)
44
Outline
  1. Introduction
  2. Related Work
  3. Profiling Run
  4. Connection Management
  5. Rank Assignment
  6. Experimental Results
  7. Conclusion

45
Conclusion
  • MC-MPI
  • Connection management
  • High performance with connections between just
    10 of all process pairs
  • Rank assignment
  • Up to 300 faster than locality-unaware
    assignments
  • Future Work
  • An API to perform profiling w/in a single run
  • Integration of adaptive collectives
Write a Comment
User Comments (0)
About PowerShow.com