Title: High Performance Broadcast Support in LAMPI over Quadrics
1High Performance Broadcast Support in LA-MPI
over Quadrics
- W. Yu, S. Sur, D.K. Panda,
- R.T. Aulwes and R.L. Graham
Advanced Computing Lab Los Alamos, NM 87545
Dept. of Computer Science The Ohio State
University
2Presentation Outline
- Problem Statement and Goals
- Design Challenges and Implementation
- Performance Evaluation
- Conclusions and Future Work
3LA-MPI
- The Los Alamos Message Passing Interface
(LA-MPI) - Provide End-to-End Reliable Message Passing
- Protect against network errors
- Protect against I/O bus errors
- Concurrent Message passing over multiple
interconnects - Message striping over multiple network interface
cards - Supported Platforms
- Operating Systems
- TRU64, Linux, Irix, MAC-OSX (32 and 64-bit)
- Communication protocols
- Shared Memory, UDP
- HIPPI-800, Quadrics, Myrinet (GM),
InfiniBand(ongoing)
4LA-MPI Architecture
5Point-to-Point Communication
bind
Assemble
Frag
Frag
Frag
Frag
Retransmit
Yes
No
ACK
NACK
ACK
Yes
No
Yes
6LA-MPI Broadcast
Generic Tree-based Broadcast
7Quadrics Hardware Broadcast
8Quadrics Hardware Broadcast
9Quadrics Hardware Broadcast
10Quadrics Hardware Broadcast
- Benefits
- Efficient, Scalable and Reliable
- Limitations
- The receive address must be global
- Receiving processes must be on contiguous nodes
- Existing broadcast implementation making use of
hardware broadcast - Elanlib
11Research Goals
- Can we make use of the hardware Broadcast to
provide an efficient and scalable broadcast
support to LA-MPI while achieving the goal of
end-to-end reliability? - Acknowledgments from receivers (after verifying
CRC) must be collected to ensure reliability - Reduce the overhead for buffer management
- Raw hardware broadcast latency 3.3us
- Elanlib broadcast latency 8.5us
- 5us overhead when making use of hardware
broadcast - Maintain the high performance and scalability of
hardware broadcast
12Presentation Outline
- Problem Statement and Goals
- Design Challenges and Implementation
- Performance Evaluation
- Conclusions and Future Work
13Challenges
- Memory management for global buffers
- Broadcast over processes on non-contiguous nodes
- Synchronization and acknowledgement
- Retransmission and Reliability
14Global Buffer Management
- Global Buffer must be consistent
- Use a global allocator to provide global buffer
on demand - Hard to manage and low buffer reuse rate
- Can satisfy large number of requests
- Maintain a static number of fixed size global
channels - Easy to manage and high reuse rate
- Need more frequent synchronization on the use of
channels
15Single communicator
- A communicator must recycle its global channels.
- Synchronize before the use of a channel
- Synchronize after the use of a channel
- Synchronize when the global buffers are about to
be used up - Reduce the frequency of synchronization
- Amortize the cost of synchronization across
multiple operations
16Multiple Communicators
- Global buffers must be recycled across different
communicators - A small number of concurrent communicators
- Communicators tend to be disjoint
- Our solution
- 8 sets of global buffers, one for COMM_WORLD
- A communicator performs an Allreduce() to find
out the list of available buffer sets and take
the first available
17Challenges
- Memory management for global buffers
- Broadcast over processes on non-contiguous nodes
- Synchronization and acknowledgement
- Retransmission and Reliability
18Broadcast over Non-contiguous Nodes
- To make use of hardware broadcast
- Group processes into sets of contiguous nodes,
called broadcast segments - Approach 1, linearly chained broadcast RDMAs
- The root performs a broadcast RDMA to each
segment - Not scalable
- Completely distributed topology, i.e., the
formation of broadcast segments by one node is
transparent to all other nodes.
19Tree-Based Chained Broadcast RDMAs
- Approach 2 (Tree-based Chaining)
- Broadcast to the largest broadcast segment
- Each process that receives data broadcasts to
another broadcast segment - Sophisticated topology
- Different trees are needed for different roots
20Synchronization and Acknowledgments
- Delayed synchronization for small messages
- Buffer Message at the broadcast channels
- Trigger broadcast RDMA(s) to send the message
- Synchronize the processes after a number of
operations - Amortize the synchronization cost across multiple
operations - With delayed synchronization, all nodes need to
be notified about the conclusion on the status of
used channels - For large messages, gt16KB, synchronize processes
at the completion of each broadcast to avoid
message buffering cost
21Synchronization Approaches
- Hardware barrier
- Efficient and scalable
- Not available for non-contiguous nodes
- May generate too much broadcast traffic
- Tree-based synchronization
- One process as the manager for a communicator
- ACKs are propagated to the manager through
chained RDMA - NACKs are generated to the manager directly
22Retransmission and Reliability
- Reliability against two kinds of errors
- I/O bus errors
- Retransmit the data
- Network errors, e.g., card failures
- Fail-over to tree-based broadcast, which is on
top of point-to-point communication and
end-to-end reliable. - Retransmission
- Timestamp is created with each broadcast request
- Retransmit the data when timer goes off or NACK
is detected - If a card failure is suspected, then fail-over to
tree-based broadcast
23Broadcast Message Flow Path
24Presentation Outline
- Problem Statement and Goals
- Design Challenges and Implementation
- Performance Evaluation
- Conclusions and Future Work
25Experiment Testbeds
- Experiment Testbeds
- 256 node quad-1.25GHz alpha TRU64 cluster at LANL
- 8 node quad-700MHz Linux cluster at OSU
- Both are equipped with Elan3 QM-400 cards
- Evaluated MPI implementations
- LA-MPI
- MPICH
- HPs Alaska
26Performance Evaluation
- Performance tests
- Broadcast latency
- Broadcast latency with SMP support
- Scalability
- Impact of the number of broadcast channels
- Cost of reliability
27Broadcast Latency
- Reduce the broadcast latency compared to the
generic broadcast implementation - Achieve 4-byte broadcast latency of 3.5us over 8
nodes - Low overhead for buffer recycling and
acknowledgments
28SMP Support
- Achieve 4-byte broadcast latency of 7.1us over
256 processes - Achieve better performance for small messages
compared to that of MPICH and HPs Alaska,
without using hardware barrier
29Scalability
- Achieve better scalability compared to the
generic algorithm - Good scalability while achieving high performance
30Broadcast Channels
- The synchronization cost is about 13us
- The cost of synchronization is amortized across
multiple broadcast operations with a large number
of broadcast channels.
31Reliability Cost
- A reliability cost of 1us for small message.
- Reliability cost for large messages are largely
due to CRC/checksum.
32Presentation Outline
- Problem Statement and Goals
- Design Challenges and Implementation
- Performance Evaluation
- Conclusions and Future Work
33Conclusions
- Achieve end-to-end reliable broadcast with low
performance impact - Achieve efficient and scalable broadcast with
Quadrics hardware broadcast - Reduce the overhead of broadcast buffer management
34Future Work
- Reduce the synchronization cost by using hardware
based barrier. - Implement the tree-based chained Broadcast RDMAs
for processes over non-contiguous nodes - Dynamically choose broadcast algorithms according
to the message pattern - Enhance the broadcast further by making use of
multiple Quadrics NICs
35More Information