High Performance Broadcast Support in LAMPI over Quadrics

1 / 35

About This Presentation

Title:

High Performance Broadcast Support in LAMPI over Quadrics

Description:

Use a global allocator to provide global buffer on demand ... Buffer Message at the broadcast channels. Trigger broadcast RDMA(s) to send the message ... –

Number of Views:31

Avg rating:3.0/5.0

Slides: 36

Provided by: Pan35

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Broadcast Support in LAMPI over Quadrics

1
High Performance Broadcast Support in LA-MPI
over Quadrics

W. Yu, S. Sur, D.K. Panda,
R.T. Aulwes and R.L. Graham

Advanced Computing Lab Los Alamos, NM 87545
Dept. of Computer Science The Ohio State
University
2
Presentation Outline

Problem Statement and Goals
Design Challenges and Implementation
Performance Evaluation
Conclusions and Future Work

3
LA-MPI

The Los Alamos Message Passing Interface
(LA-MPI)
Provide End-to-End Reliable Message Passing
Protect against network errors
Protect against I/O bus errors
Concurrent Message passing over multiple
interconnects
Message striping over multiple network interface
cards
Supported Platforms
Operating Systems
TRU64, Linux, Irix, MAC-OSX (32 and 64-bit)
Communication protocols
Shared Memory, UDP
HIPPI-800, Quadrics, Myrinet (GM),
InfiniBand(ongoing)

4
LA-MPI Architecture
5
Point-to-Point Communication
bind
Assemble
Frag
Frag
Frag
Frag
Retransmit
Yes
No
ACK
NACK
ACK
Yes
No
Yes
6
LA-MPI Broadcast
Generic Tree-based Broadcast
7
Quadrics Hardware Broadcast
8
Quadrics Hardware Broadcast
9
Quadrics Hardware Broadcast
10
Quadrics Hardware Broadcast

Benefits
Efficient, Scalable and Reliable
Limitations
The receive address must be global
Receiving processes must be on contiguous nodes
Existing broadcast implementation making use of
hardware broadcast
Elanlib

11
Research Goals

Can we make use of the hardware Broadcast to
provide an efficient and scalable broadcast
support to LA-MPI while achieving the goal of
end-to-end reliability?
Acknowledgments from receivers (after verifying
CRC) must be collected to ensure reliability
Reduce the overhead for buffer management
Raw hardware broadcast latency 3.3us
Elanlib broadcast latency 8.5us
5us overhead when making use of hardware
broadcast
Maintain the high performance and scalability of
hardware broadcast

12
Presentation Outline

Problem Statement and Goals
Design Challenges and Implementation
Performance Evaluation
Conclusions and Future Work

13
Challenges

Memory management for global buffers
Broadcast over processes on non-contiguous nodes
Synchronization and acknowledgement
Retransmission and Reliability

14
Global Buffer Management

Global Buffer must be consistent
Use a global allocator to provide global buffer
on demand
Hard to manage and low buffer reuse rate
Can satisfy large number of requests
Maintain a static number of fixed size global
channels
Easy to manage and high reuse rate
Need more frequent synchronization on the use of
channels

15
Single communicator

A communicator must recycle its global channels.
Synchronize before the use of a channel
Synchronize after the use of a channel
Synchronize when the global buffers are about to
be used up
Reduce the frequency of synchronization
Amortize the cost of synchronization across
multiple operations

16
Multiple Communicators

Global buffers must be recycled across different
communicators
A small number of concurrent communicators
Communicators tend to be disjoint
Our solution
8 sets of global buffers, one for COMM_WORLD
A communicator performs an Allreduce() to find
out the list of available buffer sets and take
the first available

17
Challenges

Memory management for global buffers
Broadcast over processes on non-contiguous nodes
Synchronization and acknowledgement
Retransmission and Reliability

18
Broadcast over Non-contiguous Nodes

To make use of hardware broadcast
Group processes into sets of contiguous nodes,
called broadcast segments
Approach 1, linearly chained broadcast RDMAs
The root performs a broadcast RDMA to each
segment
Not scalable
Completely distributed topology, i.e., the
formation of broadcast segments by one node is
transparent to all other nodes.

19
Tree-Based Chained Broadcast RDMAs

Approach 2 (Tree-based Chaining)
Broadcast to the largest broadcast segment
Each process that receives data broadcasts to
another broadcast segment
Sophisticated topology
Different trees are needed for different roots

20
Synchronization and Acknowledgments

Delayed synchronization for small messages
Buffer Message at the broadcast channels
Trigger broadcast RDMA(s) to send the message
Synchronize the processes after a number of
operations
Amortize the synchronization cost across multiple
operations
With delayed synchronization, all nodes need to
be notified about the conclusion on the status of
used channels
For large messages, gt16KB, synchronize processes
at the completion of each broadcast to avoid
message buffering cost

21
Synchronization Approaches

Hardware barrier
Efficient and scalable
Not available for non-contiguous nodes
May generate too much broadcast traffic
Tree-based synchronization
One process as the manager for a communicator
ACKs are propagated to the manager through
chained RDMA
NACKs are generated to the manager directly

22
Retransmission and Reliability

Reliability against two kinds of errors
I/O bus errors
Retransmit the data
Network errors, e.g., card failures
Fail-over to tree-based broadcast, which is on
top of point-to-point communication and
end-to-end reliable.
Retransmission
Timestamp is created with each broadcast request
Retransmit the data when timer goes off or NACK
is detected
If a card failure is suspected, then fail-over to
tree-based broadcast

23
Broadcast Message Flow Path
24
Presentation Outline

Problem Statement and Goals
Design Challenges and Implementation
Performance Evaluation
Conclusions and Future Work

25
Experiment Testbeds

Experiment Testbeds
256 node quad-1.25GHz alpha TRU64 cluster at LANL
8 node quad-700MHz Linux cluster at OSU
Both are equipped with Elan3 QM-400 cards
Evaluated MPI implementations
LA-MPI
MPICH
HPs Alaska

26
Performance Evaluation

Performance tests
Broadcast latency
Broadcast latency with SMP support
Scalability
Impact of the number of broadcast channels
Cost of reliability

27
Broadcast Latency

Reduce the broadcast latency compared to the
generic broadcast implementation
Achieve 4-byte broadcast latency of 3.5us over 8
nodes
Low overhead for buffer recycling and
acknowledgments

28
SMP Support

Achieve 4-byte broadcast latency of 7.1us over
256 processes
Achieve better performance for small messages
compared to that of MPICH and HPs Alaska,
without using hardware barrier

29
Scalability

Achieve better scalability compared to the
generic algorithm
Good scalability while achieving high performance

30
Broadcast Channels

The synchronization cost is about 13us
The cost of synchronization is amortized across
multiple broadcast operations with a large number
of broadcast channels.

31
Reliability Cost

A reliability cost of 1us for small message.
Reliability cost for large messages are largely
due to CRC/checksum.

32
Presentation Outline

Problem Statement and Goals
Design Challenges and Implementation
Performance Evaluation
Conclusions and Future Work

33
Conclusions

Achieve end-to-end reliable broadcast with low
performance impact
Achieve efficient and scalable broadcast with
Quadrics hardware broadcast
Reduce the overhead of broadcast buffer management

34
Future Work

Reduce the synchronization cost by using hardware
based barrier.
Implement the tree-based chained Broadcast RDMAs
for processes over non-contiguous nodes
Dynamically choose broadcast algorithms according
to the message pattern
Enhance the broadcast further by making use of
multiple Quadrics NICs

35
More Information

Write a Comment

User Comments (0)