High Performance Broadcast Support in LAMPI over Quadrics

1 / 35
About This Presentation
Title:

High Performance Broadcast Support in LAMPI over Quadrics

Description:

Use a global allocator to provide global buffer on demand ... Buffer Message at the broadcast channels. Trigger broadcast RDMA(s) to send the message ... –

Number of Views:31
Avg rating:3.0/5.0
Slides: 36
Provided by: Pan35
Category:

less

Transcript and Presenter's Notes

Title: High Performance Broadcast Support in LAMPI over Quadrics


1
High Performance Broadcast Support in LA-MPI
over Quadrics
  • W. Yu, S. Sur, D.K. Panda,
  • R.T. Aulwes and R.L. Graham

Advanced Computing Lab Los Alamos, NM 87545
Dept. of Computer Science The Ohio State
University
2
Presentation Outline
  • Problem Statement and Goals
  • Design Challenges and Implementation
  • Performance Evaluation
  • Conclusions and Future Work

3
LA-MPI
  • The Los Alamos Message Passing Interface
    (LA-MPI)
  • Provide End-to-End Reliable Message Passing
  • Protect against network errors
  • Protect against I/O bus errors
  • Concurrent Message passing over multiple
    interconnects
  • Message striping over multiple network interface
    cards
  • Supported Platforms
  • Operating Systems
  • TRU64, Linux, Irix, MAC-OSX (32 and 64-bit)
  • Communication protocols
  • Shared Memory, UDP
  • HIPPI-800, Quadrics, Myrinet (GM),
    InfiniBand(ongoing)

4
LA-MPI Architecture
5
Point-to-Point Communication
bind
Assemble
Frag
Frag
Frag
Frag
Retransmit
Yes
No
ACK
NACK
ACK
Yes
No
Yes
6
LA-MPI Broadcast
Generic Tree-based Broadcast
7
Quadrics Hardware Broadcast
8
Quadrics Hardware Broadcast
9
Quadrics Hardware Broadcast
10
Quadrics Hardware Broadcast
  • Benefits
  • Efficient, Scalable and Reliable
  • Limitations
  • The receive address must be global
  • Receiving processes must be on contiguous nodes
  • Existing broadcast implementation making use of
    hardware broadcast
  • Elanlib

11
Research Goals
  • Can we make use of the hardware Broadcast to
    provide an efficient and scalable broadcast
    support to LA-MPI while achieving the goal of
    end-to-end reliability?
  • Acknowledgments from receivers (after verifying
    CRC) must be collected to ensure reliability
  • Reduce the overhead for buffer management
  • Raw hardware broadcast latency 3.3us
  • Elanlib broadcast latency 8.5us
  • 5us overhead when making use of hardware
    broadcast
  • Maintain the high performance and scalability of
    hardware broadcast

12
Presentation Outline
  • Problem Statement and Goals
  • Design Challenges and Implementation
  • Performance Evaluation
  • Conclusions and Future Work

13
Challenges
  • Memory management for global buffers
  • Broadcast over processes on non-contiguous nodes
  • Synchronization and acknowledgement
  • Retransmission and Reliability

14
Global Buffer Management
  • Global Buffer must be consistent
  • Use a global allocator to provide global buffer
    on demand
  • Hard to manage and low buffer reuse rate
  • Can satisfy large number of requests
  • Maintain a static number of fixed size global
    channels
  • Easy to manage and high reuse rate
  • Need more frequent synchronization on the use of
    channels

15
Single communicator
  • A communicator must recycle its global channels.
  • Synchronize before the use of a channel
  • Synchronize after the use of a channel
  • Synchronize when the global buffers are about to
    be used up
  • Reduce the frequency of synchronization
  • Amortize the cost of synchronization across
    multiple operations

16
Multiple Communicators
  • Global buffers must be recycled across different
    communicators
  • A small number of concurrent communicators
  • Communicators tend to be disjoint
  • Our solution
  • 8 sets of global buffers, one for COMM_WORLD
  • A communicator performs an Allreduce() to find
    out the list of available buffer sets and take
    the first available

17
Challenges
  • Memory management for global buffers
  • Broadcast over processes on non-contiguous nodes
  • Synchronization and acknowledgement
  • Retransmission and Reliability

18
Broadcast over Non-contiguous Nodes
  • To make use of hardware broadcast
  • Group processes into sets of contiguous nodes,
    called broadcast segments
  • Approach 1, linearly chained broadcast RDMAs
  • The root performs a broadcast RDMA to each
    segment
  • Not scalable
  • Completely distributed topology, i.e., the
    formation of broadcast segments by one node is
    transparent to all other nodes.

19
Tree-Based Chained Broadcast RDMAs
  • Approach 2 (Tree-based Chaining)
  • Broadcast to the largest broadcast segment
  • Each process that receives data broadcasts to
    another broadcast segment
  • Sophisticated topology
  • Different trees are needed for different roots

20
Synchronization and Acknowledgments
  • Delayed synchronization for small messages
  • Buffer Message at the broadcast channels
  • Trigger broadcast RDMA(s) to send the message
  • Synchronize the processes after a number of
    operations
  • Amortize the synchronization cost across multiple
    operations
  • With delayed synchronization, all nodes need to
    be notified about the conclusion on the status of
    used channels
  • For large messages, gt16KB, synchronize processes
    at the completion of each broadcast to avoid
    message buffering cost

21
Synchronization Approaches
  • Hardware barrier
  • Efficient and scalable
  • Not available for non-contiguous nodes
  • May generate too much broadcast traffic
  • Tree-based synchronization
  • One process as the manager for a communicator
  • ACKs are propagated to the manager through
    chained RDMA
  • NACKs are generated to the manager directly

22
Retransmission and Reliability
  • Reliability against two kinds of errors
  • I/O bus errors
  • Retransmit the data
  • Network errors, e.g., card failures
  • Fail-over to tree-based broadcast, which is on
    top of point-to-point communication and
    end-to-end reliable.
  • Retransmission
  • Timestamp is created with each broadcast request
  • Retransmit the data when timer goes off or NACK
    is detected
  • If a card failure is suspected, then fail-over to
    tree-based broadcast

23
Broadcast Message Flow Path
24
Presentation Outline
  • Problem Statement and Goals
  • Design Challenges and Implementation
  • Performance Evaluation
  • Conclusions and Future Work

25
Experiment Testbeds
  • Experiment Testbeds
  • 256 node quad-1.25GHz alpha TRU64 cluster at LANL
  • 8 node quad-700MHz Linux cluster at OSU
  • Both are equipped with Elan3 QM-400 cards
  • Evaluated MPI implementations
  • LA-MPI
  • MPICH
  • HPs Alaska

26
Performance Evaluation
  • Performance tests
  • Broadcast latency
  • Broadcast latency with SMP support
  • Scalability
  • Impact of the number of broadcast channels
  • Cost of reliability

27
Broadcast Latency
  • Reduce the broadcast latency compared to the
    generic broadcast implementation
  • Achieve 4-byte broadcast latency of 3.5us over 8
    nodes
  • Low overhead for buffer recycling and
    acknowledgments

28
SMP Support
  • Achieve 4-byte broadcast latency of 7.1us over
    256 processes
  • Achieve better performance for small messages
    compared to that of MPICH and HPs Alaska,
    without using hardware barrier

29
Scalability
  • Achieve better scalability compared to the
    generic algorithm
  • Good scalability while achieving high performance

30
Broadcast Channels
  • The synchronization cost is about 13us
  • The cost of synchronization is amortized across
    multiple broadcast operations with a large number
    of broadcast channels.

31
Reliability Cost
  • A reliability cost of 1us for small message.
  • Reliability cost for large messages are largely
    due to CRC/checksum.

32
Presentation Outline
  • Problem Statement and Goals
  • Design Challenges and Implementation
  • Performance Evaluation
  • Conclusions and Future Work

33
Conclusions
  • Achieve end-to-end reliable broadcast with low
    performance impact
  • Achieve efficient and scalable broadcast with
    Quadrics hardware broadcast
  • Reduce the overhead of broadcast buffer management

34
Future Work
  • Reduce the synchronization cost by using hardware
    based barrier.
  • Implement the tree-based chained Broadcast RDMAs
    for processes over non-contiguous nodes
  • Dynamically choose broadcast algorithms according
    to the message pattern
  • Enhance the broadcast further by making use of
    multiple Quadrics NICs

35
More Information
Write a Comment
User Comments (0)
About PowerShow.com