Title: SCTPbased Middleware for MPI
1SCTP-based Middleware for MPI
- Humaira Kamal, Brad Penoff, Alan Wagner
- Department of Computer Science
- University of British Columbia
2What is MPI and SCTP?
- Message Passing Interface (MPI)
- Library that is widely used to parallelize
scientific and compute-intensive programs - Stream Control Transmission Protocol (SCTP)
- General purpose unicast transport protocol for IP
network data communications - Recently standardized by IETF
- Can be used anywhere TCP is used
3What is MPI and SCTP?
- Message Passing Interface (MPI)
- Library that is widely used to parallelize
scientific and compute-intensive programs - Stream Control Transmission Protocol (SCTP)
- General purpose unicast transport protocol for IP
network data communications - Recently standardized by IETF
- Can be used anywhere TCP is used
- Question
- Can we take advantage of SCTP features to better
support parallel applications using MPI?
4Communicating MPI Processes
TCP is often used as transport protocol for MPI
SCTP
SCTP
5SCTP Key Features
- Reliable in-order delivery, flow control, full
duplex transfer. - SACK is built in the protocol
- TCP-like congestion control
6SCTP Key Features
- Message oriented
- Use of associations
- Multihoming
- Multiple streams within an association
7Logical View of Multiple Streams in an Association
8Partially Ordered User Messages Sent on Different
Streams
9MPI Middleware
MPI_Send(msg,count,type,dest-rank,tag,context)
MPI_Recv(msg,count,type,source-rank,tag,context)
- Message matching is done based on Tag, Rank and
Context (TRC). - Combinations such as blocking, non-blocking,
synchronous, asynchronous, buffered, unbuffered. - Use of wildcards for receive
10MPI Messages Using Same Context, Two Processes
11MPI Messages Using Same Context, Two Processes
Out of order messages with same tags violate MPI
semantics
12MPI Middleware
- Message Progression Layer
- Short Messages vs. Long Messages
13Design and Implementation
- LAM (Local Area Multi-computer) is an open source
implementation of MPI library - We redesigned LAM-MPI to use SCTP
- Three-phased iterative process
- Use of One-to-One Style Sockets
- Use of Multiple Streams
- Use of One-to-Many Style Sockets
14Using SCTP for MPI
- Striking similarities between SCTP and MPI
15Implementation Issues
- Maintaining State Information
- Maintain state appropriately for each request
function to work with the one-to-many style. - Message Demultiplexing
- Extend RPI initialization to map associations to
rank. - Demultiplexing of each incoming message to direct
it to the proper receive function. - Concurrency and SCTP Streams
- Consistently map MPI tag-rank-context to SCTP
streams, maintaining proper MPI semantics. - Resource Management
- Make RPI more message-driven.
- Eliminate the use of the select() system call,
making the implementation more scalable. - Eliminating the need to maintain a large number
of socket descriptors.
16Implementation Issues
- Eliminating Race Conditions
- Finding solutions for race conditions due to
added concurrency. - Use of barrier after association setup phase.
- Reliability
- Modify out-of-band daemons and request
progression interface (RPI) to use a common
transport layer protocol to allow for all
components of LAM to multihome successfully. - Support for large messages
- Devised a long-message protocol to handle
messages larger than socket send buffer. - Experiments with different SCTP stacks
17Features of Design
- Head-of-Line Blocking
- Multihoming and Reliability
- Security
18Head-of-Line Blocking
19Multihoming
- Heartbeats
- Failover
- Retransmissions
- User adjustable controls
20Added Security
User data can be piggy-backed on third and fourth
leg
SCTPs Use of Signed Cookie
21Limitations
- Comprehensive CRC32c checksum offload to NIC
not yet commonly available - SCTP bundles messages together so it might not
always be able to pack a full MTU - SCTP stack is in early stages and will improve
over time - Performance is stack dependant (Linux lksctp
stack ltlt FreeBSD KAME stack)
22Experiments for Loss
Performance of MPI Program that Uses Multiple Tags
23Experiments Head-of-Line Blocking
Use of Different Tags vs. Same Tags
24Experiments SCTP versus TCP
MPBench Ping Pong Test under No Loss
25Conclusions
- SCTP is a better suited for MPI
- Avoids unnecessary head-of-line blocking due to
use of streams - Increased fault tolerant in presence of
multihomed hosts - In-built security features
- SCTP might be key to moving MPI programs from
LANs to WANs.
26Thank you!
- More information about our work is at
- http//www.cs.ubc.ca/labs/dsg/mpi-sctp/