Title: FT-MPI
1FT-MPI
- Graham E Fagg
- Making of the holy grail
- or
- a YAMI that is FT
2FT-MPI
- What is FT-MPI (its no YAMI)
- Building an MPI for Harness
- First sketch of FT-MPI
- Simple FT enabled Example
- A bigger meaner example (PSTSWM)
- Second view of FT-MPI
- Future directions
3FT-MPI is not just a YAMI
- FT-MPI as in Fault Tolerant MPI
- Why make a FT version?
- Harness is going to be very robust compared to
previous systems. - No single point of failure unlike PVM
- Allow MPI users to take advantage of this high
level of robustness, rather than just provide Yet
Another MPI Implementation (YAMI)
4Why FT-MPI
- Current MPI applications live under the MPI fault
tolerant model of no faults allowed. - This is great on an MPP as if you lose a node you
generally lose a partion anyway. - Makes reasoning about results easy. If there was
a fault you might have received
incomplete/incorrect values and hense have the
wrong result anyway.
5Why FT-MPI
- No-matter how we implement FT-MPI, it must follow
current MPI-1.2 (or 2) practices. I.e. we cant
really change too much about how it works
(semmantics) or how it looks (syntax). - Makes coding for FT a little interesting and very
dependent on the target application classes. As
will be shown.
6So first what does MPI do?
- All communication is via a communicator
- Communicators form an envelope in which
communication can occur, and contains information
such as process groups, topology information and
attributes (key values)
7What does MPI do?
- When an application starts up, it has a single
communicator that contains all members known an
MPI_COMM_WORLD - Other communicators containing sub-section of the
original communictor can be created from this
communicator using collective (meaning blocking,
group operations).
8What does MPI do?
- Until MPI-2 and the advent of MPI_Spawn (which
isnot really supported by any implementations
except LAM) it was not possible to add new
members to the range of addressable members in an
MPI application. - If you cant address (name) them, you cant
communicate directly with them.
9What does MPI do?
- If a member of a communicator failed for some
reason, the specification mandated that rather
than continuing which would lead to unknown
results in a doomed application, the communicator
is invalidated and the application halted in a
clean manner. - In simple if something fails, everything does.
10What we would like?
- Many applications are capable or can be made
capable of surviving such a random failure. - Initial Goal
- Provide a version of MPI that allows a range of
alternatives to an application when a sub-part of
the application has failed. - Range of alternatives depends on how the
applications themselves will handle the failure.
11Building an MPI for Harness
- Before we get into the gritty of what we do when
we get an error, how are we going to build
something in the first place? - Two methods
- Take an existing implementation (ala MPICH) and
re-engineer it for our own uses (the most popular
method currently) - Build an implementation from the ground up.
12Building a YAMI
- Taking MPICH and building an FT version should be
simple? - It has a layering system, the MPI API sits on top
of the data-structures that sit ontop of a
collective communication model, which calls an
ADI that provides p2p communications.
13Building a YAMI
- MSS tried this with their version of MPI for the
Cray T3E - Found that the layering was not very clean, lots
of short cuts and data passed between the layers
without going through the expected APIs. - Esp true of routines that handle startup (I.e.
process management)
14Building a YAMI
- Building a YAMI from scratch
- Not impossible but time consuming
- Too many function calls to support (200)
- Can implement a subset (just like compiler
writers did for HPF with subset HPF) - If we later want a full implementation then we
need a much larger team that we current have.
(Look at how long it has taken ANL to keep up to
date, and look at their currently outstanding bug
list).
15Building a YAMI
- Subset of operations best way to go
- Allows us to test a few key applications and find
out just how useful and applicable a FT-MPI would
be.
16Building an MPI for Harness
- What does Harness give us, and what do we have to
build ourselves? - Harness will give us basic functionality of
starting tasks, some basic comms between them,
some attribute storage (mboxes) and some
indication of errors and failures. - I.e. mostly what PVM gives us at the moment.
- As well as the ability to plug extra bits in...
17Harness Basic Structure
Application
Application
TCP/IP basic link
Harness run-time
Pipes / sockets
TCP/IP
HARNESS Daemon
18Harness Basic Structure
Repository
Application
Application
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
19Harness Basic Structure
Repository
Application
Application
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
20Harness Basic Structure
Application
Application
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
21Harness Basic Structure
Application
Application
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
22Harness Basic Structure
Application
Application
Harness run-time
FM-Comms-Plugin
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
23So what do we need to build for FT-MPI?
- Build the run-time components that provide the
user application with an MPI API - Build an interface in this run-time component
that allows for fast communications so that we at
least provide something that doesnt run like a 3
legged dog.
24Building the run-time system
- The system can be built as several layers.
- The top layer is the MPI API
- The next layer handles the internal MPI data
structures and some of the data buffering. - The next layer handles the collective
communications. - Breaks them down to p2p, but in a modular way so
that different collective operations can be
optimised differently depending on the target
architecture. - The lowest layer handles p2p communications.
25Building the run-time system
- Do we have any of this already?
- Yes the MPI API layer is currently in a file
called MPI_Connect/src/com_layer.c - Most of the data structures are in com_list,
msg_list.c, lists.c and hash.c - Hint, try compiling the library with the flag
-DNOMPI - Means we know what we are up against.
26Building the run-time system
- Most complex part if handling the collective
operations and all the variants of vector
operations. - PACX and MetaMPI do not support them all, but
MagPie is getting closer.
27What is MagPie ?
- A Black and White bird that collects shinny
objects. - A software system by Thilo Kielmann of Vrije
Universiteit, Amsterdam, NL. - Collects is the important word here as its is a
package that supports efficient collective
operations across multiple clusters. - Most collective operation in most MPI
implementation break down into a series of
broadcasts which scale well across switches as
long as the switches are homogeneous, which is
not the case for cluster of clusters. - I.e. can use MagPie to provide the collective
substrate.
28Building the run-time system
- Just leaves the p2p system, and the interface to
the Harness daemons themselves. - The p2p system can be build on Martins fast
message layer. - The Harness interface can be implemented on top
of PVM 3.4 for now, until Harness itself becomes
available.
29Building the run-time system
- Last details to worry about is how we are going
to change the MPI semantics to report errors and
how we continue after them. - Taking note of how we know there is a failure in
the first place.
30First sketch of FT-MPI
- First view of FT-MPI is where the users
application is able to handle errors and all we
have to provide is - A simple method for indicating errors/failures
- A simple method for recovering from errors
31First sketch of FT-MPI
- 3 initial models of failure (another later on)
- (1) There is a failure and the application is
shut down (MPI default gains us little other
than meeting the standard). - (2) Failure only effects members of a
communicator which communicate with the failed
party. I.e. p2p coms still work within the
communicator. - (3) That communicator is invalidated completely.
32First sketch of FT-MPI
- How do we detect failure?
- 4 ways
- (1) We are told its going to happen by a member
of a particular application. (ie I have NaNs
everywhere.. Panic) - (2) A point-2-point communication fails
- (3) The p2p system tells use that some-one failed
(error propergation within a communicator at the
run-time system layer) (much like (1)) - (4) Harness tells us via a message from the
daemon.
33First sketch of FT-MPI
- How do we tell the user application?
- Return it an MPI_ERR_OTHER
- Force it to check an additional MPI error call to
find where the failure occurred. - Via the cached attribute key values
- FT_MPI_PROC_FAILED which is a vector of length
MPI_COMM_SIZE of the original communicator. - How do we recover if we have just invalidated the
communicator the application will use to recover
on?
34First sketch of FT-MPI
- Some functions are allowed to be used in a
partial form to facilitate recovery. - I.e. MPI_Comm_barrier ( ) can still be used to
sync processes, but will only wait for the
surviving processes - The formation of a new communicator will also be
allowed to work with a broken communicator. - MPI_Finalize does not need a communicator
specified.
35First sketch of FT-MPI
- Forming a new communicator that the application
can use to continue is the important part. - Two functions can modified to be used
- MPI_COMM_CREATE (comm, group, newcomm )
- MPI_COMM_SPLIT (comm, colour, key, newcomm )
36First sketch of FT-MPI
- MPI_COMM_CREATE ( )
- Called with the group set to a new constant
- FT_MPI_LIVING (!)
- Creates a new communicator that contains all the
processes that continue to survive. - Special case could be to allow MPI_COMM_WORLD to
be specified as both input and output
communicator.
37First sketch of FT-MPI
- MPI_COMM_SPLIT ( )
- Called with the colour set to a new constant
- FT_MPI_NOT_DEAD_YET (!)
- key can be used to control the new rank of
processes within the new communicator. - Again creates a new communicator that contains
all the processes that continue to survive.
38Simple FT enabled Example
- Simple application at first
- Bag of tasks, where the tasks know how to handle
a failure. - Server just divides up the next set of data to be
calculated between the survivors. - Clients nominate a new server if they have enough
state. - (Can get the state by using ALL2ALL
communications for results).
39A bigger meaner example (PSTSWM)
- Parallel Spectral Transform Shallow Water Model
- 2D grid calculation
- 3D in actual computation, with 1 axis performing
FFTs, the second global reductions and the third
layering sequentially upon each logical
processor. - Calculation cannot support reduced grids like
those supported by the Parallel Community Climate
Model (PCMM), a future target application for
FT-MPI. - I.e. if we lose a logical grid point (node) we
must replace it!
40A bigger meaner example (PSTSWM)
- First Sketch ideas for FT-MPI are fine for
applications that can handle a failure and have
functional calling sequences that are not too
deep - I.e. MPI API calls can be buried deep within
routines and any errors may take quite a while to
bubble to the surface where the application can
take effective action to handle them and recover.
41A bigger meaner example (PSTSWM)
- This application proceeds in a number of well
defined stages and can only handle failure by
restarting from a known set of data. - I.e. user checkpoints have to be taken, and must
still be reachable. - User requirement is for the application to be
started and run to completion with the system
automatically handling errors without manual
intervention.
42A bigger meaner example (PSTSWM)
- Invalidating the failed communicators only as in
the first sketch are not enough for this
application. - PSTSWM creates communicators for each row and
column of the 2-D grid.
43A bigger meaner example (PSTSWM)
44A bigger meaner example (PSTSWM)
Failed Node
45A bigger meaner example (PSTSWM)
Failed Node
Failed Communicator
Failed Communicator
46A bigger meaner example (PSTSWM)
This is unknown (butterfly p2p)
Failed Node
This communication works
47A bigger meaner example (PSTSWM)
Failed Node
This is unknown as the pervious failure on
the axis might not have been detected...
48A bigger meaner example (PSTSWM)
- What is really wanted is for four things to
happen. - Firstly, ALL communicators are marked as broken
even if some are recoverable. - The underlying system propagates errors message
to all communicators, not just the ones directly
effected by the failure. - Secondly all MPI operations become NOPs where
possible so that, the application can bubble the
error to the top level as fast as possible.
49A bigger meaner example (PSTSWM)
- Thirdly, the run-time system spawns a replacement
node on behalf of the application using a
predetermined set of metrics. - Finally, the system allows this new process to be
combined with the surviving communicators at
MPI_Comm_create time. - Position (rank) of the new processes is not so
important in this application as restart data has
to be redistributed anyway, but maybe important
for other applications.
50A bigger meaner example (PSTSWM)
- For this to occur, we need a means of identifying
if a process has been spawned for the purpose of
recovery (by either the run-time system or an
application itself). - MPI_Comm_split (com, ft_mpi_still_alive,..) vs
- MPI_Comm_split (ft_mpi_external_com,
ft_mpi_new_spawned,..) - PSTSWM, doesnt care which task died and frankly
doesnt want to know! - Just wants to continue calculating..
51A bigger meaner example (PSTSWM)
- How are we going to build an FT version of this
application? - Patrick Worley (ORNL) is currently adding (user)
checkpoint and restart capability into the
application, as well as on error, get to the top
layer functionality, so that a restart can be
performed. - FT-MPI will need to provide an MPI-2 spawn
function as well as baseline MPI-1 calls. - Initially the spawning will be performed by the
PSTSWM code, and later by the run-time on its
behalf.
52A bigger meaner example (PSTSWM)
Failed Node
53A bigger meaner example (PSTSWM)
Error detected, comms invalidated
54A bigger meaner example (PSTSWM)
Error detected, comms invalidated
55A bigger meaner example (PSTSWM)
Error detected, comms invalidated
56A bigger meaner example (PSTSWM)
New task spawned
57A bigger meaner example (PSTSWM)
Application reforming communicators
58A bigger meaner example (PSTSWM)
Application reforming communicators
59A bigger meaner example (PSTSWM)
Application reforming communicators
60A bigger meaner example (PSTSWM)
Application back on-line.
61A bigger meaner example (PSTSWM)
- Hope to demo FT-PSTSWM at SC99
- Performance will not be great, as it is sensitive
to latency and very sensitive to bandwidth. - But it is probably one of the most difficult
classes of applications to support. - PCCM is the next big application on the list as
this model can be reconfigured to handle
different grid sizes dynamically.
62Future Directions
- When we move from Terra-flop systems to Peta-flop
machines we will have a mean time between
failures (MTBF) that is less than that of
expected execution runs. - Solutions like FT-MPI might help application
developers better cope with this situation,
without having to checkpoint their applications
to (performance) death.
63For now, what next?
- Implement a simple MPI implementation on top of
PVM 3.4 using as much existing software as
possible. - Support functions needed by our two exemplars.
- Make sure the lower level systems will use the
high performance coms layer efficiently when it
becomes available. - Fool some students into working for us (5
years?), for when we want to support the other
200 functions in MPI.