FT-MPI

About This Presentation

Title:

FT-MPI

Description:

FT-MPI Graham E Fagg Making of the holy grail or a YAMI that is FT FT-MPI What is FT-MPI (its no YAMI) Building an MPI for Harness First sketch of FT-MPI Simple FT ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 64

Provided by: Innovative

Learn more at: https://www.csm.ornl.gov

Category:

more less

Transcript and Presenter's Notes

Title: FT-MPI

1
FT-MPI

Graham E Fagg
Making of the holy grail
or
a YAMI that is FT

2
FT-MPI

What is FT-MPI (its no YAMI)
Building an MPI for Harness
First sketch of FT-MPI
Simple FT enabled Example
A bigger meaner example (PSTSWM)
Second view of FT-MPI
Future directions

3
FT-MPI is not just a YAMI

FT-MPI as in Fault Tolerant MPI
Why make a FT version?
Harness is going to be very robust compared to
previous systems.
No single point of failure unlike PVM
Allow MPI users to take advantage of this high
level of robustness, rather than just provide Yet
Another MPI Implementation (YAMI)

4
Why FT-MPI

Current MPI applications live under the MPI fault
tolerant model of no faults allowed.
This is great on an MPP as if you lose a node you
generally lose a partion anyway.
Makes reasoning about results easy. If there was
a fault you might have received
incomplete/incorrect values and hense have the
wrong result anyway.

5
Why FT-MPI

No-matter how we implement FT-MPI, it must follow
current MPI-1.2 (or 2) practices. I.e. we cant
really change too much about how it works
(semmantics) or how it looks (syntax).
Makes coding for FT a little interesting and very
dependent on the target application classes. As
will be shown.

6
So first what does MPI do?

All communication is via a communicator
Communicators form an envelope in which
communication can occur, and contains information
such as process groups, topology information and
attributes (key values)

7
What does MPI do?

When an application starts up, it has a single
communicator that contains all members known an
MPI_COMM_WORLD
Other communicators containing sub-section of the
original communictor can be created from this
communicator using collective (meaning blocking,
group operations).

8
What does MPI do?

Until MPI-2 and the advent of MPI_Spawn (which
isnot really supported by any implementations
except LAM) it was not possible to add new
members to the range of addressable members in an
MPI application.
If you cant address (name) them, you cant
communicate directly with them.

9
What does MPI do?

If a member of a communicator failed for some
reason, the specification mandated that rather
than continuing which would lead to unknown
results in a doomed application, the communicator
is invalidated and the application halted in a
clean manner.
In simple if something fails, everything does.

10
What we would like?

Many applications are capable or can be made
capable of surviving such a random failure.
Initial Goal
Provide a version of MPI that allows a range of
alternatives to an application when a sub-part of
the application has failed.
Range of alternatives depends on how the
applications themselves will handle the failure.

11
Building an MPI for Harness

Before we get into the gritty of what we do when
we get an error, how are we going to build
something in the first place?
Two methods
Take an existing implementation (ala MPICH) and
re-engineer it for our own uses (the most popular
method currently)
Build an implementation from the ground up.

12
Building a YAMI

Taking MPICH and building an FT version should be
simple?
It has a layering system, the MPI API sits on top
of the data-structures that sit ontop of a
collective communication model, which calls an
ADI that provides p2p communications.

13
Building a YAMI

MSS tried this with their version of MPI for the
Cray T3E
Found that the layering was not very clean, lots
of short cuts and data passed between the layers
without going through the expected APIs.
Esp true of routines that handle startup (I.e.
process management)

14
Building a YAMI

Building a YAMI from scratch
Not impossible but time consuming
Too many function calls to support (200)
Can implement a subset (just like compiler
writers did for HPF with subset HPF)
If we later want a full implementation then we
need a much larger team that we current have.
(Look at how long it has taken ANL to keep up to
date, and look at their currently outstanding bug
list).

15
Building a YAMI

Subset of operations best way to go
Allows us to test a few key applications and find
out just how useful and applicable a FT-MPI would
be.

16
Building an MPI for Harness

What does Harness give us, and what do we have to
build ourselves?
Harness will give us basic functionality of
starting tasks, some basic comms between them,
some attribute storage (mboxes) and some
indication of errors and failures.
I.e. mostly what PVM gives us at the moment.
As well as the ability to plug extra bits in...

17
Harness Basic Structure
Application
Application
TCP/IP basic link
Harness run-time
Pipes / sockets
TCP/IP
HARNESS Daemon
18
Harness Basic Structure
Repository
Application
Application
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
19
Harness Basic Structure
Repository
Application
Application
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
20
Harness Basic Structure
Application
Application
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
21
Harness Basic Structure
Application
Application
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
22
Harness Basic Structure
Application
Application
Harness run-time
FM-Comms-Plugin
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
23
So what do we need to build for FT-MPI?

Build the run-time components that provide the
user application with an MPI API
Build an interface in this run-time component
that allows for fast communications so that we at
least provide something that doesnt run like a 3
legged dog.

24
Building the run-time system

The system can be built as several layers.
The top layer is the MPI API
The next layer handles the internal MPI data
structures and some of the data buffering.
The next layer handles the collective
communications.
Breaks them down to p2p, but in a modular way so
that different collective operations can be
optimised differently depending on the target
architecture.
The lowest layer handles p2p communications.

25
Building the run-time system

Do we have any of this already?
Yes the MPI API layer is currently in a file
called MPI_Connect/src/com_layer.c
Most of the data structures are in com_list,
msg_list.c, lists.c and hash.c
Hint, try compiling the library with the flag
-DNOMPI
Means we know what we are up against.

26
Building the run-time system

Most complex part if handling the collective
operations and all the variants of vector
operations.
PACX and MetaMPI do not support them all, but
MagPie is getting closer.

27
What is MagPie ?

A Black and White bird that collects shinny
objects.
A software system by Thilo Kielmann of Vrije
Universiteit, Amsterdam, NL.
Collects is the important word here as its is a
package that supports efficient collective
operations across multiple clusters.
Most collective operation in most MPI
implementation break down into a series of
broadcasts which scale well across switches as
long as the switches are homogeneous, which is
not the case for cluster of clusters.
I.e. can use MagPie to provide the collective
substrate.

28
Building the run-time system

Just leaves the p2p system, and the interface to
the Harness daemons themselves.
The p2p system can be build on Martins fast
message layer.
The Harness interface can be implemented on top
of PVM 3.4 for now, until Harness itself becomes
available.

29
Building the run-time system

Last details to worry about is how we are going
to change the MPI semantics to report errors and
how we continue after them.
Taking note of how we know there is a failure in
the first place.

30
First sketch of FT-MPI

First view of FT-MPI is where the users
application is able to handle errors and all we
have to provide is
A simple method for indicating errors/failures
A simple method for recovering from errors

31
First sketch of FT-MPI

3 initial models of failure (another later on)
(1) There is a failure and the application is
shut down (MPI default gains us little other
than meeting the standard).
(2) Failure only effects members of a
communicator which communicate with the failed
party. I.e. p2p coms still work within the
communicator.
(3) That communicator is invalidated completely.

32
First sketch of FT-MPI

How do we detect failure?
4 ways
(1) We are told its going to happen by a member
of a particular application. (ie I have NaNs
everywhere.. Panic)
(2) A point-2-point communication fails
(3) The p2p system tells use that some-one failed
(error propergation within a communicator at the
run-time system layer) (much like (1))
(4) Harness tells us via a message from the
daemon.

33
First sketch of FT-MPI

How do we tell the user application?
Return it an MPI_ERR_OTHER
Force it to check an additional MPI error call to
find where the failure occurred.
Via the cached attribute key values
FT_MPI_PROC_FAILED which is a vector of length
MPI_COMM_SIZE of the original communicator.
How do we recover if we have just invalidated the
communicator the application will use to recover
on?

34
First sketch of FT-MPI

Some functions are allowed to be used in a
partial form to facilitate recovery.
I.e. MPI_Comm_barrier ( ) can still be used to
sync processes, but will only wait for the
surviving processes
The formation of a new communicator will also be
allowed to work with a broken communicator.
MPI_Finalize does not need a communicator
specified.

35
First sketch of FT-MPI

Forming a new communicator that the application
can use to continue is the important part.
Two functions can modified to be used
MPI_COMM_CREATE (comm, group, newcomm )
MPI_COMM_SPLIT (comm, colour, key, newcomm )

36
First sketch of FT-MPI

MPI_COMM_CREATE ( )
Called with the group set to a new constant
FT_MPI_LIVING (!)
Creates a new communicator that contains all the
processes that continue to survive.
Special case could be to allow MPI_COMM_WORLD to
be specified as both input and output
communicator.

37
First sketch of FT-MPI

MPI_COMM_SPLIT ( )
Called with the colour set to a new constant
FT_MPI_NOT_DEAD_YET (!)
key can be used to control the new rank of
processes within the new communicator.
Again creates a new communicator that contains
all the processes that continue to survive.

38
Simple FT enabled Example

Simple application at first
Bag of tasks, where the tasks know how to handle
a failure.
Server just divides up the next set of data to be
calculated between the survivors.
Clients nominate a new server if they have enough
state.
(Can get the state by using ALL2ALL
communications for results).

39
A bigger meaner example (PSTSWM)

Parallel Spectral Transform Shallow Water Model
2D grid calculation
3D in actual computation, with 1 axis performing
FFTs, the second global reductions and the third
layering sequentially upon each logical
processor.
Calculation cannot support reduced grids like
those supported by the Parallel Community Climate
Model (PCMM), a future target application for
FT-MPI.
I.e. if we lose a logical grid point (node) we
must replace it!

40
A bigger meaner example (PSTSWM)

First Sketch ideas for FT-MPI are fine for
applications that can handle a failure and have
functional calling sequences that are not too
deep
I.e. MPI API calls can be buried deep within
routines and any errors may take quite a while to
bubble to the surface where the application can
take effective action to handle them and recover.

41
A bigger meaner example (PSTSWM)

This application proceeds in a number of well
defined stages and can only handle failure by
restarting from a known set of data.
I.e. user checkpoints have to be taken, and must
still be reachable.
User requirement is for the application to be
started and run to completion with the system
automatically handling errors without manual
intervention.

42
A bigger meaner example (PSTSWM)

Invalidating the failed communicators only as in
the first sketch are not enough for this
application.
PSTSWM creates communicators for each row and
column of the 2-D grid.

43
A bigger meaner example (PSTSWM)
44
A bigger meaner example (PSTSWM)
Failed Node
45
A bigger meaner example (PSTSWM)
Failed Node
Failed Communicator
Failed Communicator
46
A bigger meaner example (PSTSWM)
This is unknown (butterfly p2p)
Failed Node
This communication works
47
A bigger meaner example (PSTSWM)
Failed Node
This is unknown as the pervious failure on
the axis might not have been detected...
48
A bigger meaner example (PSTSWM)

What is really wanted is for four things to
happen.
Firstly, ALL communicators are marked as broken
even if some are recoverable.
The underlying system propagates errors message
to all communicators, not just the ones directly
effected by the failure.
Secondly all MPI operations become NOPs where
possible so that, the application can bubble the
error to the top level as fast as possible.

49
A bigger meaner example (PSTSWM)

Thirdly, the run-time system spawns a replacement
node on behalf of the application using a
predetermined set of metrics.
Finally, the system allows this new process to be
combined with the surviving communicators at
MPI_Comm_create time.
Position (rank) of the new processes is not so
important in this application as restart data has
to be redistributed anyway, but maybe important
for other applications.

50
A bigger meaner example (PSTSWM)

For this to occur, we need a means of identifying
if a process has been spawned for the purpose of
recovery (by either the run-time system or an
application itself).
MPI_Comm_split (com, ft_mpi_still_alive,..) vs
MPI_Comm_split (ft_mpi_external_com,
ft_mpi_new_spawned,..)
PSTSWM, doesnt care which task died and frankly
doesnt want to know!
Just wants to continue calculating..

51
A bigger meaner example (PSTSWM)

How are we going to build an FT version of this
application?
Patrick Worley (ORNL) is currently adding (user)
checkpoint and restart capability into the
application, as well as on error, get to the top
layer functionality, so that a restart can be
performed.
FT-MPI will need to provide an MPI-2 spawn
function as well as baseline MPI-1 calls.
Initially the spawning will be performed by the
PSTSWM code, and later by the run-time on its
behalf.

52
A bigger meaner example (PSTSWM)
Failed Node
53
A bigger meaner example (PSTSWM)
Error detected, comms invalidated
54
A bigger meaner example (PSTSWM)
Error detected, comms invalidated
55
A bigger meaner example (PSTSWM)
Error detected, comms invalidated
56
A bigger meaner example (PSTSWM)
New task spawned
57
A bigger meaner example (PSTSWM)
Application reforming communicators
58
A bigger meaner example (PSTSWM)
Application reforming communicators
59
A bigger meaner example (PSTSWM)
Application reforming communicators
60
A bigger meaner example (PSTSWM)
Application back on-line.
61
A bigger meaner example (PSTSWM)

Hope to demo FT-PSTSWM at SC99
Performance will not be great, as it is sensitive
to latency and very sensitive to bandwidth.
But it is probably one of the most difficult
classes of applications to support.
PCCM is the next big application on the list as
this model can be reconfigured to handle
different grid sizes dynamically.

62
Future Directions

When we move from Terra-flop systems to Peta-flop
machines we will have a mean time between
failures (MTBF) that is less than that of
expected execution runs.
Solutions like FT-MPI might help application
developers better cope with this situation,
without having to checkpoint their applications
to (performance) death.

63
For now, what next?

Implement a simple MPI implementation on top of
PVM 3.4 using as much existing software as
possible.
Support functions needed by our two exemplars.
Make sure the lower level systems will use the
high performance coms layer efficiently when it
becomes available.
Fool some students into working for us (5
years?), for when we want to support the other
200 functions in MPI.