FT-MPI - PowerPoint PPT Presentation

About This Presentation
Title:

FT-MPI

Description:

FT-MPI Graham E Fagg Making of the holy grail or a YAMI that is FT FT-MPI What is FT-MPI (its no YAMI) Building an MPI for Harness First sketch of FT-MPI Simple FT ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 64
Provided by: Innovative
Learn more at: https://www.csm.ornl.gov
Category:
Tags: mpi | building | modular

less

Transcript and Presenter's Notes

Title: FT-MPI


1
FT-MPI
  • Graham E Fagg
  • Making of the holy grail
  • or
  • a YAMI that is FT

2
FT-MPI
  • What is FT-MPI (its no YAMI)
  • Building an MPI for Harness
  • First sketch of FT-MPI
  • Simple FT enabled Example
  • A bigger meaner example (PSTSWM)
  • Second view of FT-MPI
  • Future directions

3
FT-MPI is not just a YAMI
  • FT-MPI as in Fault Tolerant MPI
  • Why make a FT version?
  • Harness is going to be very robust compared to
    previous systems.
  • No single point of failure unlike PVM
  • Allow MPI users to take advantage of this high
    level of robustness, rather than just provide Yet
    Another MPI Implementation (YAMI)

4
Why FT-MPI
  • Current MPI applications live under the MPI fault
    tolerant model of no faults allowed.
  • This is great on an MPP as if you lose a node you
    generally lose a partion anyway.
  • Makes reasoning about results easy. If there was
    a fault you might have received
    incomplete/incorrect values and hense have the
    wrong result anyway.

5
Why FT-MPI
  • No-matter how we implement FT-MPI, it must follow
    current MPI-1.2 (or 2) practices. I.e. we cant
    really change too much about how it works
    (semmantics) or how it looks (syntax).
  • Makes coding for FT a little interesting and very
    dependent on the target application classes. As
    will be shown.

6
So first what does MPI do?
  • All communication is via a communicator
  • Communicators form an envelope in which
    communication can occur, and contains information
    such as process groups, topology information and
    attributes (key values)

7
What does MPI do?
  • When an application starts up, it has a single
    communicator that contains all members known an
    MPI_COMM_WORLD
  • Other communicators containing sub-section of the
    original communictor can be created from this
    communicator using collective (meaning blocking,
    group operations).

8
What does MPI do?
  • Until MPI-2 and the advent of MPI_Spawn (which
    isnot really supported by any implementations
    except LAM) it was not possible to add new
    members to the range of addressable members in an
    MPI application.
  • If you cant address (name) them, you cant
    communicate directly with them.

9
What does MPI do?
  • If a member of a communicator failed for some
    reason, the specification mandated that rather
    than continuing which would lead to unknown
    results in a doomed application, the communicator
    is invalidated and the application halted in a
    clean manner.
  • In simple if something fails, everything does.

10
What we would like?
  • Many applications are capable or can be made
    capable of surviving such a random failure.
  • Initial Goal
  • Provide a version of MPI that allows a range of
    alternatives to an application when a sub-part of
    the application has failed.
  • Range of alternatives depends on how the
    applications themselves will handle the failure.

11
Building an MPI for Harness
  • Before we get into the gritty of what we do when
    we get an error, how are we going to build
    something in the first place?
  • Two methods
  • Take an existing implementation (ala MPICH) and
    re-engineer it for our own uses (the most popular
    method currently)
  • Build an implementation from the ground up.

12
Building a YAMI
  • Taking MPICH and building an FT version should be
    simple?
  • It has a layering system, the MPI API sits on top
    of the data-structures that sit ontop of a
    collective communication model, which calls an
    ADI that provides p2p communications.

13
Building a YAMI
  • MSS tried this with their version of MPI for the
    Cray T3E
  • Found that the layering was not very clean, lots
    of short cuts and data passed between the layers
    without going through the expected APIs.
  • Esp true of routines that handle startup (I.e.
    process management)

14
Building a YAMI
  • Building a YAMI from scratch
  • Not impossible but time consuming
  • Too many function calls to support (200)
  • Can implement a subset (just like compiler
    writers did for HPF with subset HPF)
  • If we later want a full implementation then we
    need a much larger team that we current have.
    (Look at how long it has taken ANL to keep up to
    date, and look at their currently outstanding bug
    list).

15
Building a YAMI
  • Subset of operations best way to go
  • Allows us to test a few key applications and find
    out just how useful and applicable a FT-MPI would
    be.

16
Building an MPI for Harness
  • What does Harness give us, and what do we have to
    build ourselves?
  • Harness will give us basic functionality of
    starting tasks, some basic comms between them,
    some attribute storage (mboxes) and some
    indication of errors and failures.
  • I.e. mostly what PVM gives us at the moment.
  • As well as the ability to plug extra bits in...

17
Harness Basic Structure
Application
Application
TCP/IP basic link
Harness run-time
Pipes / sockets
TCP/IP
HARNESS Daemon
18
Harness Basic Structure
Repository
Application
Application
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
19
Harness Basic Structure
Repository
Application
Application
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
20
Harness Basic Structure
Application
Application
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
21
Harness Basic Structure
Application
Application
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
22
Harness Basic Structure
Application
Application
Harness run-time
FM-Comms-Plugin
Pipes / sockets
TCP/IP
Internal Harness meta-data storage
HARNESS Daemon
23
So what do we need to build for FT-MPI?
  • Build the run-time components that provide the
    user application with an MPI API
  • Build an interface in this run-time component
    that allows for fast communications so that we at
    least provide something that doesnt run like a 3
    legged dog.

24
Building the run-time system
  • The system can be built as several layers.
  • The top layer is the MPI API
  • The next layer handles the internal MPI data
    structures and some of the data buffering.
  • The next layer handles the collective
    communications.
  • Breaks them down to p2p, but in a modular way so
    that different collective operations can be
    optimised differently depending on the target
    architecture.
  • The lowest layer handles p2p communications.

25
Building the run-time system
  • Do we have any of this already?
  • Yes the MPI API layer is currently in a file
    called MPI_Connect/src/com_layer.c
  • Most of the data structures are in com_list,
    msg_list.c, lists.c and hash.c
  • Hint, try compiling the library with the flag
    -DNOMPI
  • Means we know what we are up against.

26
Building the run-time system
  • Most complex part if handling the collective
    operations and all the variants of vector
    operations.
  • PACX and MetaMPI do not support them all, but
    MagPie is getting closer.

27
What is MagPie ?
  • A Black and White bird that collects shinny
    objects.
  • A software system by Thilo Kielmann of Vrije
    Universiteit, Amsterdam, NL.
  • Collects is the important word here as its is a
    package that supports efficient collective
    operations across multiple clusters.
  • Most collective operation in most MPI
    implementation break down into a series of
    broadcasts which scale well across switches as
    long as the switches are homogeneous, which is
    not the case for cluster of clusters.
  • I.e. can use MagPie to provide the collective
    substrate.

28
Building the run-time system
  • Just leaves the p2p system, and the interface to
    the Harness daemons themselves.
  • The p2p system can be build on Martins fast
    message layer.
  • The Harness interface can be implemented on top
    of PVM 3.4 for now, until Harness itself becomes
    available.

29
Building the run-time system
  • Last details to worry about is how we are going
    to change the MPI semantics to report errors and
    how we continue after them.
  • Taking note of how we know there is a failure in
    the first place.

30
First sketch of FT-MPI
  • First view of FT-MPI is where the users
    application is able to handle errors and all we
    have to provide is
  • A simple method for indicating errors/failures
  • A simple method for recovering from errors

31
First sketch of FT-MPI
  • 3 initial models of failure (another later on)
  • (1) There is a failure and the application is
    shut down (MPI default gains us little other
    than meeting the standard).
  • (2) Failure only effects members of a
    communicator which communicate with the failed
    party. I.e. p2p coms still work within the
    communicator.
  • (3) That communicator is invalidated completely.

32
First sketch of FT-MPI
  • How do we detect failure?
  • 4 ways
  • (1) We are told its going to happen by a member
    of a particular application. (ie I have NaNs
    everywhere.. Panic)
  • (2) A point-2-point communication fails
  • (3) The p2p system tells use that some-one failed
    (error propergation within a communicator at the
    run-time system layer) (much like (1))
  • (4) Harness tells us via a message from the
    daemon.

33
First sketch of FT-MPI
  • How do we tell the user application?
  • Return it an MPI_ERR_OTHER
  • Force it to check an additional MPI error call to
    find where the failure occurred.
  • Via the cached attribute key values
  • FT_MPI_PROC_FAILED which is a vector of length
    MPI_COMM_SIZE of the original communicator.
  • How do we recover if we have just invalidated the
    communicator the application will use to recover
    on?

34
First sketch of FT-MPI
  • Some functions are allowed to be used in a
    partial form to facilitate recovery.
  • I.e. MPI_Comm_barrier ( ) can still be used to
    sync processes, but will only wait for the
    surviving processes
  • The formation of a new communicator will also be
    allowed to work with a broken communicator.
  • MPI_Finalize does not need a communicator
    specified.

35
First sketch of FT-MPI
  • Forming a new communicator that the application
    can use to continue is the important part.
  • Two functions can modified to be used
  • MPI_COMM_CREATE (comm, group, newcomm )
  • MPI_COMM_SPLIT (comm, colour, key, newcomm )

36
First sketch of FT-MPI
  • MPI_COMM_CREATE ( )
  • Called with the group set to a new constant
  • FT_MPI_LIVING (!)
  • Creates a new communicator that contains all the
    processes that continue to survive.
  • Special case could be to allow MPI_COMM_WORLD to
    be specified as both input and output
    communicator.

37
First sketch of FT-MPI
  • MPI_COMM_SPLIT ( )
  • Called with the colour set to a new constant
  • FT_MPI_NOT_DEAD_YET (!)
  • key can be used to control the new rank of
    processes within the new communicator.
  • Again creates a new communicator that contains
    all the processes that continue to survive.

38
Simple FT enabled Example
  • Simple application at first
  • Bag of tasks, where the tasks know how to handle
    a failure.
  • Server just divides up the next set of data to be
    calculated between the survivors.
  • Clients nominate a new server if they have enough
    state.
  • (Can get the state by using ALL2ALL
    communications for results).

39
A bigger meaner example (PSTSWM)
  • Parallel Spectral Transform Shallow Water Model
  • 2D grid calculation
  • 3D in actual computation, with 1 axis performing
    FFTs, the second global reductions and the third
    layering sequentially upon each logical
    processor.
  • Calculation cannot support reduced grids like
    those supported by the Parallel Community Climate
    Model (PCMM), a future target application for
    FT-MPI.
  • I.e. if we lose a logical grid point (node) we
    must replace it!

40
A bigger meaner example (PSTSWM)
  • First Sketch ideas for FT-MPI are fine for
    applications that can handle a failure and have
    functional calling sequences that are not too
    deep
  • I.e. MPI API calls can be buried deep within
    routines and any errors may take quite a while to
    bubble to the surface where the application can
    take effective action to handle them and recover.

41
A bigger meaner example (PSTSWM)
  • This application proceeds in a number of well
    defined stages and can only handle failure by
    restarting from a known set of data.
  • I.e. user checkpoints have to be taken, and must
    still be reachable.
  • User requirement is for the application to be
    started and run to completion with the system
    automatically handling errors without manual
    intervention.

42
A bigger meaner example (PSTSWM)
  • Invalidating the failed communicators only as in
    the first sketch are not enough for this
    application.
  • PSTSWM creates communicators for each row and
    column of the 2-D grid.

43
A bigger meaner example (PSTSWM)
44
A bigger meaner example (PSTSWM)
Failed Node
45
A bigger meaner example (PSTSWM)
Failed Node
Failed Communicator
Failed Communicator
46
A bigger meaner example (PSTSWM)
This is unknown (butterfly p2p)
Failed Node
This communication works
47
A bigger meaner example (PSTSWM)
Failed Node
This is unknown as the pervious failure on
the axis might not have been detected...
48
A bigger meaner example (PSTSWM)
  • What is really wanted is for four things to
    happen.
  • Firstly, ALL communicators are marked as broken
    even if some are recoverable.
  • The underlying system propagates errors message
    to all communicators, not just the ones directly
    effected by the failure.
  • Secondly all MPI operations become NOPs where
    possible so that, the application can bubble the
    error to the top level as fast as possible.

49
A bigger meaner example (PSTSWM)
  • Thirdly, the run-time system spawns a replacement
    node on behalf of the application using a
    predetermined set of metrics.
  • Finally, the system allows this new process to be
    combined with the surviving communicators at
    MPI_Comm_create time.
  • Position (rank) of the new processes is not so
    important in this application as restart data has
    to be redistributed anyway, but maybe important
    for other applications.

50
A bigger meaner example (PSTSWM)
  • For this to occur, we need a means of identifying
    if a process has been spawned for the purpose of
    recovery (by either the run-time system or an
    application itself).
  • MPI_Comm_split (com, ft_mpi_still_alive,..) vs
  • MPI_Comm_split (ft_mpi_external_com,
    ft_mpi_new_spawned,..)
  • PSTSWM, doesnt care which task died and frankly
    doesnt want to know!
  • Just wants to continue calculating..

51
A bigger meaner example (PSTSWM)
  • How are we going to build an FT version of this
    application?
  • Patrick Worley (ORNL) is currently adding (user)
    checkpoint and restart capability into the
    application, as well as on error, get to the top
    layer functionality, so that a restart can be
    performed.
  • FT-MPI will need to provide an MPI-2 spawn
    function as well as baseline MPI-1 calls.
  • Initially the spawning will be performed by the
    PSTSWM code, and later by the run-time on its
    behalf.

52
A bigger meaner example (PSTSWM)
Failed Node
53
A bigger meaner example (PSTSWM)
Error detected, comms invalidated
54
A bigger meaner example (PSTSWM)
Error detected, comms invalidated
55
A bigger meaner example (PSTSWM)
Error detected, comms invalidated
56
A bigger meaner example (PSTSWM)
New task spawned
57
A bigger meaner example (PSTSWM)
Application reforming communicators
58
A bigger meaner example (PSTSWM)
Application reforming communicators
59
A bigger meaner example (PSTSWM)
Application reforming communicators
60
A bigger meaner example (PSTSWM)
Application back on-line.
61
A bigger meaner example (PSTSWM)
  • Hope to demo FT-PSTSWM at SC99
  • Performance will not be great, as it is sensitive
    to latency and very sensitive to bandwidth.
  • But it is probably one of the most difficult
    classes of applications to support.
  • PCCM is the next big application on the list as
    this model can be reconfigured to handle
    different grid sizes dynamically.

62
Future Directions
  • When we move from Terra-flop systems to Peta-flop
    machines we will have a mean time between
    failures (MTBF) that is less than that of
    expected execution runs.
  • Solutions like FT-MPI might help application
    developers better cope with this situation,
    without having to checkpoint their applications
    to (performance) death.

63
For now, what next?
  • Implement a simple MPI implementation on top of
    PVM 3.4 using as much existing software as
    possible.
  • Support functions needed by our two exemplars.
  • Make sure the lower level systems will use the
    high performance coms layer efficiently when it
    becomes available.
  • Fool some students into working for us (5
    years?), for when we want to support the other
    200 functions in MPI.
Write a Comment
User Comments (0)
About PowerShow.com